Learning Curve Analysis (LCA) with Brute Force Method
The scripts contained here run the Learning Curve Analysis sequentially (non-parallelized) and produce results that are standardized and compatible with IMPROVE LCA postprocessing scripts.
Requirements
An IMPROVE-compliant model and its environment
Installation and Setup
Create the IMPROVE general environment:
conda create -n IMPROVE python=3.6
conda activate IMPROVE
pip install improvelib
Install the model of choice, IMPROVE, and benchmark datasets:
cd <WORKING_DIR>
git clone https://github.com/JDACS4C-IMPROVE/<MODEL>
cd <MODEL>
source setup_improve.sh
Create a Conda environment path in the model directory:
conda env create -f <MODEL_ENV>.yml -p ./<MODEL_ENV_NAME>/
Parameter Configuration
This workflow uses IMPROVE parameter handling. You should create a config file following the template of lca_swarm_params.ini
with the parameters appropriate for your experiment. Parameters may also be specified on the command line.
input_dir
: Path to benchmark data. If using a DRP model with standard setup, this should be./csa_data/raw_data
lca_splits_dir
: Path to LCA splits, as generated by the LCA splits generator.output_dir
: Path to save the LCA results.model_name
: Name of the model as used in scripts (i.e.<model_name>_preprocess_improve.py
). Note that this is case-sensitive.model_scripts_dir
: Path to the model repository as cloned above. Can be an absolute or relative path.model_environment
: Name of the model environment as created above. Can be a path, or just the name of environment directory if it is located inmodel_scripts_dir
.dataset
: Name of the dataset as used in the split names. Note that this is case-sensitive.split_nums
: List of strings of the numbers of splits.y_col_name
: Name of column to use in y data (default: auc).cuda_name
: Name of cuda device (e.g. ‘cuda:0’). If None is specified, model default parameters will be used (default: None).epochs
: Number of epochs to train for. If None is specified, model default parameters will be used (default: None).input_supp_data_dir
: Supp data dir, if required. If None is specified, model default parameters will be used (default: None).
Usage
Activate the IMPROVE environment:
conda activate <PATH/TO/MODEL>/<MODEL_ENV_NAME>
Run LCA brute force with your configuration file:
python lca_bruteforce.py --config <yourconfig.ini>
Output
The output will be in the specified output_dir
with the following structure (with the used source and target names and splits):
output_dir/
├── infer
│ ├── split_0
│ │ ├── sz_[0]
│ │ │ ├── param_log_file.txt
│ │ │ ├── test_scores.json
│ │ │ └── test_y_data_predicted.csv
│ │ ├── sz_[1]
│ │ ├── ...
│ │ └── sz_[n]
│ ├── split_1
│ ├── ...
│ └── split_9
├── ml_data
│ ├── split_0
│ │ ├── sz_[0]
│ │ │ ├── param_log_file.txt
│ │ │ ├── train_y_data.csv
│ │ │ ├── val_y_data.csv
│ │ │ ├── test_y_data.csv
│ │ │ └── train/val/test x_data, and other files per model
│ │ ├── sz_[1]
│ │ ├── ...
│ │ └── sz_[n]
│ ├── split_1
│ ├── ...
│ └── split_9
└── models
├── split_0
│ ├── sz_[0]
│ │ ├── param_log_file.txt
│ │ ├── val_scores.json
│ │ ├── val_y_data_predicted.csv
│ │ └── trained model file
│ ├── sz_[1]
│ ├── ...
│ └── sz_[n]
├── split_1
├── ...
└── split_9
We recommend using the postprocessing script for LCA to aggregate the results.