Learning Curve Analysis (LCA) with Brute Force Method

The scripts contained here run the Learning Curve Analysis sequentially (non-parallelized) and produce results that are standardized and compatible with IMPROVE LCA postprocessing scripts.

Requirements

Installation and Setup

Create the IMPROVE general environment:

conda create -n IMPROVE python=3.6
conda activate IMPROVE
pip install improvelib

Install the model of choice, IMPROVE, and benchmark datasets:

cd <WORKING_DIR>
git clone https://github.com/JDACS4C-IMPROVE/<MODEL>
cd <MODEL>
source setup_improve.sh

Create a Conda environment path in the model directory:

conda env create -f <MODEL_ENV>.yml -p ./<MODEL_ENV_NAME>/

Parameter Configuration

This workflow uses IMPROVE parameter handling. You should create a config file following the template of lca_swarm_params.ini with the parameters appropriate for your experiment. Parameters may also be specified on the command line.

  • input_dir: Path to benchmark data. If using a DRP model with standard setup, this should be ./csa_data/raw_data

  • lca_splits_dir: Path to LCA splits, as generated by the LCA splits generator.

  • output_dir: Path to save the LCA results.

  • model_name: Name of the model as used in scripts (i.e. <model_name>_preprocess_improve.py). Note that this is case-sensitive.

  • model_scripts_dir: Path to the model repository as cloned above. Can be an absolute or relative path.

  • model_environment: Name of the model environment as created above. Can be a path, or just the name of environment directory if it is located in model_scripts_dir.

  • dataset: Name of the dataset as used in the split names. Note that this is case-sensitive.

  • split_nums: List of strings of the numbers of splits.

  • y_col_name: Name of column to use in y data (default: auc).

  • cuda_name: Name of cuda device (e.g. ‘cuda:0’). If None is specified, model default parameters will be used (default: None).

  • epochs: Number of epochs to train for. If None is specified, model default parameters will be used (default: None).

  • input_supp_data_dir: Supp data dir, if required. If None is specified, model default parameters will be used (default: None).

Usage

Activate the IMPROVE environment:

conda activate <PATH/TO/MODEL>/<MODEL_ENV_NAME>

Run LCA brute force with your configuration file:

python lca_bruteforce.py --config <yourconfig.ini>

Output

The output will be in the specified output_dir with the following structure (with the used source and target names and splits):

output_dir/
├── infer
│   ├── split_0
│      ├── sz_[0]         ├── param_log_file.txt
│         ├── test_scores.json
│         └── test_y_data_predicted.csv
│      ├── sz_[1]      ├── ...
│      └── sz_[n]   ├── split_1
│   ├── ...
│   └── split_9
├── ml_data
│   ├── split_0
│      ├── sz_[0]         ├── param_log_file.txt
│         ├── train_y_data.csv
│         ├── val_y_data.csv
│         ├── test_y_data.csv
│         └── train/val/test x_data, and other files per model
│      ├── sz_[1]      ├── ...
│      └── sz_[n]   ├── split_1
│   ├── ...
│   └── split_9
└── models
    ├── split_0
       ├── sz_[0]
          ├── param_log_file.txt
          ├── val_scores.json
          ├── val_y_data_predicted.csv
          └── trained model file
       ├── sz_[1]
       ├── ...
       └── sz_[n]
    ├── split_1
    ├── ...
    └── split_9

We recommend using the postprocessing script for LCA to aggregate the results.