Learning Curve Analysis (LCA) with Brute Force Method
=========================================================

The scripts contained here run the Learning Curve Analysis sequentially (non-parallelized) and produce results that are standardized and compatible with IMPROVE LCA postprocessing scripts.

Requirements
---------------------

* :doc:`IMPROVE general environment <INSTALLATION>`
* An IMPROVE-compliant model and its environment

Installation and Setup
------------------------

Create the IMPROVE general environment:

.. code-block:: bash

    conda create -n IMPROVE python=3.6
    conda activate IMPROVE
    pip install improvelib


Install the model of choice, IMPROVE, and benchmark datasets:

.. code-block:: bash

    cd <WORKING_DIR>
    git clone https://github.com/JDACS4C-IMPROVE/<MODEL>
    cd <MODEL>
    source setup_improve.sh


Create a Conda environment path in the model directory:

.. code-block:: bash

    conda env create -f <MODEL_ENV>.yml -p ./<MODEL_ENV_NAME>/


Parameter Configuration
--------------------------

This workflow uses IMPROVE parameter handling. You should create a config file following the template of :code:`lca_swarm_params.ini` with the parameters appropriate for your experiment. Parameters may also be specified on the command line.

* :code:`input_dir`: Path to benchmark data. If using a DRP model with standard setup, this should be :code:`./csa_data/raw_data`
* :code:`lca_splits_dir`: Path to LCA splits, as generated by the LCA splits generator.
* :code:`output_dir`: Path to save the LCA results. 
* :code:`model_name`: Name of the model as used in scripts (i.e. :code:`<model_name>_preprocess_improve.py`). Note that this is case-sensitive.
* :code:`model_scripts_dir`: Path to the model repository as cloned above. Can be an absolute or relative path.
* :code:`model_environment`: Name of the model environment as created above. Can be a path, or just the name of environment directory if it is located in :code:`model_scripts_dir`.
* :code:`dataset`: Name of the dataset as used in the split names. Note that this is case-sensitive.
* :code:`split_nums`: List of strings of the numbers of splits.
* :code:`y_col_name`: Name of column to use in y data (default: auc).
* :code:`cuda_name`: Name of cuda device (e.g. 'cuda:0'). If None is specified, model default parameters will be used (default: None).
* :code:`epochs`: Number of epochs to train for. If None is specified, model default parameters will be used (default: None).
* :code:`input_supp_data_dir`: Supp data dir, if required. If None is specified, model default parameters will be used (default: None).

Usage
----------

Activate the IMPROVE environment:

.. code-block:: bash

    conda activate <PATH/TO/MODEL>/<MODEL_ENV_NAME>


Run LCA brute force with your configuration file:

.. code-block:: bash

    python lca_bruteforce.py --config <yourconfig.ini>


Output
-------

The output will be in the specified :code:`output_dir` with the following structure (with the used source and target names and splits):

.. code-block:: bash

    output_dir/
    ├── infer
    │   ├── split_0
    │   │   ├── sz_[0]
    │   │   │   ├── param_log_file.txt
    │   │   │   ├── test_scores.json
    │   │   │   └── test_y_data_predicted.csv
    │   │   ├── sz_[1]
    │   │   ├── ...
    │   │   └── sz_[n]
    │   ├── split_1
    │   ├── ...
    │   └── split_9
    ├── ml_data
    │   ├── split_0
    │   │   ├── sz_[0]
    │   │   │   ├── param_log_file.txt
    │   │   │   ├── train_y_data.csv
    │   │   │   ├── val_y_data.csv
    │   │   │   ├── test_y_data.csv
    │   │   │   └── train/val/test x_data, and other files per model
    │   │   ├── sz_[1]
    │   │   ├── ...
    │   │   └── sz_[n]
    │   ├── split_1
    │   ├── ...
    │   └── split_9
    └── models
        ├── split_0
        │   ├── sz_[0]
        │   │   ├── param_log_file.txt
        │   │   ├── val_scores.json
        │   │   ├── val_y_data_predicted.csv
        │   │   └── trained model file
        │   ├── sz_[1]
        │   ├── ...
        │   └── sz_[n]
        ├── split_1
        ├── ...
        └── split_9


We recommend using the :doc:`postprocessing <using_lc_postprocess>` script for LCA to aggregate the results.