Learning Curve Split Generator ================================= This repository contains scripts to generate data splits for learning curve analysis (LCA) using drug response response data from various sources. The scripts provide the ability to create progressive sizes of training sets, allowing to analyze model performance as a function of the amount of training data. Requirements ----------------- * Python 3.x * Pandas * NumPy Installation and Setup ------------------------ You can install the required packages using :code:`pip`: .. code-block:: bash pip install pandas numpy Parameter Configuration -------------------------- This workflow uses command line parameters are as follows: * :code:`--data_file_path`: Full path to the input data file. For DRP models with benchmark data this should be '/csa_data/raw_data/y_data/response.tsv' * :code:`--splits_dir`: Full path to the directory where the split files will be saved. * :code:`--lc_sizes`: Number of subset sizes to generate (default: 10). * :code:`--min_size`: The lower bound for the subset size (default: 128). * :code:`--max_size`: The upper bound for the subset size (default: None, which sets it to the length of the dataset). * :code:`--lc_step_scale`: Scale of progressive sampling of subset sizes in a learning curve (options: linear, log, log2,log10). Usage --------- Run :code:`generate_lc_split_files.py` to generates learning curve data splits based on the provided parameters: .. code-block:: bash python generate_lc_split_files.py --data_file_path \ --splits_dir \ --lc_sizes \ --min_size \ --max_size \ --lc_step_scale Example: .. code-block:: bash python generate_lc_split_files.py \ --data_file_path ../../../csa_data/raw_data/y_data/response.tsv \ --splits_dir ../../../csa_data/raw_data/splits \ --lc_sizes 10 \ --min_size 1024 \ --lc_step_scale log Alternatively, run the bash script :code:`gen_lc_splits.sh`: .. code-block:: bash bash gen_lc_splits.sh PATH_TO_DATA_FILE PATH_TO_SPLITD_DIR The script invokes the Python script :code:`generate_lc_split_files.py` with parameters such as the data file path, splits directory, LC sizes, min LC size, max LC size, and the scaling method for size increments. Output -------- The output consists of multiple text files in the specified splits directory. Each file contains indices of rows corresponding to the specified learning curve sizes.