Learning Curve Split Generator

This repository contains scripts to generate data splits for learning curve analysis (LCA) using drug response response data from various sources. The scripts provide the ability to create progressive sizes of training sets, allowing to analyze model performance as a function of the amount of training data.

Requirements

  • Python 3.x

  • Pandas

  • NumPy

Installation and Setup

You can install the required packages using pip:

pip install pandas numpy

Parameter Configuration

This workflow uses command line parameters are as follows:

  • --data_file_path: Full path to the input data file. For DRP models with benchmark data this should be ‘/csa_data/raw_data/y_data/response.tsv’

  • --splits_dir: Full path to the directory where the split files will be saved.

  • --lc_sizes: Number of subset sizes to generate (default: 10).

  • --min_size: The lower bound for the subset size (default: 128).

  • --max_size: The upper bound for the subset size (default: None, which sets it to the length of the dataset).

  • --lc_step_scale: Scale of progressive sampling of subset sizes in a learning curve (options: linear, log, log2,log10).

Usage

Run generate_lc_split_files.py to generates learning curve data splits based on the provided parameters:

python generate_lc_split_files.py --data_file_path <path_to_data_file> \
    --splits_dir <path_to_splits_directory> \
    --lc_sizes <number_of_sizes> \
    --min_size <minimum_size> \
    --max_size <maximum_size> \
    --lc_step_scale <scale>

Example:

python generate_lc_split_files.py \
    --data_file_path ../../../csa_data/raw_data/y_data/response.tsv \
    --splits_dir ../../../csa_data/raw_data/splits \
    --lc_sizes 10 \
    --min_size 1024 \
    --lc_step_scale log

Alternatively, run the bash script gen_lc_splits.sh:

bash gen_lc_splits.sh PATH_TO_DATA_FILE PATH_TO_SPLITD_DIR

The script invokes the Python script generate_lc_split_files.py with parameters such as the data file path, splits directory, LC sizes, min LC size, max LC size, and the scaling method for size increments.

Output

The output consists of multiple text files in the specified splits directory. Each file contains indices of rows corresponding to the specified learning curve sizes.