Preprocessing Script Example with XGBoost for DRP

This script xgboostdrp_preprocess_improve.py. preprocesses raw benchmark data and generates ML data files for the XGBoost prediction model. The naming convention for the preprocessing script is <MODEL>_preprocess_improve.py.

Inputs: The benchmark data directory is set with params['input_dir']. Alternate data can be provided, see using_alternate_data. Models can use any of the provided benchmark studies/splits by changing the value of param['train_split_file'], param['val_split_file'], and param['test_split_file'].

  • y_data: The file with the factor IDs and observation values.

    • response.tsv

  • x_data: Files with the factor IDs and features for each feature type.

    • cancer_gene_expression.tsv

    • drug_mordred.tsv

  • splits List of split_nums to use from the response file.

    • CCLE_split_0_train.txt

    • CCLE_split_0_val.txt

    • CCLE_split_0_test.txt

Note

The x_data used by your model may be different. Use the appropriate parameter(s) for your application to specify which data is used by your model (e.g. params['cell_transcriptomic_file']). Information about the parameters can be found here: Preprocess Parameters.

Outputs: All the outputs from the preprocess script are saved in params[output_dir]. The processed x and y data are used as inputs to the ML/DL prediction model in the train and infer scripts.

  • Processed x data: Processed feature data for XGBoost model.

    • train_data.parquet

    • val_data.parquet

    • test_data.parquet

  • Tranformation dictionaries: Dictionaries that contain the transformations as determined by the test set.

    • drug_transform.json

    • omics_transform.json

  • Processed y data: File with true y values and IDs (additional metadata is acceptable).

    • train_y_data.csv

    • val_y_data.csv

    • test_y_data.csv

  • Parameter log: Log of the params dictionary.

    • param_log_file.txt

  • Timer log: Log of the Timer dictionary.

    • runtime_preprocess.json

Note

The y data files should be generated with the standardized names with save_stage_ydf regardless of the model being curated. The format of the x data files can differ depending on the model and is specified with params['data_format']. In addition to the data files mentioned above, the preprocessing script can be used to save additional utility data required in train and infer.

Preprocessing Imports

This section consists of five basic components:

  • Import basic functionality: sys, Path, etc.

  • Import core improvelib functionality: we import this as ‘frm’ for historical reasons.

  • Import application-specific functionality: this will need to be changed if curating a model for another application.

  • Import model-specific functionality: at minimum this should included the model parameter definitions, but can also include other packages your model requires (Polars, numpy, etc).

  • Get file path: filepath is used by the Config and this line should be present as is.

import sys
from pathlib import Path
import pandas as pd

# Core improvelib imports
import improvelib.utils as frm
# Application-specific (DRP) imports
from improvelib.applications.drug_response_prediction.config import DRPPreprocessConfig
import improvelib.applications.drug_response_prediction.drp_utils as drp

# Model-specifc imports
from model_params_def import preprocess_params

filepath = Path(__file__).resolve().parent

Note

You may need to add imports if your model requires other packages and you may need to change the application-specific imports if using another application.

Preprocessing run()

This function contains the bulk of the code: it loads the benchmark data, checks the feature data, determines the tranformation values on the training set, preprocesses the data for the train, val, and test sets, and saves the preprocessed ML data. Here we walk through the function.

Define the function. The parameter dictionary must be passed to this function:

def run(params):

Load feature data. We load the feature data with get_x_data.

print("Load omics data.")
omics = frm.get_x_data(file = params['cell_transcriptomic_file'],
                       benchmark_dir = params['input_dir'],
                       column_name = params['canc_col_name'])

print("Load drug data.")
drugs = frm.get_x_data(file = params['drug_mordred_file'],
                       benchmark_dir = params['input_dir'],
                       column_name = params['drug_col_name'])

Validity check of feature representations. This is not needed for this model.

Note

If you are using any features for which some of the information may not be valid, here you will want to check the validity of the features and remove them from the feature dataframe if needed. We highly recommend this for any preprocess that uses RDKit, as converting to mol often returns None or an error. Another scenario is if drug target information is used and not all drugs have target information.

Determine preprocessing on training data. Tranformation include subsetting to certain features, imputing any missing values, or scaling the data. For example, in this XGBoost model we have set cell_transcriptomic_transform = [['subset', 'LINCS_SYMBOL'], ['scale', 'std']] which will first subset the genes to only those in the LINCS1000 set, and then apply the standard scaler to the data. See determine_transform for more information on all the available transformations.

First we load the y (response) data for the training split with get_response_data.

print("Load train response data.")
response_train = frm.get_response_data(split_file=params["train_split_file"],
                               benchmark_dir=params['input_dir'],
                               response_file=params['y_data_file'])

Next we find the y data that has matching features for all the feature types we are using (for XGBoost we are using transcriptomics for cell line features and Mordred descriptors for drug features) with get_response_with_features. Then we subset to the features that are present in this training y data set with get_features_in_response.

print("Find intersection of training data.")
response_train = frm.get_response_with_features(response_train, omics, params['canc_col_name'])
response_train = frm.get_response_with_features(response_train, drugs, params['drug_col_name'])
omics_train = frm.get_features_in_response(omics, response_train, params['canc_col_name'])
drugs_train = frm.get_features_in_response(drugs, response_train, params['drug_col_name'])

Then we determine the transformation values on the features that are in the training set for the transformations specified in the parameters (here params['cell_transcriptomic_transform'] and params['cell_mordred_transform']) using determine_transform. This saves the values of the transformation to a dictionary that will be used later to transform each (train, val, test) dataset.

print("Determine transformations.")
frm.determine_transform(omics_train, 'omics_transform', params['cell_transcriptomic_transform'], params['output_dir'])
frm.determine_transform(drugs_train, 'drugs_transform', params['drug_mordred_transform'], params['output_dir'])

Construct ML data for every stage (train, val, test). We highly recommend doing the preprocessing of the data in a loop to ensure the preprocessing is the same for all three stage datasets. We set up the loop as follows:

stages = {"train": params["train_split_file"],
           "val": params["val_split_file"],
           "test": params["test_split_file"]}

for stage, split_file in stages.items():
    print(f"Prepare data for stage {stage}.")

Inside this loop we find intersection of the data, transform the data, merge the data, and save the data. First we load the response data and find the intersection of the data with get_response_data, get_response_with_features, and get_features_in_response:

print(f"Find intersection of {stage} data.")
response_stage = frm.get_response_data(split_file=split_file,
                        benchmark_dir=params['input_dir'],
                        response_file=params['y_data_file'])
response_stage = frm.get_response_with_features(response_stage, omics, params['canc_col_name'])
response_stage = frm.get_response_with_features(response_stage, drugs, params['drug_col_name'])
omics_stage = frm.get_features_in_response(omics, response_stage, params['canc_col_name'])
drugs_stage = frm.get_features_in_response(drugs, response_stage, params['drug_col_name'])

Next we tranform the data using the dictionary we created above with transform_data:

print(f"Transform {stage} data.")
omics_stage = frm.transform_data(omics_stage, 'omics_transform', params['output_dir'])
drugs_stage = frm.transform_data(drugs_stage, 'drugs_transform', params['output_dir'])

Then we assign the response columns to y_df_cols so we can pull out these columns to save later. We use pandas.merge to merge all the features. Shuffling the data is optional, depending on your model.

print(f"Merge {stage} data")
y_df_cols = response_stage.columns.tolist()
data = response_stage.merge(omics_stage, on=params["canc_col_name"], how="inner")
data = data.merge(drugs_stage, on=params["drug_col_name"], how="inner")
data = data.sample(frac=1.0).reset_index(drop=True) # shuffle

Finally we save the x data and the y data. The x data file name is determined with build_ml_data_file_name. The y data is saved with save_stage_ydf. In this implementation, we add the y values (params['y_col_name']) to the x data file, and we will isolate this column in the train script, but this can be omitted and the saved y dataframe can be used.

print(f"Save {stage} data")
xdf = data.drop(columns=y_df_cols)
xdf[params['y_col_name']] = data[params['y_col_name']]
data_fname = frm.build_ml_data_file_name(data_format=params["data_format"], stage=stage)
xdf.to_parquet(Path(params["output_dir"]) / data_fname)
# [Req] Save y dataframe for the current stage
ydf = data[y_df_cols]
frm.save_stage_ydf(ydf, stage, params["output_dir"])

Note

There is flexibility in how this is implemented, but it is essential that the y data is saved with the IDs and ground truth, in the same order as the features to ensure the predictions and scores are saved accurately.

Return the output directory.

return params["output_dir"]s

Preprocessing main() and main guard

The main() function is called upon script execution and gets the parameters, calls run(), and records the time it takes for the model to run. Each line is explained below:

  • The first line (cfg = DRPPreprocessConfig()) initializes the configuration object for each script as appropriate.

  • The second line initializes the parameters. Parameters set by command line (e.g. --input_dir /my/path/to/dir) take precedence over the values in the config file, which take precedence over the default values provided by improvelib.

    • pathToModelDir is the current path in the system. filepath is already present in the template by filepath = Path(__file__).resolve().parent.

    • default_config is the default configuration file, as a string.

    • additional_definitions is the list of model-specific parameters.

  • The third line initializes the Timer.

  • The fourth line calls run() with the parameters. As dicussed, run() contains the model code.

  • The fifth line ends the Timer and saves the time to a JSON file in the output_dir.

  • The last (optional) line prints a message indicating that the script is finished and ran successfully.

def main(args):
    cfg = DRPPreprocessConfig()
    params = cfg.initialize_parameters(pathToModelDir=filepath,
                                       default_config="xgboostdrp_params.ini",
                                       additional_definitions=preprocess_params)
    timer_preprocess = frm.Timer()
    ml_data_outdir = run(params)
    timer_preprocess.save_timer(dir_to_save=params["output_dir"],
                                filename='runtime_preprocess.json',
                                extra_dict={"stage": "preprocess"})
print("\nFinished data preprocessing.")

Note

You will need to change the name of default_config to the one for your model, and the Config if you are using an application other than DRP.

The main guard below prevents unintended execution and should be present as is:

if __name__ == "__main__":
    main(sys.argv[1:])