Preprocessing Script Example with XGBoost for DRP ========================================================== This script `xgboostdrp_preprocess_improve.py `_. preprocesses raw benchmark data and generates ML data files for the XGBoost prediction model. The naming convention for the preprocessing script is `_preprocess_improve.py`. **Inputs:** The benchmark data directory is set with :code:`params['input_dir']`. Alternate data can be provided, see :doc:`using_alternate_data`. Models can use any of the provided benchmark studies/splits by changing the value of :code:`param['train_split_file']`, :code:`param['val_split_file']`, and :code:`param['test_split_file']`. * **y_data**: The file with the factor IDs and observation values. * response.tsv * **x_data**: Files with the factor IDs and features for each feature type. * cancer_gene_expression.tsv * drug_mordred.tsv * **splits** List of split_nums to use from the response file. * CCLE_split_0_train.txt * CCLE_split_0_val.txt * CCLE_split_0_test.txt .. Note:: The x_data used by your model may be different. Use the appropriate parameter(s) for your application to specify which data is used by your model (e.g. :code:`params['cell_transcriptomic_file']`). Information about the parameters can be found here: :doc:`api_preprocess`. **Outputs:** All the outputs from the *preprocess* script are saved in :code:`params[output_dir]`. The processed x and y data are used as inputs to the ML/DL prediction model in the *train* and *infer* scripts. * **Processed x data**: Processed feature data for XGBoost model. * train_data.parquet * val_data.parquet * test_data.parquet * **Tranformation dictionaries**: Dictionaries that contain the transformations as determined by the test set. * drug_transform.json * omics_transform.json * **Processed y data**: File with true y values and IDs (additional metadata is acceptable). * train_y_data.csv * val_y_data.csv * test_y_data.csv * **Parameter log**: Log of the :code:`params` dictionary. * param_log_file.txt * **Timer log**: Log of the :code:`Timer` dictionary. * runtime_preprocess.json .. Note:: The y data files should be generated with the standardized names with :doc:`api_utils_save_stage_ydf` regardless of the model being curated. The format of the x data files can differ depending on the model and is specified with :code:`params['data_format']`. In addition to the data files mentioned above, the preprocessing script can be used to save additional utility data required in *train* and *infer*. Preprocessing Imports ^^^^^^^^^^^^^^^^^^^^^^ This section consists of five basic components: * **Import basic functionality**: sys, Path, etc. * **Import core improvelib functionality**: we import this as 'frm' for historical reasons. * **Import application-specific functionality**: this will need to be changed if curating a model for another application. * **Import model-specific functionality**: at minimum this should included the model parameter definitions, but can also include other packages your model requires (Polars, numpy, etc). * **Get file path**: filepath is used by the Config and this line should be present as is. .. code-block:: python import sys from pathlib import Path import pandas as pd # Core improvelib imports import improvelib.utils as frm # Application-specific (DRP) imports from improvelib.applications.drug_response_prediction.config import DRPPreprocessConfig import improvelib.applications.drug_response_prediction.drp_utils as drp # Model-specifc imports from model_params_def import preprocess_params filepath = Path(__file__).resolve().parent .. note:: You may need to add imports if your model requires other packages and you may need to change the application-specific imports if using another application. Preprocessing :code:`run()` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This function contains the bulk of the code: it loads the benchmark data, checks the feature data, determines the tranformation values on the training set, preprocesses the data for the train, val, and test sets, and saves the preprocessed ML data. Here we walk through the function. **Define the function.** The parameter dictionary must be passed to this function: .. code-block:: python def run(params): **Load feature data.** We load the feature data with :doc:`api_utils_get_x_data`. .. code-block:: python print("Load omics data.") omics = frm.get_x_data(file = params['cell_transcriptomic_file'], benchmark_dir = params['input_dir'], column_name = params['canc_col_name']) print("Load drug data.") drugs = frm.get_x_data(file = params['drug_mordred_file'], benchmark_dir = params['input_dir'], column_name = params['drug_col_name']) **Validity check of feature representations.** This is not needed for this model. .. note:: If you are using any features for which some of the information may not be valid, here you will want to check the validity of the features and remove them from the feature dataframe if needed. We highly recommend this for any preprocess that uses RDKit, as converting to :code:`mol` often returns None or an error. Another scenario is if drug target information is used and not all drugs have target information. **Determine preprocessing on training data.** Tranformation include subsetting to certain features, imputing any missing values, or scaling the data. For example, in this XGBoost model we have set :code:`cell_transcriptomic_transform = [['subset', 'LINCS_SYMBOL'], ['scale', 'std']]` which will first subset the genes to only those in the LINCS1000 set, and then apply the standard scaler to the data. See :doc:`api_utils_determine_transform` for more information on all the available transformations. First we load the y (response) data for the training split with :doc:`api_utils_get_response_data`. .. code-block:: python print("Load train response data.") response_train = frm.get_response_data(split_file=params["train_split_file"], benchmark_dir=params['input_dir'], response_file=params['y_data_file']) Next we find the y data that has matching features for all the feature types we are using (for XGBoost we are using transcriptomics for cell line features and Mordred descriptors for drug features) with :doc:`api_utils_get_response_with_features`. Then we subset to the features that are present in this training y data set with :doc:`api_utils_get_features_in_response`. .. code-block:: python print("Find intersection of training data.") response_train = frm.get_response_with_features(response_train, omics, params['canc_col_name']) response_train = frm.get_response_with_features(response_train, drugs, params['drug_col_name']) omics_train = frm.get_features_in_response(omics, response_train, params['canc_col_name']) drugs_train = frm.get_features_in_response(drugs, response_train, params['drug_col_name']) Then we determine the transformation values on the features that are in the training set for the transformations specified in the parameters (here :code:`params['cell_transcriptomic_transform']` and :code:`params['cell_mordred_transform']`) using :doc:`api_utils_determine_transform`. This saves the values of the transformation to a dictionary that will be used later to transform each (train, val, test) dataset. .. code-block:: python print("Determine transformations.") frm.determine_transform(omics_train, 'omics_transform', params['cell_transcriptomic_transform'], params['output_dir']) frm.determine_transform(drugs_train, 'drugs_transform', params['drug_mordred_transform'], params['output_dir']) **Construct ML data for every stage (train, val, test).** We highly recommend doing the preprocessing of the data in a loop to ensure the preprocessing is the same for all three stage datasets. We set up the loop as follows: .. code-block:: python stages = {"train": params["train_split_file"], "val": params["val_split_file"], "test": params["test_split_file"]} for stage, split_file in stages.items(): print(f"Prepare data for stage {stage}.") Inside this loop we find intersection of the data, transform the data, merge the data, and save the data. First we load the response data and find the intersection of the data with :doc:`api_utils_get_response_data`, :doc:`api_utils_get_response_with_features`, and :doc:`api_utils_get_features_in_response`: .. code-block:: python print(f"Find intersection of {stage} data.") response_stage = frm.get_response_data(split_file=split_file, benchmark_dir=params['input_dir'], response_file=params['y_data_file']) response_stage = frm.get_response_with_features(response_stage, omics, params['canc_col_name']) response_stage = frm.get_response_with_features(response_stage, drugs, params['drug_col_name']) omics_stage = frm.get_features_in_response(omics, response_stage, params['canc_col_name']) drugs_stage = frm.get_features_in_response(drugs, response_stage, params['drug_col_name']) Next we tranform the data using the dictionary we created above with :doc:`api_utils_transform_data`: .. code-block:: python print(f"Transform {stage} data.") omics_stage = frm.transform_data(omics_stage, 'omics_transform', params['output_dir']) drugs_stage = frm.transform_data(drugs_stage, 'drugs_transform', params['output_dir']) Then we assign the response columns to y_df_cols so we can pull out these columns to save later. We use pandas.merge to merge all the features. Shuffling the data is optional, depending on your model. .. code-block:: python print(f"Merge {stage} data") y_df_cols = response_stage.columns.tolist() data = response_stage.merge(omics_stage, on=params["canc_col_name"], how="inner") data = data.merge(drugs_stage, on=params["drug_col_name"], how="inner") data = data.sample(frac=1.0).reset_index(drop=True) # shuffle Finally we save the x data and the y data. The x data file name is determined with :doc:`api_utils_build_ml_data_file_name`. The y data is saved with :doc:`api_utils_save_stage_ydf`. In this implementation, we add the y values (:code:`params['y_col_name']`) to the x data file, and we will isolate this column in the *train* script, but this can be omitted and the saved y dataframe can be used. .. code-block:: python print(f"Save {stage} data") xdf = data.drop(columns=y_df_cols) xdf[params['y_col_name']] = data[params['y_col_name']] data_fname = frm.build_ml_data_file_name(data_format=params["data_format"], stage=stage) xdf.to_parquet(Path(params["output_dir"]) / data_fname) # [Req] Save y dataframe for the current stage ydf = data[y_df_cols] frm.save_stage_ydf(ydf, stage, params["output_dir"]) .. note:: There is flexibility in how this is implemented, but it is essential that the y data is saved with the IDs and ground truth, in the same order as the features to ensure the predictions and scores are saved accurately. **Return the output directory.** .. code-block:: python return params["output_dir"]s Preprocessing :code:`main()` and main guard ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :code:`main()` function is called upon script execution and gets the parameters, calls :code:`run()`, and records the time it takes for the model to run. Each line is explained below: * The first line (:code:`cfg = DRPPreprocessConfig()`) initializes the configuration object for each script as appropriate. * The second line initializes the parameters. Parameters set by command line (e.g. :code:`--input_dir /my/path/to/dir`) take precedence over the values in the config file, which take precedence over the default values provided by improvelib. * :code:`pathToModelDir` is the current path in the system. :code:`filepath` is already present in the template by :code:`filepath = Path(__file__).resolve().parent`. * :code:`default_config` is the default configuration file, as a string. * :code:`additional_definitions` is the list of model-specific parameters. * The third line initializes the :doc:`api_utils_Timer`. * The fourth line calls :code:`run()` with the parameters. As dicussed, :code:`run()` contains the model code. * The fifth line ends the :doc:`api_utils_Timer` and saves the time to a JSON file in the output_dir. * The last (optional) line prints a message indicating that the script is finished and ran successfully. .. code-block:: python def main(args): cfg = DRPPreprocessConfig() params = cfg.initialize_parameters(pathToModelDir=filepath, default_config="xgboostdrp_params.ini", additional_definitions=preprocess_params) timer_preprocess = frm.Timer() ml_data_outdir = run(params) timer_preprocess.save_timer(dir_to_save=params["output_dir"], filename='runtime_preprocess.json', extra_dict={"stage": "preprocess"}) print("\nFinished data preprocessing.") .. note:: You will need to change the name of :code:`default_config` to the one for your model, and the Config if you are using an application other than DRP. The main guard below prevents unintended execution and should be present as is: .. code-block:: python if __name__ == "__main__": main(sys.argv[1:])