Tutorial

NATASHA: UPDATE THIS

IMPROVE comparison workflows require that prediction models adhere a unitified code interface.

In the realm of supervised learning models, three fundamental components exist: data preparation, model training and hyperparameter optimization (HPO), and performance evaluation [1]. Recognizing this norm, we propose to establish three distinct scripts, with each script dedicated to one of these essential components. By establishing this convention and separating the components into separate scripts we aim to enhance code readability, provenance, and maintainability. The first script handles preprocessing of input data, the second manages model training, and the third enables the utilization of the model in inference mode. All these scripts should be organized in a modular and flexible manner to enable seamless combination, integration, and workflow generation. This modular separation of components aims to facilitate an efficient and manageable workflow design and implemenation.

To adhere the unified code interface, model repositories are required to provide a python script for each of the three components and one text file specifying default pamateter values. All three scripts utilize functionality from IMPROVE library.
1. Preprocessing. Preprocessing transforms benchmark data (a.k.a raw data) into model specific input data (a.k.a ML data). An example of raw data and ML data in the context of drug response prediction (DRP) are described in applications.
2. Training. Training and optimization of the model. This often includes hyperparameter optimization (HPO) and early stopping with validation data to mitigate overfitting.
3. Inference. Computing predictions and evaluating model prediction performance using the preprocessed data (from step 1) and the trained model (from step 2).
4. Parameter file. A file that contains default parameters for the three scripts.
../_images/ML_pipeline_steps.png

General steps in developing and using prediction model.

In the sections below, we provide an example of the three scripts, showing the use of a LightGBM model for drug response prediction. The cross-study analysis benchmark data is used for the analysis. In these scripts, the interface, the required code components and the utlizization of IMPROVE library are demonstrated. The whole code associated with this example can found in this repo.

In the code examples below, required code sections are designated with [Req]. These sections refer to functionality that models must intergrate in their scripts.

Preprocessing

This script preprocesses raw benchmark data (e.g., cross-study analysis) and generates data files for a LightGBM prediction model. The naming convention for the preprocessing script is MODELNAME_preprocess_improve.py. For example: lgbm_preprocess_imporve.py.

All the outputs from the preprocessing script are saved in params["ml_data_outdir"].

Outputs from running the preprocessing script:
  1. Model input data files.

    This script creates three data files corresponding to train, validation, and test data. These data files are used as inputs to the ML/DL prediction model in the training and inference scripts. The way that data is structured in these data files is highly depedent on the prediction model. Therefore, the training and inference scripts should provide and utilize appropriate functionality for data loading and passing it to the model. The file format is specified by params["data_format"]. For example:

    for LightGBM model: train_data.csv, val_data.csv, test_data.csv
    for GraphDRP model: train_data.pt, val_data.pt, test_data.pt
  2. Y data files.

    The script also creates dataframes with true Y values and additional metadata. Regardless of the prediction model, the script generates:

    train_y_data.csv, val_y_data.csv, and test_y_data.csv.

Note that in addition to the data files mentioned above, the preprocessing script can be used to save additional utility data required by the data loader.

Below is a preprocessing script that takes cross-study analysis benchmark data and generates training, validation, and test data files. The script below is available in this repo. Another example for a preprocessing script can be found in the repo for DL model, GraphDRP.

import sys
from pathlib import Path
from typing import Dict

import pandas as pd
import joblib

# [Req] IMPROVE/CANDLE imports
from improve import framework as frm
from improve import drug_resp_pred as drp

# Model-specifc imports
from model_utils.utils import gene_selection, scale_df

filepath = Path(__file__).resolve().parent # [Req]

# ---------------------
# [Req] Parameter lists
# ---------------------
# Two parameter lists are required:
# 1. app_preproc_params
# 2. model_preproc_params
#
# The values for the parameters in both lists should be specified in a
# parameter file that is passed as default_model arg in
# frm.initialize_parameters().

# 1. App-specific params (App: monotherapy drug response prediction)
# Note! This list should not be modified (i.e., no params should added or
# removed from the list.
#
# There are two types of params in the list: default and required
# default:   default values should be used
# required:  these params must be specified for the model in the param file
app_preproc_params = [
{"name": "y_data_files", # default
    "type": str,
    "help": "List of files that contain the y (prediction variable) data. \
            Example: [['response.tsv']]",
},
{"name": "x_data_canc_files", # required
    "type": str,
    "help": "List of feature files including gene_system_identifer. Examples: \n\
            1) [['cancer_gene_expression.tsv', ['Gene_Symbol']]] \n\
            2) [['cancer_copy_number.tsv', ['Ensembl', 'Entrez']]].",
},
{"name": "x_data_drug_files", # required
    "type": str,
    "help": "List of feature files. Examples: \n\
            1) [['drug_SMILES.tsv']] \n\
            2) [['drug_SMILES.tsv'], ['drug_ecfp4_nbits512.tsv']]",
},
{"name": "canc_col_name",
    "default": "improve_sample_id", # default
    "type": str,
    "help": "Column name in the y (response) data file that contains the cancer sample ids.",
},
{"name": "drug_col_name", # default
    "default": "improve_chem_id",
    "type": str,
    "help": "Column name in the y (response) data file that contains the drug ids.",
},
]

# 2. Model-specific params (Model: LightGBM)
# All params in model_preproc_params are optional.
# If no params are required by the model, then it should be an empty list.
model_preproc_params = [
{"name": "use_lincs",
    "type": frm.str2bool,
    "default": True,
    "help": "Flag to indicate if landmark genes are used for gene selection.",
},
{"name": "scaling",
    "type": str,
    "default": "std",
    "choice": ["std", "minmax", "miabs", "robust"],
    "help": "Scaler for gene expression and Mordred descriptors data.",
},
{"name": "ge_scaler_fname",
    "type": str,
    "default": "x_data_gene_expression_scaler.gz",
    "help": "File name to save the gene expression scaler object.",
},
{"name": "md_scaler_fname",
    "type": str,
    "default": "x_data_mordred_scaler.gz",
    "help": "File name to save the Mordred scaler object.",
},
]

# [Req] Combine the two lists (the combined parameter list will be passed to
# frm.initialize_parameters() in the main().
preprocess_params = app_preproc_params + model_preproc_params
# ---------------------

# [Req]
def run(params: Dict):
""" Run data preprocessing.

Args:
    params (dict): dict of CANDLE/IMPROVE parameters and parsed values.

Returns:
    str: directory name that was used to save the preprocessed (generated)
        ML data files.
"""

# ------------------------------------------------------
# [Req] Build paths and create output dir
# ------------------------------------------------------
# Build paths for raw_data, x_data, y_data, splits
params = frm.build_paths(params)

# Create output dir for model input data (to save preprocessed ML data)
frm.create_outdir(outdir=params["ml_data_outdir"])

# ------------------------------------------------------
# [Req] Load X data (feature representations)
# ------------------------------------------------------
# Use the provided data loaders to load data that is required by the model.
#
# Benchmark data includes three dirs: x_data, y_data, splits.
# The x_data contains files that represent feature information such as
# cancer representation (e.g., omics) and drug representation (e.g., SMILES).
#
# Prediction models utilize various types of feature representations.
# Drug response prediction (DRP) models generally use omics and drug features.
#
# If the model uses omics data types that are provided as part of the benchmark
# data, then the model must use the provided data loaders to load the data files
# from the x_data dir.
print("\nLoads omics data.")
omics_obj = drp.OmicsLoader(params)
# print(omics_obj)
ge = omics_obj.dfs['cancer_gene_expression.tsv'] # return gene expression

print("\nLoad drugs data.")
drugs_obj = drp.DrugsLoader(params)
# print(drugs_obj)
md = drugs_obj.dfs['drug_mordred.tsv'] # return the Mordred descriptors
md = md.reset_index()  # TODO. implement reset_index() inside the loader

# ------------------------------------------------------
# Further preprocess X data
# ------------------------------------------------------
# Gene selection (based on LINCS landmark genes)
if params["use_lincs"]:
    genes_fpath = filepath/"landmark_genes"
    ge = gene_selection(ge, genes_fpath, canc_col_name=params["canc_col_name"])

# Prefix gene column names with "ge."
fea_sep = "."
fea_prefix = "ge"
ge = ge.rename(columns={fea: f"{fea_prefix}{fea_sep}{fea}" for fea in ge.columns[1:]})

# ------------------------------------------------------
# Create feature scaler
# ------------------------------------------------------
# Load and combine responses
print("Create feature scaler.")
rsp_tr = drp.DrugResponseLoader(params,
                                split_file=params["train_split_file"],
                                verbose=False).dfs["response.tsv"]
rsp_vl = drp.DrugResponseLoader(params,
                                split_file=params["val_split_file"],
                                verbose=False).dfs["response.tsv"]
rsp = pd.concat([rsp_tr, rsp_vl], axis=0)

# Retian feature rows that are present in the y data (response dataframe)
# Intersection of omics features, drug features, and responses
rsp = rsp.merge(ge[params["canc_col_name"]], on=params["canc_col_name"], how="inner")
rsp = rsp.merge(md[params["drug_col_name"]], on=params["drug_col_name"], how="inner")
ge_sub = ge[ge[params["canc_col_name"]].isin(rsp[params["canc_col_name"]])].reset_index(drop=True)
md_sub = md[md[params["drug_col_name"]].isin(rsp[params["drug_col_name"]])].reset_index(drop=True)

# Scale gene expression
_, ge_scaler = scale_df(ge_sub, scaler_name=params["scaling"])
ge_scaler_fpath = Path(params["ml_data_outdir"]) / params["ge_scaler_fname"]
joblib.dump(ge_scaler, ge_scaler_fpath)
print("Scaler object for gene expression: ", ge_scaler_fpath)

# Scale Mordred descriptors
_, md_scaler = scale_df(md_sub, scaler_name=params["scaling"])
md_scaler_fpath = Path(params["ml_data_outdir"]) / params["md_scaler_fname"]
joblib.dump(md_scaler, md_scaler_fpath)
print("Scaler object for Mordred:         ", md_scaler_fpath)

del rsp, rsp_tr, rsp_vl, ge_sub, md_sub

# ------------------------------------------------------
# [Req] Construct ML data for every stage (train, val, test)
# ------------------------------------------------------
# All models must load response data (y data) using DrugResponseLoader().
# Below, we iterate over the 3 split files (train, val, test) and load
# response data, filtered by the split ids from the split files.

# Dict with split files corresponding to the three sets (train, val, and test)
stages = {"train": params["train_split_file"],
            "val": params["val_split_file"],
            "test": params["test_split_file"]}

for stage, split_file in stages.items():

    # --------------------------------
    # [Req] Load response data
    # --------------------------------
    rsp = drp.DrugResponseLoader(params,
                                    split_file=split_file,
                                    verbose=False).dfs["response.tsv"]

    # --------------------------------
    # Data prep
    # --------------------------------
    # Retain (canc, drug) responses for which both omics and drug features
    # are available.
    rsp = rsp.merge(ge[params["canc_col_name"]], on=params["canc_col_name"], how="inner")
    rsp = rsp.merge(md[params["drug_col_name"]], on=params["drug_col_name"], how="inner")
    ge_sub = ge[ge[params["canc_col_name"]].isin(rsp[params["canc_col_name"]])].reset_index(drop=True)
    md_sub = md[md[params["drug_col_name"]].isin(rsp[params["drug_col_name"]])].reset_index(drop=True)

    # Scale features
    ge_sc, _ = scale_df(ge_sub, scaler=ge_scaler) # scale gene expression
    md_sc, _ = scale_df(md_sub, scaler=md_scaler) # scale Mordred descriptors

    # --------------------------------
    # [Req] Save ML data files in params["ml_data_outdir"]
    # The implementation of this step, depends on the model.
    # --------------------------------
    # [Req] Build data name
    data_fname = frm.build_ml_data_name(params, stage)

    print("Merge data")
    data = rsp.merge(ge_sc, on=params["canc_col_name"], how="inner")
    data = data.merge(md_sc, on=params["drug_col_name"], how="inner")
    data = data.sample(frac=1.0).reset_index(drop=True) # shuffle

    print("Save data")
    data = data.drop(columns=["study"]) # to_parquet() throws error since "study" contain mixed values
    data.to_parquet(Path(params["ml_data_outdir"])/data_fname) # saves ML data file to parquet

    # Prepare the y dataframe for the current stage
    fea_list = ["ge", "mordred"]
    fea_cols = [c for c in data.columns if (c.split(fea_sep)[0]) in fea_list]
    meta_cols = [c for c in data.columns if (c.split(fea_sep)[0]) not in fea_list]
    ydf = data[meta_cols]

    # [Req] Save y dataframe for the current stage
    frm.save_stage_ydf(ydf, params, stage)

return params["ml_data_outdir"]

# [Req]
def main(args):
    # [Req]
    additional_definitions = preprocess_params
    params = frm.initialize_parameters(
        filepath,
        default_model="lgbm_params.txt",
        additional_definitions=additional_definitions,
        required=None,
    )
    ml_data_outdir = run(params)
    print("\nFinished data preprocessing.")

# [Req]
if __name__ == "__main__":
    main(sys.argv[1:])

As mentioned earlier, all the required code sections are designated with [Req]. One of the requirements is to define two lists of directories: app_preproc_params and model_preproc_params. Each dict specifies keyword arguments.

The params in app_preproc_params is a collection of application-specific parameters for the preprocessing step. The application in this case is monotherapy drug response prediction. This list should be copied to the script as is. There are two types of params in this list: default and required.
  • default: default values should be used

  • required: the values for params must be specified in the parameter file

app_preproc_params = [
{"name": "y_data_files", # default
    "type": str,
    "help": "List of files that contain the y (prediction variable) data. \
            Example: [['response.tsv']]",
},
{"name": "x_data_canc_files", # required
    "type": str,
    "help": "List of feature files including gene_system_identifer. Examples: \n\
            1) [['cancer_gene_expression.tsv', ['Gene_Symbol']]] \n\
            2) [['cancer_copy_number.tsv', ['Ensembl', 'Entrez']]].",
},
{"name": "x_data_drug_files", # required
    "type": str,
    "help": "List of feature files. Examples: \n\
            1) [['drug_SMILES.tsv']] \n\
            2) [['drug_SMILES.tsv'], ['drug_ecfp4_nbits512.tsv']]",
},
{"name": "canc_col_name",
    "default": "improve_sample_id", # default
    "type": str,
    "help": "Column name in the y (response) data file that contains the cancer sample ids.",
},
{"name": "drug_col_name", # default
    "default": "improve_chem_id",
    "type": str,
    "help": "Column name in the y (response) data file that contains the drug ids.",
},
]

The params in model_preproc_params is a collection of model-specific parameters for the preprocessing step. All params in this list are optional. If no params are required by the model, then it should be an empty list.

model_preproc_params = [
    {"name": "use_lincs",
    "type": frm.str2bool,
    "default": True,
    "help": "Flag to indicate if landmark genes are used for gene selection.",
    },
    {"name": "scaling",
    "type": str,
    "default": "std",
    "choice": ["std", "minmax", "miabs", "robust"],
    "help": "Scaler for gene expression and Mordred descriptors data.",
    },
    {"name": "ge_scaler_fname",
    "type": str,
    "default": "x_data_gene_expression_scaler.gz",
    "help": "File name to save the gene expression scaler object.",
    },
    {"name": "md_scaler_fname",
    "type": str,
    "default": "x_data_mordred_scaler.gz",
    "help": "File name to save the Mordred scaler object.",
    },
]

Training

The training script is used for executing model training as well as conducting hyperparameter optimization (HPO). The script generates a trained model, and model predictions and prediction performance scores calculated using the validation data. The naming convention for the training script is MODELNAME_train_improve.py. For example: lgbm_train_imporve.py.

All the outputs from the training script are saved in params["model_outdir"].

Outputs from running the training script:
  1. Trained model.

    The training script loads the train and validation data that were generated during the preprocessing step. The train data and validation data are used for, respectively, model training and early stopping. When the model converges (i.e., prediction performance stops improving on validation data), the model is saved into a file. The model file name and file format are specified by, respectively, params["model_file_name"] and params["model_file_format"]. For example:

    for LightGBM model: model.txt
    for GraphDRP model: model.pt
  2. Predictions on validation data.

    Model predictions are calcualted using the trained model on validation data. The predictions are saved as a dataframe in val_y_data_predicted.csv

  3. Prediction performance scores on validation data.

    The performance scores are calculated using the model predictions and the true Y values for the performance metrics specified in the metrics_list. The scores are saved as json in val_scores.json.

Below is a training script that takes the generated data from the preprocessing step and trains a LightGBM model. This script is available in this repo. Another example for a training script can be found in a repo for the GraphDRP model.

import sys
from pathlib import Path
from typing import Dict

import pandas as pd
import lightgbm as lgb

# [Req] IMPROVE/CANDLE imports
from improve import framework as frm

# Model-specifc imports
from model_utils.utils import extract_subset_fea

# [Req] Imports from preprocess script
from lgbm_preprocess_improve import preprocess_params

filepath = Path(__file__).resolve().parent # [Req]

# ---------------------
# [Req] Parameter lists
# ---------------------
# Two parameter lists are required:
# 1. app_train_params
# 2. model_train_params
#
# The values for the parameters in both lists should be specified in a
# parameter file that is passed as default_model arg in
# frm.initialize_parameters().

# 1. App-specific params (App: monotherapy drug response prediction)
# Currently, there are no app-specific params for this script.
app_train_params = []

# 2. Model-specific params (Model: LightGBM)
# All params in model_train_params are optional.
# If no params are required by the model, then it should be an empty list.
model_train_params = [
    {"name": "learning_rate",
    "type": float,
    "default": 0.1,
    "help": "Learning rate for the optimizer."
    },
]

# Combine the two lists (the combined parameter list will be passed to
# frm.initialize_parameters() in the main().
train_params = app_train_params + model_train_params
# ---------------------

# [Req] List of metrics names to compute prediction performance scores
metrics_list = ["mse", "rmse", "pcc", "scc", "r2"]


# [Req]
def run(params: Dict):
    """ Run model training.

    Args:
        params (dict): dict of CANDLE/IMPROVE parameters and parsed values.

    Returns:
        dict: prediction performance scores computed on validation data
            according to the metrics_list.
    """
    # ------------------------------------------------------
    # [Req] Create output dir and build model path
    # ------------------------------------------------------
    # Create output dir for trained model, val set predictions, val set
    # performance scores
    frm.create_outdir(outdir=params["model_outdir"])

    # Build model path
    modelpath = frm.build_model_path(params, model_dir=params["model_outdir"])

    # ------------------------------------------------------
    # [Req] Create data names for train and val sets
    # ------------------------------------------------------
    train_data_fname = frm.build_ml_data_name(params, stage="train")
    val_data_fname = frm.build_ml_data_name(params, stage="val")

    # ------------------------------------------------------
    # Load model input data (ML data)
    # ------------------------------------------------------
    tr_data = pd.read_parquet(Path(params["train_ml_data_dir"])/train_data_fname)
    vl_data = pd.read_parquet(Path(params["val_ml_data_dir"])/val_data_fname)

    fea_list = ["ge", "mordred"]
    fea_sep = "."

    # Train data
    xtr = extract_subset_fea(tr_data, fea_list=fea_list, fea_sep=fea_sep)
    ytr = tr_data[[params["y_col_name"]]]
    print("xtr:", xtr.shape)
    print("ytr:", ytr.shape)

    # Val data
    xvl = extract_subset_fea(vl_data, fea_list=fea_list, fea_sep=fea_sep)
    yvl = vl_data[[params["y_col_name"]]]
    print("xvl:", xvl.shape)
    print("yvl:", yvl.shape)

    # ------------------------------------------------------
    # Prepare, train, and save model
    # ------------------------------------------------------
    # Prepare model and train settings
    ml_init_args = {'n_estimators': 1000, 'max_depth': -1,
                    'learning_rate': params["learning_rate"],
                    'num_leaves': 31, 'n_jobs': 8, 'random_state': None}
    model = lgb.LGBMRegressor(objective='regression', **ml_init_args)

    # Train model
    ml_fit_args = {'verbose': False, 'early_stopping_rounds': 50}
    ml_fit_args['eval_set'] = (xvl, yvl)
    model.fit(xtr, ytr, **ml_fit_args)

    # Save model
    model.booster_.save_model(str(modelpath))
    del model

    # ------------------------------------------------------
    # Load best model and compute predictions
    # ------------------------------------------------------
    # Load the best saved model (as determined based on val data)
    model = lgb.Booster(model_file=str(modelpath))

    # Compute predictions
    val_pred = model.predict(xvl)
    val_true = yvl.values.squeeze()

    # ------------------------------------------------------
    # [Req] Save raw predictions in dataframe
    # ------------------------------------------------------
    frm.store_predictions_df(
        params,
        y_true=val_true, y_pred=val_pred, stage="val",
        outdir=params["model_outdir"]
    )

    # ------------------------------------------------------
    # [Req] Compute performance scores
    # ------------------------------------------------------
    val_scores = frm.compute_performace_scores(
        params,
        y_true=val_true, y_pred=val_pred, stage="val",
        outdir=params["model_outdir"], metrics=metrics_list
    )

    return val_scores

# [Req]
def main(args):
    # [Req]
    additional_definitions = preprocess_params + train_params
    params = frm.initialize_parameters(
        filepath,
        default_model="lgbm_params.txt",
        additional_definitions=additional_definitions,
        required=None,
    )
    val_scores = run(params)
    print("\nFinished model training.")

# [Req]
if __name__ == "__main__":
    main(sys.argv[1:])

Similar to the preprocessing script, the training script requires defining two parameter lists: app_train_params and model_train_params.

# Currently, there are no app-specific params for this script.
app_train_params = []

# All params in model_train_params are optional.
# If no params are required by the model, then it should be an empty list.
model_train_params = [
    {"name": "learning_rate",
    "type": float,
    "default": 0.1,
    "help": "Learning rate for the optimizer."
    },
]

Inference

The inference script is used to run the trained model in inferecne mode, allowing to compute predictions on an input data. The script generates model predictions and prediction performance scores for the test data. The naming convention for the inference script is MODELNAME_infer_improve.py. For example: lgbm_infer_imporve.py.

All the outputs from the training script are saved in params["infer_outdir"].

Outputs from executing the training script:
  1. Predictions on test data.

    Model predictions calcualted using the trained model on test data. The predictions are saved as a dataframe in test_y_data_predicted.csv

  2. Prediction performance scores on test data.

    The performance scores are calculated using the model predictions and the true Y values for the performance metrics specified in the metrics_list. The scores are saved as json in test_scores.json.

Below is an inference script that takes the generated test data from the preprocessing step and trained a LightGBM model from the training step. This script is available in this repo. Another example for an inference script can be found in a repo for the GraphDRP model.

import sys
from pathlib import Path
from typing import Dict

import pandas as pd
import lightgbm as lgb

# [Req] IMPROVE/CANDLE imports
from improve import framework as frm
from improve.metrics import compute_metrics

# Model-specifc imports
from model_utils.utils import extract_subset_fea

# [Req] Imports from preprocess and train scripts
from lgbm_preprocess_improve import preprocess_params
from lgbm_train_improve import metrics_list, train_params

filepath = Path(__file__).resolve().parent # [Req]

# ---------------------
# [Req] Parameter lists
# ---------------------
# Two parameter lists are required:
# 1. app_infer_params
# 2. model_infer_params
#
# The values for the parameters in both lists should be specified in a
# parameter file that is passed as default_model arg in
# frm.initialize_parameters().

# 1. App-specific params (App: monotherapy drug response prediction)
# Currently, there are no app-specific params in this script.
app_infer_params = []

# 2. Model-specific params (Model: LightGBM)
# All params in model_infer_params are optional.
# If no params are required by the model, then it should be an empty list.
model_infer_params = []

# [Req] Combine the two lists (the combined parameter list will be passed to
# frm.initialize_parameters() in the main().
infer_params = app_infer_params + model_infer_params
# ---------------------

# [Req]
def run(params: Dict):
    """ Run model inference.

    Args:
        params (dict): dict of CANDLE/IMPROVE parameters and parsed values.

    Returns:
        dict: prediction performance scores computed on test data according
            to the metrics_list.
    """

    # ------------------------------------------------------
    # [Req] Create output dir
    # ------------------------------------------------------
    frm.create_outdir(outdir=params["infer_outdir"])

    # ------------------------------------------------------
    # [Req] Create data name for test set
    # ------------------------------------------------------
    test_data_fname = frm.build_ml_data_name(params, stage="test")

    # ------------------------------------------------------
    # Load model input data (ML data)
    # ------------------------------------------------------
    te_data = pd.read_parquet(Path(params["test_ml_data_dir"])/test_data_fname)

    fea_list = ["ge", "mordred"]
    fea_sep = "."

    # Test data
    xte = extract_subset_fea(te_data, fea_list=fea_list, fea_sep=fea_sep)
    yte = te_data[[params["y_col_name"]]]

    # ------------------------------------------------------
    # Load best model and compute predictions
    # ------------------------------------------------------
    # Build model path
    modelpath = frm.build_model_path(params, model_dir=params["model_dir"]) # [Req]

    # Load LightGBM
    model = lgb.Booster(model_file=str(modelpath))

    # Predict
    test_pred = model.predict(xte)
    test_true = yte.values.squeeze()

    # ------------------------------------------------------
    # [Req] Save raw predictions in dataframe
    # ------------------------------------------------------
    frm.store_predictions_df(
        params,
        y_true=test_true, y_pred=test_pred, stage="test",
        outdir=params["infer_outdir"]
    )

    # ------------------------------------------------------
    # [Req] Compute performance scores
    # ------------------------------------------------------
    test_scores = frm.compute_performace_scores(
        params,
        y_true=test_true, y_pred=test_pred, stage="test",
        outdir=params["infer_outdir"], metrics=metrics_list
    )

    return test_scores

# [Req]
def main(args):
    # [Req]
    additional_definitions = preprocess_params + train_params + infer_params
    params = frm.initialize_parameters(
        filepath,
        default_model="lgbm_params.txt",
        additional_definitions=additional_definitions,
        required=None,
    )
    test_scores = run(params)
    print("\nFinished model inference.")

# [Req]
if __name__ == "__main__":
    main(sys.argv[1:])

Similar to the training script, the inference script requires defining two parameter lists: app_infer_params and model_infer_params. In the case of LightGBM, both lists are empty.

# Currently, there are no app-specific params in this script.
app_infer_params = []

# All params in model_infer_params are optional.
# If no params are required by the model, then it should be an empty list.
model_infer_params = []

Parameter file

The parameter file is a txt file that contains default parameters for all three scripts. The path to this file is passed to frm.initialize_parameters() as arg default_model. The functionality enablig frm.initialize_parameters() is provided by the CANLDE library.

Example of passing the parameter file to the frm.initialize_parameters().

filepath = Path(__file__).resolve().parent

params = frm.initialize_parameters(
    filepath,
    default_model="lgbm_params.txt",
    additional_definitions=additional_definitions,
    required=None,
)

Example showing the content of the parameter file for LightGBM.

[Global_Params]
model_name = "LGBM"

[Preprocess]
train_split_file = "CCLE_split_0_train.txt"
val_split_file = "CCLE_split_0_val.txt"
test_split_file = "CCLE_split_0_test.txt"
ml_data_outdir = "./ml_data/CCLE-CCLE/split_0"
data_format = ".parquet"
y_data_files = [["response.tsv"]]
x_data_canc_files = [["cancer_gene_expression.tsv", ["Gene_Symbol"]]]
x_data_drug_files = [["drug_mordred.tsv"]]
use_lincs = True
scaling = "std"

[Train]
train_ml_data_dir = "./ml_data/CCLE-CCLE/split_0"
val_ml_data_dir = "./ml_data/CCLE-CCLE/split_0"
model_outdir = "./out_models/CCLE/split_0"
model_file_name = "model"
model_file_format = ".txt"

[Infer]
test_ml_data_dir = "./ml_data/CCLE-CCLE/split_0"
model_dir = "./out_models/CCLE/split_0"
infer_outdir = "./out_infer/CCLE-CCLE/split_0"

References

1. A. Partin et al. “Deep learning methods for drug response prediction in cancer: Predominant and emerging trends”, Frontiers in Medicine, Section Prediction Oncology, 2023