Training Script Example with XGBoost for DRP
This script xgboostdrp_train_improve.py executes XGBoost training with the processed data from preprocess. The script generates a trained model, and model predictions, and prediction performance scores calculated using the validation data. When the model converges (i.e., prediction performance stops improving on validation data), the model is saved into a file. The naming convention for the preprocessing script is <MODEL>_preprocess_improve.py.
Inputs:
The directory containing the preprocessed data is set with params['input_dir']
which is set to the same directory as params['output_dir']
was in the preprocess script.
The train data and val data are loaded and used for, respectively, model training and early stopping.
Processed x data: Processed feature data for XGBoost model.
train_data.parquet
val_data.parquet
Processed y data: File with true y values and IDs (additional metadata is acceptable).
train_y_data.csv
val_y_data.csv
Outputs:
All the outputs from the train script are saved in params['output_dir']
.
Trained model: The model file name and file format are specified by, respectively,
params["model_file_name"]
andparams["model_file_format"]
.model.json
Predictions on validation data: The predictions from the best model.
val_y_data_predicted.csv
Prediction performance scores on validation data: The performance scores are calculated using the model predictions and the true y values. Which metrics are calculated differ for regression vs. classification models, specified by
params['metric_type']
.val_scores.json
Parameter log: Log of the
params
dictionary.param_log_file.txt
Timer log: Log of the
Timer
dictionary.runtime_preprocess.json
Training Imports
Similar to the imports for the preprocess script, this section consists of five basic components:
Import basic functionality: sys, Path, etc.
Import core improvelib functionality: we import this as ‘frm’ for historical reasons.
Import application-specific functionality: this will need to be changed if curating a model for another application.
Import model-specific functionality: at minimum this should included the model parameter definitions, but can also include other packages your model requires (Polars, numpy, etc).
Get file path: filepath is used by the Config and this line should be present as is.
Notice that here we are using the config for train and we are importing XGBoost under the model-specific imports:
import sys
from pathlib import Path
import pandas as pd
# Core improvelib imports
import improvelib.utils as frm
# Application-specific (DRP) imports
from improvelib.applications.drug_response_prediction.config import DRPTrainConfig
# Model-specifc imports
from model_params_def import train_params
import xgboost as xgb
filepath = Path(__file__).resolve().parent
Note
You may need to add imports if your model requires other packages and you may need to change the application-specific imports if using another application.
Training run()
This function contains the bulk of the code: it loads the preprocessed data, trains the model, and saves the model and prediction and scores on the validation data. Here we walk through the function.
Define the function. The parameter dictionary must be passed to this function:
def run(params):
Build model path with build_model_path and create the data names for train and val sets with build_ml_data_file_name. This ensures consistency in naming. This code should be the same for all models:
modelpath = frm.build_model_path(
model_file_name=params["model_file_name"],
model_file_format=params["model_file_format"],
model_dir=params["output_dir"]
)
train_data_fname = frm.build_ml_data_file_name(data_format=params["data_format"], stage="train")
val_data_fname = frm.build_ml_data_file_name(data_format=params["data_format"], stage="val")
Load the train and val data. Here we load the stage x data with the response values and isolate the y data and drop the y data from the remaining feature data:
train_data = pd.read_parquet(Path(params["input_dir"]) / train_data_fname)
val_data = pd.read_parquet(Path(params["input_dir"]) / val_data_fname)
# Train data
ytr = train_data[[params["y_col_name"]]]
xtr = train_data.drop(columns=[params['y_col_name']])
# Val data
yvl = val_data[[params["y_col_name"]]]
xvl = val_data.drop(columns=[params['y_col_name']])
Note
This may differ based on how you saved the x data in preprocess.
Prepare, train, and save model. Here we create a dictionary for the parameters for XGBoost with the value set to the IMPROVE parameters so that if a parameter is changed (e.g. on the command line, or via HPO workflows) it is properly used in the XGBoost model. We also initialize the XGBRegressor, fit the model using the validation set for early stopping, save the model to the modelpath, and delete the model to preserve memory.
xgb_args = {'learning_rate': params['learning_rate'],
'n_estimators': params['epochs'],
'early_stopping_rounds': params['patience'],
'max_depth': params['max_depth'],
'min_child_weight': params['min_child_weight'],
'subsample': params['subsample'],
'colsample_bytree': params['colsample_bytree'],
'gamma': params['gamma'],
'lambda': params['lambda'],
'alpha': params['alpha'],
}
model = xgb.XGBRegressor(objective='reg:squarederror', **xgb_args)
model.fit(xtr, ytr, eval_set=[(xvl, yvl)])
model.save_model(str(modelpath))
del model
Note
This will look very different depending on the implementation of your model. Key aspects to include are:
Using the improvelib parameters for
params['learning_rate']
,params['epochs']
, andparams['patience']
(see Train Parameters for other parameters).Unhardcoding other model-specific parameters that users may want to change (see Configuration Example with XGBoost for DRP and Creating Model-Specific Parameters for more details).
Using the validation set for early stopping.
Saving the model with the modelpath.
Load best model and compute predictions.
# Load the best saved model (as determined based on val data)
model = xgb.XGBRegressor()
model.load_model(str(modelpath))
# Compute predictions
val_pred = model.predict(xvl)
val_true = yvl.values.squeeze()
Note
The implementation of this will vary depending on your model. The predictions and ground truth should be numpy arrays.
Save raw predictions in dataframe. The predictions from the model on the validation data as well as the ground truth are saved with store_predictions_df.
frm.store_predictions_df(
y_true=val_true,
y_pred=val_pred,
stage="val",
y_col_name=params["y_col_name"],
output_dir=params["output_dir"],
input_dir=params["input_dir"]
)
Note
The variables val_true
and val_pred
can be whatever you use in your script, but the rest should be the same.
params['y_col_name']
can be changed in the configuration file.
Compute performance scores. Performance scores on the validation data are calculated and saved with compute_performance_scores.
val_scores = frm.compute_performance_scores(
y_true=val_true,
y_pred=val_pred,
stage="val",
metric_type=params["metric_type"],
output_dir=params["output_dir"]
)
Note
The variables val_true
and val_pred
can be whatever you use in your script, but the rest should be the same.
params['metric_type']
can be changed in the configuration file, if using a classification model instead of a regression model.
Return the validation scores
return val_scores
Training main()
and main guard
Similar to the main()
function in preprocess this is called upon script execution and gets the parameters, calls run()
, and records the time it takes for the model to run. Each line is explained below:
The first line (
cfg = DRPTrainConfig()
) initializes the configuration object for each script as appropriate.The second line initializes the parameters. Parameters set by command line (e.g.
--input_dir /my/path/to/dir
) take precedence over the values in the config file, which take precedence over the default values provided by improvelib.pathToModelDir
is the current path in the system.filepath
is already present in the template byfilepath = Path(__file__).resolve().parent
.default_config
is the default configuration file, as a string.additional_definitions
is the list of model-specific parameters.
The third line initializes the Timer.
The fourth line calls
run()
with the parameters. As dicussed,run()
contains the model code.The fifth line ends the Timer and saves the time to a JSON file in the output_dir.
The last (optional) line prints a message indicating that the script is finished and ran successfully.
def main(args):
cfg = DRPTrainConfig()
params = cfg.initialize_parameters(pathToModelDir=filepath,
default_config="xgboostdrp_params.ini",
additional_definitions=train_params)
timer_train = frm.Timer()
val_scores = run(params)
timer_train.save_timer(dir_to_save=params["output_dir"],
filename='runtime_train.json',
extra_dict={"stage": "train"})
print("\nFinished model training.")
Note
You will need to change the name of default_config
to the one for your model, and the Config if you are using an application other than DRP.
The main guard below prevents unintended execution and should be present as is:
if __name__ == "__main__":
main(sys.argv[1:])