v0.1.0-alpha
To ensure compatibility with the IMPROVE software release v0.1.0-alpha, please update your curated model. Follow the instructions below and refer to the checklist at the bottom of the page. In addition, use models GraphDRP and LGBM as examples. TODO make sure links for the models are correct!
Overview
IMPROVE version 0.1.0-alpha aims to expand the user base and encourage broader adoption of the software. This version features updates to accommodate various users and contributors, both internal and external, including those involved with the development of the core IMPROVE library, application-specific modules (such as drug response prediction and drug property prediction), benchmark datasets, and model contributions. Additionally, this version provides a simplified and more user-friendly interface, as demonstrated by intuitive help outputs, comprehensive READMEs, and documentation that facilitate easy switching between versions.
This version is now available on pypi for pip installation. TODO: update pypi AND link here
Parameters
Parameters are detailed in IMPROVE API. Of note, the parameters for each step (i.e. preprocess, train, infer) are now separate.
Deprecated Parameters
Preprocess
ml_data_outdir
is nowoutput_dir
Train
train_ml_data_dir
is nowinput_dir
val_ml_data_dir
is nowinput_dir
model_outdir
is nowoutput_dir
y_data_preds_suffix
,json_scores_suffix
, andpred_col_name_suffix
are now hard-coded.
Infer
test_ml_data_dir
is nowinput_data_dir
model_dir
is nowinput_model_dir
infer_outdir
is nowoutput_dir
y_data_preds_suffix
,json_scores_suffix
, andpred_col_name_suffix
are now hard-coded.test_batch
is nowinfer_batch
.
Updating v0.0.3 curated models
Updating Environment
Make an environment without candle lib. Since many packages are installed by candlelib, you may need to add other packages to your environment now.
For now, set the PYTHONPATH as usual, this will be replaced with pip install shortly. You can also run this bash script with
source setup_improve.sh
to set up the environment. Running this script will clone IMPROVE repo, checkout the required branch, and set the PYTHONPATH (it will also download the csa benchmark dataset if it’s not already downloaded).No environment variables need to be set, the IMPROVE_DATA_DIR directory is now set by command line with
--input_dir your/path/to/csa_data/raw_data
or in the config.
Updating Import Statements
For initalizing parameters, there is a different import for each of the three scripts:
Preprocess
from improvelib.applications.drug_response_prediction.config import DRPPreprocessConfig
Train
from improvelib.applications.drug_response_prediction.config import DRPTrainConfig
Infer
from improvelib.applications.drug_response_prediction.config import DRPInferConfig
If your code uses str2bool, change the import to the following:
from improvelib.utils import str2bool
For other framework functions (previously from
improve import framework as frm
) use:import improvelib.utils as frm
For DataLoaders in Preprocess, use the following:
DrugsLoader
import improvelib.applications.drug_response_prediction.drug_utils as drugs_utils
OmicsLoader
import improvelib.applications.drug_response_prediction.omics_utils as omics_utils
In the body of the code, references to
drp.OmicsLoader()
anddrp.DrugssLoader()
should be changed toomics_utils.OmicsLoader()
anddrugs_utils.DrugsLoader()
, respectively.DrugResponseLoader
import improvelib.applications.drug_response_prediction.drp_utils as drp
Updating main()
Create the cfg object for the appropriate script:
Preprocess
cfg = DRPPreprocessConfig()
Train
cfg = DRPTrainConfig()
Infer
cfg = DRPInferConfig()
Use relevant parameters for each of the model scripts as
additional_definitions
. For example, in the infer script useadditional_definitions = infer_params
instead ofadditional_definitions = preprocess_params + train_params + infer_params
Initialize parameters. Note that instead of
default_model
nowdefault_config
is used to specify the default configuration file.params = cfg.initialize_parameters( pathToModelDir=filepath, default_config="your_configuration_file.txt", additional_definitions=additional_definitions )
Updating IMPROVE Functions
Building paths is now done automatically. This line should be removed:
params = frm.build_paths(params)
Update the name of
build_ml_data_name
tobuild_ml_data_file_name
in preprocess, train, and infer and update the arguments. Parameters are now explicitly passed. See example:frm.build_ml_data_file_name(data_format=params["data_format"], stage="test")
Update the arguments in
build_model_path
in train and infer. Parameters are now explicitly passed. Make suremodel_dir
isparams["output_dir"]
in train andparams["input_model_dir"]
in infer. See example for infer:frm.build_model_path(model_file_name=params["model_file_name"], model_file_format=params["model_file_format"], model_dir=params["input_model_dir"])
Update the arguments in
save_stage_ydf
in preprocess. Parameters are now explicitly passed. See example:frm.save_stage_ydf(ydf=rsp, stage=stage, output_dir=params["output_dir"])
Update the arguments in
store_predictions_df
in train and infer. Parameters are now explicitly passed. See example:frm.store_predictions_df( y_true=val_true, y_pred=val_pred, stage="val", y_col_name=params["y_col_name"], output_dir=params["output_dir"] )
Update the arguments in
compute_performance_scores
in train and infer. Note “performance” is now spelled correctly. Parameters are now explicitly passed. The parametermetric_type
is set to regression by default and should not need to be changed for DRP models. See example:val_scores = frm.compute_performance_scores( y_true=val_true, y_pred=val_pred, stage="val", metric_type=params["metric_type"], output_dir=params["output_dir"] )
In infer,
compute_performance_scores
should only be called ifcalc_infer_scores
isTrue
. Wrap this in anif
statement. See example:if params["calc_infer_scores"]: test_scores = frm.compute_performance_scores( y_true=test_true, y_pred=test_pred, stage="test", metric_type=params["metric_type"], output_dir=params["output_dir"] )
If your code uses
compute_metrics
(usually in train), update the arguments. See example:compute_metrics(train_true, train_pred, params["metric_type"])
The list
metrics_list
is not required now and should be deleted. This list is hard-coded incompute_metrics
usingmetric_type
.In infer, make sure that
run()
does not return test_scores, as this is now only generated ifcalc_infer_scores
isTrue
.
Updating References to Input and Output Directories
All scripts have a single output_dir
. Preprocess and train scripts have a single input_dir
.
The infer script has two input directories, one for the saved model (input_model_dir
) and one for the ML data for the inference split (input_data_dir
).
These are all set by default to the current working directory, but it is important to ensure that the correct input directories (i.e. model and data) are used in the code in the infer script so that workflows function correctly.
Updating Model-specific Parameter Definitions
Model-specific parameter definitions should be in a file named model_params_def.py
. This file should contain three lists, one for each script (see below). These lists should be imported into the appropriate scripts (e.g. for preprocess use from model_params_def import preprocess_params
). For more information see Creating Model-Specific Parameters.
from improvelib.utils import str2bool preprocess_params = [] train_params = [] infer_params = []
Updating the Default Configuration File
The new improvelib API now only reads the parameters in the relevant section as each script is run.
If there are parameters that are used in more than one script (e.g. model_file_name
in both train and infer), these will have to be set in both the [Train] and [Infer] sections of the config.
Changes to Running Code
The path to csa_data can be set in the config or by command line. See example:
python graphdrp_preprocess_improve.py --input_dir /your/path/to/csa_data/raw_data
The default input and output directories are current working directory, but can be set in the config or by command line. Remember
input_dir
should not be used in infer, useinput_data_dir
andinput_model_dir
. See example:python graphdrp_infer_improve.py --input_data_dir /your/path/to/data --input_model_dir /your/path/to/model --output_dir /your/path/to/results
With the above changes to
compute_performance_scores
in Infer, inference scores will not automatically be computed. Setcalc_infer_scores = True
in the config or--calc_infer_scores True
on the command line.
If your model uses Supplemental Data
There should be a shell script that downloads the data in the repo. Use input_supp_data_dir
to set the path to this directory.
INTERNAL USE - Curated Model Checklist - v0.1.0
All of the following should be completed for the update of curated models from the legacy version (v0.0.3) to the latest version (v0.1.0).
Tag the legacy version
Make sure your model works with the legacy version (tagged v0.0.3-beta) of the IMPROVE lib. https://github.com/JDACS4C-IMPROVE/IMPROVE/tree/v0.0.3-beta This means that all 3 model scripts run with the csa benchmark datasets.
Update the README.md to follow the same structure as much as possible in these examples. Make sure the install instructions refer to the v0.0.3-beta tag. Code should have
setup_improve.sh
anddownload_csa.sh
.Create branch legacy-v0.0.3-beta. See examples:
Create tag v0.0.3-beta with
git tag v0.0.3-beta
thengit push origin v0.0.3-beta
. See examples:
Change environment and code with the above instructions and confirm it runs successfully. This code should stay on the develop branch for now.
Code should not use environmental variables.
Code should not be dependent on candlelib.
In infer, use
input_model_dir
andinput_data_dir
as appropriate so the CSA workflow functions properly.Parameters should be defined in model_params_def.py and these lists imported into the appropriate scripts (i.e. preprocess, train, infer).
Default config should be named MODELNAME_params.txt.
Update readme to include new instructions for set up of environment with pip installation of improvelib (and without candlelib).
Update
setup_improve.sh
to the correct improvelib branch (improve_branch="develop"
).Check the documentation page for your model (Drug Response Prediction Models) and make sure it is accurate. Tell Natasha if it isn’t.
Send Natasha a list of your model-specific parameters (or a link to them).
Tell Alex the model has been updated according to this page.