v0.1.0-alpha

To ensure compatibility with the IMPROVE software release v0.1.0-alpha, please update your curated model. Follow the instructions below and refer to the checklist at the bottom of the page. In addition, use models GraphDRP and LGBM as examples. TODO make sure links for the models are correct!

Overview

IMPROVE version 0.1.0-alpha aims to expand the user base and encourage broader adoption of the software. This version features updates to accommodate various users and contributors, both internal and external, including those involved with the development of the core IMPROVE library, application-specific modules (such as drug response prediction and drug property prediction), benchmark datasets, and model contributions. Additionally, this version provides a simplified and more user-friendly interface, as demonstrated by intuitive help outputs, comprehensive READMEs, and documentation that facilitate easy switching between versions.

This version is now available on pypi for pip installation. TODO: update pypi AND link here

Parameters

Parameters are detailed in IMPROVE API. Of note, the parameters for each step (i.e. preprocess, train, infer) are now separate.

Deprecated Parameters

Preprocess
- ml_data_outdir is now output_dir
Train
- train_ml_data_dir is now input_dir
- val_ml_data_dir is now input_dir
- model_outdir is now output_dir
- y_data_preds_suffix, json_scores_suffix, and pred_col_name_suffix are now hard-coded.
Infer
- test_ml_data_dir is now input_data_dir
- model_dir is now input_model_dir
- infer_outdir is now output_dir
- y_data_preds_suffix, json_scores_suffix, and pred_col_name_suffix are now hard-coded.
- test_batch is now infer_batch.

Updating v0.0.3 curated models

DRP Benchmarks v2 update

Benchmarks no longer have a triple header, and are flexibly loaded. Documentation will be fully updated after push to develop.

Delete:

import improvelib.applications.drug_response_prediction.drug_utils as drugs_utils
import improvelib.applications.drug_response_prediction.omics_utils as omics_utils

Example change for omics:

omics_obj = omics_utils.OmicsLoader(params)
ge = omics_obj.dfs['cancer_gene_expression.tsv']

to

ge = drp.get_cell_transcriptomics(file = params['cell_transcriptomic_file'],
                                benchmark_dir = params['input_dir'],
                                cell_column_name = params['canc_col_name'],
                                norm = params['cell_transcriptomic_transform'])
ge = ge.reset_index()

Example change for drug:

drugs_obj = drugs_utils.DrugsLoader(params)
md = drugs_obj.dfs['drug_mordred.tsv']

to

md = drp.get_drug_mordred(file = params['drug_mordred_file'],
            benchmark_dir = params['input_dir'],
            drug_column_name = params['drug_col_name'])

Example change for response (change all instances):

rsp_tr = drp.DrugResponseLoader(params,
                                split_file=params["train_split_file"],
                                verbose=False).dfs["response.tsv"]

to

rsp_tr = drp.get_response_data(split_file=params["train_split_file"],
                               benchmark_dir=params['input_dir'],
                               response_file=params['y_data_file'])

Updating Environment

Make an environment without candle lib. Since many packages are installed by candlelib, you may need to add other packages to your environment now.
For now, set the PYTHONPATH as usual, this will be replaced with pip install shortly. You can also run this bash script with source setup_improve.sh to set up the environment. Running this script will clone IMPROVE repo, checkout the required branch, and set the PYTHONPATH (it will also download the csa benchmark dataset if it’s not already downloaded).
No environment variables need to be set, the IMPROVE_DATA_DIR directory is now set by command line with --input_dir your/path/to/csa_data/raw_data or in the config.

Updating Import Statements

For initalizing parameters, there is a different import for each of the three scripts:

Preprocess

from improvelib.applications.drug_response_prediction.config import DRPPreprocessConfig

Train

from improvelib.applications.drug_response_prediction.config import DRPTrainConfig

Infer

from improvelib.applications.drug_response_prediction.config import DRPInferConfig

If your code uses str2bool, change the import to the following:
```
from improvelib.utils import str2bool
```
For other framework functions (previously from improve import framework as frm) use:
```
import improvelib.utils as frm
```
For DataLoaders in Preprocess, use the following:
- DrugsLoader
```
import improvelib.applications.drug_response_prediction.drug_utils as drugs_utils
```
- OmicsLoader
```
import improvelib.applications.drug_response_prediction.omics_utils as omics_utils
```
In the body of the code, references to drp.OmicsLoader() and drp.DrugssLoader() should be changed to omics_utils.OmicsLoader() and drugs_utils.DrugsLoader(), respectively.
- DrugResponseLoader
```
import improvelib.applications.drug_response_prediction.drp_utils as drp
```

Updating `main()`

Create the cfg object for the appropriate script:

Preprocess

cfg = DRPPreprocessConfig()

Train

cfg = DRPTrainConfig()

Infer

cfg = DRPInferConfig()

Use relevant parameters for each of the model scripts as additional_definitions. For example, in the infer script use additional_definitions = infer_params instead of additional_definitions = preprocess_params + train_params + infer_params

Initialize parameters. Note that instead of default_model now default_config is used to specify the default configuration file.

params = cfg.initialize_parameters(
    pathToModelDir=filepath,
    default_config="your_configuration_file.txt",
    additional_definitions=additional_definitions
)

Updating IMPROVE Functions

Building paths is now done automatically. This line should be removed:
```
params = frm.build_paths(params)
```
Update the name of build_ml_data_name to build_ml_data_file_name in preprocess, train, and infer and update the arguments. Parameters are now explicitly passed. See example:
```
frm.build_ml_data_file_name(data_format=params["data_format"], stage="test")
```
Update the arguments in build_model_path in train and infer. Parameters are now explicitly passed. Make sure model_dir is params["output_dir"] in train and params["input_model_dir"] in infer. See example for infer:
```
frm.build_model_path(model_file_name=params["model_file_name"],
    model_file_format=params["model_file_format"],
    model_dir=params["input_model_dir"])
```
Update the arguments in save_stage_ydf in preprocess. Parameters are now explicitly passed. See example:
```
frm.save_stage_ydf(ydf=rsp, stage=stage, output_dir=params["output_dir"])
```

Update the arguments in store_predictions_df in train and infer. Parameters are now explicitly passed. See example:

frm.store_predictions_df(
    y_true=val_true,
    y_pred=val_pred,
    stage="val",
    y_col_name=params["y_col_name"],
    output_dir=params["output_dir"]
)

Update the arguments in compute_performance_scores in train and infer. Note “performance” is now spelled correctly. Parameters are now explicitly passed. The parameter metric_type is set to regression by default and should not need to be changed for DRP models. See example:
```
val_scores = frm.compute_performance_scores(
    y_true=val_true,
    y_pred=val_pred,
    stage="val",
    metric_type=params["metric_type"],
    output_dir=params["output_dir"]
)
```

In infer, compute_performance_scores should only be called if calc_infer_scores is True. Wrap this in an if statement. See example:

if params["calc_infer_scores"]:
    test_scores = frm.compute_performance_scores(
        y_true=test_true,
        y_pred=test_pred,
        stage="test",
        metric_type=params["metric_type"],
        output_dir=params["output_dir"]
    )

If your code uses compute_metrics (usually in train), update the arguments. See example:
```
compute_metrics(train_true, train_pred, params["metric_type"])
```
The list metrics_list is not required now and should be deleted. This list is hard-coded in compute_metrics using metric_type.
In infer, make sure that run() does not return test_scores, as this is now only generated if calc_infer_scores is True.

Updating References to Input and Output Directories

All scripts have a single output_dir. Preprocess and train scripts have a single input_dir. The infer script has two input directories, one for the saved model (input_model_dir) and one for the ML data for the inference split (input_data_dir). These are all set by default to the current working directory, but it is important to ensure that the correct input directories (i.e. model and data) are used in the code in the infer script so that workflows function correctly.

Updating Model-specific Parameter Definitions

Model-specific parameter definitions should be in a file named model_params_def.py. This file should contain three lists, one for each script (see below). These lists should be imported into the appropriate scripts (e.g. for preprocess use from model_params_def import preprocess_params). For more information see Creating Model-Specific Parameters.

from improvelib.utils import str2bool

preprocess_params = []
train_params = []
infer_params = []

Updating the Default Configuration File

The new improvelib API now only reads the parameters in the relevant section as each script is run. If there are parameters that are used in more than one script (e.g. model_file_name in both train and infer), these will have to be set in both the [Train] and [Infer] sections of the config.

Changes to Running Code

The path to csa_data can be set in the config or by command line. See example:

python graphdrp_preprocess_improve.py --input_dir /your/path/to/csa_data/raw_data

The default input and output directories are current working directory, but can be set in the config or by command line. Remember input_dir should not be used in infer, use input_data_dir and input_model_dir. See example:
```
python graphdrp_infer_improve.py --input_data_dir /your/path/to/data --input_model_dir /your/path/to/model --output_dir /your/path/to/results
```
With the above changes to compute_performance_scores in Infer, inference scores will not automatically be computed. Set calc_infer_scores = True in the config or --calc_infer_scores True on the command line.

If your model uses Supplemental Data

There should be a shell script that downloads the data in the repo. Use input_supp_data_dir to set the path to this directory.

INTERNAL USE - Curated Model Checklist - v0.1.0

All of the following should be completed for the update of curated models from the legacy version (v0.0.3) to the latest version (v0.1.0).

Tag the legacy version
- Make sure your model works with the legacy version (tagged v0.0.3-beta) of the IMPROVE lib. https://github.com/JDACS4C-IMPROVE/IMPROVE/tree/v0.0.3-beta This means that all 3 model scripts run with the csa benchmark datasets.
- Update the README.md to follow the same structure as much as possible in these examples. Make sure the install instructions refer to the v0.0.3-beta tag. Code should have setup_improve.sh and download_csa.sh.
  - https://github.com/JDACS4C-IMPROVE/GraphDRP/tree/legacy-v0.0.3-beta
  - https://github.com/JDACS4C-IMPROVE/LGBM/tree/legacy-v0.0.3-beta
- Create branch legacy-v0.0.3-beta. See examples:
  - https://github.com/JDACS4C-IMPROVE/GraphDRP/tree/legacy-v0.0.3-beta
  - https://github.com/JDACS4C-IMPROVE/LGBM/tree/legacy-v0.0.3-beta
- Create tag v0.0.3-beta with git tag v0.0.3-beta then git push origin v0.0.3-beta. See examples:
  - https://github.com/JDACS4C-IMPROVE/GraphDRP/tree/v0.0.3-beta
  - https://github.com/JDACS4C-IMPROVE/LGBM/tree/v0.0.3-beta
Change environment and code with the above instructions and confirm it runs successfully. This code should stay on the develop branch for now.
Code should not use environmental variables.
Code should not be dependent on candlelib.
In infer, use input_model_dir and input_data_dir as appropriate so the CSA workflow functions properly.
Parameters should be defined in model_params_def.py and these lists imported into the appropriate scripts (i.e. preprocess, train, infer).
Default config should be named MODELNAME_params.txt.
Update readme to include new instructions for set up of environment with pip installation of improvelib (and without candlelib).
Update setup_improve.sh to the correct improvelib branch (improve_branch="develop").
Check the documentation page for your model (Drug Response Prediction Models) and make sure it is accurate. Tell Natasha if it isn’t.
Send Natasha a list of your model-specific parameters (or a link to them).
Tell Alex the model has been updated according to this page.