v0.1.0-alpha

To ensure compatibility with the IMPROVE software release v0.1.0-alpha, please update your curated model. Follow the instructions below and refer to the checklist at the bottom of the page. In addition, use models GraphDRP and LGBM as examples. TODO make sure links for the models are correct!

Overview

IMPROVE version 0.1.0-alpha aims to expand the user base and encourage broader adoption of the software. This version features updates to accommodate various users and contributors, both internal and external, including those involved with the development of the core IMPROVE library, application-specific modules (such as drug response prediction and drug property prediction), benchmark datasets, and model contributions. Additionally, this version provides a simplified and more user-friendly interface, as demonstrated by intuitive help outputs, comprehensive READMEs, and documentation that facilitate easy switching between versions.

This version is now available on pypi for pip installation. TODO: update pypi AND link here

Parameters

Parameters are detailed in IMPROVE API. Of note, the parameters for each step (i.e. preprocess, train, infer) are now separate.

Deprecated Parameters

  • Preprocess

    • ml_data_outdir is now output_dir

  • Train

    • train_ml_data_dir is now input_dir

    • val_ml_data_dir is now input_dir

    • model_outdir is now output_dir

    • y_data_preds_suffix, json_scores_suffix, and pred_col_name_suffix are now hard-coded.

  • Infer

    • test_ml_data_dir is now input_data_dir

    • model_dir is now input_model_dir

    • infer_outdir is now output_dir

    • y_data_preds_suffix, json_scores_suffix, and pred_col_name_suffix are now hard-coded.

    • test_batch is now infer_batch.

Updating v0.0.3 curated models

Updating Environment

  • Make an environment without candle lib. Since many packages are installed by candlelib, you may need to add other packages to your environment now.

  • For now, set the PYTHONPATH as usual, this will be replaced with pip install shortly. You can also run this bash script with source setup_improve.sh to set up the environment. Running this script will clone IMPROVE repo, checkout the required branch, and set the PYTHONPATH (it will also download the csa benchmark dataset if it’s not already downloaded).

  • No environment variables need to be set, the IMPROVE_DATA_DIR directory is now set by command line with --input_dir your/path/to/csa_data/raw_data or in the config.

Updating Import Statements

  • For initalizing parameters, there is a different import for each of the three scripts:

    • Preprocess

    from improvelib.applications.drug_response_prediction.config import DRPPreprocessConfig
    
    • Train

    from improvelib.applications.drug_response_prediction.config import DRPTrainConfig
    
    • Infer

    from improvelib.applications.drug_response_prediction.config import DRPInferConfig
    
  • If your code uses str2bool, change the import to the following:

    from improvelib.utils import str2bool
    
  • For other framework functions (previously from improve import framework as frm) use:

    import improvelib.utils as frm
    
  • For DataLoaders in Preprocess, use the following:

    • DrugsLoader

    import improvelib.applications.drug_response_prediction.drug_utils as drugs_utils
    
    • OmicsLoader

    import improvelib.applications.drug_response_prediction.omics_utils as omics_utils
    

    In the body of the code, references to drp.OmicsLoader() and drp.DrugssLoader() should be changed to omics_utils.OmicsLoader() and drugs_utils.DrugsLoader(), respectively.

    • DrugResponseLoader

    import improvelib.applications.drug_response_prediction.drp_utils as drp
    

Updating main()

  • Create the cfg object for the appropriate script:

    • Preprocess

    cfg = DRPPreprocessConfig()
    
    • Train

    cfg = DRPTrainConfig()
    
    • Infer

    cfg = DRPInferConfig()
    
  • Use relevant parameters for each of the model scripts as additional_definitions. For example, in the infer script use additional_definitions = infer_params instead of additional_definitions = preprocess_params + train_params + infer_params

  • Initialize parameters. Note that instead of default_model now default_config is used to specify the default configuration file.

    params = cfg.initialize_parameters(
        pathToModelDir=filepath,
        default_config="your_configuration_file.txt",
        additional_definitions=additional_definitions
    )
    

Updating IMPROVE Functions

  • Building paths is now done automatically. This line should be removed:

    params = frm.build_paths(params)
    
  • Update the name of build_ml_data_name to build_ml_data_file_name in preprocess, train, and infer and update the arguments. Parameters are now explicitly passed. See example:

    frm.build_ml_data_file_name(data_format=params["data_format"], stage="test")
    
  • Update the arguments in build_model_path in train and infer. Parameters are now explicitly passed. Make sure model_dir is params["output_dir"] in train and params["input_model_dir"] in infer. See example for infer:

    frm.build_model_path(model_file_name=params["model_file_name"],
        model_file_format=params["model_file_format"],
        model_dir=params["input_model_dir"])
    
  • Update the arguments in save_stage_ydf in preprocess. Parameters are now explicitly passed. See example:

    frm.save_stage_ydf(ydf=rsp, stage=stage, output_dir=params["output_dir"])
    
  • Update the arguments in store_predictions_df in train and infer. Parameters are now explicitly passed. See example:

    frm.store_predictions_df(
        y_true=val_true,
        y_pred=val_pred,
        stage="val",
        y_col_name=params["y_col_name"],
        output_dir=params["output_dir"]
    )
    
  • Update the arguments in compute_performance_scores in train and infer. Note “performance” is now spelled correctly. Parameters are now explicitly passed. The parameter metric_type is set to regression by default and should not need to be changed for DRP models. See example:

    val_scores = frm.compute_performance_scores(
        y_true=val_true,
        y_pred=val_pred,
        stage="val",
        metric_type=params["metric_type"],
        output_dir=params["output_dir"]
    )
    
  • In infer, compute_performance_scores should only be called if calc_infer_scores is True. Wrap this in an if statement. See example:

    if params["calc_infer_scores"]:
        test_scores = frm.compute_performance_scores(
            y_true=test_true,
            y_pred=test_pred,
            stage="test",
            metric_type=params["metric_type"],
            output_dir=params["output_dir"]
        )
    
  • If your code uses compute_metrics (usually in train), update the arguments. See example:

    compute_metrics(train_true, train_pred, params["metric_type"])
    
  • The list metrics_list is not required now and should be deleted. This list is hard-coded in compute_metrics using metric_type.

  • In infer, make sure that run() does not return test_scores, as this is now only generated if calc_infer_scores is True.

Updating References to Input and Output Directories

All scripts have a single output_dir. Preprocess and train scripts have a single input_dir. The infer script has two input directories, one for the saved model (input_model_dir) and one for the ML data for the inference split (input_data_dir). These are all set by default to the current working directory, but it is important to ensure that the correct input directories (i.e. model and data) are used in the code in the infer script so that workflows function correctly.

Updating Model-specific Parameter Definitions

Model-specific parameter definitions should be in a file named model_params_def.py. This file should contain three lists, one for each script (see below). These lists should be imported into the appropriate scripts (e.g. for preprocess use from model_params_def import preprocess_params). For more information see Creating Model-Specific Parameters.

from improvelib.utils import str2bool

preprocess_params = []
train_params = []
infer_params = []

Updating the Default Configuration File

The new improvelib API now only reads the parameters in the relevant section as each script is run. If there are parameters that are used in more than one script (e.g. model_file_name in both train and infer), these will have to be set in both the [Train] and [Infer] sections of the config.

Changes to Running Code

  • The path to csa_data can be set in the config or by command line. See example:

    python graphdrp_preprocess_improve.py --input_dir /your/path/to/csa_data/raw_data
    
  • The default input and output directories are current working directory, but can be set in the config or by command line. Remember input_dir should not be used in infer, use input_data_dir and input_model_dir. See example:

    python graphdrp_infer_improve.py --input_data_dir /your/path/to/data --input_model_dir /your/path/to/model --output_dir /your/path/to/results
    
  • With the above changes to compute_performance_scores in Infer, inference scores will not automatically be computed. Set calc_infer_scores = True in the config or --calc_infer_scores True on the command line.

If your model uses Supplemental Data

There should be a shell script that downloads the data in the repo. Use input_supp_data_dir to set the path to this directory.

INTERNAL USE - Curated Model Checklist - v0.1.0

All of the following should be completed for the update of curated models from the legacy version (v0.0.3) to the latest version (v0.1.0).