Using Non-Benchmark Data

One of the strengths of IMPROVE is the use of benchmark datasets to compare models. However, the standardization provided by IMPROVE also makes it an excellent system to perform experiments to better understand the models and the data. There are cases where you may want to compare external data to the benchmark data and evaluate the performance. IMPROVE makes it easy to do so.

For example, say you are interested in Drug Response Prediction. The Benchmark Data for DRP includes cancer_discretized_copy_number.tsv, in which -2, -1, 0, 1, 2 indicate deep deletion, heterozygous deletion, neutral, copy number gain, copy number amplification, respectively. We might hypothesize that minor changes in copy number don’t matter and that the model would perform better if we binarize deep deletions or amplifications (-2 or 2) vs minor changes in copy number (-1, 0, 1).

We can take the cancer_discretized_copy_number.tsv file, change -2 or 2 to 1 and -1, 0, or 1 to 0. We’ll save this file in /home/my_data as cancer_binary_cnv.tsv.

Warning

All data files should be tab-separated, with no index numbers. The first column in the text file will be read as the index and must contain the IDs. The feature names (gene names in this example) should be the column headers.

We look through the Drug Response Prediction Models and see that tCNNS uses discretized copy number. We can run the model (see Running a Curated Model for details) like normal except at the preprocessing step, we’ll specify the location of the new CNV file we made like so:

python tcnns_preprocess_improve.py --input_dir ./drp_v2 --cell_cnv_file /home/my_data/cancer_binary_cnv.tsv

That’s it! You can run train and infer like normal, and compare your results.