Using Non-Benchmark Data
========================

One of the strengths of IMPROVE is the use of benchmark datasets to compare models. However, the standardization provided
by IMPROVE also makes it an excellent system to perform experiments to better understand the models and the data. There
are cases where you may want to compare external data to the benchmark data and evaluate the performance. IMPROVE makes it easy to do so.

For example, say you are interested in :doc:`app_drp`. The :doc:`app_drp_benchmark` includes cancer_discretized_copy_number.tsv, in which  
-2, -1, 0, 1, 2 indicate deep deletion, heterozygous deletion, neutral, copy number gain, copy number amplification, respectively.
We might hypothesize that minor changes in copy number don't matter and that the model would perform better if we binarize 
deep deletions or amplifications (-2 or 2) vs minor changes in copy number (-1, 0, 1).

We can take the cancer_discretized_copy_number.tsv file, change -2 or 2 to 1 and -1, 0, or 1 to 0. We'll save this file in 
:code:`/home/my_data` as :code:`cancer_binary_cnv.tsv`.

.. warning::

    All data files should be tab-separated, with no index numbers. The first column in the text file will be read as the 
    index and *must* contain the IDs. The feature names (gene names in this example) should be the column headers.

We look through the :doc:`app_drp_models` and see that :doc:`Models-tCNNS` uses discretized copy number. We can run 
the model (see :doc:`using_running` for details) like normal except at the preprocessing step, we'll specify the 
location of the new CNV file we made like so:

.. code-block:: bash

    python tcnns_preprocess_improve.py --input_dir ./drp_v2 --cell_cnv_file /home/my_data/cancer_binary_cnv.tsv

That's it! You can run *train* and *infer* like normal, and compare your results.