determine_transform
improvelib.applications.drug_response_prediction.drp_utils.determine_transform(x_data_df, x_data_name, x_transform_list, output_dir)
Sets the transformations (imputations, scaling, and/or subsetting) features based on a list of lists of [[strategy, subtype]]. Saves a dictionary containing the details needed to perform the specified transformations on all sets.
The following [strategy, subtype] can be specified for transformations:
impute
zero
: imputes all NaN with zerosmean
: imputes all NaN with the mean of the whole DataFramemean_col
: imputes all NaN with the mean, column-wisemedian
: imputes all NaN with the median of the whole DataFramemedian_col
: imputes all NaN with the median, column-wise
scale
std
orStandardScaler
: scales with sklearn.preprocessing.StandardScaler()minmax
orMinMaxScaler
: scales with sklearn.preprocessing.MinMaxScaler()minabs
orMinAbsScaler
: scales with sklearn.preprocessing.MinAbsScaler()robust
orRobustScaler
: scales with sklearn.preprocessing.RobustScaler()
subset
L1000_SYMBOL
: subsets with L1000, should be used if column names are gene symbolsL1000_ENTREZ
: subsets with L1000, should be used if column names are EntrezL1000_SYMBOL
: subsets with LINCS, should be used if column names are gene symbolshigh_variance
: subsets to columns where the variance is higher than 0.8<YOUR/PATH/TO/FILE>
: custom subsetting - file must be a plain text list of column names, with each name on a new line
Warning
Determining transformations should be done on only the features in the training set. See the example below.
Used in preprocess.
Parameters:
- x_data_dfpd.DataFrame
The input DataFrame, column names must be Entrez IDs, index must be IDs.
- x_data_namestr
Name for the saved tranformation dictionary (.json will be added).
- x_transform_listList of Lists of str
List of lists of [[strategy, subtype]], e.g. [[‘subset’, ‘L1000_SYMBOL’], [‘scale’, ‘StandardScaler’]].
- output_dirstr
The output directory where the results should be saved. Should be set to params[‘output_dir’].
Returns:
None
Example
Before determining the transformations using the training set, it is important to only use features that are in the training set and have features for both drug and cell. This can be easily performed by calling get_response_with_features and get_features_in_response like so:
print("Find intersection of training data.")
response_train = drp.get_response_with_features(response_train, omics, params['canc_col_name'])
response_train = drp.get_response_with_features(response_train, drugs, params['drug_col_name'])
omics_train = drp.get_features_in_response(omics, response_train, params['canc_col_name'])
drugs_train = drp.get_features_in_response(drugs, response_train, params['drug_col_name'])
Once your data contains only the features necessary for the training set, transformations can be determined:
print("Determine transformations.")
drp.determine_transform(omics_train, 'omics_transform', omics_transform, params['output_dir'])
drp.determine_transform(drugs_train, 'drugs_transform', drugs_transform, params['output_dir'])