determine_transform

improvelib.utils.determine_transform(x_data_df, x_data_name, x_transform_list, output_dir)

Sets the transformations (imputations, scaling, and/or subsetting) features based on a list of lists of [[strategy, subtype]]. Saves a dictionary containing the details needed to perform the specified transformations on all sets.

The following [strategy, subtype] can be specified for transformations:

impute
- zero: imputes all NaN with zeros
- mean: imputes all NaN with the mean of the whole DataFrame
- mean_col: imputes all NaN with the mean, column-wise
- median: imputes all NaN with the median of the whole DataFrame
- median_col: imputes all NaN with the median, column-wise
scale
- std or StandardScaler: scales with sklearn.preprocessing.StandardScaler()
- minmax or MinMaxScaler: scales with sklearn.preprocessing.MinMaxScaler()
- minabs or MinAbsScaler: scales with sklearn.preprocessing.MinAbsScaler()
- robust or RobustScaler: scales with sklearn.preprocessing.RobustScaler()
subset
- L1000_SYMBOL: subsets with L1000, should be used if column names are gene symbols
- L1000_ENTREZ: subsets with L1000, should be used if column names are Entrez
- L1000_SYMBOL: subsets with LINCS, should be used if column names are gene symbols
- high_variance: subsets to columns where the variance is higher than 0.8
- <YOUR/PATH/TO/FILE>: custom subsetting - file must be a plain text list of column names, with each name on a new line

Warning

Determining transformations should be done on only the features in the training set. See the example below.

Used in preprocess.

Parameters:

x_data_dfpd.DataFrame: The input DataFrame, column names must be Entrez IDs, index must be IDs.
x_data_namestr: Name for the saved tranformation dictionary (.json will be added).
x_transform_listList of Lists of str: List of lists of [[strategy, subtype]], e.g. [[‘subset’, ‘L1000_SYMBOL’], [‘scale’, ‘StandardScaler’]].
output_dirstr: The output directory where the results should be saved. Should be set to params[‘output_dir’].

Returns:

None

Example

Before determining the transformations using the training set, it is important to only use features that are in the training set and have features for both drug and cell. This can be easily performed by calling get_y_data_with_features and get_features_in_y_data like so:

print("Find intersection of training data.")
response_train = drp.get_y_data_with_features(response_train, omics, params['canc_col_name'])
response_train = drp.get_y_data_with_features(response_train, drugs, params['drug_col_name'])
omics_train = drp.get_features_in_y_data(omics, response_train, params['canc_col_name'])
drugs_train = drp.get_features_in_y_data(drugs, response_train, params['drug_col_name'])

Once your data contains only the features necessary for the training set, transformations can be determined:

print("Determine transformations.")
drp.determine_transform(omics_train, 'omics_transform', omics_transform, params['output_dir'])
drp.determine_transform(drugs_train, 'drugs_transform', drugs_transform, params['output_dir'])