Utility Functions ==================== IMPROVE general utility functions ----------------------------------------- The following functions are part of improvelib and should be used to standardize aspects of the workflow. .. topic:: build_ml_data_file_name Returns the name of the ML data file in the form of _data., e.g. train_data.pt. Used in *preprocess*, *train*, and *infer*. Args: data_format (str) stage (str) Returns: str: data file name .. topic:: build_model_path Returns the path to save/load the trained model. Used in *train* and *infer*. Args: model_file_name (str): Name of model file model_file_format (str): Type of file for model (e.g. '.pt') model dir (Path or str): Directory path to save the model Returns: pathlib.Path: path to model .. topic:: save_stage_ydf Save a subset of y data samples (rows of the input dataframe). The "subset" refers to one of the three stages involved in developing ML models, including: "train", "val", or "test". Used in *preprocess*. Args: ydf (pd.DataFrame): Dataframe with y data samples stage (str) : Either 'train', 'val', or 'test' output_dir (str): Directory to save to Returns: None .. topic:: store_predictions_df Save predictions with accompanying dataframe. This allows to trace original data evaluated (e.g. drug and cell pairs) if corresponding dataframe is available(output from :code:`save_stage_ydf` in *preprocess*), in which case the whole structure as well as the model predictions are stored. If the dataframe is not available, only ground truth and model predictions are stored. Used in *train* and *infer*. Args: y_pred (np.array): Model predictions y_col_name (str): Name of the column in the y_data predicted on (e.g. 'auc', 'ic50') stage (str): Specify if evaluation is with respect to val or test set ('val', or 'test') output_dir (str): Directory to write results y_true (Optional: np.array): Ground truth input_dir (Optional: str): Directory where df with ground truth with metadata is stored round_decimals (Optional: int): Number of decimals in output (default is 4) .. topic:: compute_performance_scores Evaluate predictions according to specified metrics. Scores are stored in specified path and returned. Used in *train* and *infer*. Args: y_true (np.array): Array with ground truth values y_pred (np.array): Array with model predictions stage (str): Specify if evaluation is with respect to val or test set ('val', or 'test') metric_type (str): Either 'classification' or 'regression' output_dir (str): Directory to write results Returns: dict: metrics evaluated and corresponding scores IMPROVE Drug Response Prediction Utility Functions ------------------------------------------------------- Drug Utility Functions (drug_utils) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. topic:: class DrugsLoader Class to load and manage drug data. Args: params (Dict): IMPROVE parameters. sep (str): Character separator in the loaded files (e.g., "\t" for TSV files). verbose (bool): Whether to print detailed loading information. Attributes: smiles_fname (str): Filename for SMILES data. mordred_fname (str): Filename for Mordred descriptors. ecfp4_512bit_fname (str): Filename for ECFP4 512-bit data. known_file_names (List[str]): List of known drug data filenames. params (Dict): Configuration parameters for loading data. sep (str): Separator used in data files. inp (List): List of input drug data files. drug_col_name (str): Column name for drug identifiers. x_data_path (str): Path to the directory containing drug data files. dfs (Dict[str, pd.DataFrame]): Dictionary to store loaded dataframes, keyed by filename. verbose (bool): Flag to control verbosity of output. inp_fnames (List[str]): List of input filenames extracted from inp. Example: .. code-block:: from improve import drug_resp_pred as drp params = { "x_data_drug_files": "[['drug_SMILES.tsv'], ['drug_mordred.tsv'], ['drug_ecfp4_nbits512.tsv']]", "drug_col_name": "DrugID", "x_data_path": "/path/to/drug/data" } drugs_loader = drp.DrugsLoader(params) print(drugs_loader) print(dir(drugs_loader)) smi = drugs_loader["drug_SMILES.tsv"] Omics Utility Functions (omics_utils) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. topic:: class OmicsLoader Class aggregates methods to load omics data. Args: params (Dict): IMPROVE parameters. sep (str): Character separator in the loaded files (e.g., "\t" for TSV files). verbose (bool): Flag for verbosity. Attributes: copy_number_fname (str): Filename for copy number data. discretized_copy_number_fname (str): Filename for discretized copy number data. dna_methylation_fname (str): Filename for DNA methylation data. gene_expression_fname (str): Filename for gene expression data. miRNA_expression_fname (str): Filename for miRNA expression data. mutation_count_fname (str): Filename for mutation count data. mutation_long_format_fname (str): Filename for mutation data in long format. mutation_fname (str): Filename for mutation data in parquet format. rppa_fname (str): Filename for RPPA data. known_file_names (List[str]): List of known omics data filenames. params (Dict): Configuration parameters for loading data. sep (str): Separator used in data files. inp (List): List of input omics data files. x_data_path (str): Path to the directory containing omics data files. canc_col_name (str): Column name for indexing in the data files. dfs (Dict[str, pd.DataFrame]): Dictionary to store loaded dataframes, keyed by filename. verbose (bool): Flag to control verbosity of output. inp_fnames (List[str]): List of input filenames extracted from inp. Example: .. code-block:: from improve import drug_resp_pred as drp params = { "x_data_canc_files": "[['cancer_gene_expression.tsv', 'Gene_Symbol'], " "['cancer_copy_number.tsv', 'Entrez']]", "canc_col_name": "SampleID", "x_data_path": "/path/to/omics/data" } omics_loader = drp.OmicsLoader(params) print(omics_loader) print(dir(omics_loader)) gene_expression_data = omics_loader["cancer_gene_expression.tsv"] Response Utility Functions (drp_utils) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. topic:: get_common_samples Search for common data in a reference column and retain only those rows. Args: df1 (pd.DataFrame): First dataframe. df2 (pd.DataFrame): Second dataframe. ref_col (str): The reference column to find the common values. Returns: Tuple[pd.DataFrame, pd.DataFrame]: Tuple of DataFrames after filtering for common data. .. topic:: common_elements Return a list of elements that the provided lists have in common. Args: list1 (List): One list. list2 (List): Another list. verbose (bool): Flag for verbosity. If True, info about computations is displayed. Default is False. Returns: List: List of common elements. .. topic:: class DrugResponseLoader Class for loading monotherapy drug response data. Args: params (Dict): IMPROVE parameters. split_file (str): File name that contains the split ids (rows). sep (str): Character separator in the loaded files (e.g., "\t" for TSV files). verbose (bool): Flag for verbosity. Default is True. Attributes: response_fname (str): Default response file name. known_file_names (List[str]): List of known file names. params (Dict): Parameters for loading data. sep (str): Separator used in data files. inp (List): Parsed input data files. y_col_name (str): Column name for the target variable. canc_col_name (str): Column name for cancer sample identifiers. drug_col_name (str): Column name for drug identifiers. y_data_path (str): Path to the directory containing y data files. split_fpath (Path): Path to the file containing split identifiers. dfs (Dict[str, pd.DataFrame]): Dictionary to store loaded dataframes. verbose (bool): Verbosity flag. Example: .. code-block:: from improve import drug_resp_pred as drp drp_loader = drp.DrugResponseLoader(params) print(drp_loader) print(dir(drp_loader)) rsp = drp_loader["response.tsv"] IMPROVE Synergy utility functions ----------------------------------------- The following functions are part of improvelib and should be used to standardize aspects of the workflow. .. topic:: get_response_data Gets response data for a given split file. Used in *preprocess*. Args: split_file (Union[str, Path, list of str, list of Path]): Name of split file if in benchmark data, otherwise path to split file. Can be a list of str or Path. benchmark_dir (Union[str, Path]): Path to benchmark data directory. response_file (str): Name of response file (default: 'synergy.tsv') sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for given split. .. topic:: get_all_response_data Gets response data for a given split file. Used in *preprocess*. Args: train_split_file (Union[str, Path, list of str, list of Path]): Name of train split file if in benchmark data, otherwise path to train split file. Can be a list of str or Path. val_split_file (Union[str, Path, list of str, list of Path]): Name of val split file if in benchmark data, otherwise path to val split file. Can be a list of str or Path. test_split_file (Union[str, Path, list of str, list of Path]): Name of test split file if in benchmark data, otherwise path to test split file. Can be a list of str or Path. benchmark_dir (Union[str, Path]): Path to benchmark data directory. response_file (str): Name of response file (default: 'synergy.tsv') sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for all splits with col 'split' denoting split type ('train', 'val', or 'test'). .. topic:: get_cell_transcriptomics Gets cell transcriptomics. Sets index to cell ID and sets dtype to float64. Used in *preprocess*. Args: file (Union[str, Path]): Name of cell transcriptomics file if in benchmark data, otherwise path to cell transcriptomics file. benchmark_dir (Union[str, Path]): Path to benchmark data directory. cell_column_name (str): Name of ID column for cell data. norm (list): Normalization to perform on this data. Returns: pd.DataFrame: cell transcriptomics data (with normalization if specified), index set to cell ID. .. topic:: get_cell_cnv Gets cell Copy Number Variation. Sets index to cell ID and sets dtype to float64. Used in *preprocess*. Args: file (Union[str, Path]): Name of cell Copy Number Variation file if in benchmark data, otherwise path to cell Copy Number Variation file. benchmark_dir (Union[str, Path]): Path to benchmark data directory. cell_column_name (str): Name of ID column for cell data. norm (list): Normalization to perform on this data. Returns: pd.DataFrame: cell Copy Number Variation data (with normalization if specified), index set to cell ID. .. topic:: get_cell_mutations Gets cell mutation. Sets index to cell ID and sets dtype to float64. Used in *preprocess*. Args: file (Union[str, Path]): Name of cell mutation file if in benchmark data, otherwise path to cell mutation file. benchmark_dir (Union[str, Path]): Path to benchmark data directory. cell_column_name (str): Name of ID column for cell data. norm (list): Normalization to perform on this data. Returns: pd.DataFrame: cell mutation data (with normalization if specified), index set to cell ID. .. topic:: get_response_data Gets response data for a given split file. Used in *preprocess*. Args: split_file (Union[str, Path, list of str, list of Path]): Name of split file if in benchmark data, otherwise path to split file. Can be a list of str or Path. benchmark_dir (Union[str, Path]): Path to benchmark data directory. response_file (str): Name of response file (default: 'synergy.tsv') sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for given split. .. topic:: get_response_data Gets response data for a given split file. Used in *preprocess*. Args: split_file (Union[str, Path, list of str, list of Path]): Name of split file if in benchmark data, otherwise path to split file. Can be a list of str or Path. benchmark_dir (Union[str, Path]): Path to benchmark data directory. response_file (str): Name of response file (default: 'synergy.tsv') sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for given split. .. topic:: get_response_data Gets response data for a given split file. Used in *preprocess*. Args: split_file (Union[str, Path, list of str, list of Path]): Name of split file if in benchmark data, otherwise path to split file. Can be a list of str or Path. benchmark_dir (Union[str, Path]): Path to benchmark data directory. response_file (str): Name of response file (default: 'synergy.tsv') sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for given split. .. topic:: get_response_data Gets response data for a given split file. Used in *preprocess*. Args: split_file (Union[str, Path, list of str, list of Path]): Name of split file if in benchmark data, otherwise path to split file. Can be a list of str or Path. benchmark_dir (Union[str, Path]): Path to benchmark data directory. response_file (str): Name of response file (default: 'synergy.tsv') sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for given split. .. topic:: get_response_data Gets response data for a given split file. Used in *preprocess*. Args: split_file (Union[str, Path, list of str, list of Path]): Name of split file if in benchmark data, otherwise path to split file. Can be a list of str or Path. benchmark_dir (Union[str, Path]): Path to benchmark data directory. response_file (str): Name of response file (default: 'synergy.tsv') sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for given split. .. topic:: get_response_data Gets response data for a given split file. Used in *preprocess*. Args: split_file (Union[str, Path, list of str, list of Path]): Name of split file if in benchmark data, otherwise path to split file. Can be a list of str or Path. benchmark_dir (Union[str, Path]): Path to benchmark data directory. response_file (str): Name of response file (default: 'synergy.tsv') sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for given split. .. topic:: get_response_data Gets response data for a given split file. Used in *preprocess*. Args: * split_file (Union[str, Path, list of str, list of Path]): Name of split file if in benchmark data, otherwise path to split file. Can be a list of str or Path. * benchmark_dir (Union[str, Path]): Path to benchmark data directory. * response_file (str): Name of response file (default: 'synergy.tsv') * sep (str): Separator for response file (default: '\t'). Returns: pd.DataFrame: Response dataframe for given split.