tirank.GPextractor
- class tirank.GPextractor.GenePairExtractor(savePath, analysis_mode, top_var_genes=500, top_gene_pairs=2000, p_value_threshold=None, max_cutoff=0.8, min_cutoff=0.2)[source]
Bases:
objectA class to extract and filter phenotype-associated gene pairs (PGPs).
This class loads bulk and sc/st expression data, identifies genes associated with a clinical phenotype (via t-test, Cox regression, or Pearson correlation), creates all possible pairs between positive and negative-associated genes, filters these pairs based on co-occurrence and variance, and finally transforms both bulk and sc/st datasets into gene pair matrices.
- Parameters:
savePath (str) – The main project directory path.
analysis_mode (str) – The analysis mode (‘Classification’, ‘Cox’, ‘Regression’).
top_var_genes (int, optional) – The number of top variable genes to pre-filter from the sc/st data. Defaults to 500.
top_gene_pairs (int, optional) – The number of top variable gene pairs to select after filtering. Defaults to 2000.
p_value_threshold (float, optional) – P-value threshold for selecting phenotype-associated genes. Defaults to None (which may be an issue, but reflects the original code’s None default).
max_cutoff (float, optional) – Maximum co-occurrence proportion for filtering gene pairs (removes highly redundant pairs). Defaults to 0.8.
min_cutoff (float, optional) – Minimum co-occurrence proportion for filtering gene pairs (removes pairs with no co-occurrence). Defaults to 0.2.
- load_data()[source]
Loads the required expression and clinical data from disk.
Loads ‘bulkExp_train.pkl’, ‘bulkClinical_train.pkl’, and ‘scAnndata.pkl’ from the ‘2_preprocessing’ directory and stores them as attributes.
- Returns:
None
- save_data()[source]
Saves the generated gene pair matrices to disk.
Saves ‘train_bulk_gene_pairs_mat.pkl’, ‘val_bulkExp_gene_pairs_mat.pkl’, and ‘sc_gene_pairs_mat.pkl’ to the ‘2_preprocessing’ directory. The validation matrix is created by transforming the validation expression data using the training gene pairs.
- Returns:
None
- run_extraction()[source]
Main orchestration function to run the full gene pair extraction pipeline.
This function performs the following steps: 1. Finds intersecting genes between bulk and sc/st data. 2. Selects top variable genes from sc/st data. 3. Subsets all expression data to these genes. 4. Identifies phenotype-associated gene sets (e.g., risk/protective)
based on the specified ‘analysis_mode’.
Transforms the bulk expression data into a gene pair matrix.
Filters the gene pair matrix by co-occurrence and variance.
Transforms the sc/st expression data using the filtered gene pairs.
Saves the final matrices as attributes and plots them.
- Returns:
None
- extract_candidate_genes(gene_names)[source]
Subsets the expression matrices to a list of candidate genes.
- Parameters:
gene_names (list) – A list of gene names to keep.
- Returns:
- A tuple containing:
pd.DataFrame: The subsetted bulk expression matrix.
pd.DataFrame: The subsetted single-cell expression matrix.
- Return type:
tuple
- calculate_binomial_gene_pairs()[source]
Finds phenotype-associated genes for ‘Classification’ mode.
Performs a t-test for each gene between two groups in the clinical data.
- Returns:
- A tuple containing:
list: Genes up-regulated in group 0 (t-stat > 0).
list: Genes up-regulated in group 1 (t-stat < 0).
- Return type:
tuple
- calculate_survival_gene_pairs()[source]
Finds phenotype-associated genes for ‘Cox’ survival mode.
Performs a univariate Cox proportional hazards model for each gene.
- Returns:
- A tuple containing:
list: Risk genes (Hazard Ratio > 1).
list: Protective genes (Hazard Ratio < 1).
- Return type:
tuple
- calculate_regression_gene_pairs()[source]
Finds phenotype-associated genes for ‘Regression’ mode.
Performs a Pearson correlation for each gene against the continuous clinical variable.
- Returns:
- A tuple containing:
list: Positively correlated genes.
list: Negatively correlated genes.
- Return type:
tuple
- transform_bulk_gene_pairs(genes_r, genes_p)[source]
Transforms the bulk expression matrix into a gene pair matrix (REO).
Creates all possible pairs between the two gene sets (e.g., risk/protective). A pair is 1 if gene_r > gene_p, else -1.
- Parameters:
genes_r (list) – The list of genes for the “positive” set (e.g., risk genes).
genes_p (list) – The list of genes for the “negative” set (e.g., protective genes).
- Returns:
The transformed bulk gene pair matrix (gene pairs x samples).
- Return type:
pd.DataFrame
- filter_gene_pairs(bulk_GPMat)[source]
Filters the bulk gene pair matrix based on co-occurrence and variance.
- Parameters:
bulk_GPMat (pd.DataFrame) – The raw bulk gene pair matrix.
- Returns:
The filtered bulk gene pair matrix.
- Return type:
pd.DataFrame
- transform_single_cell_gene_pairs(bulk_GPMat)[source]
Transforms the sc/st expression matrix into a gene pair matrix.
Uses the exact same gene pairs that were filtered from the bulk data.
- Parameters:
bulk_GPMat (pd.DataFrame) – The filtered bulk gene pair matrix. The index of this DataFrame defines the gene pairs to use.
- Returns:
The transformed sc/st gene pair matrix.
- Return type:
pd.DataFrame
- split_gene_pairs(gene_pairs)[source]
Helper function to split gene pair names.
- Parameters:
gene_pairs (list) – A list of gene pair strings (e.g., “GENE1__GENE2”).
- Returns:
- A tuple containing:
list: The list of first genes (e.g., “GENE1”).
list: The list of second genes (e.g., “GENE2”).
- Return type:
tuple
Classes
A class to extract and filter phenotype-associated gene pairs (PGPs). |