tirank.GPextractor.GenePairExtractor

class tirank.GPextractor.GenePairExtractor(savePath, analysis_mode, top_var_genes=500, top_gene_pairs=2000, p_value_threshold=None, max_cutoff=0.8, min_cutoff=0.2)[source]

Bases: object

A class to extract and filter phenotype-associated gene pairs (PGPs).

This class loads bulk and sc/st expression data, identifies genes associated with a clinical phenotype (via t-test, Cox regression, or Pearson correlation), creates all possible pairs between positive and negative-associated genes, filters these pairs based on co-occurrence and variance, and finally transforms both bulk and sc/st datasets into gene pair matrices.

Parameters:
  • savePath (str) – The main project directory path.

  • analysis_mode (str) – The analysis mode (‘Classification’, ‘Cox’, ‘Regression’).

  • top_var_genes (int, optional) – The number of top variable genes to pre-filter from the sc/st data. Defaults to 500.

  • top_gene_pairs (int, optional) – The number of top variable gene pairs to select after filtering. Defaults to 2000.

  • p_value_threshold (float, optional) – P-value threshold for selecting phenotype-associated genes. Defaults to None (which may be an issue, but reflects the original code’s None default).

  • max_cutoff (float, optional) – Maximum co-occurrence proportion for filtering gene pairs (removes highly redundant pairs). Defaults to 0.8.

  • min_cutoff (float, optional) – Minimum co-occurrence proportion for filtering gene pairs (removes pairs with no co-occurrence). Defaults to 0.2.

load_data()[source]

Loads the required expression and clinical data from disk.

Loads ‘bulkExp_train.pkl’, ‘bulkClinical_train.pkl’, and ‘scAnndata.pkl’ from the ‘2_preprocessing’ directory and stores them as attributes.

Returns:

None

save_data()[source]

Saves the generated gene pair matrices to disk.

Saves ‘train_bulk_gene_pairs_mat.pkl’, ‘val_bulkExp_gene_pairs_mat.pkl’, and ‘sc_gene_pairs_mat.pkl’ to the ‘2_preprocessing’ directory. The validation matrix is created by transforming the validation expression data using the training gene pairs.

Returns:

None

run_extraction()[source]

Main orchestration function to run the full gene pair extraction pipeline.

This function performs the following steps: 1. Finds intersecting genes between bulk and sc/st data. 2. Selects top variable genes from sc/st data. 3. Subsets all expression data to these genes. 4. Identifies phenotype-associated gene sets (e.g., risk/protective)

based on the specified ‘analysis_mode’.

  1. Transforms the bulk expression data into a gene pair matrix.

  2. Filters the gene pair matrix by co-occurrence and variance.

  3. Transforms the sc/st expression data using the filtered gene pairs.

  4. Saves the final matrices as attributes and plots them.

Returns:

None

extract_candidate_genes(gene_names)[source]

Subsets the expression matrices to a list of candidate genes.

Parameters:

gene_names (list) – A list of gene names to keep.

Returns:

A tuple containing:
  • pd.DataFrame: The subsetted bulk expression matrix.

  • pd.DataFrame: The subsetted single-cell expression matrix.

Return type:

tuple

calculate_binomial_gene_pairs()[source]

Finds phenotype-associated genes for ‘Classification’ mode.

Performs a t-test for each gene between two groups in the clinical data.

Returns:

A tuple containing:
  • list: Genes up-regulated in group 0 (t-stat > 0).

  • list: Genes up-regulated in group 1 (t-stat < 0).

Return type:

tuple

calculate_survival_gene_pairs()[source]

Finds phenotype-associated genes for ‘Cox’ survival mode.

Performs a univariate Cox proportional hazards model for each gene.

Returns:

A tuple containing:
  • list: Risk genes (Hazard Ratio > 1).

  • list: Protective genes (Hazard Ratio < 1).

Return type:

tuple

calculate_regression_gene_pairs()[source]

Finds phenotype-associated genes for ‘Regression’ mode.

Performs a Pearson correlation for each gene against the continuous clinical variable.

Returns:

A tuple containing:
  • list: Positively correlated genes.

  • list: Negatively correlated genes.

Return type:

tuple

transform_bulk_gene_pairs(genes_r, genes_p)[source]

Transforms the bulk expression matrix into a gene pair matrix (REO).

Creates all possible pairs between the two gene sets (e.g., risk/protective). A pair is 1 if gene_r > gene_p, else -1.

Parameters:
  • genes_r (list) – The list of genes for the “positive” set (e.g., risk genes).

  • genes_p (list) – The list of genes for the “negative” set (e.g., protective genes).

Returns:

The transformed bulk gene pair matrix (gene pairs x samples).

Return type:

pd.DataFrame

filter_gene_pairs(bulk_GPMat)[source]

Filters the bulk gene pair matrix based on co-occurrence and variance.

Parameters:

bulk_GPMat (pd.DataFrame) – The raw bulk gene pair matrix.

Returns:

The filtered bulk gene pair matrix.

Return type:

pd.DataFrame

transform_single_cell_gene_pairs(bulk_GPMat)[source]

Transforms the sc/st expression matrix into a gene pair matrix.

Uses the exact same gene pairs that were filtered from the bulk data.

Parameters:

bulk_GPMat (pd.DataFrame) – The filtered bulk gene pair matrix. The index of this DataFrame defines the gene pairs to use.

Returns:

The transformed sc/st gene pair matrix.

Return type:

pd.DataFrame

split_gene_pairs(gene_pairs)[source]

Helper function to split gene pair names.

Parameters:

gene_pairs (list) – A list of gene pair strings (e.g., “GENE1__GENE2”).

Returns:

A tuple containing:
  • list: The list of first genes (e.g., “GENE1”).

  • list: The list of second genes (e.g., “GENE2”).

Return type:

tuple