tirank.SCSTpreprocess

tirank.SCSTpreprocess.merge_datasets(bulkClinical_1, bulkClinical_2, bulkExp_1, bulkExp_2)[source]

Merges two bulk expression and clinical datasets, finding intersecting genes.

Parameters:
  • bulkClinical_1 (pd.DataFrame) – Clinical data for the first cohort.

  • bulkClinical_2 (pd.DataFrame) – Clinical data for the second cohort.

  • bulkExp_1 (pd.DataFrame) – Expression data for the first cohort (genes x samples).

  • bulkExp_2 (pd.DataFrame) – Expression data for the second cohort (genes x samples).

Returns:

A tuple containing:
  • pd.DataFrame: The merged expression DataFrame.

  • pd.DataFrame: The merged clinical DataFrame.

Or returns 0 if no intersecting genes are found.

Return type:

tuple

tirank.SCSTpreprocess.normalize_data(exp)[source]

Normalize gene expression data using z-score normalization (row-wise).

Parameters:

exp (pd.DataFrame) – A pandas DataFrame with genes as rows and samples as columns.

Returns:

A z-score normalized DataFrame.

Return type:

pd.DataFrame

tirank.SCSTpreprocess.is_imbalanced(bulkClinical, threshold)[source]

Checks if the primary clinical variable is imbalanced.

Parameters:
  • bulkClinical (pd.DataFrame) – DataFrame with clinical data. Assumes the variable of interest is in the first column.

  • threshold (float) – The minimum proportion for a class to be considered ‘balanced’.

Returns:

True if the minority class is below the threshold, False otherwise.

Return type:

bool

tirank.SCSTpreprocess.perform_sampling_on_RNAseq(savePath, mode='SMOTE', threshold=0.5)[source]

Performs sampling (over- or under-sampling) on the bulk training data.

This function is used to correct for class imbalance in ‘Classification’ mode. It loads the training data, applies the specified sampling method, and overwrites the training files with the resampled data.

Parameters:
  • savePath (str) – The main project directory path.

  • mode (str, optional) – The sampling method to use. One of ‘SMOTE’, ‘downsample’ (RandomUnderSampler), ‘upsample’ (RandomOverSampler), or ‘tomeklinks’ (TomekLinks). Defaults to “SMOTE”.

  • threshold (float, optional) – The imbalance threshold. Sampling is only performed if the minority class proportion is below this value. Defaults to 0.5.

Returns:

None

tirank.SCSTpreprocess.FilteringAnndata(adata, max_count=35000, min_count=5000, MT_propor=10, min_cell=10, imgPath='./')[source]

Filters an AnnData object based on QC metrics.

Filters cells/spots based on total counts and mitochondrial percentage. Filters genes based on minimum cell count. Also saves a QC violin plot.

Parameters:
  • adata (sc.AnnData) – The AnnData object to filter.

  • max_count (int, optional) – Maximum total counts per cell/spot. Defaults to 35000.

  • min_count (int, optional) – Minimum total counts per cell/spot. Defaults to 5000.

  • MT_propor (int, optional) – Maximum percentage of mitochondrial gene counts. Defaults to 10.

  • min_cell (int, optional) – Minimum number of cells/spots a gene must be expressed in. Defaults to 10.

  • imgPath (str, optional) – Path to save the QC violin plot. Defaults to “./”.

Returns:

The filtered AnnData object.

Return type:

sc.AnnData

tirank.SCSTpreprocess.Normalization(adata)[source]

Performs total count normalization (target_sum=1e4) on an AnnData object.

Parameters:

adata (sc.AnnData) – The AnnData object.

Returns:

The normalized AnnData object.

Return type:

sc.AnnData

tirank.SCSTpreprocess.Logtransformation(adata)[source]

Performs log1p transformation on an AnnData object.

Parameters:

adata (sc.AnnData) – The AnnData object.

Returns:

The log-transformed AnnData object.

Return type:

sc.AnnData

tirank.SCSTpreprocess.Clustering(ann_data, infer_mode, savePath)[source]

Performs standard clustering (HVGs, PCA, neighbors, UMAP, Leiden).

If neighbors are already computed, it just re-runs Leiden. Otherwise, it runs the full pipeline. Saves a UMAP or spatial plot.

Parameters:
  • ann_data (sc.AnnData) – The AnnData object.

  • infer_mode (str) – The inference data type (‘SC’ or ‘ST’) for plotting.

  • savePath (str) – The main project directory path to save plots.

Returns:

The clustered AnnData object.

Return type:

sc.AnnData

tirank.SCSTpreprocess.compute_similarity(savePath, ann_data, calculate_distance=False)[source]

Extracts and saves the cell/spot similarity matrix (connectivities).

Optionally, it can also calculate a spatial distance-based adjacency matrix (6 nearest neighbors) for ST data.

Parameters:
  • savePath (str) – The main project directory path.

  • ann_data (sc.AnnData) – A clustered AnnData object (must have ann_data.obsp[‘connectivities’]).

  • calculate_distance (bool, optional) – Whether to compute the spatial distance matrix (ST only). Defaults to False.

Returns:

None

tirank.SCSTpreprocess.calculate_populations_meanRank(input_data, category)[source]

Calculates the mean feature values for each cell subpopulation (category).

Parameters:
  • input_data (pd.DataFrame) – Input DataFrame (samples x features).

  • category (pd.Series) – A Series indicating the category (e.g., cluster) of each sample. Must share the same index as input_data.

Returns:

A DataFrame where rows are categories and columns

are the mean of features for that category.

Return type:

pd.DataFrame

Functions

tirank.SCSTpreprocess.Clustering

Performs standard clustering (HVGs, PCA, neighbors, UMAP, Leiden).

tirank.SCSTpreprocess.FilteringAnndata

Filters an AnnData object based on QC metrics.

tirank.SCSTpreprocess.Logtransformation

Performs log1p transformation on an AnnData object.

tirank.SCSTpreprocess.Normalization

Performs total count normalization (target_sum=1e4) on an AnnData object.

tirank.SCSTpreprocess.calculate_populations_meanRank

Calculates the mean feature values for each cell subpopulation (category).

tirank.SCSTpreprocess.compute_similarity

Extracts and saves the cell/spot similarity matrix (connectivities).

tirank.SCSTpreprocess.is_imbalanced

Checks if the primary clinical variable is imbalanced.

tirank.SCSTpreprocess.merge_datasets

Merges two bulk expression and clinical datasets, finding intersecting genes.

tirank.SCSTpreprocess.normalize_data

Normalize gene expression data using z-score normalization (row-wise).

tirank.SCSTpreprocess.perform_sampling_on_RNAseq

Performs sampling (over- or under-sampling) on the bulk training data.