tirank.SCSTpreprocess
- tirank.SCSTpreprocess.merge_datasets(bulkClinical_1, bulkClinical_2, bulkExp_1, bulkExp_2)[source]
Merges two bulk expression and clinical datasets, finding intersecting genes.
- Parameters:
bulkClinical_1 (pd.DataFrame) – Clinical data for the first cohort.
bulkClinical_2 (pd.DataFrame) – Clinical data for the second cohort.
bulkExp_1 (pd.DataFrame) – Expression data for the first cohort (genes x samples).
bulkExp_2 (pd.DataFrame) – Expression data for the second cohort (genes x samples).
- Returns:
- A tuple containing:
pd.DataFrame: The merged expression DataFrame.
pd.DataFrame: The merged clinical DataFrame.
Or returns 0 if no intersecting genes are found.
- Return type:
tuple
- tirank.SCSTpreprocess.normalize_data(exp)[source]
Normalize gene expression data using z-score normalization (row-wise).
- Parameters:
exp (pd.DataFrame) – A pandas DataFrame with genes as rows and samples as columns.
- Returns:
A z-score normalized DataFrame.
- Return type:
pd.DataFrame
- tirank.SCSTpreprocess.is_imbalanced(bulkClinical, threshold)[source]
Checks if the primary clinical variable is imbalanced.
- Parameters:
bulkClinical (pd.DataFrame) – DataFrame with clinical data. Assumes the variable of interest is in the first column.
threshold (float) – The minimum proportion for a class to be considered ‘balanced’.
- Returns:
True if the minority class is below the threshold, False otherwise.
- Return type:
bool
- tirank.SCSTpreprocess.perform_sampling_on_RNAseq(savePath, mode='SMOTE', threshold=0.5)[source]
Performs sampling (over- or under-sampling) on the bulk training data.
This function is used to correct for class imbalance in ‘Classification’ mode. It loads the training data, applies the specified sampling method, and overwrites the training files with the resampled data.
- Parameters:
savePath (str) – The main project directory path.
mode (str, optional) – The sampling method to use. One of ‘SMOTE’, ‘downsample’ (RandomUnderSampler), ‘upsample’ (RandomOverSampler), or ‘tomeklinks’ (TomekLinks). Defaults to “SMOTE”.
threshold (float, optional) – The imbalance threshold. Sampling is only performed if the minority class proportion is below this value. Defaults to 0.5.
- Returns:
None
- tirank.SCSTpreprocess.FilteringAnndata(adata, max_count=35000, min_count=5000, MT_propor=10, min_cell=10, imgPath='./')[source]
Filters an AnnData object based on QC metrics.
Filters cells/spots based on total counts and mitochondrial percentage. Filters genes based on minimum cell count. Also saves a QC violin plot.
- Parameters:
adata (sc.AnnData) – The AnnData object to filter.
max_count (int, optional) – Maximum total counts per cell/spot. Defaults to 35000.
min_count (int, optional) – Minimum total counts per cell/spot. Defaults to 5000.
MT_propor (int, optional) – Maximum percentage of mitochondrial gene counts. Defaults to 10.
min_cell (int, optional) – Minimum number of cells/spots a gene must be expressed in. Defaults to 10.
imgPath (str, optional) – Path to save the QC violin plot. Defaults to “./”.
- Returns:
The filtered AnnData object.
- Return type:
sc.AnnData
- tirank.SCSTpreprocess.Normalization(adata)[source]
Performs total count normalization (target_sum=1e4) on an AnnData object.
- Parameters:
adata (sc.AnnData) – The AnnData object.
- Returns:
The normalized AnnData object.
- Return type:
sc.AnnData
- tirank.SCSTpreprocess.Logtransformation(adata)[source]
Performs log1p transformation on an AnnData object.
- Parameters:
adata (sc.AnnData) – The AnnData object.
- Returns:
The log-transformed AnnData object.
- Return type:
sc.AnnData
- tirank.SCSTpreprocess.Clustering(ann_data, infer_mode, savePath)[source]
Performs standard clustering (HVGs, PCA, neighbors, UMAP, Leiden).
If neighbors are already computed, it just re-runs Leiden. Otherwise, it runs the full pipeline. Saves a UMAP or spatial plot.
- Parameters:
ann_data (sc.AnnData) – The AnnData object.
infer_mode (str) – The inference data type (‘SC’ or ‘ST’) for plotting.
savePath (str) – The main project directory path to save plots.
- Returns:
The clustered AnnData object.
- Return type:
sc.AnnData
- tirank.SCSTpreprocess.compute_similarity(savePath, ann_data, calculate_distance=False)[source]
Extracts and saves the cell/spot similarity matrix (connectivities).
Optionally, it can also calculate a spatial distance-based adjacency matrix (6 nearest neighbors) for ST data.
- Parameters:
savePath (str) – The main project directory path.
ann_data (sc.AnnData) – A clustered AnnData object (must have ann_data.obsp[‘connectivities’]).
calculate_distance (bool, optional) – Whether to compute the spatial distance matrix (ST only). Defaults to False.
- Returns:
None
- tirank.SCSTpreprocess.calculate_populations_meanRank(input_data, category)[source]
Calculates the mean feature values for each cell subpopulation (category).
- Parameters:
input_data (pd.DataFrame) – Input DataFrame (samples x features).
category (pd.Series) – A Series indicating the category (e.g., cluster) of each sample. Must share the same index as input_data.
- Returns:
- A DataFrame where rows are categories and columns
are the mean of features for that category.
- Return type:
pd.DataFrame
Functions
Performs standard clustering (HVGs, PCA, neighbors, UMAP, Leiden). |
|
Filters an AnnData object based on QC metrics. |
|
Performs log1p transformation on an AnnData object. |
|
Performs total count normalization (target_sum=1e4) on an AnnData object. |
|
Calculates the mean feature values for each cell subpopulation (category). |
|
Extracts and saves the cell/spot similarity matrix (connectivities). |
|
Checks if the primary clinical variable is imbalanced. |
|
Merges two bulk expression and clinical datasets, finding intersecting genes. |
|
Normalize gene expression data using z-score normalization (row-wise). |
|
Performs sampling (over- or under-sampling) on the bulk training data. |