Analysis Functions¶
Utilities for computing metrics, gene contributions, and understanding basis performance.
compute_predictivity¶
- sctop.compute_predictivity(basis: DataFrame) DataFrame[source]¶
Compute predictivity matrix from basis.
The predictivity shows how each gene contributes to each cell type’s score. Formula: predictivity = inv(B^T @ B) @ B^T
- Parameters:
basis (pd.DataFrame) – Basis matrix (genes x cell_types)
- Returns:
predictivity – Predictivity matrix (cell_types x genes) Shows how each gene contributes to each cell type score
- Return type:
pd.DataFrame
Purpose: Compute the predictivity matrix showing how each gene contributes to each cell type score.
Mathematical Background:
The predictivity matrix \(\\eta\) is defined as:
where \(\\xi\) is the basis matrix (genes × cell types).
Each entry \(\\eta_{ct,g}\) represents how much gene \(g\) contributes to the score for cell type \(ct\).
Parameters:
basis (pd.DataFrame): Cell type basis (genes × cell types)
Returns:
pd.DataFrame: Predictivity matrix (cell types × genes)
Example:
# Compute predictivity from basis
predictivity = top.compute_predictivity(basis)
# See which genes contribute most to T cell score
t_cell_pred = predictivity.loc['T cell'].sort_values(ascending=False)
print("Top 10 T cell predictor genes:")
print(t_cell_pred.head(10))
# Positive values: gene expression increases cell type score
# Negative values: gene expression decreases cell type score
Notes:
Predictivity is computed once and can be reused
Used internally by contribution analysis functions
Reveals gene importance for each cell type
compute_gene_contributions¶
- sctop.compute_gene_contributions(data: DataFrame | ndarray, basis: DataFrame, predictivity: DataFrame | None = None, cell_types: List[str] | None = None, process_data: bool = True) Dict[str, DataFrame][source]¶
Compute gene-level contributions to cell type scores.
For each cell type, computes: contribution = expression * predictivity
- Parameters:
data (DataFrame or array) – Expression data (genes x samples)
basis (pd.DataFrame) – Basis matrix
predictivity (pd.DataFrame, optional) – Precomputed predictivity matrix. If None, computed from basis
cell_types (list, optional) – Cell types to compute contributions for. If None, uses all
process_data (bool) – Whether to process the data first (default: True)
- Returns:
contributions – Dictionary mapping cell_type -> contribution_matrix (genes x samples)
- Return type:
dict
Purpose: Compute gene-level contributions to cell type scores for specific samples.
How It Works:
For each cell type, contribution is computed as:
where:
\(g\) = gene
\(s\) = sample
\(ct\) = cell type
Parameters:
data (pd.DataFrame or array): Expression data (genes × samples)
basis (pd.DataFrame): Cell type basis
predictivity (pd.DataFrame, optional): Precomputed predictivity (computed if None)
cell_types (list, optional): Cell types to analyze (default: all)
process_data (bool, default=True): Whether to process raw counts first
Returns:
dict: Maps cell_type → contribution_matrix (genes × samples)
Example:
# Compute contributions for all cell types
contributions = top.compute_gene_contributions(
data=sample_data,
basis=basis,
process_data=True
)
# Analyze T cell contributions
t_cell_contrib = contributions['T cell']
# Find genes driving T cell score in sample 1
sample1_contrib = t_cell_contrib['sample_1'].sort_values(ascending=False)
print("Top 20 genes driving T cell assignment:")
print(sample1_contrib.head(20))
Use Cases:
Identify marker genes: Which genes drive cell type assignments?
Validate assignments: Do expected markers contribute highly?
Compare samples: How do contributions differ across conditions?
Quality control: Are unexpected genes contributing?
find_top_contributing_genes¶
- sctop.find_top_contributing_genes(contributions: DataFrame, n_genes: int = 20, aggregate: str = 'mean') Series[source]¶
Find top contributing genes from contribution matrix.
- Parameters:
contributions (pd.DataFrame) – Gene contributions (genes x samples)
n_genes (int) – Number of top genes to return
aggregate (str) – How to aggregate across samples: ‘mean’, ‘median’, ‘max’
- Returns:
top_genes – Top contributing genes with their aggregated scores
- Return type:
pd.Series
Purpose: Identify the top contributing genes from a contribution matrix.
Parameters:
contributions (pd.DataFrame): Gene contributions (genes × samples)
n_genes (int, default=20): Number of top genes to return
aggregate (str, default=’mean’): How to aggregate across samples
'mean': Average contribution'median': Median contribution'max': Maximum contribution
Returns:
pd.Series: Top genes with their aggregated contribution scores
Example:
# Get contributions for a cell type
contrib = contributions['Macrophage']
# Find top 30 genes by mean contribution
top_genes = top.find_top_contributing_genes(
contributions=contrib,
n_genes=30,
aggregate='mean'
)
print("Top macrophage marker genes:")
for gene, score in top_genes.items():
print(f" {gene}: {score:.4f}")
Aggregation Methods:
mean: Good for consistent markers across samples
median: Robust to outliers
max: Finds genes with strongest contribution in any sample
perform_anova_selection¶
- sctop.perform_anova_selection(basis: DataFrame, adata: AnnData, training_IDs: ndarray, cell_type_column: str, n_features: int = 2000, percentile: float | None = None, standardize: bool = True) Tuple[DataFrame, ndarray][source]¶
Perform ANOVA feature selection on the basis and optionally standardize.
Parameters:¶
- basispd.DataFrame
The basis matrix (genes x cell types)
- adataad.AnnData
The AnnData object
- training_IDsnp.ndarray
Training cell IDs
- cell_type_columnstr
Column name for cell types
- n_featuresint
Number of top features to select (if percentile is None)
- percentilefloat, optional
Percentile of features to keep (overrides n_features)
- standardizebool
Whether to standardize the basis after selection (default: True)
Returns:¶
- basis_selectedpd.DataFrame
Basis with selected features only (and standardized if requested)
- selected_genesnp.ndarray
Array of selected gene names
Purpose: Select informative genes using one-way ANOVA F-test.
Description:
IMPORTANT: The best practice is to carefully curate your basis by seriously considering what is a biologically meaningful cell type and combining similar cell types (e.g. although epithelial cells are very specialized, many stromal and immune cell types are functionally identical). Merging similar cell types, dropping cell types with very few cells, and dropping questionably-annotated cell types should ALWAYS be done first before considering feature selection. Feature selection should be a last resort to improve performance after careful curation of cell types.
ANOVA feature selection identifies genes that differ significantly across cell types. This selects genes that discriminate between cell types the most. These may or may not be biologically meaningful.
Parameters:
basis (pd.DataFrame): The basis matrix
adata (ad.AnnData): Full annotated dataset
training_IDs (np.ndarray): Training cell IDs
cell_type_column (str): Column with cell type labels
n_features (int, default=2000): Number of genes to keep
percentile (float, optional): Keep top percentile (overrides n_features)
standardize (bool, default=True): Whether to standardize basis after selection
Returns:
basis_selected (pd.DataFrame): Basis with only selected genes
selected_genes (np.ndarray): Array of selected gene names
Example:
from sctop.utils import perform_anova_selection
# Select top 5000 genes
basis_filtered, genes = perform_anova_selection(
basis=basis,
adata=adata,
training_IDs=train_ids,
cell_type_column='cell_type',
n_features=5000
)
print(f"Reduced from {basis.shape[0]} to {basis_filtered.shape[0]} genes")
Notes:
Higher F-scores indicate better discrimination
May remove cell-type-specific markers with moderate expression
Most useful for large gene sets (>20k genes)
Standardization ensures basis vectors have unit norm
calculate_metrics¶
- sctop.calculate_metrics(true_labels: List, predicted_labels: List, total_cells: int, accuracies: Dict) Dict[source]¶
Calculate comprehensive metrics.
Purpose: Calculate comprehensive classification metrics from predictions.
Parameters:
true_labels (list): Ground truth cell type labels
predicted_labels (list): Predicted cell type labels
total_cells (int): Total number of cells
accuracies (dict): Dictionary with counts for top1, top3, unspecified
Returns:
dict: Metrics including:
accuracy: Top-1 accuracy (correct predictions / total)top3_accuracy: Top-3 accuracyunspecified_rate: Fraction below confidence thresholdf1_macro: Macro-averaged F1 scoref1_weighted: Weighted F1 scoreprecision_macro,precision_weightedrecall_macro,recall_weightedtotal_cells
Example:
from sctop.utils import calculate_metrics
metrics = calculate_metrics(
true_labels=true_types,
predicted_labels=pred_types,
total_cells=len(test_ids),
accuracies={'top1': 850, 'top3': 920, 'unspecified': 30}
)
print(f"Accuracy: {metrics['accuracy']:.3f}")
print(f"F1 (weighted): {metrics['f1_weighted']:.3f}")
calculate_per_cell_type_accuracy¶
- sctop.calculate_per_cell_type_accuracy(cell_accuracies: Dict) DataFrame[source]¶
Calculate per cell type accuracy.
Purpose: Compute accuracy metrics for each individual cell type.
Parameters:
cell_accuracies (dict): Per-cell accuracy information
Returns:
pd.DataFrame: Per-cell-type metrics with columns:
correct: Number of correctly classified cellstotal: Total cells of this typeaccuracy: Fraction correcttop3_correct: Correct within top 3 predictionstop3_accuracy: Top-3 accuracyunspecified_count: Cells below confidence thresholdunspecified_rate: Fraction unspecified
Example:
from sctop.utils import calculate_per_cell_type_accuracy
per_type = calculate_per_cell_type_accuracy(cell_accs)
# Find worst performers
worst = per_type.nsmallest(10, 'accuracy')
print("\nWorst performing cell types:")
print(worst[['accuracy', 'total']])
# Find types with many unspecified
high_unspec = per_type.nsmallest(10, 'unspecified_rate')
Notes:
Sorted by accuracy (best to worst)
Useful for identifying problematic cell types
Consider merging types with low accuracy and high confusion
run_scoring_parallel¶
- sctop.run_scoring_parallel(adata: AnnData, basis: DataFrame, test_IDs: ndarray, cell_type_column: str, spec_value: float, outer_chunks: int, inner_chunk_size: int, n_jobs: int = 4) Tuple[dict, list, list, dict][source]¶
OPTIMIZED: Parallel scoring of test cells. Uses ThreadPoolExecutor for shared-memory parallel processing.
Purpose: Score test cells against basis in parallel (internal function used by create_basis).
Key Features:
Thread-based parallelism for shared-memory efficiency
Automatic chunking of test set
Progress bar via tqdm
Returns detailed per-cell metrics
Parameters:
adata (ad.AnnData): Full dataset
basis (pd.DataFrame): Cell type basis
test_IDs (np.ndarray): Test cell IDs to score
cell_type_column (str): Cell type column name
spec_value (float): Threshold for unspecified predictions
outer_chunks (int): Number of chunks
inner_chunk_size (int): Chunk size for internal processing
n_jobs (int, default=4): Number of parallel workers
Returns:
Tuple of:
cell_accuracies (dict): Per-cell results
true_labels (list): Ground truth labels
predicted_labels (list): Predicted labels
accuracies (dict): Aggregate counts
Example:
from sctop.utils import run_scoring_parallel
cell_accs, true_labs, pred_labs, accs = run_scoring_parallel(
adata=adata,
basis=basis,
test_IDs=test_ids,
cell_type_column='cell_type',
spec_value=0.1,
outer_chunks=20,
inner_chunk_size=500,
n_jobs=4
)
print(f"Top-1 accuracy: {accs['top1'] / len(test_ids):.3f}")
Performance Tuning:
For fast scoring:
n_jobs=8, outer_chunks=50, inner_chunk_size=1000
For memory-constrained:
n_jobs=2, outer_chunks=100, inner_chunk_size=500
plot_performance_summary¶
- sctop.plot_performance_summary(true_labels: List, predicted_labels: List, f1_df: DataFrame | None = None, figsize_base: int = 10)[source]¶
Generates and displays a Confusion Matrix and a Per-Cell-Type F1 Score plot.
Purpose: Generate comprehensive visualization of classification performance.
Creates:
Confusion Matrix: Normalized by true labels (recall)
F1 Score Bar Plot: Per-cell-type F1 scores
Parameters:
true_labels (list): True cell type labels
predicted_labels (list): Predicted labels
figsize_base (int, default=10): Base figure size (scales with #types)
f1_df (pd.DataFrame, optional): Precomputed F1 scores
Example:
from sctop.utils import plot_performance_summary
plot_performance_summary(
true_labels=results['true_labels'],
predicted_labels=results['predicted_labels'],
f1_df=results['f1_scores']
)
Notes:
Automatically called by
create_basisifplot_results=TrueFigure size scales with number of cell types
Confusion matrix is normalized (shows recall per type)
Useful for identifying confused cell type pairs
print_metrics¶
Purpose: Pretty-print metrics dictionary.
Parameters:
metrics (dict): Metrics dictionary from
calculate_metrics
Example:
from sctop.utils import print_metrics
print_metrics(results['metrics'])
Output:
Accuracy (Top-1): 0.8723
Top-3 Accuracy: 0.9541
Unspecified Rate: 0.0234
F1 Score (Macro): 0.8456
F1 Score (Weighted): 0.8701
Precision (Macro): 0.8532
Precision (Weighted): 0.8745
Recall (Macro): 0.8512
Recall (Weighted): 0.8723