sctop.utils¶

Functions¶

`perform_anova_selection`(...)	Perform ANOVA feature selection on the basis and optionally standardize.
`calculate_metrics`(→ Dict)	Calculate comprehensive metrics.
`calculate_per_cell_type_accuracy`(...)	Calculate per cell type accuracy.
`print_metrics`(metrics)	Pretty print metrics.
`create_basis_optimized`(...)	Original function - kept for backwards compatibility.
`run_scoring_parallel`(→ Tuple[dict, list, list, dict])	OPTIMIZED: Parallel scoring of test cells.
`score_chunk_optimized`(→ Tuple[dict, list, list, dict])	OPTIMIZED: Score a single chunk of cells.
`plot_performance_summary`(true_labels, predicted_labels)	Generates and displays a Confusion Matrix and a Per-Cell-Type F1 Score plot.
`compute_predictivity`(→ sctop.processing.pd.DataFrame)	Compute predictivity matrix from basis.
`compute_gene_contributions`(→ Dict[str, ...)	Compute gene-level contributions to cell type scores.
`find_top_contributing_genes`(→ sctop.processing.pd.Series)	Find top contributing genes from contribution matrix.

Module Contents¶

sctop.utils.perform_anova_selection(basis: sctop.processing.pd.DataFrame, adata: anndata.AnnData, training_IDs: sctop.processing.np.ndarray, cell_type_column: str, n_features: int = 2000, percentile: float | None = None, standardize: bool = True) → Tuple[sctop.processing.pd.DataFrame, sctop.processing.np.ndarray][source]¶

Perform ANOVA feature selection on the basis and optionally standardize.

Parameters:¶

basispd.DataFrame: The basis matrix (genes x cell types)
adataad.AnnData: The AnnData object
training_IDsnp.ndarray: Training cell IDs
cell_type_columnstr: Column name for cell types
n_featuresint: Number of top features to select (if percentile is None)
percentilefloat, optional: Percentile of features to keep (overrides n_features)
standardizebool: Whether to standardize the basis after selection (default: True)

Returns:¶

basis_selectedpd.DataFrame: Basis with selected features only (and standardized if requested)
selected_genesnp.ndarray: Array of selected gene names

sctop.utils.calculate_metrics(true_labels: List, predicted_labels: List, total_cells: int, accuracies: Dict) → Dict[source]¶: Calculate comprehensive metrics.

sctop.utils.calculate_per_cell_type_accuracy(cell_accuracies: Dict) → sctop.processing.pd.DataFrame[source]¶: Calculate per cell type accuracy.

sctop.utils.print_metrics(metrics: Dict)[source]¶: Pretty print metrics.

sctop.utils.create_basis_optimized(adata: anndata.AnnData, cell_type_column: str, threshold: int, test_size: float, random_state: int, n_jobs: int = -1) → Tuple[sctop.processing.pd.DataFrame, sctop.processing.np.ndarray, sctop.processing.np.ndarray][source]¶: Original function - kept for backwards compatibility.

sctop.utils.run_scoring_parallel(adata: anndata.AnnData, basis: sctop.processing.pd.DataFrame, test_IDs: sctop.processing.np.ndarray, cell_type_column: str, spec_value: float, outer_chunks: int, inner_chunk_size: int, n_jobs: int = 4) → Tuple[dict, list, list, dict][source]¶: OPTIMIZED: Parallel scoring of test cells. Uses ThreadPoolExecutor for shared-memory parallel processing.

sctop.utils.score_chunk_optimized(adata: anndata.AnnData, basis: sctop.processing.pd.DataFrame, sample_IDs: sctop.processing.np.ndarray, cell_type_column: str, spec_value: float, inner_chunk_size: int) → Tuple[dict, list, list, dict][source]¶: OPTIMIZED: Score a single chunk of cells. Extracted for parallel processing.

sctop.utils.plot_performance_summary(true_labels: List, predicted_labels: List, f1_df: sctop.processing.pd.DataFrame | None = None, figsize_base: int = 10)[source]¶: Generates and displays a Confusion Matrix and a Per-Cell-Type F1 Score plot.

sctop.utils.compute_predictivity(basis: sctop.processing.pd.DataFrame) → sctop.processing.pd.DataFrame[source]¶

Compute predictivity matrix from basis.

The predictivity shows how each gene contributes to each cell type’s score. Formula: predictivity = inv(B^T @ B) @ B^T

Parameters:: basis (pd.DataFrame) – Basis matrix (genes x cell_types)
Returns:: predictivity – Predictivity matrix (cell_types x genes) Shows how each gene contributes to each cell type score
Return type:: pd.DataFrame

sctop.utils.compute_gene_contributions(data: sctop.processing.pd.DataFrame | sctop.processing.np.ndarray, basis: sctop.processing.pd.DataFrame, predictivity: sctop.processing.pd.DataFrame | None = None, cell_types: List[str] | None = None, process_data: bool = True) → Dict[str, sctop.processing.pd.DataFrame][source]¶

Compute gene-level contributions to cell type scores.

For each cell type, computes: contribution = expression * predictivity

Parameters:

data (DataFrame or array) – Expression data (genes x samples)
basis (pd.DataFrame) – Basis matrix
predictivity (pd.DataFrame, optional) – Precomputed predictivity matrix. If None, computed from basis
cell_types (list, optional) – Cell types to compute contributions for. If None, uses all
process_data (bool) – Whether to process the data first (default: True)

Returns:

contributions – Dictionary mapping cell_type -> contribution_matrix (genes x samples)

Return type:

dict

sctop.utils.find_top_contributing_genes(contributions: sctop.processing.pd.DataFrame, n_genes: int = 20, aggregate: str = 'mean') → sctop.processing.pd.Series[source]¶

Find top contributing genes from contribution matrix.

Parameters:

contributions (pd.DataFrame) – Gene contributions (genes x samples)
n_genes (int) – Number of top genes to return
aggregate (str) – How to aggregate across samples: ‘mean’, ‘median’, ‘max’

Returns:

top_genes – Top contributing genes with their aggregated scores

Return type:

pd.Series