API Reference¶

This page documents all public functions in the scTOP package.

Core Functions¶

The core functions handle data processing and scoring.

sctop.processing.process(df_in: DataFrame | ndarray | spmatrix, average: bool = False, chunk_size: int | None = None) → DataFrame[source]¶: Process scRNA-seq data with optional chunking.

sctop.processing.rank_zscore_fast(arr: ndarray) → ndarray[source]¶: Rank-based normal scores transformation.

sctop.processing.score(basis: DataFrame, sample: DataFrame, full_output: bool = False, chunk_size: int | None = None) → DataFrame | List[source]¶: Project sample onto basis with optional chunking.

Basis Management¶

Functions for loading and creating reference bases.

sctop.sctop.analyze_sample_contributions(sample_data_dict: Dict[str, DataFrame | ndarray], basis: DataFrame, cell_types: List[str] | None = None, n_top_genes: int = 20, process_data: bool = True) → Dict[str, Dict][source]¶

Analyze gene contributions for multiple samples/clusters.

Parameters:

sample_data_dict (dict) – Dictionary mapping sample_name -> expression_data
basis (pd.DataFrame) – Basis matrix
cell_types (list, optional) – Cell types to analyze. If None, uses all
n_top_genes (int) – Number of top genes to identify per sample
process_data (bool) – Whether to process the data

Returns:

results – Nested dictionary with structure: {cell_type: {

’contributions’: {sample_name: contribution_matrix}, ‘top_genes’: {sample_name: [gene1, gene2, …]}, ‘expressions’: {sample_name: expression_matrix}

}}

Return type:

dict

sctop.sctop.create_basis(adata: AnnData, cell_type_column: str, threshold: int, test_size: float = 0.2, random_state: int = 42, n_jobs: int = -1, do_anova: bool = False, n_features: int = 20000, anova_percentile: float | None = None, spec_value: float = 0.1, outer_chunks: int = 10, inner_chunk_size: int = 1000, n_scoring_jobs: int = 4, cv_folds: int | None = None, plot_results: bool = True) → Dict[source]¶

Create basis and evaluate with optional ANOVA selection and cross-validation.

Parameters:¶

adataad.AnnData: Annotated data object
cell_type_columnstr: Column name for cell types in adata.obs
thresholdint: Minimum number of cells per cell type
test_sizefloat: Fraction of data to use for testing (if cv_folds is None)
random_stateint: Random seed
n_jobsint: Number of parallel jobs for basis creation
do_anovabool: Whether to perform ANOVA feature selection
n_featuresint: Number of features to select with ANOVA
anova_percentilefloat, optional: Percentile of features to keep (overrides n_features)
spec_valuefloat: Threshold for unspecified predictions
outer_chunksint: Number of chunks for parallel scoring
inner_chunk_sizeint: Chunk size for internal processing
n_scoring_jobsint: Number of parallel jobs for scoring
cv_foldsint, optional: Number of cross-validation folds. If None, uses single train-test split

Returns:¶

resultsdict: Dictionary containing: - ‘basis’: final basis - ‘selected_genes’: selected genes (if ANOVA) - ‘metrics’: performance metrics - ‘cv_results’: cross-validation results (if cv_folds is not None) - ‘confusion_matrix’: confusion matrix - ‘per_cell_type’: per cell type accuracy

sctop.sctop.list_available_bases() → List[str][source]¶

List available premade bases that can be loaded.

Returns:¶

basis_keyslist: List of available basis keys

sctop.sctop.load_basis(basis_key: str, cache_dir: str | Path | None = None, force_download: bool = False) → Tuple[DataFrame, DataFrame][source]¶

Load a basis from an h5ad file hosted online.

Parameters:¶

basis_keystr: Name/key of the basis to load (e.g., “MCKO legacy”)
cache_dirstr, optional: Directory to cache downloaded files. If None, uses system temp directory.
force_downloadbool: If True, re-downloads even if cached file exists

Returns:¶

basispd.DataFrame: Basis matrix (genes x cell types)
metadatapd.DataFrame: Metadata for the basis (cell types x attributes)

Example:¶

>>> basis, metadata = load_basis(
...     basis_key="MCKO legacy"
... )

Analysis Functions¶

Utilities for gene contribution analysis and metrics.

sctop.utils.calculate_metrics(true_labels: List, predicted_labels: List, total_cells: int, accuracies: Dict) → Dict[source]¶: Calculate comprehensive metrics.

sctop.utils.calculate_per_cell_type_accuracy(cell_accuracies: Dict) → DataFrame[source]¶: Calculate per cell type accuracy.

sctop.utils.compute_gene_contributions(data: DataFrame | ndarray, basis: DataFrame, predictivity: DataFrame | None = None, cell_types: List[str] | None = None, process_data: bool = True) → Dict[str, DataFrame][source]¶

Compute gene-level contributions to cell type scores.

For each cell type, computes: contribution = expression * predictivity

Parameters:

data (DataFrame or array) – Expression data (genes x samples)
basis (pd.DataFrame) – Basis matrix
predictivity (pd.DataFrame, optional) – Precomputed predictivity matrix. If None, computed from basis
cell_types (list, optional) – Cell types to compute contributions for. If None, uses all
process_data (bool) – Whether to process the data first (default: True)

Returns:

contributions – Dictionary mapping cell_type -> contribution_matrix (genes x samples)

Return type:

dict

sctop.utils.compute_predictivity(basis: DataFrame) → DataFrame[source]¶

Compute predictivity matrix from basis.

The predictivity shows how each gene contributes to each cell type’s score. Formula: predictivity = inv(B^T @ B) @ B^T

Parameters:: basis (pd.DataFrame) – Basis matrix (genes x cell_types)
Returns:: predictivity – Predictivity matrix (cell_types x genes) Shows how each gene contributes to each cell type score
Return type:: pd.DataFrame

sctop.utils.create_basis_optimized(adata: AnnData, cell_type_column: str, threshold: int, test_size: float, random_state: int, n_jobs: int = -1) → Tuple[DataFrame, ndarray, ndarray][source]¶: Original function - kept for backwards compatibility.

sctop.utils.find_top_contributing_genes(contributions: DataFrame, n_genes: int = 20, aggregate: str = 'mean') → Series[source]¶

Find top contributing genes from contribution matrix.

Parameters:

contributions (pd.DataFrame) – Gene contributions (genes x samples)
n_genes (int) – Number of top genes to return
aggregate (str) – How to aggregate across samples: ‘mean’, ‘median’, ‘max’

Returns:

top_genes – Top contributing genes with their aggregated scores

Return type:

pd.Series

sctop.utils.perform_anova_selection(basis: DataFrame, adata: AnnData, training_IDs: ndarray, cell_type_column: str, n_features: int = 2000, percentile: float | None = None, standardize: bool = True) → Tuple[DataFrame, ndarray][source]¶

Perform ANOVA feature selection on the basis and optionally standardize.

Parameters:¶

basispd.DataFrame: The basis matrix (genes x cell types)
adataad.AnnData: The AnnData object
training_IDsnp.ndarray: Training cell IDs
cell_type_columnstr: Column name for cell types
n_featuresint: Number of top features to select (if percentile is None)
percentilefloat, optional: Percentile of features to keep (overrides n_features)
standardizebool: Whether to standardize the basis after selection (default: True)

Returns:¶

basis_selectedpd.DataFrame: Basis with selected features only (and standardized if requested)
selected_genesnp.ndarray: Array of selected gene names

sctop.utils.plot_performance_summary(true_labels: List, predicted_labels: List, f1_df: DataFrame | None = None, figsize_base: int = 10)[source]¶: Generates and displays a Confusion Matrix and a Per-Cell-Type F1 Score plot.

sctop.utils.print_metrics(metrics: Dict)[source]¶: Pretty print metrics.

sctop.utils.run_scoring_parallel(adata: AnnData, basis: DataFrame, test_IDs: ndarray, cell_type_column: str, spec_value: float, outer_chunks: int, inner_chunk_size: int, n_jobs: int = 4) → Tuple[dict, list, list, dict][source]¶: OPTIMIZED: Parallel scoring of test cells. Uses ThreadPoolExecutor for shared-memory parallel processing.

sctop.utils.score_chunk_optimized(adata: AnnData, basis: DataFrame, sample_IDs: ndarray, cell_type_column: str, spec_value: float, inner_chunk_size: int) → Tuple[dict, list, list, dict][source]¶: OPTIMIZED: Score a single chunk of cells. Extracted for parallel processing.

Visualization¶

Plotting functions for results visualization.

sctop.visualization.create_colorbar(data, label, colormap='rocket_r', ax=None)[source]¶

sctop.visualization.plot_all_contributions(results: Dict[str, Dict], sample_names: List[str], output_dir: str | None = None, highlight_genes: Dict[str, List[str]] | None = None, dpi: int = 150, **plot_kwargs) → None[source]¶

Create and save contribution plots for all cell types and samples.

Parameters:

results (dict) – Results from analyze_sample_contributions
sample_names (list) – List of sample names to plot
output_dir (str, optional) – Base directory for saving plots. If None, uses current directory
highlight_genes (dict, optional) – Dictionary mapping cell_type -> [genes_to_highlight]
dpi (int) – DPI for saved images
**plot_kwargs – Additional kwargs passed to plot_gene_contribution_scatter

sctop.visualization.plot_expression_distribution(scores, n=10, ax=None, box_color='skyblue', fontsize=30, **kwargs)[source]¶: Plots boxplots of expression for top genes with a fixed y-axis scale.

sctop.visualization.plot_highest(projections, n=10, ax=None, color='olive', fontsize=40, **kwargs)[source]¶: Plots a horizontal bar chart of the top N projections with a fixed x-axis scale.

sctop.visualization.plot_two(projections, celltype1, celltype2, gene=None, gene_expressions=None, ax=None, **kwargs)[source]¶