API Reference

This page documents all public functions in the scTOP package.

Core Functions

The core functions handle data processing and scoring.

sctop.processing.process(df_in: DataFrame | ndarray | spmatrix, average: bool = False, chunk_size: int | None = None) DataFrame[source]

Process scRNA-seq data with optional chunking.

sctop.processing.rank_zscore_fast(arr: ndarray) ndarray[source]

Rank-based normal scores transformation.

sctop.processing.score(basis: DataFrame, sample: DataFrame, full_output: bool = False, chunk_size: int | None = None) DataFrame | List[source]

Project sample onto basis with optional chunking.

Basis Management

Functions for loading and creating reference bases.

sctop.sctop.analyze_sample_contributions(sample_data_dict: Dict[str, DataFrame | ndarray], basis: DataFrame, cell_types: List[str] | None = None, n_top_genes: int = 20, process_data: bool = True) Dict[str, Dict][source]

Analyze gene contributions for multiple samples/clusters.

Parameters:
  • sample_data_dict (dict) – Dictionary mapping sample_name -> expression_data

  • basis (pd.DataFrame) – Basis matrix

  • cell_types (list, optional) – Cell types to analyze. If None, uses all

  • n_top_genes (int) – Number of top genes to identify per sample

  • process_data (bool) – Whether to process the data

Returns:

results – Nested dictionary with structure: {cell_type: {

’contributions’: {sample_name: contribution_matrix}, ‘top_genes’: {sample_name: [gene1, gene2, …]}, ‘expressions’: {sample_name: expression_matrix}

}}

Return type:

dict

sctop.sctop.create_basis(adata: AnnData, cell_type_column: str, threshold: int, test_size: float = 0.2, random_state: int = 42, n_jobs: int = -1, do_anova: bool = False, n_features: int = 20000, anova_percentile: float | None = None, spec_value: float = 0.1, outer_chunks: int = 10, inner_chunk_size: int = 1000, n_scoring_jobs: int = 4, cv_folds: int | None = None, plot_results: bool = True) Dict[source]

Create basis and evaluate with optional ANOVA selection and cross-validation.

Parameters:

adataad.AnnData

Annotated data object

cell_type_columnstr

Column name for cell types in adata.obs

thresholdint

Minimum number of cells per cell type

test_sizefloat

Fraction of data to use for testing (if cv_folds is None)

random_stateint

Random seed

n_jobsint

Number of parallel jobs for basis creation

do_anovabool

Whether to perform ANOVA feature selection

n_featuresint

Number of features to select with ANOVA

anova_percentilefloat, optional

Percentile of features to keep (overrides n_features)

spec_valuefloat

Threshold for unspecified predictions

outer_chunksint

Number of chunks for parallel scoring

inner_chunk_sizeint

Chunk size for internal processing

n_scoring_jobsint

Number of parallel jobs for scoring

cv_foldsint, optional

Number of cross-validation folds. If None, uses single train-test split

Returns:

resultsdict

Dictionary containing: - ‘basis’: final basis - ‘selected_genes’: selected genes (if ANOVA) - ‘metrics’: performance metrics - ‘cv_results’: cross-validation results (if cv_folds is not None) - ‘confusion_matrix’: confusion matrix - ‘per_cell_type’: per cell type accuracy

sctop.sctop.list_available_bases() List[str][source]

List available premade bases that can be loaded.

Returns:

basis_keyslist

List of available basis keys

sctop.sctop.load_basis(basis_key: str, cache_dir: str | Path | None = None, force_download: bool = False) Tuple[DataFrame, DataFrame][source]

Load a basis from an h5ad file hosted online.

Parameters:

basis_keystr

Name/key of the basis to load (e.g., “MCKO legacy”)

cache_dirstr, optional

Directory to cache downloaded files. If None, uses system temp directory.

force_downloadbool

If True, re-downloads even if cached file exists

Returns:

basispd.DataFrame

Basis matrix (genes x cell types)

metadatapd.DataFrame

Metadata for the basis (cell types x attributes)

Example:

>>> basis, metadata = load_basis(
...     basis_key="MCKO legacy"
... )

Analysis Functions

Utilities for gene contribution analysis and metrics.

sctop.utils.calculate_metrics(true_labels: List, predicted_labels: List, total_cells: int, accuracies: Dict) Dict[source]

Calculate comprehensive metrics.

sctop.utils.calculate_per_cell_type_accuracy(cell_accuracies: Dict) DataFrame[source]

Calculate per cell type accuracy.

sctop.utils.compute_gene_contributions(data: DataFrame | ndarray, basis: DataFrame, predictivity: DataFrame | None = None, cell_types: List[str] | None = None, process_data: bool = True) Dict[str, DataFrame][source]

Compute gene-level contributions to cell type scores.

For each cell type, computes: contribution = expression * predictivity

Parameters:
  • data (DataFrame or array) – Expression data (genes x samples)

  • basis (pd.DataFrame) – Basis matrix

  • predictivity (pd.DataFrame, optional) – Precomputed predictivity matrix. If None, computed from basis

  • cell_types (list, optional) – Cell types to compute contributions for. If None, uses all

  • process_data (bool) – Whether to process the data first (default: True)

Returns:

contributions – Dictionary mapping cell_type -> contribution_matrix (genes x samples)

Return type:

dict

sctop.utils.compute_predictivity(basis: DataFrame) DataFrame[source]

Compute predictivity matrix from basis.

The predictivity shows how each gene contributes to each cell type’s score. Formula: predictivity = inv(B^T @ B) @ B^T

Parameters:

basis (pd.DataFrame) – Basis matrix (genes x cell_types)

Returns:

predictivity – Predictivity matrix (cell_types x genes) Shows how each gene contributes to each cell type score

Return type:

pd.DataFrame

sctop.utils.create_basis_optimized(adata: AnnData, cell_type_column: str, threshold: int, test_size: float, random_state: int, n_jobs: int = -1) Tuple[DataFrame, ndarray, ndarray][source]

Original function - kept for backwards compatibility.

sctop.utils.find_top_contributing_genes(contributions: DataFrame, n_genes: int = 20, aggregate: str = 'mean') Series[source]

Find top contributing genes from contribution matrix.

Parameters:
  • contributions (pd.DataFrame) – Gene contributions (genes x samples)

  • n_genes (int) – Number of top genes to return

  • aggregate (str) – How to aggregate across samples: ‘mean’, ‘median’, ‘max’

Returns:

top_genes – Top contributing genes with their aggregated scores

Return type:

pd.Series

sctop.utils.perform_anova_selection(basis: DataFrame, adata: AnnData, training_IDs: ndarray, cell_type_column: str, n_features: int = 2000, percentile: float | None = None, standardize: bool = True) Tuple[DataFrame, ndarray][source]

Perform ANOVA feature selection on the basis and optionally standardize.

Parameters:

basispd.DataFrame

The basis matrix (genes x cell types)

adataad.AnnData

The AnnData object

training_IDsnp.ndarray

Training cell IDs

cell_type_columnstr

Column name for cell types

n_featuresint

Number of top features to select (if percentile is None)

percentilefloat, optional

Percentile of features to keep (overrides n_features)

standardizebool

Whether to standardize the basis after selection (default: True)

Returns:

basis_selectedpd.DataFrame

Basis with selected features only (and standardized if requested)

selected_genesnp.ndarray

Array of selected gene names

sctop.utils.plot_performance_summary(true_labels: List, predicted_labels: List, f1_df: DataFrame | None = None, figsize_base: int = 10)[source]

Generates and displays a Confusion Matrix and a Per-Cell-Type F1 Score plot.

sctop.utils.print_metrics(metrics: Dict)[source]

Pretty print metrics.

sctop.utils.run_scoring_parallel(adata: AnnData, basis: DataFrame, test_IDs: ndarray, cell_type_column: str, spec_value: float, outer_chunks: int, inner_chunk_size: int, n_jobs: int = 4) Tuple[dict, list, list, dict][source]

OPTIMIZED: Parallel scoring of test cells. Uses ThreadPoolExecutor for shared-memory parallel processing.

sctop.utils.score_chunk_optimized(adata: AnnData, basis: DataFrame, sample_IDs: ndarray, cell_type_column: str, spec_value: float, inner_chunk_size: int) Tuple[dict, list, list, dict][source]

OPTIMIZED: Score a single chunk of cells. Extracted for parallel processing.

Visualization

Plotting functions for results visualization.

sctop.visualization.create_colorbar(data, label, colormap='rocket_r', ax=None)[source]
sctop.visualization.plot_all_contributions(results: Dict[str, Dict], sample_names: List[str], output_dir: str | None = None, highlight_genes: Dict[str, List[str]] | None = None, dpi: int = 150, **plot_kwargs) None[source]

Create and save contribution plots for all cell types and samples.

Parameters:
  • results (dict) – Results from analyze_sample_contributions

  • sample_names (list) – List of sample names to plot

  • output_dir (str, optional) – Base directory for saving plots. If None, uses current directory

  • highlight_genes (dict, optional) – Dictionary mapping cell_type -> [genes_to_highlight]

  • dpi (int) – DPI for saved images

  • **plot_kwargs – Additional kwargs passed to plot_gene_contribution_scatter

sctop.visualization.plot_expression_distribution(scores, n=10, ax=None, box_color='skyblue', fontsize=30, **kwargs)[source]

Plots boxplots of expression for top genes with a fixed y-axis scale.

sctop.visualization.plot_highest(projections, n=10, ax=None, color='olive', fontsize=40, **kwargs)[source]

Plots a horizontal bar chart of the top N projections with a fixed x-axis scale.

sctop.visualization.plot_two(projections, celltype1, celltype2, gene=None, gene_expressions=None, ax=None, **kwargs)[source]