Refrence Basis Management

Functions for loading pre-made reference bases and creating custom reference bases from annotated data.

list_available_bases

sctop.list_available_bases() List[str][source]

List available premade bases that can be loaded.

Returns:

basis_keyslist

List of available basis keys

Purpose: List all pre-computed reference bases available for download.

Returns:

  • list: Names of available basis keys

Example:

import sctop as top

available = top.list_available_bases()
print("Available bases:", available)
# ['MCKO legacy']

Notes:

  • Bases are hosted on Figshare

  • New bases may be added over time

  • Use the returned keys with load_basis()

load_basis

sctop.load_basis(basis_key: str, cache_dir: str | Path | None = None, force_download: bool = False) Tuple[DataFrame, DataFrame][source]

Load a basis from an h5ad file hosted online.

Parameters:

basis_keystr

Name/key of the basis to load (e.g., “MCKO legacy”)

cache_dirstr, optional

Directory to cache downloaded files. If None, uses system temp directory.

force_downloadbool

If True, re-downloads even if cached file exists

Returns:

basispd.DataFrame

Basis matrix (genes x cell types)

metadatapd.DataFrame

Metadata for the basis (cell types x attributes)

Example:

>>> basis, metadata = load_basis(
...     basis_key="MCKO legacy"
... )

Purpose: Download and load a pre-made reference basis from online storage.

Key Features:

  • Automatic download with progress bar

  • Local caching to avoid re-downloading

  • Returns both basis and metadata

Parameters:

  • basis_key (str): Name of the basis (from list_available_bases())

  • cache_dir (str or Path, optional): Directory to cache downloaded files (default: system temp)

  • force_download (bool, default=False): Re-download even if cached

Returns:

  • basis (pd.DataFrame): Cell type basis (genes × cell types)

  • metadata (pd.DataFrame): Cell type metadata

Example:

# Load with default cache location
basis, metadata = top.load_basis("MCKO legacy")

# Use custom cache directory
basis, metadata = top.load_basis(
    basis_key="MCKO legacy",
    cache_dir="./my_bases"
)

# Force re-download
basis, metadata = top.load_basis(
    basis_key="MCKO legacy",
    force_download=True
)

print(f"Loaded basis: {basis.shape[0]} genes × {basis.shape[1]} cell types")
print("Cell types:", list(basis.columns))

Notes:

  • First download may take time depending on connection

  • Subsequent loads are instant (cached)

  • Basis is already processed (ready for score())

  • Check metadata for cell type counts and other info

Available Bases:

MCKO legacy

  • Source: Mouse Cell Atlas (Kotton lab variant)

  • Organism: Mouse (Mus musculus)

  • Cell types: Over 100 major cell types

  • Format: Pre-processed, ready to use

create_basis

sctop.create_basis(adata: AnnData, cell_type_column: str, threshold: int, test_size: float = 0.2, random_state: int = 42, n_jobs: int = -1, do_anova: bool = False, n_features: int = 20000, anova_percentile: float | None = None, spec_value: float = 0.1, outer_chunks: int = 10, inner_chunk_size: int = 1000, n_scoring_jobs: int = 4, cv_folds: int | None = None, plot_results: bool = True) Dict[source]

Create basis and evaluate with optional ANOVA selection and cross-validation.

Parameters:

adataad.AnnData

Annotated data object

cell_type_columnstr

Column name for cell types in adata.obs

thresholdint

Minimum number of cells per cell type

test_sizefloat

Fraction of data to use for testing (if cv_folds is None)

random_stateint

Random seed

n_jobsint

Number of parallel jobs for basis creation

do_anovabool

Whether to perform ANOVA feature selection

n_featuresint

Number of features to select with ANOVA

anova_percentilefloat, optional

Percentile of features to keep (overrides n_features)

spec_valuefloat

Threshold for unspecified predictions

outer_chunksint

Number of chunks for parallel scoring

inner_chunk_sizeint

Chunk size for internal processing

n_scoring_jobsint

Number of parallel jobs for scoring

cv_foldsint, optional

Number of cross-validation folds. If None, uses single train-test split

Returns:

resultsdict

Dictionary containing: - ‘basis’: final basis - ‘selected_genes’: selected genes (if ANOVA) - ‘metrics’: performance metrics - ‘cv_results’: cross-validation results (if cv_folds is not None) - ‘confusion_matrix’: confusion matrix - ‘per_cell_type’: per cell type accuracy

Purpose: Create a custom cell type reference basis from annotated scRNA-seq data with automatic validation.

Key Features:

  • Train-test split or k-fold cross-validation

  • Optional ANOVA feature selection

  • Parallel processing for speed

  • Comprehensive performance metrics

  • Confusion matrix and per-cell-type accuracy

  • Memory-efficient chunking

  • Automatic visualization of results

Parameters:

  • adata (ad.AnnData): Annotated data object with cell type labels

  • cell_type_column (str): Column name in adata.obs containing cell types

  • threshold (int): Minimum number of cells required per cell type

  • test_size (float, default=0.2): Fraction of data for testing (if cv_folds=None)

  • random_state (int, default=42): Random seed for reproducibility

  • n_jobs (int, default=-1): Number of parallel jobs (-1 = all cores)

  • do_anova (bool, default=False): Whether to perform feature selection

  • n_features (int, default=20000): Number of features to select (if do_anova=True)

  • anova_percentile (float, optional): Keep top percentile of genes (overrides n_features)

  • spec_value (float, default=0.1): Threshold for “unspecified” predictions

  • outer_chunks (int, default=10): Number of chunks for parallel scoring

  • inner_chunk_size (int, default=1000): Chunk size for internal processing

  • n_scoring_jobs (int, default=4): Number of parallel jobs for scoring

  • cv_folds (int, optional): If specified, use k-fold cross-validation instead of train-test

  • plot_results (bool, default=True): Whether to generate performance plots

Returns:

Dictionary containing:

  • basis (pd.DataFrame): Final cell type basis (genes × cell types)

  • selected_genes (np.ndarray or None): Selected genes if ANOVA was used

  • training_IDs (np.ndarray): Cell IDs used for training

  • test_IDs (np.ndarray): Cell IDs used for testing

  • metrics (dict): Performance metrics including:

    • accuracy: Top-1 accuracy

    • top3_accuracy: Top-3 accuracy

    • unspecified_rate: Fraction of low-confidence predictions

    • f1_macro, f1_weighted: F1 scores

    • precision_macro, precision_weighted

    • recall_macro, recall_weighted

  • confusion_matrix (np.ndarray): Confusion matrix

  • confusion_matrix_labels (list): Cell type labels for confusion matrix

  • per_cell_type (pd.DataFrame): Per-cell-type accuracy metrics

  • f1_scores (pd.DataFrame): F1 scores for each cell type

  • true_labels (list): True cell type labels (test set)

  • predicted_labels (list): Predicted cell type labels (test set)

  • cv_results (list, optional): Cross-validation fold results (if cv_folds specified)

  • cv_avg_metrics (dict, optional): Average CV metrics (if cv_folds specified)

Example 1: Basic Usage:

import anndata as ad
import sctop as top

# Load annotated data
adata = ad.read_h5ad("mouse_atlas.h5ad")

# Create basis with validation
# plot_results = True will display performance metrics like accuracy and F1 scores
results = top.create_basis(
    adata=adata,
    cell_type_column='cell_type',
    threshold=100,  # At least 100 cells per type
    test_size=0.2,
    plot_results=True
)

# Get the basis
basis = results['basis']

Example 2: Memory-Efficient for Large Datasets:

results = top.create_basis(
    adata=adata,
    cell_type_column='cell_type',
    threshold=100,
    n_jobs=4,  # Limit parallel jobs
    n_scoring_jobs=2,
    inner_chunk_size=500,  # Smaller chunks
    outer_chunks=50  # More chunks
)
Example 3: With Feature Selection::

IMPORTANT: The best practice is to carefully curate your basis by seriously considering what is a biologically meaningful cell type and combining similar cell types (e.g. although epithelial cells are very specialized, many stromal and immune cell types are functionally identical). Merging similar cell types, dropping cell types with very few cells, and dropping questionably-annotated cell types should ALWAYS be done first before considering feature selection. Feature selection should be a last resort to improve performance after careful curation of cell types.

results = top.create_basis(

adata=adata, cell_type_column=’cell_type’, threshold=100, do_anova=True, n_features=5000, # Select top 5000 genes random_state=42

)

selected_genes = results[‘selected_genes’] print(f”Selected {len(selected_genes)} informative genes”)

Example 4: Cross-Validation:

results = top.create_basis(
    adata=adata,
    cell_type_column='cell_type',
    threshold=100,
    cv_folds=5,  # 5-fold cross-validation
    n_jobs=-1,
    random_state=42
)

# Check CV results
cv_metrics = results['cv_avg_metrics']
print(f"CV Accuracy: {cv_metrics['accuracy_mean']:.3f} ± {cv_metrics['accuracy_std']:.3f}")

Notes:

  • Threshold: Cell types with fewer than threshold cells are excluded

  • Unspecified: Cells with max score < spec_value are marked “unspecified”

  • Parallelization: Uses thread pools for shared-memory parallel processing

  • Memory: Adjust inner_chunk_size and n_scoring_jobs for large datasets

  • ANOVA: Reduces features but may remove important genes

  • Cross-validation: More robust but slower than single train-test split

Performance Tips:

For small datasets (<10k cells):

results = create_basis(
    adata, 'cell_type', 100,
    n_jobs=-1,  # Use all cores
    n_scoring_jobs=8
)

For large datasets (>100k cells):

results = create_basis(
    adata, 'cell_type', 100,
    n_jobs=4,  # Conservative
    n_scoring_jobs=2,
    inner_chunk_size=500,
    outer_chunks=100,
)

Troubleshooting:

  • Check per-cell-type metrics to find problematic types

  • Examine confusion matrix for similar types

  • Consider merging similar cell types

  • Increase threshold to require more cells per type

  • Try ANOVA feature selection as a last resort

analyze_sample_contributions

sctop.analyze_sample_contributions(sample_data_dict: Dict[str, DataFrame | ndarray], basis: DataFrame, cell_types: List[str] | None = None, n_top_genes: int = 20, process_data: bool = True) Dict[str, Dict][source]

Analyze gene contributions for multiple samples/clusters.

Parameters:
  • sample_data_dict (dict) – Dictionary mapping sample_name -> expression_data

  • basis (pd.DataFrame) – Basis matrix

  • cell_types (list, optional) – Cell types to analyze. If None, uses all

  • n_top_genes (int) – Number of top genes to identify per sample

  • process_data (bool) – Whether to process the data

Returns:

results – Nested dictionary with structure: {cell_type: {

’contributions’: {sample_name: contribution_matrix}, ‘top_genes’: {sample_name: [gene1, gene2, …]}, ‘expressions’: {sample_name: expression_matrix}

}}

Return type:

dict

Purpose: Analyze which genes contribute most to cell type scores across multiple samples.

Key Features:

  • Computes gene-level contributions for each cell type

  • Identifies top contributing genes per sample

  • Handles multiple samples/clusters simultaneously

  • Optional data processing

Parameters:

  • sample_data_dict (dict): Maps sample_name → expression_data (DataFrame or array)

  • basis (pd.DataFrame): Cell type basis

  • cell_types (list, optional): Cell types to analyze (default: all in basis)

  • n_top_genes (int, default=20): Number of top genes to identify

  • process_data (bool, default=True): Whether to process raw counts

Returns:

Dictionary with structure:

{
    cell_type: {
        'contributions': {sample_name: contribution_matrix},
        'top_genes': {sample_name: [gene1, gene2, ...]},
        'expressions': {sample_name: expression_matrix}
    }
}

Example:

# Prepare sample dictionary
sample_dict = {
    'cluster_1': cluster1_counts,
    'cluster_2': cluster2_counts,
    'cluster_3': cluster3_counts
}

# Analyze contributions
results = top.analyze_sample_contributions(
    sample_data_dict=sample_dict,
    basis=basis,
    cell_types=['T cell', 'B cell', 'Macrophage'],
    n_top_genes=20
)

# Get top genes for T cells in cluster 1
t_cell_genes = results['T cell']['top_genes']['cluster_1']
print("Top T cell genes in cluster 1:", t_cell_genes[:10])

# Get contribution matrix
t_cell_contrib = results['T cell']['contributions']['cluster_1']

Notes:

  • Contribution = gene expression × predictivity

  • Higher contribution = gene more responsible for that cell type score

  • Useful for identifying markers and understanding assignments