Refrence Basis Management¶
Functions for loading pre-made reference bases and creating custom reference bases from annotated data.
list_available_bases¶
- sctop.list_available_bases() List[str][source]¶
List available premade bases that can be loaded.
Returns:¶
- basis_keyslist
List of available basis keys
Purpose: List all pre-computed reference bases available for download.
Returns:
list: Names of available basis keys
Example:
import sctop as top
available = top.list_available_bases()
print("Available bases:", available)
# ['MCKO legacy']
Notes:
Bases are hosted on Figshare
New bases may be added over time
Use the returned keys with
load_basis()
load_basis¶
- sctop.load_basis(basis_key: str, cache_dir: str | Path | None = None, force_download: bool = False) Tuple[DataFrame, DataFrame][source]¶
Load a basis from an h5ad file hosted online.
Parameters:¶
- basis_keystr
Name/key of the basis to load (e.g., “MCKO legacy”)
- cache_dirstr, optional
Directory to cache downloaded files. If None, uses system temp directory.
- force_downloadbool
If True, re-downloads even if cached file exists
Returns:¶
- basispd.DataFrame
Basis matrix (genes x cell types)
- metadatapd.DataFrame
Metadata for the basis (cell types x attributes)
Example:¶
>>> basis, metadata = load_basis( ... basis_key="MCKO legacy" ... )
Purpose: Download and load a pre-made reference basis from online storage.
Key Features:
Automatic download with progress bar
Local caching to avoid re-downloading
Returns both basis and metadata
Parameters:
basis_key (str): Name of the basis (from
list_available_bases())cache_dir (str or Path, optional): Directory to cache downloaded files (default: system temp)
force_download (bool, default=False): Re-download even if cached
Returns:
basis (pd.DataFrame): Cell type basis (genes × cell types)
metadata (pd.DataFrame): Cell type metadata
Example:
# Load with default cache location
basis, metadata = top.load_basis("MCKO legacy")
# Use custom cache directory
basis, metadata = top.load_basis(
basis_key="MCKO legacy",
cache_dir="./my_bases"
)
# Force re-download
basis, metadata = top.load_basis(
basis_key="MCKO legacy",
force_download=True
)
print(f"Loaded basis: {basis.shape[0]} genes × {basis.shape[1]} cell types")
print("Cell types:", list(basis.columns))
Notes:
First download may take time depending on connection
Subsequent loads are instant (cached)
Basis is already processed (ready for
score())Check metadata for cell type counts and other info
Available Bases:
MCKO legacy¶
Source: Mouse Cell Atlas (Kotton lab variant)
Organism: Mouse (Mus musculus)
Cell types: Over 100 major cell types
Format: Pre-processed, ready to use
create_basis¶
- sctop.create_basis(adata: AnnData, cell_type_column: str, threshold: int, test_size: float = 0.2, random_state: int = 42, n_jobs: int = -1, do_anova: bool = False, n_features: int = 20000, anova_percentile: float | None = None, spec_value: float = 0.1, outer_chunks: int = 10, inner_chunk_size: int = 1000, n_scoring_jobs: int = 4, cv_folds: int | None = None, plot_results: bool = True) Dict[source]¶
Create basis and evaluate with optional ANOVA selection and cross-validation.
Parameters:¶
- adataad.AnnData
Annotated data object
- cell_type_columnstr
Column name for cell types in adata.obs
- thresholdint
Minimum number of cells per cell type
- test_sizefloat
Fraction of data to use for testing (if cv_folds is None)
- random_stateint
Random seed
- n_jobsint
Number of parallel jobs for basis creation
- do_anovabool
Whether to perform ANOVA feature selection
- n_featuresint
Number of features to select with ANOVA
- anova_percentilefloat, optional
Percentile of features to keep (overrides n_features)
- spec_valuefloat
Threshold for unspecified predictions
- outer_chunksint
Number of chunks for parallel scoring
- inner_chunk_sizeint
Chunk size for internal processing
- n_scoring_jobsint
Number of parallel jobs for scoring
- cv_foldsint, optional
Number of cross-validation folds. If None, uses single train-test split
Returns:¶
- resultsdict
Dictionary containing: - ‘basis’: final basis - ‘selected_genes’: selected genes (if ANOVA) - ‘metrics’: performance metrics - ‘cv_results’: cross-validation results (if cv_folds is not None) - ‘confusion_matrix’: confusion matrix - ‘per_cell_type’: per cell type accuracy
Purpose: Create a custom cell type reference basis from annotated scRNA-seq data with automatic validation.
Key Features:
Train-test split or k-fold cross-validation
Optional ANOVA feature selection
Parallel processing for speed
Comprehensive performance metrics
Confusion matrix and per-cell-type accuracy
Memory-efficient chunking
Automatic visualization of results
Parameters:
adata (ad.AnnData): Annotated data object with cell type labels
cell_type_column (str): Column name in
adata.obscontaining cell typesthreshold (int): Minimum number of cells required per cell type
test_size (float, default=0.2): Fraction of data for testing (if cv_folds=None)
random_state (int, default=42): Random seed for reproducibility
n_jobs (int, default=-1): Number of parallel jobs (-1 = all cores)
do_anova (bool, default=False): Whether to perform feature selection
n_features (int, default=20000): Number of features to select (if do_anova=True)
anova_percentile (float, optional): Keep top percentile of genes (overrides n_features)
spec_value (float, default=0.1): Threshold for “unspecified” predictions
outer_chunks (int, default=10): Number of chunks for parallel scoring
inner_chunk_size (int, default=1000): Chunk size for internal processing
n_scoring_jobs (int, default=4): Number of parallel jobs for scoring
cv_folds (int, optional): If specified, use k-fold cross-validation instead of train-test
plot_results (bool, default=True): Whether to generate performance plots
Returns:
Dictionary containing:
basis (pd.DataFrame): Final cell type basis (genes × cell types)
selected_genes (np.ndarray or None): Selected genes if ANOVA was used
training_IDs (np.ndarray): Cell IDs used for training
test_IDs (np.ndarray): Cell IDs used for testing
metrics (dict): Performance metrics including:
accuracy: Top-1 accuracytop3_accuracy: Top-3 accuracyunspecified_rate: Fraction of low-confidence predictionsf1_macro,f1_weighted: F1 scoresprecision_macro,precision_weightedrecall_macro,recall_weighted
confusion_matrix (np.ndarray): Confusion matrix
confusion_matrix_labels (list): Cell type labels for confusion matrix
per_cell_type (pd.DataFrame): Per-cell-type accuracy metrics
f1_scores (pd.DataFrame): F1 scores for each cell type
true_labels (list): True cell type labels (test set)
predicted_labels (list): Predicted cell type labels (test set)
cv_results (list, optional): Cross-validation fold results (if cv_folds specified)
cv_avg_metrics (dict, optional): Average CV metrics (if cv_folds specified)
Example 1: Basic Usage:
import anndata as ad
import sctop as top
# Load annotated data
adata = ad.read_h5ad("mouse_atlas.h5ad")
# Create basis with validation
# plot_results = True will display performance metrics like accuracy and F1 scores
results = top.create_basis(
adata=adata,
cell_type_column='cell_type',
threshold=100, # At least 100 cells per type
test_size=0.2,
plot_results=True
)
# Get the basis
basis = results['basis']
Example 2: Memory-Efficient for Large Datasets:
results = top.create_basis(
adata=adata,
cell_type_column='cell_type',
threshold=100,
n_jobs=4, # Limit parallel jobs
n_scoring_jobs=2,
inner_chunk_size=500, # Smaller chunks
outer_chunks=50 # More chunks
)
- Example 3: With Feature Selection::
IMPORTANT: The best practice is to carefully curate your basis by seriously considering what is a biologically meaningful cell type and combining similar cell types (e.g. although epithelial cells are very specialized, many stromal and immune cell types are functionally identical). Merging similar cell types, dropping cell types with very few cells, and dropping questionably-annotated cell types should ALWAYS be done first before considering feature selection. Feature selection should be a last resort to improve performance after careful curation of cell types.
- results = top.create_basis(
adata=adata, cell_type_column=’cell_type’, threshold=100, do_anova=True, n_features=5000, # Select top 5000 genes random_state=42
)
selected_genes = results[‘selected_genes’] print(f”Selected {len(selected_genes)} informative genes”)
Example 4: Cross-Validation:
results = top.create_basis(
adata=adata,
cell_type_column='cell_type',
threshold=100,
cv_folds=5, # 5-fold cross-validation
n_jobs=-1,
random_state=42
)
# Check CV results
cv_metrics = results['cv_avg_metrics']
print(f"CV Accuracy: {cv_metrics['accuracy_mean']:.3f} ± {cv_metrics['accuracy_std']:.3f}")
Notes:
Threshold: Cell types with fewer than
thresholdcells are excludedUnspecified: Cells with max score <
spec_valueare marked “unspecified”Parallelization: Uses thread pools for shared-memory parallel processing
Memory: Adjust
inner_chunk_sizeandn_scoring_jobsfor large datasetsANOVA: Reduces features but may remove important genes
Cross-validation: More robust but slower than single train-test split
Performance Tips:
For small datasets (<10k cells):
results = create_basis(
adata, 'cell_type', 100,
n_jobs=-1, # Use all cores
n_scoring_jobs=8
)
For large datasets (>100k cells):
results = create_basis(
adata, 'cell_type', 100,
n_jobs=4, # Conservative
n_scoring_jobs=2,
inner_chunk_size=500,
outer_chunks=100,
)
Troubleshooting:
Check per-cell-type metrics to find problematic types
Examine confusion matrix for similar types
Consider merging similar cell types
Increase
thresholdto require more cells per typeTry ANOVA feature selection as a last resort
analyze_sample_contributions¶
- sctop.analyze_sample_contributions(sample_data_dict: Dict[str, DataFrame | ndarray], basis: DataFrame, cell_types: List[str] | None = None, n_top_genes: int = 20, process_data: bool = True) Dict[str, Dict][source]¶
Analyze gene contributions for multiple samples/clusters.
- Parameters:
sample_data_dict (dict) – Dictionary mapping sample_name -> expression_data
basis (pd.DataFrame) – Basis matrix
cell_types (list, optional) – Cell types to analyze. If None, uses all
n_top_genes (int) – Number of top genes to identify per sample
process_data (bool) – Whether to process the data
- Returns:
results – Nested dictionary with structure: {cell_type: {
’contributions’: {sample_name: contribution_matrix}, ‘top_genes’: {sample_name: [gene1, gene2, …]}, ‘expressions’: {sample_name: expression_matrix}
}}
- Return type:
dict
Purpose: Analyze which genes contribute most to cell type scores across multiple samples.
Key Features:
Computes gene-level contributions for each cell type
Identifies top contributing genes per sample
Handles multiple samples/clusters simultaneously
Optional data processing
Parameters:
sample_data_dict (dict): Maps sample_name → expression_data (DataFrame or array)
basis (pd.DataFrame): Cell type basis
cell_types (list, optional): Cell types to analyze (default: all in basis)
n_top_genes (int, default=20): Number of top genes to identify
process_data (bool, default=True): Whether to process raw counts
Returns:
Dictionary with structure:
{
cell_type: {
'contributions': {sample_name: contribution_matrix},
'top_genes': {sample_name: [gene1, gene2, ...]},
'expressions': {sample_name: expression_matrix}
}
}
Example:
# Prepare sample dictionary
sample_dict = {
'cluster_1': cluster1_counts,
'cluster_2': cluster2_counts,
'cluster_3': cluster3_counts
}
# Analyze contributions
results = top.analyze_sample_contributions(
sample_data_dict=sample_dict,
basis=basis,
cell_types=['T cell', 'B cell', 'Macrophage'],
n_top_genes=20
)
# Get top genes for T cells in cluster 1
t_cell_genes = results['T cell']['top_genes']['cluster_1']
print("Top T cell genes in cluster 1:", t_cell_genes[:10])
# Get contribution matrix
t_cell_contrib = results['T cell']['contributions']['cluster_1']
Notes:
Contribution = gene expression × predictivity
Higher contribution = gene more responsible for that cell type score
Useful for identifying markers and understanding assignments