Quick Start¶

This guide will help you get started with scTOP quickly.

Loading Pre-made Reference Bases¶

scTOP provides pre-computed reference bases from published atlases.

List Available Bases¶

import sctop as top

# See what's available
available = top.list_available_bases()
print(available)
# ['MCKO legacy']

Load a Basis¶

# Load the Mouse Cell Atlas Kotton lab basis
basis, metadata = top.load_basis(
    basis_key="MCKO legacy",
    cache_dir="./bases_cache"  # Optional: cache downloaded files
)

print(f"Loaded basis with {basis.shape[0]} genes and {basis.shape[1]} cell types")
print(f"Cell types: {list(basis.columns)}")

The basis is a pandas DataFrame with:

Index: Gene names
Columns: Cell type names
Values: Processed expression values representing each cell type

Scoring Samples Against a Basis¶

Once you have a basis, you can score your samples to identify cell types.

Process Your Data¶

Start with raw count data (genes × samples):

import pandas as pd

# Your raw count matrix
# Rows = genes, Columns = samples/cells
raw_counts = pd.DataFrame(...)

# Process: normalize → log → rank → z-score
processed_sample = top.process(raw_counts)

Score Against Basis¶

# Project onto cell type basis
projections = top.score(basis, processed_sample)

# projections is a DataFrame: cell_types × samples
# Each column shows how well that sample matches each cell type

Interpret Results¶

# For a single sample
sample_scores = projections['sample_1'].sort_values(ascending=False)

print("Top 5 cell type matches:")
print(sample_scores.head(5))

# Visualize top matches
top.plot_highest(sample_scores, n=10)

Creating a Custom Basis¶

You can create your own reference basis from annotated scRNA-seq data.

Prepare Your Data¶

Your data should be in AnnData format with cell type annotations:

import anndata as ad

# Load your annotated data
adata = ad.read_h5ad("your_atlas.h5ad")

# Check that you have cell type annotations
print(adata.obs.columns)  # Should include your cell type column
print(adata.obs['cell_type'].value_counts())

Create Basis with Validation¶

results = top.create_basis(
    adata=adata,
    cell_type_column='cell_type',  # Column name in adata.obs
    threshold=100,                 # Minimum cells per type
    test_size=0.2,                 # 20% held out for testing
    random_state=42,
    do_anova=False,                # Optional: feature selection
    cv_folds=None,                  # Or use cross-validation
    plot_results=True              # Plot performance summary
)

# Access results
basis = results['basis']

Review Performance¶

# Per cell type accuracy
per_type = results['per_cell_type']
print("\nWorst performing cell types:")
print(per_type.nsmallest(10, 'accuracy')[['accuracy', 'total']])

# Confusion matrix
cm = results['confusion_matrix']
labels = results['confusion_matrix_labels']

Save Your Basis¶

# Save as HDF5 for reuse
import anndata as ad

# Create AnnData object from basis
adata_basis = ad.AnnData(
    X=basis.T.values,
    obs=pd.DataFrame(index=basis.columns),
    var=pd.DataFrame(index=basis.index)
)

adata_basis.write_h5ad("my_custom_basis.h5ad")

Advanced: Cross-Validation¶

For more robust validation, use k-fold cross-validation:

results = top.create_basis(
    adata=adata,
    cell_type_column='cell_type',
    threshold=100,
    cv_folds=5,  # 5-fold cross-validation
    n_jobs=-1,   # Use all CPU cores
    random_state=42
)

# Cross-validation results
cv_results = results['cv_results']
avg_metrics = results['cv_avg_metrics']

print(f"Average accuracy: {avg_metrics['accuracy_mean']:.3f} ± {avg_metrics['accuracy_std']:.3f}")

Advanced: Feature Selection with ANOVA¶

For large datasets, you can select informative genes:

results = top.create_basis(
    adata=adata,
    cell_type_column='cell_type',
    threshold=100,
    do_anova=True,
    n_features=5000,  # Select top 5000 genes
    # OR use percentile:
    # anova_percentile=25  # Keep top 25% of genes
)

selected_genes = results['selected_genes']
print(f"Selected {len(selected_genes)} informative genes")

Gene Contribution Analysis¶

Understand which genes drive cell type assignments:

# Analyze contributions
contributions = top.compute_gene_contributions(
    data=raw_counts,
    basis=basis,
    cell_types=['T cell', 'B cell', 'Macrophage']
)

# Find top genes for each cell type
for cell_type, contrib in contributions.items():
    top_genes = top.find_top_contributing_genes(contrib, n_genes=20)
    print(f"\n{cell_type} - Top 20 genes:")
    print(top_genes)

Visualization¶

Basic Plots¶

import matplotlib.pyplot as plt

# Plot top cell type matches
top.plot_highest(projections['sample_1'], n=15)
plt.tight_layout()
plt.show()

# Plot expression distribution of top genes
top.plot_expression_distribution(processed_sample, n=10)
plt.show()

2D Projections¶

# Compare two cell types
top.plot_two(
    projections,
    celltype1='T cell',
    celltype2='B cell',
    alpha=0.5
)
plt.xlabel('T cell score')
plt.ylabel('B cell score')
plt.show()

Next Steps¶

Read the API Reference for detailed function documentation
See Tutorials for more complex workflows
Understand the Theory behind the method