API Reference¶
This page documents all public functions in the scTOP package.
Core Functions¶
The core functions handle data processing and scoring.
- sctop.processing.process(df_in: DataFrame | ndarray | spmatrix, average: bool = False, chunk_size: int | None = None) DataFrame[source]¶
Process scRNA-seq data with optional chunking.
Basis Management¶
Functions for loading and creating reference bases.
- sctop.sctop.analyze_sample_contributions(sample_data_dict: Dict[str, DataFrame | ndarray], basis: DataFrame, cell_types: List[str] | None = None, n_top_genes: int = 20, process_data: bool = True) Dict[str, Dict][source]¶
Analyze gene contributions for multiple samples/clusters.
- Parameters:
sample_data_dict (dict) – Dictionary mapping sample_name -> expression_data
basis (pd.DataFrame) – Basis matrix
cell_types (list, optional) – Cell types to analyze. If None, uses all
n_top_genes (int) – Number of top genes to identify per sample
process_data (bool) – Whether to process the data
- Returns:
results – Nested dictionary with structure: {cell_type: {
’contributions’: {sample_name: contribution_matrix}, ‘top_genes’: {sample_name: [gene1, gene2, …]}, ‘expressions’: {sample_name: expression_matrix}
}}
- Return type:
dict
- sctop.sctop.create_basis(adata: AnnData, cell_type_column: str, threshold: int, test_size: float = 0.2, random_state: int = 42, n_jobs: int = -1, do_anova: bool = False, n_features: int = 20000, anova_percentile: float | None = None, spec_value: float = 0.1, outer_chunks: int = 10, inner_chunk_size: int = 1000, n_scoring_jobs: int = 4, cv_folds: int | None = None, plot_results: bool = True) Dict[source]¶
Create basis and evaluate with optional ANOVA selection and cross-validation.
Parameters:¶
- adataad.AnnData
Annotated data object
- cell_type_columnstr
Column name for cell types in adata.obs
- thresholdint
Minimum number of cells per cell type
- test_sizefloat
Fraction of data to use for testing (if cv_folds is None)
- random_stateint
Random seed
- n_jobsint
Number of parallel jobs for basis creation
- do_anovabool
Whether to perform ANOVA feature selection
- n_featuresint
Number of features to select with ANOVA
- anova_percentilefloat, optional
Percentile of features to keep (overrides n_features)
- spec_valuefloat
Threshold for unspecified predictions
- outer_chunksint
Number of chunks for parallel scoring
- inner_chunk_sizeint
Chunk size for internal processing
- n_scoring_jobsint
Number of parallel jobs for scoring
- cv_foldsint, optional
Number of cross-validation folds. If None, uses single train-test split
Returns:¶
- resultsdict
Dictionary containing: - ‘basis’: final basis - ‘selected_genes’: selected genes (if ANOVA) - ‘metrics’: performance metrics - ‘cv_results’: cross-validation results (if cv_folds is not None) - ‘confusion_matrix’: confusion matrix - ‘per_cell_type’: per cell type accuracy
- sctop.sctop.list_available_bases() List[str][source]¶
List available premade bases that can be loaded.
Returns:¶
- basis_keyslist
List of available basis keys
- sctop.sctop.load_basis(basis_key: str, cache_dir: str | Path | None = None, force_download: bool = False) Tuple[DataFrame, DataFrame][source]¶
Load a basis from an h5ad file hosted online.
Parameters:¶
- basis_keystr
Name/key of the basis to load (e.g., “MCKO legacy”)
- cache_dirstr, optional
Directory to cache downloaded files. If None, uses system temp directory.
- force_downloadbool
If True, re-downloads even if cached file exists
Returns:¶
- basispd.DataFrame
Basis matrix (genes x cell types)
- metadatapd.DataFrame
Metadata for the basis (cell types x attributes)
Example:¶
>>> basis, metadata = load_basis( ... basis_key="MCKO legacy" ... )
Analysis Functions¶
Utilities for gene contribution analysis and metrics.
- sctop.utils.calculate_metrics(true_labels: List, predicted_labels: List, total_cells: int, accuracies: Dict) Dict[source]¶
Calculate comprehensive metrics.
- sctop.utils.calculate_per_cell_type_accuracy(cell_accuracies: Dict) DataFrame[source]¶
Calculate per cell type accuracy.
- sctop.utils.compute_gene_contributions(data: DataFrame | ndarray, basis: DataFrame, predictivity: DataFrame | None = None, cell_types: List[str] | None = None, process_data: bool = True) Dict[str, DataFrame][source]¶
Compute gene-level contributions to cell type scores.
For each cell type, computes: contribution = expression * predictivity
- Parameters:
data (DataFrame or array) – Expression data (genes x samples)
basis (pd.DataFrame) – Basis matrix
predictivity (pd.DataFrame, optional) – Precomputed predictivity matrix. If None, computed from basis
cell_types (list, optional) – Cell types to compute contributions for. If None, uses all
process_data (bool) – Whether to process the data first (default: True)
- Returns:
contributions – Dictionary mapping cell_type -> contribution_matrix (genes x samples)
- Return type:
dict
- sctop.utils.compute_predictivity(basis: DataFrame) DataFrame[source]¶
Compute predictivity matrix from basis.
The predictivity shows how each gene contributes to each cell type’s score. Formula: predictivity = inv(B^T @ B) @ B^T
- Parameters:
basis (pd.DataFrame) – Basis matrix (genes x cell_types)
- Returns:
predictivity – Predictivity matrix (cell_types x genes) Shows how each gene contributes to each cell type score
- Return type:
pd.DataFrame
- sctop.utils.create_basis_optimized(adata: AnnData, cell_type_column: str, threshold: int, test_size: float, random_state: int, n_jobs: int = -1) Tuple[DataFrame, ndarray, ndarray][source]¶
Original function - kept for backwards compatibility.
- sctop.utils.find_top_contributing_genes(contributions: DataFrame, n_genes: int = 20, aggregate: str = 'mean') Series[source]¶
Find top contributing genes from contribution matrix.
- Parameters:
contributions (pd.DataFrame) – Gene contributions (genes x samples)
n_genes (int) – Number of top genes to return
aggregate (str) – How to aggregate across samples: ‘mean’, ‘median’, ‘max’
- Returns:
top_genes – Top contributing genes with their aggregated scores
- Return type:
pd.Series
- sctop.utils.perform_anova_selection(basis: DataFrame, adata: AnnData, training_IDs: ndarray, cell_type_column: str, n_features: int = 2000, percentile: float | None = None, standardize: bool = True) Tuple[DataFrame, ndarray][source]¶
Perform ANOVA feature selection on the basis and optionally standardize.
Parameters:¶
- basispd.DataFrame
The basis matrix (genes x cell types)
- adataad.AnnData
The AnnData object
- training_IDsnp.ndarray
Training cell IDs
- cell_type_columnstr
Column name for cell types
- n_featuresint
Number of top features to select (if percentile is None)
- percentilefloat, optional
Percentile of features to keep (overrides n_features)
- standardizebool
Whether to standardize the basis after selection (default: True)
Returns:¶
- basis_selectedpd.DataFrame
Basis with selected features only (and standardized if requested)
- selected_genesnp.ndarray
Array of selected gene names
- sctop.utils.plot_performance_summary(true_labels: List, predicted_labels: List, f1_df: DataFrame | None = None, figsize_base: int = 10)[source]¶
Generates and displays a Confusion Matrix and a Per-Cell-Type F1 Score plot.
- sctop.utils.run_scoring_parallel(adata: AnnData, basis: DataFrame, test_IDs: ndarray, cell_type_column: str, spec_value: float, outer_chunks: int, inner_chunk_size: int, n_jobs: int = 4) Tuple[dict, list, list, dict][source]¶
OPTIMIZED: Parallel scoring of test cells. Uses ThreadPoolExecutor for shared-memory parallel processing.
Visualization¶
Plotting functions for results visualization.
- sctop.visualization.plot_all_contributions(results: Dict[str, Dict], sample_names: List[str], output_dir: str | None = None, highlight_genes: Dict[str, List[str]] | None = None, dpi: int = 150, **plot_kwargs) None[source]¶
Create and save contribution plots for all cell types and samples.
- Parameters:
results (dict) – Results from analyze_sample_contributions
sample_names (list) – List of sample names to plot
output_dir (str, optional) – Base directory for saving plots. If None, uses current directory
highlight_genes (dict, optional) – Dictionary mapping cell_type -> [genes_to_highlight]
dpi (int) – DPI for saved images
**plot_kwargs – Additional kwargs passed to plot_gene_contribution_scatter
- sctop.visualization.plot_expression_distribution(scores, n=10, ax=None, box_color='skyblue', fontsize=30, **kwargs)[source]¶
Plots boxplots of expression for top genes with a fixed y-axis scale.