Theory¶
This page explains the theoretical foundation and mathematical details of scTOP.
Overview¶
scTOP (Single-cell Type Order Parameters) identifies cell phenotypes by projecting single-cell expression profiles onto a reference basis of known cell types. The method uses order parameters for Hopfield networks as defined by Kanter and Sompolinsky.
Theoretical Background¶
This method is described in detail in scTOP: physics-inspired order parameters for cellular identification and visualization by Maria Yampolskaya, Michael J. Herriges, Laertis Ikonomou, Darrell Kotton, and Pankaj Mehta.
Key Concepts¶
Cell fate coordinates: In physics, order parameters are a macroscopic variable describing the phase of a system. For cell types, Hopfield-inspired order parameters measure cellular identity, and can be used to define the coordinates of cell fate space.
Cell Type as Attractor State: This approach is inspired by the Waddington landscape and treating cell types as attractors in a Hopfield network. Each cell type corresponds to a stable attractor in gene expression space.
Projection onto Basis: By creating a basis from reference cell types, we can decompose any sample’s expression profile into contributions from each cell type.
Mathematical Framework¶
Notation¶
\(\\xi\): Basis matrix (genes × cell types)
\(\\mathbf{s}\): Processed sample vector (genes × 1)
\(\\mathbf{a}\): Cell type scores (cell types × 1)
\(G\): Number of genes
\(C\): Number of cell types
Data Processing¶
Step 1: Normalization¶
Each sample is independently normalized by library size:
where \(x_{g,i}\) is the raw count for gene \(g\) in sample \(i\).
Purpose: Remove technical variation due to sequencing depth.
Step 2: Log Transformation¶
Purpose: Stabilize variance and reduce the influence of highly expressed genes.
Step 3: Rank-Based Z-Score Transform¶
For each sample \(i\):
Rank genes: Assign ranks \(r_{g,i} \\in [1, G]\) to genes based on \(y_{g,i}\)
Handle ties: Use average rank for tied values
Convert to percentiles:
\[\begin{split}p_{g,i} = \\frac{r_{g,i}}{G + 1}\end{split}\]Apply inverse normal CDF (probit transform):
\[\begin{split}z_{g,i} = \\Phi^{-1}(p_{g,i}) = \\sqrt{2} \\cdot \\text{erf}^{-1}(2p_{g,i} - 1)\end{split}\]
Purpose: Ranking is valuable because there is a lot of batch-to-bath variability in scRNA-seq data. Ranking converts all gene expression to relative ordering instead of absolute expression, helping ameliorate batch effects.Rank-based methods are more stable across experiments
Basis Construction¶
Creating the Reference Basis¶
Given annotated reference data, the basis is constructed by:
Group cells by type: Collect all cells \(\\{\\mathbf{s}_1, \\mathbf{s}_2, \\ldots, \\mathbf{s}_N\\}\) for each cell type
Process each cell: Apply the full processing pipeline to cells individually, avoiding cross-sample operations
Average within type:
\[\begin{split}\\mathbf{b}_{ct} = \\frac{1}{N_{ct}} \\sum_{i \\in ct} \\mathbf{s}_i\end{split}\]Assemble basis: \(\\xi = [\\mathbf{b}_1, \\mathbf{b}_2, \\ldots, \\mathbf{b}_C]\)
Result: Each column of \(\\xi\) represents the average processed expression profile for one cell type.
Basis Properties¶
Non-orthogonality: The gene expression profiles of cell types are correlated, so \(\\xi^T \\xi \\neq I\). The basis captures these relationships.
Similarity Matrix: \(A = \\xi^T \\xi / G\) encodes cell type similarities:
\(A_{ii} \\approx 1\): Self-similarity
\(A_{ij}\): Similarity between types \(i\) and \(j\)
Scoring (Projection)¶
Non-Orthogonal Projection¶
To score a sample \(\\mathbf{s}\) against basis \(\\xi\), we solve:
This gives the coefficients \(\\mathbf{a}^*\) that best reconstruct \(\\mathbf{s}\) as a linear combination of basis vectors.
Interpretation of Scores¶
Higher score → better match to that cell type
Scores can be negative: Sample anti-correlates with that type
Scores can be very low: Because of dropout, scores for single cells may be low across all types. Try using pseudo-bulk (i.e. average across populations) for more robust scores.
Use endogenous samples as controls: To interpret scores, compare to known endogenous samples processed the same way whenever possible. This gives a baseline for expected score ranges given dropout, technical variability, and data quality.
Predictivity Matrix¶
The matrix \(\\eta\) is called the predictivity matrix:
Each entry \(\\eta_{ct,g}\) represents:
How much does expression of gene \(g\) contribute to the score for cell type \(ct\)?
Gene Contributions¶
For a sample \(\\mathbf{s}\) and cell type \(ct\):
The contribution of gene \(g\) to type \(ct\) is:
Interpretation:
\(c_{g,ct} > 0\): High expression of gene \(g\) increases score for type \(ct\). Low expression of gene \(g\) decreases score.
\(c_{g,ct} < 0\): High expression of gene \(g\) decreases score for type \(ct\). Low expression of gene \(g\) increases score.
\(|c_{g,ct}|\) large: Gene \(g\) strongly influences assignment
Use Case: Identify which genes drive cell type assignments, validate with known markers.
Performance Metrics¶
Top-1 Accuracy¶
The fraction of cells assigned to the correct type.
Top-3 Accuracy¶
The fraction where true type is in top 3 predictions.
F1 Score¶
Harmonic mean of precision and recall:
Computed per cell type and aggregated (macro or weighted average).
Unspecified Rate¶
Fraction of predictions with low confidence:
where \(\\tau\) is the specification threshold (default: 0.1).
Feature Selection¶
ANOVA-Based Selection¶
scTOP can select informative genes using one-way ANOVA:
Compute F-statistic for each gene:
\[\begin{split}F_g = \\frac{\\text{MS}_{\\text{between}}}{\\text{MS}_{\\text{within}}}\end{split}\]where MS = mean square.
Select top genes: Keep top \(k\) genes or top percentile by F-score.
Standardize basis: Optionally standardize selected features to unit norm.
Disclaimer: May discard important and relevant genes. May be misleading in defining what distinct cell types are.
Cross-Validation¶
To robustly evaluate basis quality, scTOP supports k-fold cross-validation:
Split data into \(k\) folds
For each fold:
Train basis on \(k-1\) folds
Test on held-out fold
Compute metrics
Average metrics across folds:
\[\begin{split}\\mu_{\\text{metric}} = \\frac{1}{k} \\sum_{i=1}^k \\text{metric}_i\end{split}\]Create final basis on all data
Advantage: More reliable performance estimate than single train-test split.
Best Practices¶
For Optimal Results¶
1. Use high-quality reference data: The entire method relies on the idea that each cell type is an attractor basin. Although it can distinguish highly-correlated cell types very well, it is very important to carefully curate a reference basis. The quality of scTOP’s output depends heavily on the quality of the reference basis. * Use sufficient cells per type (at least 100, preferably much more) * Ensure accurate and consistent annotations (make sure there are no duplicates or highly-similar types, e.g. “T cell CD4+” and “T cell”)
Common gene space: Ensure query and reference share many genes
Consistent processing: Do not pre-normalize or log-transform; scTOP handles this
Validate basis: Check accuracy, confusion matrix, and F1 scores before use. It’s good to iteratively create a basis by merging similar cell types and dropping suspicious ones.
For Interpretation¶
Examine contributions: Use gene contributions to validate and understand assignments
Compare to markers: Check if known markers have high contributions
Visualize cell trajectories: Plot scores in 2D or 3D to explore differentiation paths. scTOP is not just for annotating cells; it provides a set of cell fate coordinates to study the trajectories of differentiation. This is the most powerful use of scTOP.
Check confusion matrix: Understand which types are commonly confused
References¶
Kanter, I., & Sompolinsky, H. (1987). Associative recall of memory without errors. Physical Review A, 35(1), 380.
Yampolskaya, Maria, et al. “scTOP: physics-inspired order parameters for cellular identification and visualization.” Development 150.21 (2023): dev201873.
Yampolskaya, Maria, and Pankaj Mehta. “Hopfield Networks as Models of Emergent Function in Biology.” arXiv preprint arXiv:2506.13076 (2025).
Lang, A. H., Li, H., Collins, J. J., & Mehta, P. (2014). Epigenetic Landscapes Explain Partially Reprogrammed Cells and Identify Key Reprogramming Genes. PLoS Computational Biology, 10(8), e1003734.