Theory¶

This page explains the theoretical foundation and mathematical details of scTOP.

Overview¶

scTOP (Single-cell Type Order Parameters) identifies cell phenotypes by projecting single-cell expression profiles onto a reference basis of known cell types. The method uses order parameters for Hopfield networks as defined by Kanter and Sompolinsky.

Theoretical Background¶

This method is described in detail in scTOP: physics-inspired order parameters for cellular identification and visualization by Maria Yampolskaya, Michael J. Herriges, Laertis Ikonomou, Darrell Kotton, and Pankaj Mehta.

Key Concepts¶

Cell fate coordinates: In physics, order parameters are a macroscopic variable describing the phase of a system. For cell types, Hopfield-inspired order parameters measure cellular identity, and can be used to define the coordinates of cell fate space.

Cell Type as Attractor State: This approach is inspired by the Waddington landscape and treating cell types as attractors in a Hopfield network. Each cell type corresponds to a stable attractor in gene expression space.

Projection onto Basis: By creating a basis from reference cell types, we can decompose any sample’s expression profile into contributions from each cell type.

Mathematical Framework¶

Notation¶

\(\\xi\): Basis matrix (genes × cell types)
\(\\mathbf{s}\): Processed sample vector (genes × 1)
\(\\mathbf{a}\): Cell type scores (cell types × 1)
\(G\): Number of genes
\(C\): Number of cell types

Data Processing¶

Step 1: Normalization¶

Each sample is independently normalized by library size:

\[\begin{split}x'_{g,i} = \\frac{x_{g,i}}{\\sum_{g'} x_{g',i}}\end{split}\]

where \(x_{g,i}\) is the raw count for gene \(g\) in sample \(i\).

Purpose: Remove technical variation due to sequencing depth.

Step 2: Log Transformation¶

\[\begin{split}y_{g,i} = \\log_2(x'_{g,i} + 1)\end{split}\]

Purpose: Stabilize variance and reduce the influence of highly expressed genes.

Step 3: Rank-Based Z-Score Transform¶

For each sample \(i\):

Rank genes: Assign ranks \(r_{g,i} \\in [1, G]\) to genes based on \(y_{g,i}\)
Handle ties: Use average rank for tied values
Convert to percentiles:

\[\begin{split}p_{g,i} = \\frac{r_{g,i}}{G + 1}\end{split}\]
Apply inverse normal CDF (probit transform):

\[\begin{split}z_{g,i} = \\Phi^{-1}(p_{g,i}) = \\sqrt{2} \\cdot \\text{erf}^{-1}(2p_{g,i} - 1)\end{split}\]

Purpose: Ranking is valuable because there is a lot of batch-to-bath variability in scRNA-seq data. Ranking converts all gene expression to relative ordering instead of absolute expression, helping ameliorate batch effects.Rank-based methods are more stable across experiments

Basis Construction¶

Creating the Reference Basis¶

Given annotated reference data, the basis is constructed by:

Group cells by type: Collect all cells \(\\{\\mathbf{s}_1, \\mathbf{s}_2, \\ldots, \\mathbf{s}_N\\}\) for each cell type
Process each cell: Apply the full processing pipeline to cells individually, avoiding cross-sample operations
Average within type:

\[\begin{split}\\mathbf{b}_{ct} = \\frac{1}{N_{ct}} \\sum_{i \\in ct} \\mathbf{s}_i\end{split}\]
Assemble basis: \(\\xi = [\\mathbf{b}_1, \\mathbf{b}_2, \\ldots, \\mathbf{b}_C]\)

Result: Each column of \(\\xi\) represents the average processed expression profile for one cell type.

Basis Properties¶

Non-orthogonality: The gene expression profiles of cell types are correlated, so \(\\xi^T \\xi \\neq I\). The basis captures these relationships.

Similarity Matrix: \(A = \\xi^T \\xi / G\) encodes cell type similarities:

\(A_{ii} \\approx 1\): Self-similarity
\(A_{ij}\): Similarity between types \(i\) and \(j\)

Scoring (Projection)¶

Non-Orthogonal Projection¶

To score a sample \(\\mathbf{s}\) against basis \(\\xi\), we solve:

\[\begin{split}\\mathbf{a}^* = (\\xi^T \\xi)^{-1} \\xi^T \\mathbf{s}\end{split}\]

This gives the coefficients \(\\mathbf{a}^*\) that best reconstruct \(\\mathbf{s}\) as a linear combination of basis vectors.

Interpretation of Scores¶

Higher score → better match to that cell type
Scores can be negative: Sample anti-correlates with that type
Scores can be very low: Because of dropout, scores for single cells may be low across all types. Try using pseudo-bulk (i.e. average across populations) for more robust scores.
Use endogenous samples as controls: To interpret scores, compare to known endogenous samples processed the same way whenever possible. This gives a baseline for expected score ranges given dropout, technical variability, and data quality.

Predictivity Matrix¶

The matrix \(\\eta\) is called the predictivity matrix:

\[\begin{split}\\eta = (\\xi^T \\xi)^{-1} \\xi^T / G\end{split}\]

Each entry \(\\eta_{ct,g}\) represents:

How much does expression of gene \(g\) contribute to the score for cell type \(ct\)?

Gene Contributions¶

For a sample \(\\mathbf{s}\) and cell type \(ct\):

\[\begin{split}a_{ct} = \\sum_g \\eta_{ct,g} \\cdot s_g\end{split}\]

The contribution of gene \(g\) to type \(ct\) is:

\[\begin{split}c_{g,ct} = \\eta_{ct,g} \\cdot s_g\end{split}\]

Interpretation:

\(c_{g,ct} > 0\): High expression of gene \(g\) increases score for type \(ct\). Low expression of gene \(g\) decreases score.
\(c_{g,ct} < 0\): High expression of gene \(g\) decreases score for type \(ct\). Low expression of gene \(g\) increases score.
\(|c_{g,ct}|\) large: Gene \(g\) strongly influences assignment

Use Case: Identify which genes drive cell type assignments, validate with known markers.

Performance Metrics¶

Top-1 Accuracy¶

\[\begin{split}\\text{Acc}_1 = \\frac{\\#\\{\\text{argmax}(\\mathbf{a}) = \\text{true type}\\}}{N}\end{split}\]

The fraction of cells assigned to the correct type.

Top-3 Accuracy¶

\[\begin{split}\\text{Acc}_3 = \\frac{\\#\\{\\text{true type} \\in \\text{top 3 of } \\mathbf{a}\\}}{N}\end{split}\]

The fraction where true type is in top 3 predictions.

F1 Score¶

Harmonic mean of precision and recall:

\[\begin{split}F_1 = 2 \\cdot \\frac{\\text{Precision} \\cdot \\text{Recall}}{\\text{Precision} + \\text{Recall}}\end{split}\]

Computed per cell type and aggregated (macro or weighted average).

Unspecified Rate¶

Fraction of predictions with low confidence:

\[\begin{split}\\text{Unspec} = \\frac{\\#\\{\\max(\\mathbf{a}) < \\tau\\}}{N}\end{split}\]

where \(\\tau\) is the specification threshold (default: 0.1).

Feature Selection¶

ANOVA-Based Selection¶

scTOP can select informative genes using one-way ANOVA:

Compute F-statistic for each gene:

\[\begin{split}F_g = \\frac{\\text{MS}_{\\text{between}}}{\\text{MS}_{\\text{within}}}\end{split}\]

where MS = mean square.
Select top genes: Keep top \(k\) genes or top percentile by F-score.
Standardize basis: Optionally standardize selected features to unit norm.

Disclaimer: May discard important and relevant genes. May be misleading in defining what distinct cell types are.

Cross-Validation¶

To robustly evaluate basis quality, scTOP supports k-fold cross-validation:

Split data into \(k\) folds
For each fold:
- Train basis on \(k-1\) folds
- Test on held-out fold
- Compute metrics
Average metrics across folds:

\[\begin{split}\\mu_{\\text{metric}} = \\frac{1}{k} \\sum_{i=1}^k \\text{metric}_i\end{split}\]
Create final basis on all data

Advantage: More reliable performance estimate than single train-test split.

Best Practices¶

For Optimal Results¶

1. Use high-quality reference data: The entire method relies on the idea that each cell type is an attractor basin. Although it can distinguish highly-correlated cell types very well, it is very important to carefully curate a reference basis. The quality of scTOP’s output depends heavily on the quality of the reference basis. * Use sufficient cells per type (at least 100, preferably much more) * Ensure accurate and consistent annotations (make sure there are no duplicates or highly-similar types, e.g. “T cell CD4+” and “T cell”)

Common gene space: Ensure query and reference share many genes
Consistent processing: Do not pre-normalize or log-transform; scTOP handles this
Validate basis: Check accuracy, confusion matrix, and F1 scores before use. It’s good to iteratively create a basis by merging similar cell types and dropping suspicious ones.

For Interpretation¶

Examine contributions: Use gene contributions to validate and understand assignments
Compare to markers: Check if known markers have high contributions
Visualize cell trajectories: Plot scores in 2D or 3D to explore differentiation paths. scTOP is not just for annotating cells; it provides a set of cell fate coordinates to study the trajectories of differentiation. This is the most powerful use of scTOP.
Check confusion matrix: Understand which types are commonly confused

References¶

Kanter, I., & Sompolinsky, H. (1987). Associative recall of memory without errors. Physical Review A, 35(1), 380.
Yampolskaya, Maria, et al. “scTOP: physics-inspired order parameters for cellular identification and visualization.” Development 150.21 (2023): dev201873.
Yampolskaya, Maria, and Pankaj Mehta. “Hopfield Networks as Models of Emergent Function in Biology.” arXiv preprint arXiv:2506.13076 (2025).
Lang, A. H., Li, H., Collins, J. J., & Mehta, P. (2014). Epigenetic Landscapes Explain Partially Reprogrammed Cells and Identify Key Reprogramming Genes. PLoS Computational Biology, 10(8), e1003734.