SToFM: a Multi-scale Foundation Model for Spatial Transcriptomics¶
- Conference: ICML 2025
- arXiv: 2507.11588
- Code: GitHub
- Area: Segmentation
- Keywords: Spatial Transcriptomics, Foundation Model, Multi-scale Learning, SE(2) Transformer, Tissue Domain Segmentation
TL;DR¶
SToFM is proposed as the first multi-scale spatial transcriptomics foundation model. By integrating gene-scale domain adaptation, micro-scale subpatch partitioning, and macro-scale virtual cell injection, combined with an SE(2) Transformer and pre-trained on a large-scale corpus of 88M cells, SToFM significantly outperforms existing methods in tasks such as tissue domain semantic segmentation and cell-type annotation.
Background & Motivation¶
Spatial transcriptomics (ST) technologies measure gene expression while preserving cellular spatial locations, providing tissue-level information unavailable in single-cell RNA sequencing. However, ST data contains multi-scale biological information that is not adequately captured by existing models:
Macro-scale (Figure 1a): Tissue morphology and organ structural information, such as functional regions and anatomical layers.
Micro-scale (Figure 1b): Cellular microenvironments and intercellular interactions.
Gene-scale (Figure 1c): Gene expression profiles of individual cells.
Limitations of existing ST foundation models: - Nicheformer: Only utilizes gene expression, completely ignoring spatial coordinates. - CellPLM: Integrates spatial information via sinusoidal positional embeddings, but represents an initial attempt and lacks fine-grained multi-scale designs.
The key challenge is: how to simultaneously capture and integrate information from these three scales across a tissue slice containing tens of thousands of cells using appropriate model architectures and self-supervised objectives.
Method¶
Overall Architecture¶
SToFM consists of two stages: 1. Multi-scale Information Extraction: Processes each ST slice at three scales to construct a set of subpatches containing multi-scale information. 2. SE(2) Transformer Representation Learning: Jointly models gene expression and spatial information on the subpatches.
Key Designs¶
Gene-scale: Domain Adaptation
Incremental training is performed based on Geneformer (a pre-trained single-cell foundation model): - ST data has lower quality (limited gene coverage, high dropout rates), and direct encoding performs poorly. - The cell encoder \(f_{cell}\) is continuously pre-trained on ST data to achieve scRNA-seq \(\rightarrow\) ST domain adaptation. - Masked Gene Modeling + contrastive learning objectives are utilized.
Micro-scale: Subpatch Partitioning
The entire ST slice is partitioned into multiple subpatches based on spatial locations, with each subpatch containing approximately 1000 cells: - Balances computational efficiency with the preservation of local intercellular interactions. - Emphasizes spatially localized cell-cell interactions.
Macro-scale: Virtual Cells
All cells in the slice are clustered using the Leiden clustering algorithm, and each cluster is aggregated into a virtual cell: - The embeddings and positions of the virtual cells are the average of all cells within their respective clusters. - Preserves the primary morphology and partitioning information of the slice, serving as a compressed representation of macro-scale information. - Virtual cells are injected into each subpatch, enabling the model to perceive macro-scale structures while learning micro-scale information.
SE(2) Transformer:
An SE(2)-invariant Transformer architecture is used to jointly encode cell embeddings and spatial coordination: - Input: Cell embeddings \(F^{(i)}\) + distance matrix \(D^{(i)}\) (pair representations initialized via a Gaussian module). - The distance matrix acts as an attention bias, similar to Graphformer and AlphaFold. - Output: Cell representations \(Y_{cell}^{(i)}\) and pair representations \(Y_{pair}^{(i)}\). - Guarantees invariance to 2D translation and rotation.
Pre-training Objectives¶
Masked Cell Modeling (MCM): Randomly masks 10% of cell embeddings, using the output representations to predict the masked embeddings (MSE loss).
Pairwise Distance Recovery (PDR): Randomly selects 10% of cells, adds Gaussian noise to their coordinates, and reconstructs the original distance matrix using pair representations (MSE loss).
Loss & Training¶
Stage-wise training: For the first 2 epochs, the cell encoder is frozen and only the SE(2) Transformer is trained \(\rightarrow\) for the 3rd epoch, both are jointly fine-tuned. 4ĆA100 GPUs for approximately 20 days.
Key Experimental Results¶
Main Results 1: Tissue Domain Semantic Segmentation (F1 Score)¶
| Model | ST Pre-training | Embryo Avg | Embryo Cross-slice | DLPFC Avg | DLPFC Cross-slice |
|---|---|---|---|---|---|
| scGPT | No Spatial | 0.7450 | 0.3947 | 0.6178 | 0.5885 |
| Geneformer | No Spatial | 0.7467 | 0.3745 | 0.5606 | 0.5440 |
| CellPLM | Expression+Spatial | 0.7722 | 0.3985 | 0.6219 | 0.5953 |
| SToFM | Multi-scale | 0.8046 | 0.4588 | 0.6535 | 0.6437 |
Main Results 2: Cell-type Annotation (Mouse Brain)¶
| Model | Brain1 Acc | Brain1 F1 | Brain2 Acc | Brain2 F1 |
|---|---|---|---|---|
| CellPLM | 0.6001 | 0.4186 | 0.9256 | 0.7332 |
| SToFM | 0.6349 | 0.4951 | 0.9289 | 0.8362 |
Ablation Study¶
| Ablation Variant | Gene | Micro | Macro | Embryo Cross-slice F1 | Brain1 F1 |
|---|---|---|---|---|---|
| Cell encoder w/o DA | ā | ā | ā | 0.3745 | 0.3853 |
| Cell encoder w/ DA | ā | ā | ā | 0.4155 | 0.4725 |
| SToFM w/o VCs | ā | ā | ā | 0.4291 | 0.4893 |
| SToFM | ā | ā | ā | 0.4588 | 0.4951 |
Key Findings¶
- In cross-slice settings, SToFM shows a greater advantage (Embryo cross-slice F1: 0.4588 vs 0.3985), demonstrating that multi-scale information significantly enhances generalization ability.
- The distinct contributions of the three scales are clear: domain adaptation (+4.1%/+8.7%), micro-scale information (+1.4%/+1.7%), and macro-scale information (+3.0%/+0.6%).
- In zero-shot clustering, the UMAP visualizations of SToFM show that different cell types form clear, compact clusters.
- SToCorpus-88M is currently the largest high-resolution ST pre-training corpus, exceeding Nicheformer by 1.6 times.
Highlights & Insights¶
- Elegant Multi-scale Design: The virtual cell mechanism compresses macro-scale information and injects it into micro-scale subpatches, cleverly encoding information across three scales within the Transformer's sequence input without incurring computational explosion from processing the whole slice.
- Rationality of SE(2) Invariance: Computational interpretation of tissue slices should be robust to rotation and translation; the SE(2) Transformer naturally guarantees this invariance.
- Data Contribution: SToCorpus-88M covers 6 ST technologies, 2000 slices, and 88M cells, representing a highly valuable contribution to the research community.
Limitations & Future Work¶
- Limited to Three Scales: Finer-grained scales (such as the gene regulatory network level) are not modeled, nor are image pyramid methods utilized.
- Lack of Multimodal Integration: Fails to exploit pathological images or prior biological knowledge like known ligand-receptor pairs.
- Low-resolution ST Exclusion: Only single-cell or near single-cell resolution data was selected; low-resolution technologies like 10x Visium were not included in pre-training.
- High Computational Cost: Double forward passes of the cell encoder and multi-scale processing introduce significant computational overhead (4ĆA100 GPUs for 20 days of training).
Related Work & Insights¶
- Geneformer (Theodoris et al., 2023): Source of the cell encoder initialization for SToFM, featuring a gene sequencing encoding strategy ranked by relative expression level.
- CellPLM (Wen et al., 2023): The first ST foundation model to integrate spatial information, utilizing only sinusoidal positional encoding; SToFM greatly expands upon this paradigm.
- Nicheformer (Schaar et al., 2024): The current largest joint ST/scRNA-seq pre-trained model, but completely ignores spatial coordinates.
- Uni-Mol/AlphaFold: Successful application of SE(2)/SE(3) Transformers in molecules/proteins, from which SToFM adapts concepts for spatial transcriptomics.
Rating¶
- Novelty: āāāā ā The multi-scale design (especially the virtual cell mechanism) and the application of SE(2) Transformer in ST are highly original.
- Experimental Thoroughness: āāāāā ā Covers 5 types of tasks (segmentation, annotation, clustering, deconvolution, imputation) + ablation + visualization.
- Writing Quality: āāāā ā Clear multi-scale motivation and highly systematic, complete methodological descriptions.
- Value: āāāāā ā Dataset, model, and code are completely open-source, providing a major driving force for the spatial transcriptomics analysis field.