Cross-Slice Knowledge Transfer via Masked Multi-Modal Heterogeneous Graph Contrastive Learning for Spatial Gene Expression Inference¶

CVPR 2026 Medical Imaging Spatial transcriptomics heterogeneous graph learning cross-slice knowledge transfer contrastive learning gene expression prediction

Conference: CVPR 2026 arXiv: 2603.22821 Code: https://github.com/wenwenmin/SpaHGC Area: Medical Image Analysis / Spatial Transcriptomics Keywords: Spatial transcriptomics, heterogeneous graph learning, cross-slice knowledge transfer, contrastive learning, gene expression prediction

TL;DR¶

This paper proposes SpaHGC, a multimodal heterogeneous graph framework that constructs three types of subgraphs—intra-target-slice (TS), cross-slice (CS), and intra-reference-slice (RS)—and integrates masked graph contrastive learning with a cross-node dual attention mechanism to predict spatial gene expression from H&E histopathology images, achieving PCC improvements of 7.3%–27.1% across seven datasets.

Background & Motivation¶

Background: Spatial transcriptomics (ST) technology enables precise quantification of the spatial distribution of gene expression in tissues, but its high experimental cost limits large-scale application. Predicting ST gene expression from H&E histopathology images has emerged as a promising alternative.

Limitations of Prior Work: (1) ST data are sparse and noisy—gene expression at certain spots may be missing or near zero; (2) existing methods model spatial structure within a single slice only, ignoring expression patterns shared across slices; (3) inter-individual variability and disease progression introduce cross-sample heterogeneity that single-slice models struggle to capture in a generalizable manner.

Key Challenge: Tissues of the same type or disease typically share common expression patterns, yet individual differences make direct cross-slice alignment difficult—how can shared information be effectively integrated while accommodating individual heterogeneity?

Goal: To model cross-slice spatial relationships and transfer prior knowledge from multiple reference slices to improve gene expression prediction on a target slice.

Key Insight: Constructing a multimodal heterogeneous graph that connects cross-slice spots via image embeddings from a pathology foundation model (UNI), and leveraging contrastive learning to enhance the robustness of learned representations.

Core Idea: Heterogeneous graph + cross-slice knowledge transfer + masked contrastive learning = more accurate gene expression prediction.

Method¶

Overall Architecture¶

SpaHGC operates in four stages: (1) extracting patch embeddings using the UNI pathology foundation model; (2) constructing three types of subgraphs—intra-target-slice (TS), cross-slice (CS), and intra-reference-slice (RS); (3) applying complementary masking to generate two augmented views, which are trained via a heterogeneous graph encoder with contrastive learning; and (4) producing gene expression predictions.

Key Designs¶

Multimodal Heterogeneous Graph Construction:
- Target-slice (TS) graph: Connects the \(Q\) nearest-neighbor spots within the target slice based on Euclidean distance, capturing local spatial continuity.
- Cross-slice (CS) graph: For each target patch embedding \(\mathbf{z}_t^{(i)}\), retrieves the Top-K patches from the reference slice by cosine similarity, forming cross-slice edges. Reference nodes carry joint features \(\mathbf{h}_r = [\mathbf{z}_r \| \mathbf{y}_r]\) (visual + gene expression).
- Reference-slice (RS) graph: Top-K connections among reference nodes based on cosine similarity of their joint features, forming a global semantic scaffold.
- Design Motivation: The three edge types respectively capture local spatial semantics, cross-slice morphological similarity, and global expression relationships within the reference slice—enabling multi-level information fusion.
Cross Node Dual Attention (CNDA):
- Function: A bidirectional attention mechanism in which target nodes attend to reference nodes to acquire visual and gene expression knowledge, while reference nodes attend to target nodes to update their own representations.
- Core equations: \(\mathbf{A}_{t \leftarrow r} = \text{softmax}(\frac{\mathbf{Q}_t \mathbf{K}_r^\top}{\sqrt{d'}})\), \(\bar{\mathbf{L}}_t = \mathbf{A}_{t \leftarrow r} \mathbf{V}_r\)
- Design Motivation: Selective transfer—dynamic attention weights allow the model to automatically select the most relevant morphological and expression information from the reference slice while suppressing irrelevant cross-slice noise.
Cross Node Attention Pooling (CNAP):
- Function: Multi-head unidirectional cross-node attention that aggregates reference node representations into target node representations via cross-attention.
- Design Motivation: Compared with simple exemplar retrieval, CNAP dynamically aggregates auxiliary information according to the contextual semantics of each target node, enabling more flexible adaptation across different tissue regions.
Complementary Masking Contrastive Learning:
- Function: Node-type-specific feature masking is applied separately to target and reference nodes to generate two complementary views (\(\mathbf{M}_t^{(1)} + \mathbf{M}_t^{(2)} = \mathbf{1}\)), and the model is trained with a cosine-distance contrastive loss.
- Design Motivation: Simulates real-world feature dropout and sequencing noise present in ST data, forcing the model to learn consistent representations that are robust to such corruption.

Loss & Training¶

Contrastive loss: \(\mathcal{L}_{\text{con}} = \frac{1}{N} \sum_j (2 - 2 \cdot \text{Cos}(\hat{L}_j^1, \hat{L}_j^2))\)
A regression loss is used for gene expression prediction.
An asymmetric contrastive design is adopted in which one view is stop-gradient to serve as a stable target.

Key Experimental Results¶

Main Results (7 public ST datasets)¶

Method	HER+ PCC%	cSCC PCC%	Lymph Node PCC%	Pancreas2 PCC%
STNet	5.61	9.2	3.4	31.56
HisToGene	7.89	17.56	19.24	26.13
mclSTExp	23.15	31.88	21.64	31.61
M2OST	18.24	24.88	30.97	38.35
SpaHGC	27.86	38.79	35.02	41.36

Ablation Study¶

Component contributions are validated by progressively removing modules from the full SpaHGC: - Removing CNDA → notable PCC drop, confirming the critical role of cross-slice attention. - Removing the CS graph → significant PCC drop, confirming the necessity of cross-slice connectivity. - Removing masking → reduced robustness, confirming the contribution of contrastive learning. - Replacing UNI with ResNet → clear PCC decrease, confirming the importance of a strong pathology foundation model.

Key Findings¶

SpaHGC achieves PCC gains of 7.3%–27.1% across all seven datasets, representing substantial improvements.
Predicted expressions show significant enrichment in multiple cancer-related pathways, validating biological relevance.
Cross-slice knowledge transfer generalizes across different platforms (10x Visium, ST 1000, etc.), tissue types, and cancer subtypes.

Highlights & Insights¶

Cross-slice knowledge transfer: Unlike existing methods that operate on a single slice, leveraging prior knowledge from multiple reference slices represents an important paradigm shift.
Deep integration of the pathology foundation model (UNI): Establishing cross-slice connections via strong pretrained embeddings fully exploits large-scale pretraining knowledge.
Biological downstream validation: Beyond numerical metrics, the work includes pathway enrichment analysis and other biological validations.

Limitations & Future Work¶

Multiple reference slices are required as training data, which may be limiting for tissue types with extremely scarce samples.
The Top-K connections in graph construction depend on the quality of UNI embeddings.
Fine-grained incorporation of spatial positional information during heterogeneous graph construction has not been explored.

The contrastive alignment ideas in BLEEP and mclSTExp are instructive, but SpaHGC's heterogeneous graph framework is more flexible.
The exemplar retrieval concept in EGGN bears similarity to SpaHGC's CS graph, but SpaHGC adopts a more systematic graph-structured approach.
The complementary masking strategy is generalizable to other multimodal graph learning tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of modeling cross-slice relationships via heterogeneous graphs is valuable, though the core components (GraphSAGE, attention, contrastive learning) are combinations of established techniques.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Seven datasets, nine baselines, and biological downstream analysis—highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear structure with rich figures and tables.
Value: ⭐⭐⭐⭐ Makes a significant contribution to the spatial transcriptomics field; cross-slice knowledge transfer is a promising direction.