AAAI 2026 Medical Imaging Spatial Transcriptomics Gene Expression Prediction Contrastive Learning Multimodal Alignment Knowledge Augmentation Pathology Images

Dual-Path Knowledge-Augmented Contrastive Alignment Network for Spatially Resolved Transcriptomics¶

Conference: AAAI 2026 arXiv: 2511.17685 Code: coffeeNtv/DKAN Area: Medical Imaging / Spatial Transcriptomics Keywords: Spatial Transcriptomics, Gene Expression Prediction, Contrastive Learning, Multimodal Alignment, Knowledge Augmentation, Pathology Images

TL;DR¶

This paper proposes DKAN, a Dual-path Knowledge-Augmented contrastive Alignment Network that integrates semantic information from external gene databases as a cross-modal coordinator. Combined with a unified one-stage contrastive learning paradigm and an adaptive weighting mechanism, DKAN predicts spatially resolved gene expression from H&E-stained whole slide images (WSI), achieving state-of-the-art performance across three public ST datasets.

Background & Motivation¶

Background: Spatial Transcriptomics (ST) enables measurement of gene expression profiles in tissue sections while preserving spatial context, which is critical for understanding disease etiology and tissue heterogeneity. However, ST technologies are costly and resolution-limited, motivating researchers to explore predicting spatial gene expression from low-cost H&E-stained WSIs.

Limitations of Prior Work: - Reliance on low-level visual features: Most methods exploit only low-level features such as pixel intensity (color distribution) and cellular structure (shape and texture), failing to capture high-level semantic information such as gene function, biological pathways, and disease associations. - Over-dependence on exemplar retrieval: Contrastive learning and exemplar-guided pipelines require constructing additional reference datasets and retrieving similar patches, introducing unnecessary complexity. - Insufficient heterogeneous modality alignment: Existing fusion strategies directly enforce alignment between heterogeneous modalities such as images and gene expression, failing to preserve biologically relevant interaction information.

Key Challenge: Image features and gene expression features belong to entirely heterogeneous modalities, making direct alignment difficult and prone to losing biological meaning; existing contrastive learning methods require additional retrieval steps that introduce pipeline redundancy.

Goal: Achieve effective multimodal alignment between images and gene expression without relying on exemplar retrieval, while incorporating high-level biological semantics to improve prediction accuracy.

Key Insight: External gene database knowledge is introduced as a "bridge" to indirectly align the two heterogeneous modalities of image and expression—rather than directly comparing apples and oranges, a shared knowledge intermediary is used to establish their relationship.

Core Idea: Gene semantic features serve as dynamic cross-modal coordinators, interacting with image and expression features along dual paths respectively, to achieve knowledge-guided implicit modality alignment.

Method¶

Overall Architecture¶

DKAN consists of four core modules: 1. Gene Semantic Representation: Gene knowledge is retrieved from the NCBI gene database; structured gene semantic text is generated by an LLM (GPT-4o) and encoded into 1024-dimensional features using BioBERT, then processed by a Transformer to obtain semantic features $f^{text}$. 2. Gene Expression Embedding: The $N_p \times N_g$ gene expression matrix is encoded into $f^{exp}$ via a linear layer + GELU + residual connection + layer normalization. 3. Multi-level Image Embedding: Image features are extracted at three levels—WSI-level, region-level (k=25 neighboring patches), and patch-level—and fused via cross-attention into $f^{img}$. 4. Dual-Path Contrastive Alignment: Gene semantic features serve as queries to attend to image and expression features respectively, producing knowledge-augmented representations $e^{ti}$ and $e^{te}$ for contrastive learning.

Key Designs¶

Gene Semantic Representation: - Functional, pathway, and disease association information for $N_g$ target genes is retrieved from the NCBI database. - A structured prompt (including role definition, task requirements, and output specification) is designed for GPT-4o to generate normalized gene semantic text. - BioBERT (pre-trained on large-scale biomedical corpora) encodes the text into 1024-dimensional vectors. - After linear projection for dimension alignment, a standard Transformer captures global dependencies among gene semantic embeddings.

Multi-level Image Embedding: - WSI-level and region-level features are extracted using UNI (a histopathology pre-trained foundation model) with frozen weights, each followed by a multi-head Transformer. - Patch-level features are extracted by a trainable ResNet18 (with the final pooling and fully connected layers removed). - Fusion strategy: two cross-attention modules—WSI features serve as queries attending to region-level and patch-level features respectively, and the results are summed to produce the final $f^{img}$.

Dual-Path Contrastive Alignment: - Image path: Gene semantic features act as "functional query instructions," filtering morphologically relevant regions from image features via cross-attention. - Expression path: Gene semantic features act as "distribution correction factors," constraining predicted gene expression features via cross-attention. - Each semantic feature independently queries image and expression features, generating $e^{ti}$ and $e^{te}$. - Key advantage: Rather than directly enforcing alignment between heterogeneous modalities, implicit alignment is achieved through independent interaction with semantic knowledge.

Unified One-Stage Contrastive Learning: - All modalities are used during training; only image and semantic modalities are used during inference. - No reference dataset construction or exemplar retrieval is required. - Positive pairs: $e^{ti}$ and $e^{te}$ of the same gene; negative pairs: representations from different genes.

Loss & Training¶

The total loss consists of contrastive loss and supervised loss with adaptive weighting:

\[\mathcal{L} = w_{sup}\mathcal{L}_{sup} + w_{cont}\mathcal{L}_{cont}\]

Contrastive Loss (InfoNCE form): $$\mathcal{L}_{cont} = -\sum_i \log \frac{\exp(sim(e_{ti}^i, e_{te}^i)/\tau)}{\sum_j \exp(sim(e_{ti}^i, e_{te}^j)/\tau)}$$

Supervised Loss (with knowledge distillation): $$\mathcal{L}_{sup} = \sum_{d \in \mathcal{D}} \mathcal{L}_d + \|\hat{Y} - Y\|^2$$

where the distillation loss for each intermediate prediction is: $\mathcal{L}_d = \lambda\|\hat{Y}_d - \hat{Y}\|^2 + (1-\lambda)\|\hat{Y}_d - Y\|^2$

Adaptive Weighting: Weights are dynamically adjusted as the normalized inverse of each loss value, assigning higher weights to smaller losses to prevent any single objective from dominating training.

Experiments¶

Datasets¶

Dataset	Samples	Patients	Spots	Genes/Spot	Type
HER2+	36	8	13,620	14,873	Breast Cancer
STNET	68	23	30,612	26,949	Breast Cancer
cSCC	12	4	8,671	17,047	Cutaneous Squamous Cell Carcinoma

Main Results (vs. 10 SOTA Methods)¶

HER2+ Dataset:

Method	MAE↓	MSE↓	PCC-ALL↑	PCC-HPG↑	PCC-HEG↑	PCC-HVG↑
TRIPLEX (strongest baseline)	0.364	0.234	0.304	0.491	0.271	0.260
DKAN (Ours)	0.361	0.224	0.330	0.531	0.317	0.304

STNET Dataset: DKAN achieves leading performance in MAE (0.322), MSE (0.179), and all PCC metrics.

cSCC Dataset: DKAN significantly outperforms all baselines in MAE (0.383), MSE (0.239), and PCC-ALL (0.407), with the largest improvement margin (PCC-ALL from 0.363 to 0.407).

Ablation Study¶

Ablation	PCC-ALL Change	Description
Remove multi-scale context	0.219 → 0.117	Multi-level image features are critical
Remove gene semantic text	0.219 → 0.210	Semantic information provides effective biological priors
Remove contrastive learning	0.219 → 0.209	Contrastive learning improves cross-modal alignment quality
Text as KV	0.219 → 0.216	Text as Query yields better performance

Encoder Selection: BioBERT > BioGPT > PLIP > Conch (text encoders); UNI > Conch > ResNet18 > ResNet50 > PLIP (image encoders)

LLM Comparison: GPT-4o > DeepSeek-v3 > LLaMA2 > DeepSeek-R1

Fusion Strategy: Cross-attention > Sum+Transformer > Concat+Transformer > Sum > Concat

Key Findings¶

The incorporation of gene semantic knowledge yields consistent improvements across all datasets, validating the importance of high-level biological priors for gene expression prediction.
The unified one-stage contrastive learning eliminates dependence on exemplar retrieval, simplifying the pipeline while improving performance.
The dual-path design is more effective than direct heterogeneous modality alignment—implicit alignment mediated by semantic knowledge achieves higher quality.
Visualization of cancer biomarker genes (FN1, HSPB1) demonstrates that DKAN accurately captures spatial expression patterns.

Highlights & Insights¶

Knowledge Augmentation Paradigm: This work is the first to systematically integrate external gene database knowledge into spatial transcriptomics prediction, providing biological priors through LLM-generated structured gene semantic text.
Bridge-based Alignment: Rather than directly aligning heterogeneous modalities, implicit alignment is achieved through a shared semantic knowledge space—a design philosophy with broad implications for other heterogeneous multimodal tasks.
One-Stage Paradigm Simplification: Contrastive and supervised learning are unified into end-to-end training, eliminating the redundant exemplar retrieval step.
Adaptive Loss Balancing: Dynamic adjustment of contrastive and supervised loss weights prevents optimization imbalance caused by differing convergence rates.

Limitations & Future Work¶

The quality of gene semantic text depends on GPT-4o and the current state of the NCBI database, which may introduce biases or incompleteness.
Experiments are validated on only three public ST datasets (two breast cancer + one skin cancer); generalizability to other tissue types and diseases remains to be tested.
The frozen weights of WSI-level and region-level image encoders (UNI) may limit adaptation to specific tissue types.
The selection of 250 spatially variable genes may restrict the method's applicability to other gene sets.
Pre-generating gene description text via GPT-4o adds preprocessing overhead.

Local methods: ST-Net (DenseNet-121), EGN/EGGN (exemplar retrieval + graph convolution), SEPAL (neighborhood graph + GNN), BLEEP (CLIP-style contrastive learning), mclSTExp
Global methods: HisToGene (ViT + positional encoding), HE2RNA (super-tile aggregation), THItoGene (dynamic convolution + capsule module + ViT + GATv2)
Multi-scale methods: Hist2ST, TRIPLEX (multi-view feature combination), M2OST (many-to-one prediction), ST-Align (niche-level clustering + triple-objective alignment)

Rating¶

⭐⭐⭐⭐ (4/5)

Novelty: ⭐⭐⭐⭐ — Knowledge augmentation combined with dual-path implicit alignment is a novel contribution
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive superiority over 10 baselines on three datasets with thorough ablations
Writing Quality: ⭐⭐⭐⭐ — Clear structure with informative figures
Value: ⭐⭐⭐⭐ — Code is provided and the pipeline is reproducible