CVPR 2026 Medical Imaging Alzheimer's disease diagnosis multimodal vision-language models evidence alignment reinforcement fine-tuning 3D brain segmentation

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease¶

Conference: CVPR 2026 arXiv: 2602.19178 Code: Coming soon (including grounding annotations) Area: Medical Imaging Keywords: Alzheimer's disease diagnosis, multimodal vision-language models, evidence alignment, reinforcement fine-tuning, 3D brain segmentation

TL;DR¶

This paper proposes EMAD, an end-to-end multimodal vision-language framework for AD diagnosis that generates structured reports. Through hierarchical Sentence–Evidence–Anatomy (SEA) Grounding, each diagnostic statement is explicitly linked to clinical evidence and 3D brain anatomy. Executable rule-driven GRPO reinforcement fine-tuning is applied to ensure clinical consistency.

Background & Motivation¶

Clinical diagnosis of Alzheimer's disease (AD) requires integrating multimodal data including structural MRI, neuropsychological tests, APOE genotype, and cerebrospinal fluid (CSF) biomarkers. Existing AI methods suffer from three core limitations:

Black-box problem: Most models output only labels or risk scores, without explaining "why this judgment was made" or "which evidence supports it."

Insufficient multimodal integration: Many methods operate on a single modality, ignoring cross-modal dependencies.

Disconnection from clinical guidelines: Reports generated by existing MLLMs rarely (i) link generated sentences to specific clinical entries, (ii) localize statements to 3D brain anatomical structures, or (iii) enforce adherence to diagnostic frameworks such as NIA-AA.

The core motivation of EMAD is to construct a transparent, traceable, and anatomically faithful AD report generation system where every diagnostic statement is supported by an evidence chain.

Method¶

Overall Architecture¶

EMAD consists of four core components: (1) multimodal encoders, (2) projection and fusion layers, (3) a text decoder for report generation, and (4) a hierarchical SEA Grounding head.

Input: Given \(\mathcal{X}=\{x_v, x_t\}\), where \(x_v \in \mathbb{R}^{D \times H \times W}\) is a 3D sMRI scan and \(x_t\) represents structured clinical variables (demographics, genetics, cognitive tests, CSF biomarkers, etc.).

Visual encoder \(E_v\): A 3D Vision Transformer extracts patch-level visual embeddings \(h_v\).
Text encoder \(E_t\): Longformer encodes clinical text features \(h_t\).
Bidirectional Cross-Attention Fusion (BCA): \(h_v'\) and \(h_t'\) are mapped to the same dimensional space via linear projection, then alternately serve as Q/KV:

\[\mathbf{A}_{t \to v} = \text{Attn}(h_t', h_v', h_v'), \quad \mathbf{A}_{v \to t} = \text{Attn}(h_v', h_t', h_t')\]

Residual connections preserve modality-specific information: \(z_v = h_v' + \mathbf{A}_{v \to t}\), \(z_t = h_t' + \mathbf{A}_{t \to v}\).

Text decoder: LLaMA 3.2-1B with rank-8 LoRA. The fused features \((z_v, z_t)\) replace the <sMRI> and <clinical> placeholders in the prompt, and structured reports are generated autoregressively.

Key Designs¶

Sentence–Evidence–Anatomy (SEA) Grounding: A hierarchical evidence alignment mechanism.
- Sentence-to-Evidence: Each generated sentence \(\hat{s}_i\) is matched to a clinical evidence set \(\mathcal{E}=\{e_1,\ldots,e_K\}\) in a many-to-many fashion. A multi-positive InfoNCE loss is applied bidirectionally (evidence→sentence + sentence→evidence):
\(\mathcal{L}_{\text{SE}} = \frac{1}{N}\sum_{i=1}^{N}(\ell_i^{e \to s} + \ell_i^{s \to e})\)
- Evidence-to-Anatomy: When evidence carries anatomical pointers, an evidence-conditioned 3D segmentation network localizes the corresponding brain regions. A lightweight cross-attention block is inserted after the self-attention layer at each decoder level of Segformer3D, enabling visual tokens to attend to evidence text tokens. The output is a voxel-level probability mask \(\hat{\mathbf{M}}_i = \sigma(\text{Head}(\mathbf{Y}^{(L)}))\), trained with Dice + BCE loss.
GTX-Distill (Grounding Transfer Distillation): A label-efficient grounding distillation strategy.
- Stage 1: A Teacher Grounder \(G_T\) is trained on a small annotated subset to learn the sentence→evidence distribution \(q(e|s_i)\) and anatomical masks.
- Stage 2: \(G_T\) is frozen; a Student Grounder \(G_\theta\) is trained on large-scale model-generated reports via temperature-scaled KL divergence distillation:
\(\mathcal{L}^{\text{distill}} = \tau^2 \sum_i \text{KL}(q_\tau(\cdot|\hat{s}_i) \| p_{\theta,\tau}(\cdot|\hat{s}_i))\)

With only 25% of grounding annotations, the student retains 95% of the teacher's R@3 performance.

Executable-Rule GRPO (Reinforcement Fine-Tuning): GRPO reinforcement learning based on verifiable rewards.

The total reward aggregates three executable components: \(R = w_F R_F + w_{\text{NIA}} R_{\text{NIA-AA}} + w_C R_{\text{consistency}}\)

- **Format reward $R_F$**: Checks whether the three tags Reasoning/Diagnosis/Confidence are all present.
- **NIA-AA diagnostic reward $R_{\text{NIA-AA}}$**: Encompasses category alignment (CN/MCI/Dementia), biomarker consistency (Aβ/tTau/pTau threshold checks), and clinical feature coverage.
- **Reasoning consistency reward $R_{\text{consistency}}$**: An NLI model verifies the entailment relationship Reasoning⇒Diagnosis to prevent logical contradictions.

Loss & Training¶

Three-stage progressive training:

Stage 1 (PT): Contrastive learning + reconstruction learning to align multimodal representations.
- \(\mathcal{L}_{\text{PT}} = \mathcal{L}_{\text{itc}} + \lambda_{\text{res}}(\mathcal{L}_{\text{res}}^v + \mathcal{L}_{\text{res}}^t)\)
Stage 2 (SFT + GTX-Distill): Lower encoder layers are frozen; upper layers, projection layers, and decoder LoRA are fine-tuned.
- \(\mathcal{L}_{\text{SFT}} = \mathcal{L}_{\text{txt}} + \lambda_{\text{KL}} \mathcal{L}^{\text{distill}}\)
Stage 3 (RFT): GRPO reinforcement fine-tuning with group size \(G=4\), clipping \(\epsilon=0.2\), and KL coefficient \(\beta=0.1\).

Key Experimental Results¶

Main Results¶

Dataset: AD-MultiSense (based on ADNI + AIBL, 10,378 samples / 2,619 subjects)

Task	Metric	EMAD	M3D-LaMed (best baseline)	Gain
CN vs CI	ACC	93.33%	91.28%	+2.05
CN vs CI	AUC	91.83%	89.16%	+2.67
CN vs CI	BERTScore	0.9120	0.8748	+0.037
CN vs MCI	ACC	92.82%	89.47%	+3.35
CN vs MCI	AUC	90.09%	88.06%	+2.03
Three-class	ACC / Macro-F1	89.4% / 87.8%	84.7% / 82.5% (Alifuse)	+4.7 / +5.3

Report quality metrics (CN vs CI): BLEU 0.5422, METEOR 0.6790, ROUGE 0.7781 — substantially outperforming all baselines.

Ablation Study¶

Configuration	ACC (CN vs CI)	AUC	Notes
sMRI only	71.24%	54.76%	Single visual modality severely insufficient
Clinical only	88.83%	82.69%	Text modality contributes substantially
Image + Clinical (EMAD)	93.33%	91.83%	Multimodal fusion optimal
EMAD w/o RFT	91.28%	—	Without reinforcement fine-tuning
+ Format reward only	91.45%	—	Format validity: 85.3→97.8%
+ Format + NIA-AA	92.10%	—	NIA-AA consistency: 74.1→86.7%
Full EMAD	92.82%	—	Entailment: 68.2→87.6%

Key Findings¶

Using visual features alone yields only 71.24% ACC (SEN extremely high at 95.33% but SPE only 12.31%), indicating that single-modality models tend to predict all subjects as positive.
GTX-Distill retains 95% R@3 with only 25% annotations, and nearly matches the fully supervised teacher at 50% annotations.
Evidence-conditioned segmentation improves hippocampal Dice from 0.78 to 0.84.
The NIA-AA standard yields slightly better results than IWG-2 (93.33 vs. 92.93 ACC).

Highlights & Insights¶

Evidence traceability: SEA Grounding achieves hierarchical interpretability along the chain "sentence → clinical evidence → 3D anatomy," providing dual evidence support for each diagnostic statement.
Label efficiency: GTX-Distill substantially reduces grounding annotation requirements by transferring the teacher's alignment capability to the student via KL distillation.
Verifiable reward design: Executable-Rule GRPO encodes clinical guidelines as programmatically verifiable reward functions (format / NIA-AA / entailment), eliminating the need for human preference annotations.
Refined training strategy: The three-stage progressive training (PT→SFT→RFT) incrementally builds alignment, faithfulness, and verifiability.

Limitations & Future Work¶

The dataset is based solely on ADNI + AIBL, limiting sample diversity (predominantly European/American white populations).
The 3D sMRI encoder uses a ViT-based architecture, incurring substantial computational cost for high-resolution whole-brain scans.
The NLI-based consistency reward depends on the quality of an external model, which may introduce noise.
Longitudinal temporal modeling has not yet been explored.
Validation is limited to the AD setting; generalization to other neurodegenerative diseases remains to be demonstrated.

M3D-LaMed: A pioneering work combining 3D medical images with LLMs, but lacking grounding and clinical guideline constraints.
GRPO (DeepSeekMath): The original proposal of group relative policy optimization; EMAD adapts it to the medical domain and designs executable rewards.
BLIP / CLIP: EMAD's contrastive learning and momentum encoder design are inspired by BLIP.
Insight: The approach of formalizing clinical diagnostic guidelines as executable reward functions is generalizable to other medical scenarios with well-defined diagnostic criteria (e.g., tumor grading, cardiovascular risk assessment).

Rating¶

Novelty: ⭐⭐⭐⭐ SEA Grounding, GTX-Distill, and Executable-Rule GRPO each carry independent value; their combination forms a complete interpretable AD diagnosis framework.
Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task evaluation (binary classification + three-class + report quality + grounding + ablation), though comparisons with more 3D medical VLMs are lacking.
Writing Quality: ⭐⭐⭐⭐ Clear structure with complete mathematical derivations, though some symbol definitions are scattered.
Value: ⭐⭐⭐⭐ An important advance in interpretable medical AI; the executable reward paradigm offers useful insights for medical RLHF.