Skip to content

EMAD: Evidence-Centric Grounded Multimodal Diagnosis for Alzheimer's Disease

Conference: CVPR 2026
arXiv: 2602.19178
Code: Coming soon (including grounding annotations)
Area: Medical Imaging
Keywords: Alzheimer's Disease diagnosis, Multimodal Vision-Language Models, Evidence Alignment, Reinforced Fine-Tuning, 3D Brain Segmentation

TL;DR

Ours proposes EMAD, an end-to-end multimodal vision-language framework that generates structured reports for AD diagnosis. It explicitly associates each diagnostic statement with clinical evidence and 3D brain anatomy through hierarchical Sentence–Evidence–Anatomy (SEA) Grounding and ensures clinical consistency via executable rule-driven GRPO reinforcement fine-tuning.

Background & Motivation

Clinical diagnosis of Alzheimer's Disease (AD) requires the integration of multimodal data, including structural MRI (sMRI), neuropsychological tests, APOE genotypes, and CSF biomarkers. Existing AI methods face three core challenges:

Black-box Problem: Most models only output labels or risk scores, failing to explain why a decision was made and what evidence supports it.

Limitations of Prior Work: Many approaches still operate on a single modality, ignoring cross-modal dependencies.

Clinical Guideline Disconnection: Medical reports generated by existing MLLMs rarely (i) link generated sentences to specific clinical items, (ii) ground statements in 3D brain anatomy, or (iii) enforce diagnostic frameworks like NIA-AA.

The core motivation of EMAD is to build a transparent, traceable, and anatomically faithful AD report generation system where every diagnostic statement is supported by an evidence chain.

Method

Overall Architecture

EMAD addresses the black-box problem where AD diagnostic models provide labels without evidence. The approach enables a multimodal VLM to generate structured diagnostic reports while pinning every sentence in the report to clinical evidence and 3D brain regions. It consists of four parts: a multimodal encoder, a projection and fusion layer, a text decoder (report generation), and a hierarchical SEA Grounding head. The input is \(\mathcal{X}=\{x_v, x_t\}\), where \(x_v \in \mathbb{R}^{D \times H \times W}\) is a 3D sMRI and \(x_t\) represents structured clinical variables. The visual encoder \(E_v\) (3D ViT) extracts patch-level embeddings \(h_v\), and the text encoder \(E_t\) (Longformer) encodes clinical text \(h_t\). Both are projected into a common dimensional space and fused via Bi-directional Cross-Attention (BCA), alternating between Q/KV roles:

\[\mathbf{A}_{t \to v} = \text{Attn}(h_t', h_v', h_v'), \quad \mathbf{A}_{v \to t} = \text{Attn}(h_v', h_t', h_t')\]

Residual connections are used to preserve modality-specific information: \(z_v = h_v' + \mathbf{A}_{v \to t}\) and \(z_t = h_t' + \mathbf{A}_{t \to v}\). The fused features replace the <sMRI> and <clinical> placeholders in the prompt, and reports are generated autoregressively by LLaMA 3.2-1B + rank-8 LoRA. After report generation, the SEA Grounding head pins it sentence-by-sentence to evidence and brain regions. GTX-Distill and Executable-Rule GRPO are employed during training to make alignment capabilities transferable at low cost and to ensure the output adheres to clinical rules.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    A["Input: 3D sMRI + Structured Clinical Variables"] --> B["Multimodal Encoder<br/>3D ViT (Visual) + Longformer (Text)"]
    B --> C["Projection + Bi-directional Cross-Attention BCA<br/>Resulting in Fused Features z_v, z_t"]
    C --> D["Text Decoder<br/>LLaMA 3.2-1B + LoRA<br/>Generates Structured Report"]
    D --> SEA
    subgraph SEA["1. SEA Grounding (Hierarchical Evidence Alignment)"]
        direction TB
        E["Sentence → Evidence<br/>Multi-positive InfoNCE Matching"] --> F["Evidence → Anatomy<br/>Evidence-Conditioned 3D Segmentation Mask"]
    end
    SEA --> G["Traceable Diagnostic Report<br/>Sentence → Evidence → 3D Brain Region"]
    H["2. GTX-Distill<br/>Teacher → Student KL Distillation"] -. Label-efficient training of SEA alignment head .-> SEA
    I["3. Executable-Rule GRPO<br/>Format / NIA-AA / Reasoning Consistency Rewards"] -. RFT Reinforcement Fine-tuning of Decoder .-> D

Key Designs

1. Sentence–Evidence–Anatomy (SEA) Grounding: Pinning Diagnostic Sentences to Evidence and Anatomy

To address the black-box challenge, SEA decomposes interpretability into two levels of alignment. Sentence-to-Evidence performs many-to-many matching between each generated sentence \(\hat{s}_i\) and the clinical evidence set \(\mathcal{E}=\{e_1,\ldots,e_K\}\), using a bi-directional multi-positive InfoNCE loss \(\mathcal{L}_{\text{SE}} = \frac{1}{N}\sum_{i=1}^{N}(\ell_i^{e \to s} + \ell_i^{s \to e})\) to bring sentences closer to their supporting evidence. Evidence-to-Anatomy then localizes evidence with anatomical pointers to specific brain regions. A lightweight cross-attention block is inserted after each self-attention layer in the Segformer3D decoder, allowing visual tokens to attend to evidence text tokens, outputting voxel-level probability masks \(\hat{\mathbf{M}}_i = \sigma(\text{Head}(\mathbf{Y}^{(L)}))\), trained with Dice + BCE. This forms a dual-traceable evidence chain: "Sentence → Evidence → 3D Anatomy."

2. GTX-Distill (Grounding Transfer Distillation): Retaining 95% Alignment Capability with 25% Annotation

Voxel-level grounding annotations are extremely expensive. GTX-Distill bypasses the need for full annotation through two-stage distillation. Stage 1 trains a Teacher Grounder \(G_T\) on a small annotated subset to learn the \(q(e|s_i)\) distribution and anatomical masks. Stage 2 freezes \(G_T\) and trains a Student Grounder \(G_\theta\) on reports generated by the large-scale model, using temperature-scaled KL divergence distillation \(\mathcal{L}^{\text{distill}} = \tau^2 \sum_i \text{KL}(q_\tau(\cdot|\hat{s}_i) \| p_{\theta,\tau}(\cdot|\hat{s}_i))\). As a result, only 25% grounding annotation is needed to maintain 95% of the teacher's R@3, significantly reducing annotation costs.

3. Executable-Rule GRPO: Encoding Clinical Guidelines as Programmatically Verifiable Rewards

Medical reports must adhere to diagnostic frameworks, but manual preference annotations are costly and subjective. Clinical rules are thus encoded as executable rewards for GRPO reinforcement fine-tuning. The total reward aggregates three verifiable components \(R = w_F R_F + w_{\text{NIA}} R_{\text{NIA-AA}} + w_C R_{\text{consistency}}\). The format reward \(R_F\) check if the Reasoning/Diagnosis/Confidence tags are complete. The NIA-AA diagnostic reward \(R_{\text{NIA-AA}}\) checks category alignment (CN/MCI/Dementia), biomarker consistency (Aβ/tTau/pTau thresholds), and clinical feature coverage. The reasoning consistency reward \(R_{\text{consistency}}\) uses an NLI model to verify the entailment of Reasoning ⇒ Diagnosis, preventing logical contradictions. This reward suite injects compliance, faithfulness, and self-consistency directly into the model without manual preference labels.

Loss & Training

Progressive three-stage training:

  • Stage 1 (PT): Contrastive learning + reconstruction learning to align multimodal representations: \(\mathcal{L}_{\text{PT}} = \mathcal{L}_{\text{itc}} + \lambda_{\text{res}}(\mathcal{L}_{\text{res}}^v + \mathcal{L}_{\text{res}}^t)\).
  • Stage 2 (SFT + GTX-Distill): Freezing the lower layers of the encoder while fine-tuning the top layers, projection layer, and decoder LoRA: \(\mathcal{L}_{\text{SFT}} = \mathcal{L}_{\text{txt}} + \lambda_{\text{KL}} \mathcal{L}^{\text{distill}}\).
  • Stage 3 (RFT): GRPO reinforcement fine-tuning with group size \(G=4\), clipping \(\epsilon=0.2\), and KL coefficient \(\beta=0.1\).

Key Experimental Results

Main Results

Dataset: AD-MultiSense (based on ADNI + AIBL, 10,378 samples / 2,619 subjects)

Task Metric EMAD M3D-LaMed (best baseline) Gain
CN vs CI ACC 93.33% 91.28% +2.05
CN vs CI AUC 91.83% 89.16% +2.67
CN vs CI BERTScore 0.9120 0.8748 +0.037
CN vs MCI ACC 92.82% 89.47% +3.35
CN vs MCI AUC 90.09% 88.06% +2.03
Tri-classification ACC / Macro-F1 89.4% / 87.8% 84.7% / 82.5% (Alifuse) +4.7 / +5.3

Report quality metrics (CN vs CI): BLEU 0.5422, METEOR 0.6790, ROUGE 0.7781 — significantly outperforming all baselines.

Ablation Study

Configuration ACC (CN vs CI) AUC Description
sMRI only 71.24% 54.76% Visual unimodality is severely insufficient
Clinical only 88.83% 82.69% Text modality contributes significantly
Image + Clinical (EMAD) 93.33% 91.83% Multimodal fusion is optimal
EMAD w/o RFT 91.28% No reinforcement fine-tuning
+ Format reward only 91.45% Format validity 85.3 → 97.8%
+ Format + NIA-AA 92.10% NIA-AA consistency 74.1 → 86.7%
Full EMAD 92.82% Entailment 68.2 → 87.6%

Key Findings

  • Using only visual features resulted in an ACC of only 71.24% (high SEN of 95.33% but SPE of only 12.31%), indicating unimodal models tend to predict all cases as positive.
  • GTX-Distill retains 95% R@3 with only 25% annotation and matches the fully supervised teacher with 50% annotation.
  • Evidence-conditioned segmentation improved the hippocampal Dice score from 0.78 to 0.84.
  • Supervision under NIA-AA standards performed slightly better than IWG-2 standards (93.33 vs 92.93 ACC).

Highlights & Insights

  • Evidence Chain Traceability: SEA Grounding achieves hierarchical interpretability from "Sentence → Clinical Evidence → 3D Anatomy," providing dual evidence support for each diagnostic statement.
  • Label Efficiency: GTX-Distill significantly reduces the requirement for grounding annotations, transferring teacher alignment capabilities to the student via KL distillation.
  • Verifiable Reward Design: Executable-Rule GRPO encodes clinical guidelines into programmatically verifiable reward functions (Format/NIA-AA/Entailment), eliminating the need for manual preference labels.
  • Sophisticated Training Strategy: Successive three-stage training (PT → SFT → RFT) builds alignment, faithfulness, and verification capabilities step-by-step.

Limitations & Future Work

  • The dataset is limited to ADNI + AIBL, with restricted sample diversity (primarily Western Caucasian populations).
  • 3D sMRI encoding utilizes a ViT-based architecture, which incurs high computational overhead for high-resolution whole-brain scans.
  • NLI-based consistency rewards depend on external model quality and may introduce noise.
  • Longitudinal data modeling for temporal analysis has not yet been explored.
  • Validated only in AD scenarios; generalization to other neurodegenerative diseases remains to be tested.
  • M3D-LaMed: Pioneering work in 3D medical images + LLMs, but lacks grounding and clinical guideline constraints.
  • GRPO (DeepSeekMath): The original proposal for group relative policy optimization; EMAD adapts it to medical scenarios with executable rewards.
  • BLIP / CLIP: Design of contrastive learning and momentum encoders in EMAD was inspired by BLIP.
  • Insight: The approach of formalizing clinical diagnostic guidelines into executable reward functions can be generalized to other medical scenarios with clear diagnostic criteria (e.g., tumor grading, cardiovascular risk assessment).

Rating

  • Novelty: ⭐⭐⭐⭐ SEA Grounding, GTX-Distill, and Executable-Rule GRPO provide independent value; their combination creates a complete interpretable AD diagnostic framework.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-task evaluation (Binary/Tri-classification/Report Quality/Grounding/Ablation), though comparisons with more 3D medical VLMs are missing.
  • Writing Quality: ⭐⭐⭐⭐ Clear structure and complete formula derivations, though some symbol definitions are scattered.
  • Value: ⭐⭐⭐⭐ Significant progress in interpretable medical AI; the executable reward concept is insightful for medical RLHF.