ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation¶
Conference: ICML 2026
arXiv: 2602.04279
Code: Code is available at here
Area: Medical Imaging / Multimodal VLM / Reinforcement Learning
Keywords: ECG Interpretation, Medical MLLM, Protocol-Guided, Modality Dropout, Process-Reward RL
TL;DR¶
ECG-R1 is the first "reasoning" medical multimodal large language model for ECG interpretation. Through a suite of protocol-guided instruction data synthesis + decoupled signal/image encoding + interleaved modality dropout training + evidence-based process reward RL, it improves the ECG diagnostic accuracy from the previous SOTA GEM's 74.7 to 80.3 while maintaining cross-modality consistency when either modality is missing.
Background & Motivation¶
Background: Current mainstream approaches delegate ECG interpretation to two types of models: general or medical MLLMs (e.g., GPT-5.1, MedGemma), which usually view only ECG images, or specialized ECG MLLMs (e.g., PULSE, GEM) that incorporate 12-lead time-series signals for omni-perception. Both paths follow the "VLM + fine-tuning" paradigm, adopting general multimodal training methodologies.
Limitations of Prior Work: Systematic evaluations on the ECG-Grounding test set reveal two concerning issues: First, even flagship models like GPT-5.1 achieve only 31.5 diagnostic accuracy and produce "hallucinated" interpretations that look structured and professional but are clinically incorrect. Second, omni models like GEM suffer significant performance drops when one modality (signal or image) is missing at inference time, and interpretations for the same ECG under different modalities are contradictory, with a BLEU-4 of only 0.33.
Key Challenge: Existing training corpora are inherently unreliable. Datasets like ECG-Grounding are generated by prompting LLMs to derive interpretations from diagnostic labels; thus, responses rely on pre-training priors rather than actual ECG diagnostic rules, embedding numerous clinical errors. SFT merely makes models more proficient in these incorrect causal chains. Furthermore, omni architectures that stuff time-series tokens into <image> placeholders and reuse the same image-language projector assume both modalities must co-occur, making single-modality inference unnatural and constrained.
Goal: To address these three issues: (1) create interpretation corpora that truly follow clinical protocols; (2) ensure model stability and consistency when any modality is missing; (3) reward the reasoning process itself rather than just the final answer.
Key Insight: ECG has established diagnostic protocols (e.g., Chapter 23 of ECG from Basics to Essentials breaks the process into five steps). These explicit rules can "constrain LLM generation of training data," hard-coding medical priors into synthesis prompts. Additionally, as images and time-series are two renderings of the same waveform, the cross-modality divergence \(\Delta_{\text{view}}\) should theoretically be minimal, providing a natural basis for "cross-modality exchange invariance."
Core Idea: Use protocols to inject medical rules into data, use IMD to embed robustness and consistency into training objectives, and use EDER to incorporate process evidence into RL rewards. All three layers align with "verifiable clinical evidence."
Method¶
Overall Architecture¶
The input to ECG-R1 is a triplet \((x^{\text{text}}, x^I, x^T)\) representing the text instruction, the rendered ECG image, and the 12-lead time-series signal. The output is a structured interpretation \(y\), consisting of a <think> block (six-step protocol reasoning), a brief summary, and an <answer> block (final diagnosis). The pipeline includes: FeatureDB extraction → Protocol-guided training corpora generation → Decoupled dual encoders mapping image/time-series to a shared LLM space → Two-stage training (SFT + RL) with Interleaved Modality Dropout (IMD). The LLM backbone is Qwen3-VL-8B, and the time-series encoder is ECG-CoCa.
Key Designs¶
-
Protocol-Guided Instruction Data Generation:
- Function: Generates 30,000 instruction-answer pairs following medical protocols from MIMIC-IV-ECG as the primary SFT corpus.
- Mechanism: A deterministic, non-trainable FeatureDB extracts 14 types of physiological features across 12 leads (heart rate, RR, P/QRS/T amplitudes and durations, PR/QT/QTc, ST descriptors, etc.), denoted as \(\boldsymbol{x}^{fs} = \mathrm{FeatureDB}(\boldsymbol{x}^T)\). Then, a prompt \(\boldsymbol{x}^p = \mathrm{ProtocolGuider}(\boldsymbol{x}^{fs}, x^{\text{protocol}})\) based on the five-step protocol (Rate&Rhythm → Conduction&Axis → Hypertrophy → Ischemia → Electrolytes&QT, with mandatory differential exclusion) is fed to DeepSeek-V3.1-Terminus to generate interpretations. The model is forced to output according to a fixed schema: six-step
<think>+ summary +<answer>. - Design Motivation: Compared to "bare prompts" in ECG-Grounding, protocol guidance explicitly injects quantitative thresholds and differential rules into generation constraints, suppressing hallucinations and uncovering anomalies missed in original reports.
-
Decoupled Modalities Encoding + Interleaved Modality Dropout (IMD):
- Function: Enables the same LLM to provide robust and consistent interpretations across four environments: (signal+image), signal only, image only, and swapped modality order.
- Mechanism: The architecture introduces an explicit
<ecg>tag before<image>. Time-series and images process through independent projectors: \(z^T = \mathrm{Proj}_T(\mathrm{Encoder}_T(x^T))\) and \(z^I = \mathrm{Proj}_I(\mathrm{Encoder}_I(x^I))\). During training, transformations \(\tau \in \mathcal{T}_{\text{test}}=\{\tau_I, \tau_T, \tau_{IT}, \tau_{TI}\}\) are sampled from distribution \(q\) to minimize the mixed risk \(R_q(\theta)=\mathbb{E}_{\tau\sim q}[R_\tau(\theta)]\). Under the coverage assumption \(q(\tau)\geq\alpha\), it is proven that \(R_{\max}(\theta) \leq \alpha^{-1} R_q(\theta)\), keeping divergence \(\mathcal{F}(\theta)\) at the \(\Delta_{\text{view}}+\sqrt{\varepsilon_{\tau_I}/2}+\sqrt{\varepsilon_{\tau_T}/2}\) level. - Design Motivation: Existing omni methods perform ERM only under the "both modalities present + fixed order" setting. IMD integrates all target test environments into the training objective without requiring extra alignment losses or generative modules.
-
EDER: Evidence-based Diagnostic Evidence Reward RL:
- Function: Refines the reasoning chain post-SFT, ensuring each step in the
<think>block is rewarded for "hitting key diagnostic evidence." - Mechanism: DeepSeek-V3.1-Terminus extracts key evidence phrases \(\mathcal{E}_k(y)\) from reference traces of RL samples. Step-level reward is defined as \(r^{(k)}_{\text{step}}=|\mathrm{match}(\mathcal{E}_k(y), \tilde{y}^{(k)})|/|\mathcal{E}_k(y)|\), and process reward as \(R_{\text{EDER}}=\frac{1}{K}\sum_k r^{(k)}_{\text{step}}\). Accuracy reward \(R_{\text{accuracy}}\) uses Jaccard similarity. The total reward \(R_{\text{total}} = R_{\text{format}} + R_{\text{accuracy}} + \lambda R_{\text{EDER}}\) is optimized via DAPO.
- Design Motivation: General reasoning RL only evaluates format and final answers. EDER injects "must-mention key findings" signals into each step to suppress reasoning hallucinations.
- Function: Refines the reasoning chain post-SFT, ensuring each step in the
Loss & Training¶
Two stages: The SFT stage uses \(\mathcal{D}_{\text{SFT}}\) for one epoch of teacher-forcing \(\min_\theta \mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{SFT}}}[-\log\pi_\theta(y|x)]\) with IMD. The RL stage performs DAPO on a subset of \(|\mathcal{D}_{\text{RL}}|=3{,}948\) with the objective \(J(\theta)=\mathbb{E}[\frac{1}{N}\sum_{i,t}\min(r_{i,t}, \tilde{r}_{i,t}) \hat{A}_i]\), also incorporating IMD.
Key Experimental Results¶
Main Results¶
Testing on ECG-Grounding (2,381 cases) scored by DeepSeek-V3.1-Terminus across seven rubrics, plus a blind evaluation of 100 cases by four cardiologists.
| Model Category | Model | Diagnosis Acc | Analysis Completeness | Lead Evidence Validity | Clinical Diagnostic Fidelity |
|---|---|---|---|---|---|
| Closed-source Flagship | GPT-5.1-Instant | 31.48 | 3.03 | 1.92 | 43.46 |
| Medical MLLM | MedGemma-27B | 25.23 | 3.20 | 0.81 | 39.22 |
| ECG-Specific | PULSE | 66.13 | 1.90 | 0.19 | 40.53 |
| ECG-Specific | GEM (Prev. SOTA) | 74.70 | 4.25 | 4.41 | 62.90 |
| Ours | ECG-R1 (SFT) | 79.33 | 6.36 | 5.53 | 83.51 |
| Ours | ECG-R1 (RL) | 80.29 | 6.51 | 5.81 | 84.20 |
Diagnostic accuracy improves by 5.6 absolute points over GEM (74.7 → 80.3), and clinical fidelity increases by 21 points (62.9 → 84.2).
Cross-modality Consistency¶
| Metric | BLEU-4 | ROUGE-L | SBERT |
|---|---|---|---|
| GEM | 0.33 | 0.43 | 0.92 |
| ECG-R1 | 0.69 | 0.73 | 0.97 |
BLEU-4 more than doubled, proving high consistency between signal-only and image-only inferences.
Key Findings¶
- The IMD coverage assumption yields an \(\alpha^{-1}\) bound on worst-case risk, ensuring model stability regardless of which modality is missing—a requirement for clinical deployment.
- While the SFT-to-RL gain is roughly 1 point, EDER specifically addresses "correct diagnosis with fabricated reasoning," which is critical for medical compliance.
- The performance gap with GPT-5.1 (80.3 vs 31.5) demonstrates that protocol-specific corpora provide more leverage for small models than broad parameter scaling.
Highlights & Insights¶
- "Encoding domain rules into data synthesis prompts" is a highly transferable paradigm for any medical sub-field with standard-of-care documentation.
- The IMD theoretical analysis is elegant: it abstracts modality absence and order exchange into four deterministic transformations.
- The process reward \(R_{\text{EDER}}\) utilizes LLM phrase extraction and string matching, avoiding the high cost of training a PRM while maintaining precision.
Limitations & Future Work¶
- Dependency on FeatureDB limits the grounding capability; anomalies not extractable by FeatureDB cannot be covered by the protocol.
- The 30K corpus is entirely LLM-generated; while protocol-guided, it still follows the LLM-as-author paradigm and might propagate systematic errors if the protocol itself is incomplete.
- Cross-modality consistency guarantees rely on the \(\Delta_{\text{view}} \approx 0\) assumption unique to ECG, which may not hold in truly heterogeneous multimodal scenarios (e.g., RGB+Depth).
- The RL subset is relatively small (3,948 samples), potentially limiting policy diversity.
Related Work & Insights¶
- vs GEM (Lan et al., 2025): GEM reuses image placeholders for time-series; ours decouples them with explicit tags and independent projectors. GEM lacks IMD, causing single-modality failure.
- vs ECG-Grounding (Lan et al., 2025) Data: ECG-Grounding allows free-form LLM generation; ours constrains it within clinical rules using a five-step protocol.
- vs DeepSeek-R1 / R1-VL (Zhang et al., 2025): R1 rewards format + answer; EDER adopts step-level rewards but replaces general visual reasoning with diagnostic evidence hit rates, adapting the R1 paradigm for medical RL.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of protocol-guided data, IMD, and EDER is a first for the ECG domain with solid theoretical support for IMD.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Triple verification via rubrics, consistency metrics, and cardiologist blind tests against four baseline categories.
- Writing Quality: ⭐⭐⭐⭐ Method sections are well-structured; minor redundant notation in formulas.
- Value: ⭐⭐⭐⭐⭐ Addresses the critical medical MLLM bottlenecks of hallucination and modality absence, with a paradigm applicable to other protocol-based medical fields.