From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining¶
Conference: ICML 2025
arXiv: 2506.21803
Code: https://github.com/HKU-MedAI/MELP
Area: Medical AI / ECG Analysis
Keywords: ECG Pretraining, Multimodal Learning, Multi-scale Representation, Contrastive Learning, ECG-Text Alignment, Zero-shot Classification, Self-Supervised Learning
TL;DR¶
MELP proposes a multi-scale ECG-language pretraining model. By utilizing cross-modal supervisory signals at three levels (Token, Beat, and Rhythm) combined with domain-specific cardiology language model pretraining, it comprehensively outperforms existing self-supervised and multimodal ECG methods in zero-shot classification, linear probing, and transfer learning.
Background & Motivation¶
Electrocardiograms (ECGs) are a core tool for diagnosing cardiovascular diseases. While deep learning has significantly improved ECG analysis, it faces the challenge of high costs associated with large-scale manual annotation. Consequently, self-supervised learning (SSL) has emerged as a promising alternative.
Limitations of Prior Work:
Limitations of Unimodal SSL: Contrastive (e.g., SimCLR, CLOCS) and generative (e.g., ST-MEM, HeartLang) methods only utilize ECG signals, ignoring the rich semantic information embedded in clinical texts.
Insufficiency of Global Alignment: The few existing ECG-language alignment methods (e.g., MERL) only focus on global ECG-to-text alignment, neglecting the multi-scale structure of ECG signals.
Omission of Hierarchical Structure: Cardiologists interpret ECGs in a hierarchical manner—from waveform components (token level) to heartbeat cycles (beat level) and finally to overall rhythms (rhythm level). Existing models fail to capture this multi-scale nature.
Core Motivation: To mimic the multi-scale interpretation process of clinicians by establishing cross-modal supervision between ECG and text across three levels: token, beat, and rhythm.
Method¶
Overall Architecture¶
MELP (Multi-scale ECG-Language Pretraining) consists of the following components (Figure 2):
- Cardiology Language Model Pretraining (Stage 1)
- Multimodal Pretraining (Stage 2) — Three-level Supervision
Key Designs¶
1. Cardiology Language Pretraining¶
Based on the MedCPT query encoder, masked language modeling (MLM) pretraining is conducted using cardiology corpora from three sources: - PubMed cardiology-related literature - Wikipedia cardiology articles - Clinical reports from the MIMIC-IV-ECG training set
The objective is to equip the text encoder with rich domain knowledge in cardiology, laying a high-quality semantic representation foundation for subsequent cross-modal alignment.
2. Token-level: ECG Description Generation¶
Adopting an encoder-decoder framework, the ECG encoder generates token-level embeddings \(E \in \mathbb{R}^{L_t \times D}\), which are aggregated into \(\tilde{E} \in \mathbb{R}^{128 \times D}\) via attention pooling with 128 learnable query tokens. The text decoder autoregressively generates paired reports in a GPT-style manner:
Through the report generation task, the model is forced to capture fine-grained waveform features (e.g., absence of P-wave, QRS duration), implicitly learning the relationships between these local indicators and clinical diagnoses.
3. Beat-level: Heartbeat-Sentence Alignment¶
Introducing 10 learnable tokens, token-level features are aggregated into beat-level representations \(B \in \mathbb{R}^{N_B \times D}\) via attention pooling. On the text side, word tokens are averaged per sentence to generate sentence embeddings \(S \in \mathbb{R}^{S \times D}\).
Attention-weighted beat embedding:
Local contrastive loss:
Through the beat-sentence matching mechanism, the model captures correspondences between transient abnormal heartbeats and specific descriptive sentences.
4. Rhythm-level: Global ECG-Report Alignment¶
The global ECG embedding \(X_i^g\) is computed as the average of all beat embeddings, while the global text embedding \(T_i^g\) is represented by the [CLS] token. A standard InfoNCE contrastive loss is employed:
5. Total Loss¶
Where \(\lambda_1 = 2\) and \(\lambda_2 = 0.2\).
Key Experimental Results¶
Linear Probing (AUC%, 100% Training Data)¶
| Method | PTBXL-Rhythm | PTBXL-Super | CPSC2018 | CSN |
|---|---|---|---|---|
| SimCLR | 77.73 | 73.53 | 76.54 | 73.20 |
| Wav2Vec2+CMSC+RLM | 92.05 | 85.53 | 92.61 | 87.87 |
| MERL | 92.31 | 88.01 | 92.48 | 87.39 |
| MELP | 93.66 | 89.44 | 92.49 | 90.25 |
Linear Probing (AUC%, 1% Training Data)¶
| Method | PTBXL-Rhythm | PTBXL-Super | CPSC2018 | CSN |
|---|---|---|---|---|
| SimCLR | 51.41 | 63.41 | 59.78 | 59.02 |
| MERL | 84.85 | 84.46 | 81.49 | 73.23 |
| MELP | 87.72 | 87.63 | 82.05 | 77.65 |
Key Findings: - The advantages of MELP are most prominent in extremely low-resource annotation scenarios (1%), where it outperforms MERL by 2-4 percentage points across multiple datasets. - With 100% data, MELP consistently maintains SOTA performance on average across six evaluation sets.
Zero-Shot Classification¶
MELP achieves top performance in zero-shot ECG classification, demonstrating the efficacy of multi-scale supervision in enhancing the quality of cross-modal alignment.
Ablation Study¶
- Removal of token-level supervision \(\to\) Largest performance drop
- Removal of rhythm-level supervision \(\to\) Loss of zero-shot capability
- Removal of beat-level supervision \(\to\) Moderate performance drop
- None of the three components are dispensable, verifying the complementary nature of multi-scale supervision.
Highlights & Insights¶
- Clinically-Inspired Design: The supervisory signals are designed by mimicking the multi-scale interpretation process of cardiologists (waveform \(\to\) heartbeat \(\to\) rhythm), which offers strong physiological plausibility.
- Cardiology-Specific Language Model: Pretraining the text computer on domain-specific corpora prior to cross-modal alignment proves significantly superior to directly utilizing generic language models.
- Integration of Generative and Contrastive Learning: Token-level captioning (generative) and Beat/Rhythm-level contrastive learning (discriminative) complement each other, avoiding the limitations of a single paradigm.
- Flexible Downstream Adaptation: Generative pretraining allows the model to naturally extend to tasks such as ECG report generation and ECG question-answering.
Limitations & Future Work¶
- The beat level uses a fixed number of 10 learnable tokens, lacking adaptive adjustment based on the actual heart rate.
- The pretraining data (MIMIC-IV-ECG) originates from a single source, and cross-institutional generalization remains to be verified.
- The text decoder is initialized randomly, potentially requiring more data for sufficient training.
- The computational overhead of token-level captioning is relatively high.
Related Work & Insights¶
- ECG Representation Learning: SimCLR, CLOCS, Wav2Vec 2.0, ST-MEM, HeartLang
- ECG-Language Pretraining: MERL, C-MELT, ECG-Chat, ESI
- General Multimodal Contrastive Learning: CLIP, CoCa
Rating¶
⭐⭐⭐⭐ — The multi-scale design concept is elegant and physiologically grounded, with comprehensive experimental coverage (six datasets, multiple evaluation protocols). The introduction of cardiology language pretraining provides an important practical insight. The open-source code facilitates reproducibility and subsequent research. The fixed number of beat-level tokens and single-source pretraining are the primary limitations.