ICML 2025 Medical Imaging ECG Pretraining Multimodal Learning Multi-scale Representation Contrastive Learning ECG-Text Alignment Zero-shot Classification Self-Supervised Learning

From Token to Rhythm: A Multi-Scale Approach for ECG-Language Pretraining¶

Conference: ICML 2025
arXiv: 2506.21803
Code: https://github.com/HKU-MedAI/MELP
Area: Medical AI / ECG Analysis
Keywords: ECG Pretraining, Multimodal Learning, Multi-scale Representation, Contrastive Learning, ECG-Text Alignment, Zero-shot Classification, Self-Supervised Learning

TL;DR¶

MELP proposes a multi-scale ECG-language pretraining model. By utilizing cross-modal supervisory signals at three levels (Token, Beat, and Rhythm) combined with domain-specific cardiology language model pretraining, it comprehensively outperforms existing self-supervised and multimodal ECG methods in zero-shot classification, linear probing, and transfer learning.

Background & Motivation¶

Electrocardiograms (ECGs) are a core tool for diagnosing cardiovascular diseases. While deep learning has significantly improved ECG analysis, it faces the challenge of high costs associated with large-scale manual annotation. Consequently, self-supervised learning (SSL) has emerged as a promising alternative.

Limitations of Prior Work:

Limitations of Unimodal SSL: Contrastive (e.g., SimCLR, CLOCS) and generative (e.g., ST-MEM, HeartLang) methods only utilize ECG signals, ignoring the rich semantic information embedded in clinical texts.

Insufficiency of Global Alignment: The few existing ECG-language alignment methods (e.g., MERL) only focus on global ECG-to-text alignment, neglecting the multi-scale structure of ECG signals.

Omission of Hierarchical Structure: Cardiologists interpret ECGs in a hierarchical manner—from waveform components (token level) to heartbeat cycles (beat level) and finally to overall rhythms (rhythm level). Existing models fail to capture this multi-scale nature.

Core Motivation: To mimic the multi-scale interpretation process of clinicians by establishing cross-modal supervision between ECG and text across three levels: token, beat, and rhythm.

Method¶

Overall Architecture¶

MELP (Multi-scale ECG-Language Pretraining) consists of the following components (Figure 2):

Cardiology Language Model Pretraining (Stage 1)
Multimodal Pretraining (Stage 2) — Three-level Supervision

Key Designs¶

1. Cardiology Language Pretraining¶

Based on the MedCPT query encoder, masked language modeling (MLM) pretraining is conducted using cardiology corpora from three sources: - PubMed cardiology-related literature - Wikipedia cardiology articles - Clinical reports from the MIMIC-IV-ECG training set

The objective is to equip the text encoder with rich domain knowledge in cardiology, laying a high-quality semantic representation foundation for subsequent cross-modal alignment.

2. Token-level: ECG Description Generation¶

Adopting an encoder-decoder framework, the ECG encoder generates token-level embeddings \(E \in \mathbb{R}^{L_t \times D}\), which are aggregated into \(\tilde{E} \in \mathbb{R}^{128 \times D}\) via attention pooling with 128 learnable query tokens. The text decoder autoregressively generates paired reports in a GPT-style manner:

\[\mathcal{L}_{\mathrm{LM}}(\zeta) = -\sum_{i=1}^{N} \log p(w_i | w_{0:i-1}, \tilde{E})\]

Through the report generation task, the model is forced to capture fine-grained waveform features (e.g., absence of P-wave, QRS duration), implicitly learning the relationships between these local indicators and clinical diagnoses.

3. Beat-level: Heartbeat-Sentence Alignment¶

Introducing 10 learnable tokens, token-level features are aggregated into beat-level representations \(B \in \mathbb{R}^{N_B \times D}\) via attention pooling. On the text side, word tokens are averaged per sentence to generate sentence embeddings \(S \in \mathbb{R}^{S \times D}\).

Attention-weighted beat embedding:

\[\hat{B}(l) = \sum_{l=1}^{N_B} \alpha_l S(l), \quad \alpha(l) = \frac{\exp(\langle S(l), B(l) \rangle / \tau_1)}{\sum_j \exp(\langle S(l), B(j) \rangle / \tau_1)}\]

Local contrastive loss:

\[\mathcal{L}_{\mathrm{Local}} = \frac{1}{2}(\mathcal{L}_{\mathrm{Local}}^{e \to t} + \mathcal{L}_{\mathrm{Local}}^{t \to e})\]

Through the beat-sentence matching mechanism, the model captures correspondences between transient abnormal heartbeats and specific descriptive sentences.

4. Rhythm-level: Global ECG-Report Alignment¶

The global ECG embedding \(X_i^g\) is computed as the average of all beat embeddings, while the global text embedding \(T_i^g\) is represented by the [CLS] token. A standard InfoNCE contrastive loss is employed:

\[\mathcal{L}_g^{e \to t} = -\frac{1}{B}\log \frac{\exp(\langle X_i^g, T_i^g \rangle / \tau)}{\sum_{j=1}^{B} \exp(\langle X_i^g, T_j^g \rangle / \tau)}\]

5. Total Loss¶

\[\mathcal{L} = \mathcal{L}_g + \lambda_1 \mathcal{L}_{\mathrm{LM}} + \lambda_2 \mathcal{L}_{\mathrm{Local}}\]

Where \(\lambda_1 = 2\) and \(\lambda_2 = 0.2\).

Key Experimental Results¶

Linear Probing (AUC%, 100% Training Data)¶

Method	PTBXL-Rhythm	PTBXL-Super	CPSC2018	CSN
SimCLR	77.73	73.53	76.54	73.20
Wav2Vec2+CMSC+RLM	92.05	85.53	92.61	87.87
MERL	92.31	88.01	92.48	87.39
MELP	93.66	89.44	92.49	90.25

Linear Probing (AUC%, 1% Training Data)¶

Method	PTBXL-Rhythm	PTBXL-Super	CPSC2018	CSN
SimCLR	51.41	63.41	59.78	59.02
MERL	84.85	84.46	81.49	73.23
MELP	87.72	87.63	82.05	77.65

Key Findings: - The advantages of MELP are most prominent in extremely low-resource annotation scenarios (1%), where it outperforms MERL by 2-4 percentage points across multiple datasets. - With 100% data, MELP consistently maintains SOTA performance on average across six evaluation sets.

Zero-Shot Classification¶

MELP achieves top performance in zero-shot ECG classification, demonstrating the efficacy of multi-scale supervision in enhancing the quality of cross-modal alignment.

Ablation Study¶

Removal of token-level supervision \(\to\) Largest performance drop
Removal of rhythm-level supervision \(\to\) Loss of zero-shot capability
Removal of beat-level supervision \(\to\) Moderate performance drop
None of the three components are dispensable, verifying the complementary nature of multi-scale supervision.

Highlights & Insights¶

Clinically-Inspired Design: The supervisory signals are designed by mimicking the multi-scale interpretation process of cardiologists (waveform \(\to\) heartbeat \(\to\) rhythm), which offers strong physiological plausibility.
Cardiology-Specific Language Model: Pretraining the text computer on domain-specific corpora prior to cross-modal alignment proves significantly superior to directly utilizing generic language models.
Integration of Generative and Contrastive Learning: Token-level captioning (generative) and Beat/Rhythm-level contrastive learning (discriminative) complement each other, avoiding the limitations of a single paradigm.
Flexible Downstream Adaptation: Generative pretraining allows the model to naturally extend to tasks such as ECG report generation and ECG question-answering.

Limitations & Future Work¶

The beat level uses a fixed number of 10 learnable tokens, lacking adaptive adjustment based on the actual heart rate.
The pretraining data (MIMIC-IV-ECG) originates from a single source, and cross-institutional generalization remains to be verified.
The text decoder is initialized randomly, potentially requiring more data for sufficient training.
The computational overhead of token-level captioning is relatively high.

ECG Representation Learning: SimCLR, CLOCS, Wav2Vec 2.0, ST-MEM, HeartLang
ECG-Language Pretraining: MERL, C-MELT, ECG-Chat, ESI
General Multimodal Contrastive Learning: CLIP, CoCa

Rating¶

⭐⭐⭐⭐ — The multi-scale design concept is elegant and physiologically grounded, with comprehensive experimental coverage (six datasets, multiple evaluation protocols). The introduction of cardiology language pretraining provides an important practical insight. The open-source code facilitates reproducibility and subsequent research. The fixed number of beat-level tokens and single-source pretraining are the primary limitations.