ICLR 2026 Medical Imaging Medical image pre-training multi-label contrastive learning multi-granular alignment CLIP improvement vision-language pre-training

Boosting Medical Visual Understanding From Multi-Granular Language Learning¶

Conference: ICLR 2026 arXiv: 2511.15943 Code: https://github.com/HUANGLIZI/MGLL Area: Medical Imaging / Multimodal VLM Keywords: Medical image pre-training, multi-label contrastive learning, multi-granular alignment, CLIP improvement, vision-language pre-training

TL;DR¶

This paper proposes Multi-Granular Language Learning (MGLL), a plug-and-play contrastive learning framework that jointly optimizes a soft CLIP loss, a point-wise loss, and a smooth KL divergence to align medical images with multi-label, multi-granular text descriptions. MGLL consistently surpasses state-of-the-art methods on fundus and X-ray datasets, and when used as a visual encoder for multimodal large language models, improves diagnostic accuracy by up to 34.1%.

Background & Motivation¶

Background: Contrastive learning approaches such as CLIP have achieved remarkable success in general vision by learning cross-modal aligned representations from image-text pairs. Many medical visual foundation models have adopted CLIP-style pre-training.

Limitations of Prior Work: Standard CLIP relies on a single-label, single-granularity image-text pairing strategy; however, medical images are inherently multi-label and multi-granular. For example, a single fundus image may simultaneously present diabetic macular edema and diabetic retinopathy (multi-label), each of which can be further described at coarse granularity (disease category) and fine granularity (severity, clinical description). Existing multi-label contrastive methods focus on instance-label associations but neglect cross-granularity semantics.

Key Challenge: Medical images encode more complex and hierarchical information than natural images, yet data are scarcer due to privacy constraints and annotation costs. Single-granularity, single-label supervision wastes rich hierarchical annotation information, while naively mixing multi-granular information in a single encoding causes feature interference across semantic levels.

Goal: To achieve, within a unified framework, both multi-label alignment (one image corresponding to multiple labels) and cross-granularity alignment (consistency across annotations at different levels).

Key Insight: Construct a multi-granular text description dataset and design three complementary loss functions to separately optimize multi-label alignment and cross-granularity consistency.

Core Idea: Jointly optimize a soft CLIP loss for multi-label soft alignment, a point-wise loss for fine-grained pairwise alignment, and a smooth KL divergence for cross-granularity feature consistency, achieving comprehensive vision-language alignment for medical images.

Method¶

Overall Architecture¶

MGLL consists of an image encoder (ViT-L/14) and a text encoder (BiomedicalBERT). The inputs are medical images paired with multi-granular text descriptions (e.g., disease category, clinical interpretation, examination description). MGLL introduces no additional granularity-sensitive encoders, incurs zero extra computational cost, and can be plugged into any vision-language model.

Key Designs¶

Soft CLIP Loss \(\mathcal{L}_{\text{sCLIP}}\):
- Function: Extends the hard single-label matching of standard CLIP to multi-label soft alignment.
- Mechanism: Allows image feature \(V_i\) to align simultaneously with multiple text labels \(\{T_{i1}, T_{i2}, ..., T_{iM_i}\}\). The weight \(w_{ik}\) for each image-text pair is derived by normalizing a co-occurrence matrix: \(w_{ik} = \frac{\text{cooccurrence}(V_i, T_{ik})}{\sum_k \text{cooccurrence}(V_i, T_{ik})}\). The optimization objective is equivalent to driving image features toward the weighted centroid of their associated text features.
- Design Motivation: CLIP forces each image to align with a single label, producing biased representations in multi-label settings. The soft loss naturally handles one-to-many mappings through soft weights.
Point-wise Loss \(\mathcal{L}_P\):
- Function: Optimizes point-level image-text alignment within a given granularity level.
- Mechanism: Uses binary cross-entropy as the loss, where \(y_{ij} \in \{0, 1\}\) indicates whether image \(V_i\) and text \(T_j\) form a valid match. A sigmoid activation normalizes the similarity score into a probability: \(\mathcal{L}_P = -\sum_{i,j} \frac{y_{ij} \log \sigma(x_{ij}) + (1-y_{ij}) \log(1-\sigma(x_{ij}))}{N}\)
- Design Motivation: While the soft CLIP loss focuses on soft assignment among positive samples, the point-wise loss additionally and explicitly suppresses similarity for negative pairs (minimizing \(\sigma(x_{ij})\) when \(y_{ij}=0\)), complementing the former to enhance multi-label discriminability.
Smooth KL Divergence Loss \(\mathcal{L}_{\text{sKL}}\):
- Function: Ensures that text features from different granularity levels are aligned into a unified feature space.
- Mechanism: Given prediction distributions \(\{P_1, ..., P_m\}\) for \(m\) granularity levels, computes the mean distribution \(M = \frac{1}{m}\sum_i P_i\), then minimizes the KL divergence from each granularity distribution to the mean: \(\mathcal{L}_{\text{sKL}} = \sum_{i=1}^m D_{\text{KL}}(P_i \| M)\)
- Design Motivation: Without a cross-granularity consistency constraint, features from different granularities scatter across disjoint subspaces, preventing cross-granularity generalization. Minimizing KL divergence to the mean distribution forces all granularity representations to converge (\(P_1 = P_2 = ... = P_m = M\)).

Loss & Training¶

The final loss is a weighted sum of the three terms: \(\mathcal{L}_{\text{MGLL}} = 0.5 \cdot \mathcal{L}_{\text{sCLIP}} + 1.0 \cdot \mathcal{L}_P + 1.0 \cdot \mathcal{L}_{\text{sKL}}\)

Large-Scale Multi-Granular Dataset Construction¶

MGLL-Fundus: 246,389 fundus image-text pairs with multi-granular annotations, aggregated from 49 public datasets covering 50+ diseases. Granularity levels include: normal/abnormal labels, specific disease categories, and clinical interpretation descriptions.
MGLL-Xray: 190,882 chest X-ray images from the MIDRC database. Granularity levels include: imaging modality (CR/DX), study description, and series description.

Key Experimental Results¶

Main Results¶

MGLL is compared against CLIP, CheXzero, MRM, UniChest, and other state-of-the-art methods on 9 fundus downstream datasets and 3 X-ray datasets:

Method	MIDRC-XR AUC (LP/FT)	MIDRC-Portable AUC (LP/FT)	ChestX-ray14 AUC (LP/FT)
CLIP	54.72 / 88.52	71.43 / 91.83	69.75 / 82.05
UniChest	59.02 / 92.51	78.49 / 95.44	76.15 / 85.84
FG-CLIP	58.31 / 93.29	80.31 / 96.93	76.62 / 85.10
MGLL	61.25 / 99.08	83.86 / 99.75	82.94 / 87.37

MGLL achieves the best results across all datasets under both linear probe and fine-tuning settings. On the multi-label dataset RFMiD, MGLL outperforms the second-best method by 16.6% in linear probe AUC and 6.7% in fine-tuning AUC.

Results when embedded into MLLMs — replacing the visual encoder of 7 MLLMs:

MLLM	Original Accuracy	+MGLL Accuracy	Gain
InstructBLIP	47.29%	61.99%	+14.7%
LLaVA	72.73%	79.98%	+7.3%
LLaVA-Med	24.28%	58.37%	+34.1%
Med-Flamingo	26.97%	58.70%	+31.7%
InternVL	77.35%	81.96%	+4.6%
Janus-Pro	68.92%	79.80%	+10.9%

Gains are most pronounced for medical-domain models (LLaVA-Med, Med-Flamingo), while general-purpose models (LLaVA, InternVL) also show notable improvements.

Ablation Study¶

Loss function ablation on the RFMiD dataset:

Configuration	LP AUC	FT AUC	Notes
CLIP baseline	44.66	65.10	Single-label, single-granularity
\(\mathcal{L}_P\) only	70.34	88.25	Point-wise contributes most
\(\mathcal{L}_{\text{sCLIP}}\) only	67.86	85.13	Soft CLIP also yields notable gains
\(\mathcal{L}_{\text{sCLIP}} + \mathcal{L}_P\)	75.73	90.31	Complementary effect
Full MGLL	79.62	92.83	+sKL yields further improvement

Ablation on the number of granularity levels (MIDRC-XR-Portable): increasing from 1 to 2 to 3 granularities yields monotonically increasing AUC (LP: 80.54 → 82.92 → 83.86), validating the importance of preserving hierarchical information structure.

Key Findings¶

The point-wise loss contributes most (AUC gain of 25.68%), as it jointly optimizes positive and negative pairs.
The smooth KL divergence as a cross-granularity constraint provides an additional ~4% AUC improvement.
Regarding encoder selection, ViT-L/14 outperforms ViT-H/14 (larger is not always better, suggesting overfitting), and BiomedicalBERT outperforms both the CLIP text encoder and LLaMA.
MGLL substantially outperforms CLIP even under low-resolution or noisy text conditions, demonstrating strong robustness.

Highlights & Insights¶

Plug-and-play design: Without introducing any additional encoder parameters, MGLL achieves multi-label and multi-granular alignment purely through improved loss functions, and can directly replace the contrastive objective in any VLM.
Elegant theoretical analysis: Gradient analysis demonstrates that the soft CLIP loss drives image features toward the weighted centroid of the associated text features (Eq. 10), providing a clear and intuitive interpretation.
Engineering value of large-scale dataset construction: MGLL-Fundus (246K pairs, 49 datasets, 50+ diseases) and MGLL-Xray (190K images) fill a critical gap in multi-granular pre-training data for medical imaging.
Evaluation paradigm for embedding into MLLMs: Evaluating MGLL by substituting the visual encoder of 7 MLLMs is a transferable experimental design for assessing domain-specific visual encoders in other fields.

Limitations & Future Work¶

Granularity definition requires domain expertise: Granularity levels and corresponding texts must be manually designed for each medical domain, limiting generalizability.
Validation limited to classification tasks: Downstream tasks such as segmentation and detection, which are equally important in medical imaging, are not evaluated.
Dataset bias toward fundus and chest X-ray: Generalizability to other modalities such as CT, MRI, and pathology slides remains unexplored.
Coarse modeling of inter-granularity relationships: The smooth KL divergence simply pulls all granularity distributions toward the mean without explicitly modeling hierarchical or containment relationships among granularities (e.g., disease category as a hypernym of severity level).
Future directions: Exploring automatic extraction of multi-granular annotations from medical reports and encoding hierarchical (tree-structured) relationships into the loss function.

vs. CLIP: CLIP performs hard single-label matching, whereas MGLL performs multi-label soft matching with cross-granularity consistency, yielding substantial gains in medical settings (LP AUC on RFMiD: 44.66 → 79.62).
vs. MedCLIP: MedCLIP addresses false negatives via semantic matching but remains single-granularity; MGLL fundamentally restructures the supervision signal.
vs. UniChest: UniChest performs domain adaptation for chest X-rays, while MGLL provides a more general multi-granular framework effective for both X-ray and fundus imaging.
vs. SupCon: Supervised contrastive learning exploits label structure but is limited to a fixed label space; MGLL enables open-vocabulary semantics through the text encoder.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of multi-label and multi-granular alignment is novel, though each individual loss function is not new in isolation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation covers 11 downstream datasets, 7 MLLMs, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐ Theoretical analysis and experimental presentation are clear, though the related work section is somewhat crowded.
Value: ⭐⭐⭐⭐ Directly applicable to medical visual pre-training; both the dataset and the method are readily reusable.