MedCLIPSeg: Probabilistic Vision-Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation¶
Conference: CVPR2026
arXiv: 2602.20423
Code: HealthX-Lab/MedCLIPSeg
Area: Medical Imaging
Keywords: Medical Image Segmentation, CLIP Adaptation, Probabilistic Attention, Uncertainty Modeling, Cross-Modal Fusion, Data Efficiency
TL;DR¶
Built upon frozen CLIP encoders, MedCLIPSeg introduces a probabilistic cross-modal attention adapter (PVL) that enables bidirectional vision-language interaction and explicit prediction uncertainty modeling, complemented by a soft patch-level contrastive loss. The method achieves strong data efficiency, domain generalization, and interpretability across 16 medical segmentation datasets.
Background & Motivation¶
Medical image segmentation has long been constrained by three core bottlenecks: scarcity of annotated data (due to the high cost of expert labeling), ambiguous anatomical boundaries (caused by low soft-tissue contrast), and domain shift across different devices and institutions. While vision-language pre-trained models such as CLIP provide powerful cross-modal representations, existing approaches either exploit only image-level CLIP features for coarse-grained alignment or lack explicit modeling of segmentation prediction uncertainty. This leads to rapid performance degradation under limited annotations and insufficient robustness in cross-domain scenarios.
The authors identify three critical gaps: (1) most existing CLIP adaptation methods perform unidirectional text→image guidance, lacking bidirectional interaction; (2) standard deterministic attention cannot express confidence differences across patch features; (3) global contrastive losses are too coarse to encourage fine-grained semantic alignment at the patch level. MedCLIPSeg addresses all three gaps simultaneously.
Method¶
Overall Architecture¶
MedCLIPSeg is built upon frozen CLIP dual encoders (using UniMedCLIP as the default backbone). Input medical images are processed by the visual encoder to extract patch-level embeddings, while text descriptions (containing information such as organ location and imaging modality) are processed by the text encoder to extract token-level embeddings. The core innovation lies in inserting learnable Probabilistic Vision-Language (PVL) adapters at intermediate layers of the CLIP encoders to achieve bidirectional fusion between image and text tokens. The fused visual features are then fed into a lightweight segmentation decoder to produce segmentation masks. The training loss consists of a segmentation loss and a soft patch-level contrastive loss.
Key Design 1: Probabilistic Cross-Modal Attention (PVL Adapter)¶
The PVL adapter is the central contribution of this work. Unlike standard cross-attention, PVL models Keys and Values as Gaussian distributions rather than deterministic vectors:
- Variational Keys/Values: For each token, learnable projections predict a mean \(\mu\) and variance \(\sigma^2\), yielding Key \(\sim \mathcal{N}(\mu_K, \sigma^2_K)\) and Value \(\sim \mathcal{N}(\mu_V, \sigma^2_V)\). High variance indicates semantic uncertainty in that token, while low variance indicates high confidence.
- Confidence-weighted Attention: Attention weights consider not only Query-Key similarity but are also weighted by the inverse of the Key variance (i.e., uncertainty). Tokens with high uncertainty are automatically down-weighted, allowing the model to focus on reliable features. The confidence weighting is controlled by a hyperparameter \(\beta\).
- Bidirectional Interaction: The PVL adapter operates in both directions — visual patch tokens querying text tokens (text→image) and text tokens querying visual patch tokens (image→text). A residual gating mechanism fuses the adapted features with the original CLIP features.
This probabilistic formulation enables the model to identify and suppress noisy or ambiguous features at the attention stage, which is particularly effective for the boundary ambiguity and artifacts commonly encountered in medical imaging.
Key Design 2: Pixel-level Uncertainty Estimation¶
Leveraging the probabilistic distribution over Values, the model performs Monte Carlo sampling at inference time to generate multiple predictions. Their mean serves as the final segmentation mask, and their entropy is computed as a per-pixel uncertainty map. The uncertainty map intuitively indicates which regions have reliable segmentation results and which are ambiguous — a capability with significant clinical decision-support value. Experiments show that uncertainty hotspots consistently concentrate at anatomical boundaries and challenging regions, with consistent behavior on both in-distribution and out-of-distribution data.
Key Design 3: Soft Patch-Level Contrastive Loss¶
Conventional image-text contrastive losses operate at the image level, providing overly coarse supervision. MedCLIPSeg proposes patch-level contrastive learning with soft targets rather than hard positive/negative labels. Specifically, for each visual patch, the model computes its similarity to multiple text descriptions (paraphrases, descriptions at varying levels of detail, etc.), and the resulting softmax distribution serves as the supervision signal. This encourages the model to learn fine-grained semantic distinctions rather than simple binary match/mismatch judgments, thereby improving generalization under limited annotations.
Key Experimental Results¶
Data Efficiency Evaluation (Average DSC/NSD with 10%/25%/50%/100% training data)¶
| Method | 10% DSC | 10% NSD | 50% DSC | 50% NSD | 100% DSC | 100% NSD |
|---|---|---|---|---|---|---|
| UNet | 60.95 | 64.43 | 71.61 | 75.14 | 78.49 | 82.07 |
| nnU-Net | 73.45 | 77.37 | 78.86 | 82.68 | 81.40 | 85.08 |
| CLIPSeg | 74.66 | 77.75 | 79.63 | 82.58 | 84.87 | 87.74 |
| CAT-Seg | 78.76 | 81.50 | 83.32 | 85.61 | 85.90 | 88.31 |
| VLSM-Adapter | 74.47 | 77.50 | 80.83 | 83.77 | 83.85 | 86.72 |
| MaPLe + Decoder | 74.81 | 77.90 | 82.81 | 85.80 | 84.94 | 87.91 |
| MedCLIPSeg | 81.10 | 83.94 | 87.18 | 89.95 | 88.66 | 91.35 |
MedCLIPSeg substantially outperforms all baselines across all data regimes. At only 10% of training data, it achieves a DSC of 81.10, surpassing the second-best method CAT-Seg (78.76) by 2.34 points; at 100% data, it reaches 88.66 DSC, leading CAT-Seg by 2.76 points.
Ablation Study (DSC, ID/OOD/Harmonic Mean)¶
| Ablation | ID | OOD | HM |
|---|---|---|---|
| MedCLIPSeg (Full) | 89.11 | 79.02 | 83.76 |
| w/o PVL Adapter | 81.23 (−7.88) | 55.23 (−23.79) | 65.75 (−18.01) |
| Deterministic (w/o probabilistic modeling) | 87.68 (−1.43) | 63.12 (−15.90) | 73.40 (−10.36) |
| w/o Visual Adaptation | 81.50 (−7.61) | 64.40 (−14.62) | 71.95 (−11.81) |
| w/o Bidirectional Interaction | 88.71 (−0.40) | 77.71 (−1.31) | 82.85 (−0.91) |
| w/o Soft Contrastive Loss | 87.24 (−1.87) | 77.08 (−1.94) | 81.84 (−1.92) |
| Hard-target Contrastive | 88.34 (−0.77) | 77.64 (−1.38) | 82.65 (−1.11) |
Two critical findings: (1) the PVL adapter is the most essential component — its removal causes a 23.79-point drop in OOD performance; (2) probabilistic modeling is crucial for domain generalization — the deterministic variant drops only 1.43 on ID but 15.90 on OOD.
Key Findings¶
- Probabilistic modeling contributes far more to domain generalization than to in-domain performance: the deterministic variant drops only 1.43 on ID but 15.90 on OOD, indicating that uncertainty modeling primarily helps the model suppress unreliable features under distribution shift.
- Visual adaptation is more critical than text adaptation: removing visual adaptation causes drops of 7.61/14.62 (ID/OOD), while removing text adaptation causes only 0.28/2.62 drops, suggesting that cross-modal enhancement on the visual side is the bottleneck for segmentation.
- Sensitivity to text prompt design: contradictory descriptions reduce HM from 83.76 to 65.79; insufficient descriptions reduce it to 56.82; over-detailed descriptions reduce it to 78.48, demonstrating significant sensitivity to prompt quality.
- Backbone selection: UniMedCLIP > BiomedCLIP (82.48) > vanilla CLIP (81.07) > PubMedCLIP (79.28), confirming that medically pre-trained CLIP backbones substantially outperform general-purpose CLIP.
- Adapter insertion depth: inserting PVL adapters at approximately layer 10 yields the best results; too shallow insertion provides insufficient semantics, while too deep insertion disrupts high-level abstractions.
Highlights & Insights¶
- Elegant use of probabilistic attention: modeling Keys and Values as distributions rather than vectors elegantly unifies cross-modal fusion and uncertainty estimation in a single mechanism — a design principle readily transferable to other dense prediction tasks.
- Frozen encoder + lightweight adapter: the CLIP encoders are fully frozen; only the PVL adapters and decoder are trained, making the approach parameter-efficient, preserving pre-trained knowledge, and deployment-friendly.
- Comprehensive domain generalization evaluation: 16 datasets, 5 imaging modalities (CT, MRI, ultrasound, endoscopy, dermoscopy), and 6 organ types, with both in-domain and out-of-domain splits, providing strong empirical evidence.
- Clinical value of uncertainty maps: uncertainty and segmentation quality are highly correlated, making the uncertainty map a natural automatic quality-control signal to alert clinicians about unreliable segmentation regions.
- Generalization gain from soft contrastive loss: the improvement from hard to soft targets is modest (HM +1.11) but nearly cost-free, demonstrating the value of fine-grained contrastive learning.
Limitations & Future Work¶
- Text prompts require manual design including organ location and imaging modality information; automated prompt generation could further lower the barrier to use.
- Monte Carlo sampling increases inference time; deployment requires balancing the number of samples against computational efficiency.
- The current method operates on 2D slices and has not been extended to native 3D volumetric segmentation.
- Training and test modalities in the domain generalization experiments partially overlap; generalization to entirely unseen modalities (e.g., OCT) has not been validated.
- Probabilistic modeling introduces additional hyperparameters (\(\beta\), number of samples, etc.) whose optimal settings may vary across datasets.
Related Work & Insights¶
- CLIPSeg / DenseCLIP / ZegCLIP: these methods directly apply CLIP features for segmentation but lack probabilistic modeling and fine-grained contrastive losses. MedCLIPSeg significantly outperforms them across all settings, with particularly large margins in low-data and cross-domain scenarios.
- VLSM-Adapter: similarly adapts CLIP but with only unidirectional (text→visual) interaction and deterministic attention. MedCLIPSeg's bidirectional probabilistic adaptation substantially outperforms it (HM higher by ~3.5 points).
- CAT-Seg: a strong baseline using cost aggregation for segmentation, but similarly lacks uncertainty modeling and exhibits weaker domain generalization than MedCLIPSeg.
- CausalCLIPSeg: introduces causal reasoning to improve generalization but exhibits unstable OOD performance (HM 57.54), far below MedCLIPSeg's 80.80.
- nnU-Net: represents the upper bound for pure vision-based methods, but the gap under low data (10% DSC: 73.45 vs. 81.10) is substantial.
Further connections: - The probabilistic attention mechanism could be transferred to general CLIP segmentation methods (e.g., CAT-Seg, SAN) to explore its effectiveness in open-vocabulary natural image segmentation. - Uncertainty maps combined with active learning could automatically select high-uncertainty samples for annotation requests, further reducing labeling costs. - Integration with foundation models such as SAM: the probabilistic fusion module may serve as a replacement for SAM's prompt encoder. - The soft contrastive loss is generalizable to other multi-prompt settings, such as VQA and referring segmentation.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Probabilistic cross-modal attention is a meaningful design contribution, though the overall framework is a natural extension of CLIP adaptation)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (16 datasets, 5 modalities, extensive ablations — highly thorough)
- Writing Quality: ⭐⭐⭐⭐ (Clear and well-organized, with rich figures and tables)
- Value: ⭐⭐⭐⭐ (The generalization gains from probabilistic modeling are impressive, with clear clinical deployment potential)