A Semi-Supervised Framework for Breast Ultrasound Segmentation with Training-Free Pseudo-Label Generation and Label Refinement¶

Conference: CVPR2025
arXiv: 2603.06167
Code: To be confirmed
Area: Medical Imaging
Keywords: semi-supervised learning, breast ultrasound segmentation, pseudo label, vision-language model, contrastive learning

TL;DR¶

This paper proposes a semi-supervised breast ultrasound segmentation framework combining training-free pseudo-label generation from VLMs (Grounding DINO + SAM driven by appearance description prompts) and dual-teacher uncertainty fusion refinement, achieving performance close to fully supervised learning with only 2.5% of labeled data.

Background & Motivation¶

High Annotation Costs: Pixel-level annotation of breast ultrasound (BUS) images requires radiology experts, which is time-consuming and expensive, limiting the practicality of fully supervised methods.

Limitations of Prior Work: Pseudo-labels are error-prone in the early stages, leading to confirmation bias; under extremely limited annotations, the teacher model is under-trained, generating noisy pseudo-labels; strong-weak augmentation strategies designed for natural RGB images are unsuitable for gray-scale, speckle-noise-heavy BUS data.

Failure of Direct VLM Transfer: When medical terms (such as "tumor", "lesion") are used as prompts, general VLMs lack domain semantic knowledge, leading to unstable localization on BUS.

Impractical Fine-tuning Schemes: Fine-tuning foundation models requires bounding box annotations, large amounts of labeled data, and customized vision-text pairs, which contradicts clinical scenarios with extreme label sparsity.

Contrastive Learning Ignores Hard Regions: In existing SSL, contrastive learning samples global or "reliable" pixel features, ignoring uncertain yet informative boundary regions.

Appearance Consistency of BUS Lesions: Breast tumors usually present consistent appearance features such as "dark oval/dark round", enabling cross-domain structural transfer using simple appearance descriptions.

Method¶

Overall Architecture¶

A two-stage pipeline: (1) Appearance Prompt-driven training-free Pseudo-label Generation (APPG); (2) Dual-teacher semi-supervised pseudo-label refinement (static teacher warm-up + uncertainty-weighted fusion + adaptive reverse contrastive learning).

Step 1: APPG — Appearance Prompt-Driven Training-Free Pseudo-Label Generation¶

Prompt Design: Utilizing an LLM (GPT-5) to translate domain-general medical features (shape, boundary, intensity contrast) into concise appearance descriptions: "dark oval", "dark round", "dark lobulated".
Two-Stage Generation: Grounding DINO generates bounding boxes \(b_i^u\) based on the appearance prompts, which are then fed into SAM to generate segmentation masks \(\hat{y}_i^0\).
Key Insight: Appearance descriptions enable cross-domain structural transfer from natural to medical images without any training or fine-tuning.

Step 2: Static Teacher Warm-Up Training¶

Filter invalid pseudo-labels (masks with foreground area < 1% are discarded).
Train the static teacher \(T^A\) using valid pseudo-labels generated by APPG (with BCE + Dice loss) to capture coarse-grained structural priors.
Freeze the parameters of \(T^A\) after training.

Step 3: Uncertainty-Driven Semi-Supervised Learning¶

Uncertainty-Entropy Weighted Fusion (UEWF): - The static teacher \(T^A\) (structurally reliable, non-adaptive) and the EMA dynamic teacher \(T^B\) (temporally consistent, potentially noisy) generate soft pseudo-labels, respectively. - Calculate Shannon entropy \(\rightarrow\) patch-wise average pooling smoothing (\(k=14\)) \(\rightarrow\) use the reciprocal of entropy as the confidence weight. - Weighted fusion: \(\hat{y}_i^F = \frac{w_A \cdot \hat{y}_i^A + w_B \cdot \hat{y}_i^B}{w_A + w_B + \epsilon}\)

Adaptive Uncertainty-Guided Reverse Contrastive Learning (AURCL): - Select low-confidence pixels using a dynamic top-K threshold (adaptive percentile + fixed lower bound \(\tau_{fix}=0.2\)). - Perform probability reversal on low-confidence pixels: \(\tilde{p}(u,v) = 1 - p(u,v)\). - Patch-level feature extraction \(\rightarrow\) original and reversed features at the same location form positive pairs, while those at different locations form negative pairs. - Use InfoNCE contrastive loss to enhance feature discrimination in boundary regions.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_s + \lambda_u \mathcal{L}_u + \lambda_c \mathcal{L}_c\]

Both \(\mathcal{L}_s\) and \(\mathcal{L}_u\) are BCE + Dice, with \(\lambda_u=1, \lambda_c=0.5\).

Key Experimental Results¶

BUSI Dataset (2.5% Annotation = 12 Labeled Images)¶

Method	Dice(%)	IoU(%)	Acc(%)
U-Net (Fully Supervised, 100%)	81.68	73.74	96.65
BCP (CVPR'23)	58.93	49.48	93.89
CSC-PA (CVPR'25)	58.78	45.97	93.68
Text-semiseg (MICCAI'25)	56.85	45.35	93.13
Ours	72.72	63.11	95.08

UBB Cross-Dataset (2.5% Annotation = 13 Labeled Images)¶

Method	Dice(%)	IoU(%)	Acc(%)
U-Net (Fully Supervised, 100%)	74.81	65.56	97.29
Text-semiseg (MICCAI'25)	59.76	46.30	93.97
Ours	75.75	65.86	96.67

Key Findings: - Under 2.5% annotations, Dice is improved by +13.79% (BUSI) and +15.99% (UBB) compared to the prior SOTA. - On UBB, using only 13 labeled images outperforms fully supervised U-Net (75.75% vs 74.81%), demonstrating the framework's outstanding capability under extreme annotation scarcity. - Ablation study: APPG contributes the most (Dice +14.09%), dual-teacher +3.83%, AURCL +0.47%, and UEWF +0.52%. - Appearance prompts (e.g., "dark oval") perform significantly better than medical term prompts (e.g., "tumor") for cross-domain localization.

Highlights & Insights¶

Ingenious Design of Cross-Domain Appearance Transfer: Replacing medical term prompts with simple appearance descriptions ("dark oval") achieves effective zero-shot transfer from natural to medical images, a simple and elegant strategy.
Breakthrough Performance Under Extremely Few Annotations: Utilizing only 2.5% annotations achieves performance close to or even exceeding full supervision, demonstrating high clinical utility.
Complementary Dual-Teacher Refinement: The static teacher provides structural priors, while the dynamic teacher captures training progress. Their strengths are combined via uncertainty-weighted fusion.
Paradigm Scalability: For other imaging modalities or pathologies, reliable pseudo-supervision can be obtained using just a simple global appearance description.

Limitations & Future Work¶

The appearance description strategy is highly effective for lesions with highly consistent morphology (such as dark ovals in BUS), but may fail for lesions with highly variable and complex morphologies (such as invasive tumors).
The Grounding DINO+SAM pipeline still fails on some images (requiring filtering of invalid masks), resulting in < 100% pseudo-label coverage.
Based on a ResNet-34 backbone, more powerful segmentation architectures remain unexplored.
The data scale of the four BUS datasets is relatively small (e.g., 647 images for BUSI), lacking large-scale validation.
The absence of data augmentation strategies (an intentional design by the authors) may limit a fair comparison with augmentation-based methods.

Semi-Supervised Segmentation: Mean Teacher, U2PL (entropy-based confidence), BCP (bidirectional copy-paste), MCF (multi-level feature consistency), PH-Net (hard region learning).
VLM-Assisted Segmentation: UniSeg (universal segmentation), SAM-MediCLIPV2 (medical domain alignment), CLIP-style SSL (which, however, requires large-scale in-domain pre-training).
BUS-Specific Methods: PGCL (pseudo-label guided contrastive learning), AAU (anatomical-aware uncertainty), Text-semiseg (CLIP-enhanced SSL).
Core difference of this work: training-free, in-domain fine-tuning free, achieving cross-domain transfer purely via appearance prompts.

Rating¶

Novelty: ⭐⭐⭐⭐ — The cross-domain transfer via appearance prompts is elegant and effective, and the reverse contrastive learning in AURCL is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ — Employs multiple datasets, various annotation ratios, comprehensive ablation studies, and comparisons with VLM baselines.
Writing Quality: ⭐⭐⭐⭐ — The methodology is described systematically and clearly, with well-justified motivations.
Value: ⭐⭐⭐⭐⭐ — The breakthrough performance under extremely limited annotations has significant clinical implications, and the paradigm is highly extensible to other medical imaging modalities.