ProPL: Universal Semi-Supervised Ultrasound Image Segmentation via Prompt-Guided Pseudo-Labeling¶
Conference: AAAI 2026 arXiv: 2511.15057 Code: https://github.com/WUTCM-Lab/ProPL Area: Medical Imaging / Ultrasound Segmentation Keywords: universal segmentation, semi-supervised learning, pseudo-labeling, prompt guidance, ultrasound imaging
TL;DR¶
ProPL is proposed as a framework that, for the first time, achieves universal semi-supervised ultrasound image segmentation via a shared visual encoder, prompt-guided dual decoders, and uncertainty-driven pseudo-label calibration. With only 1/16 labeled data across 5 organs and 8 tasks, it surpasses fully supervised methods by 5.18% mDice.
Background & Motivation¶
Background: Ultrasound image segmentation is critical for computer-aided diagnosis, yet existing methods are typically designed for specific organs or tasks, exhibiting poor generalizability.
Limitations of Prior Work: - Fully supervised methods require large amounts of annotated data, which is particularly costly for ultrasound images due to speckle noise, acoustic shadowing, and tissue artifacts that blur boundaries. - Semi-supervised methods reduce annotation requirements but remain confined to single tasks. - Universal segmentation frameworks (e.g., DoDNet, UniSeg) support only fully supervised settings and are thus constrained by annotation availability.
Core Problem: How to construct a universal ultrasound segmentation framework capable of handling multiple organs and tasks simultaneously while requiring only minimal annotations?
Key Insight: Combining prompt learning for task adaptation with dual-decoder mutual learning to generate reliable pseudo-labels.
Method¶
Overall Architecture¶
Input ultrasound images → shared ConvNeXt-Tiny encoder → dual decoders (standard decoder \(\mathcal{G}_{sd}\) + prompt decoder \(\mathcal{G}_{pd}\)) → mutual pseudo-label supervision between decoders. Task prompts are encoded via BERT and injected into the prompt decoder.
Key Designs¶
-
Prompting-upon-Decoding (PuD):
- Function: Injects task-specific textual prompts into the decoding process.
- Mechanism: Task descriptions are encoded via BERT to obtain \(\bm{t}\), aligned in dimensionality through 1D convolution and linear projection, and then injected into decoding features via multi-head cross-attention: \(\bm{h}_k = \bm{z}_k' + \alpha \cdot \text{MHCA}(Q=\bm{z}_k', K=\bm{\tau}, V=\bm{\tau})\)
- Design Motivation: Unlike one-hot encodings or learnable prompts, textual prompts carry richer semantics and generalize to unseen tasks; the learnable scalar \(\alpha\) controls the influence of the prompt.
-
Uncertainty-driven Pseudo-Label Calibration (UPLC):
- Function: Filters and calibrates pseudo-labels based on prediction uncertainty.
- Mechanism: Each decoder independently generates predictions; regions of disagreement are treated as high-uncertainty, and only high-confidence regions are used for mutual pseudo-label supervision.
- Design Motivation: Raw pseudo-labels introduce noise; the divergence between the two decoders serves as an uncertainty signal for filtering unreliable predictions.
-
Universal Ultrasound Dataset:
- 6,400 images spanning 5 organs (breast, fetal, cardiac, ovary, thyroid) across 8 segmentation tasks.
- Labeled data partitions: 1/16, 1/8, and 1/4 settings.
Key Experimental Results¶
Main Results (1/16 Labeled Data)¶
| Method Type | Method | mDice% | mIoU% |
|---|---|---|---|
| Single-task supervised | U-Net | 75.17 | 64.76 |
| Single-task semi-supervised | UniMatch | 79.38 | 69.66 |
| Universal supervised | DoDNet | 62.99 | 50.04 |
| Universal supervised | CLIP-UM | 63.70 | 51.27 |
| Universal semi-supervised | ProPL | 80.35 | 70.63 |
Different Labeling Ratios¶
| Labeling Ratio | ProPL mDice | Gain over Runner-up |
|---|---|---|
| 1/16 | 80.35% | +0.97% |
| 1/8 | 82.56% | +2.2% |
| 1/4 | 83.70% | +1.32% |
Ablation Study (1/16)¶
| Configuration | mDice | mIoU |
|---|---|---|
| w/o prompt (PuD) | 60.76 | 52.57 |
| w/o UPLC | 77.85 | 67.23 |
| Full ProPL | 80.35 | 70.63 |
Key Findings¶
- Removing task prompts causes a 19.59% drop in mDice (80.35→60.76), underscoring the critical role of prompts in universal models.
- UPLC contributes a 2.5% mDice improvement, confirming the effectiveness of uncertainty-based pseudo-label filtering.
- ProPL requires only 712 MB of GPU memory and dominates the performance–efficiency Pareto frontier over all baselines.
- Universal supervised methods (DoDNet, CLIP-UM) perform poorly on ultrasound data (~63% mDice), indicating that universal models benefit substantially from semi-supervised augmentation.
Highlights & Insights¶
- First formulation of "universal semi-supervised ultrasound segmentation": integrates multi-organ multi-task generality with the practical constraint of limited annotations.
- Textual prompts vs. one-hot/learnable prompts: textual prompts add only 18 s/epoch overhead but provide substantially richer semantics; their removal causes model collapse.
- Dual-decoder mutual learning: high-confidence predictions from one decoder serve as pseudo-labels for the other, with inter-decoder disagreement used to estimate uncertainty.
Limitations & Future Work¶
- The dataset covers only 2D ultrasound images and does not extend to 3D volumetric ultrasound.
- Prompt templates require manual design; automated prompt generation may yield further improvements.
- The threshold in UPLC requires manual tuning; adaptive threshold strategies warrant further exploration.
- Cross-modality generalization (e.g., whether the framework transfers to CT/MRI) remains unvalidated.
Related Work & Insights¶
- vs. UniMatch (semi-supervised): UniMatch is the single-task semi-supervised state of the art; ProPL surpasses it by 0.97% in the universal setting.
- vs. DoDNet/UniSeg (universal supervised): These methods perform poorly on ultrasound data; ProPL compensates for annotation scarcity through semi-supervised learning.
- vs. SAM-based methods: SAM-based approaches require additional interactive prompts (points, scribbles), whereas ProPL requires only textual task descriptions.
Rating¶
- Novelty: ⭐⭐⭐⭐ First definition of universal semi-supervised ultrasound segmentation with a well-motivated framework design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Eight tasks, multiple labeling ratios, detailed ablations, and efficiency analysis.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and contributions are well-articulated.
- Value: ⭐⭐⭐⭐ Both the dataset and the framework offer practical contributions to clinical ultrasound segmentation.