Uncertainty-Aware Concept and Motion Segmentation for Semi-Supervised Angiography Videos¶
Conference: CVPR2025
arXiv: 2603.00881
Code: GitHub
Area: Medical Imaging
Keywords: Semi-supervised learning, video segmentation, coronary angiography, SAM3, optical flow, uncertainty
TL;DR¶
This paper proposes the SMART framework, which utilizes a SAM3-based teacher-student architecture combined with text concept prompts, confidence-aware consistency regularization, and dual-stream temporal consistency to achieve semi-supervised vessel segmentation in X-ray coronary angiography videos.
Background & Motivation¶
Coronary artery disease (CAD) is the leading cause of death globally, and X-ray coronary angiography (XCA) is the clinical gold standard. Accurate segmentation of coronary arteries is the foundation of automated diagnosis, but the annotation cost is extremely high (requiring frame-by-frame pixel-level annotation).
Limitations of Prior Work:
Difficulties in direct application of SAM series: SAM/SAM2 rely on geometric prompts (points, bounding boxes), which limits their generalization ability across different clinical institutions.
Neglect of temporal information: Static image methods fail to exploit the temporal dynamics of XCA videos.
Unreliable pseudo-labels: Due to low contrast and low signal-to-noise ratio in coronary images, the outputs from the teacher model are highly noisy.
Advantages of SAM3: SAM3 introduces concept prompts (semantic text descriptions), which avoids reliance on geometric priors.
Method¶
Phase 1: Text-Driven Segmentation Fine-Tuning¶
The teacher SAM3 is fine-tuned on the labeled dataset D_l. The SAM3 architecture is retained, and only the parameters related to text prompts in the image encoder, text encoder, and detector are fine-tuned. The optimization is performed using a joint Dice + BCE loss.
Phase 2: Motion-Aware Semi-Supervised Learning¶
Confidence-Aware Consistency Regularization (CCR)¶
Addressing the core challenge of unreliable teacher outputs: 1. Inject N=8 independent Gaussian noise perturbations into each frame to obtain N teacher predictions. 2. Calculate the average prediction \(\bar{P}\) as the reliable pseudo-label. 3. Compute the uncertainty weight \(\mathcal{U}\) (variance of the N predictions). 4. The confidence-aware consistency loss exerts stronger supervision in regions with high uncertainty, pushing the model to improve predictions in uncertain areas.
Dual-Stream Temporal Consistency (DSTC)¶
Leveraging optical flow to model temporal dynamics of vessels: 1. Estimate forward and backward optical flow using pre-trained SEA-RAFT. 2. Motion consistency loss L_opti: Ensure pixel-level alignment of predictions between adjacent frames using mask warping. 3. Flow coherence loss L_coh: Penalize the deviation of boundary points from the primary motion of the vessel structure, helping to distinguish between foreground and background.
Total Loss¶
Only the student model is used during inference.
Key Experimental Results¶
Evaluated on XCAV (111 videos/59 patients) and CAVSA (1061 videos/121 patients) using only 16 labeled videos:
| Method | XCAV DSC | XCAV clDice | CAVSA DSC | CAVSA clDice |
|---|---|---|---|---|
| UNet (Supervised) | 70.80 | 69.24 | 64.19 | 70.27 |
| SAM3 (Direct) | 42.73 | 34.51 | 30.82 | 30.14 |
| CPC-SAM | 77.90 | 79.15 | 77.90 | 78.28 |
| Denver | 73.30 | 70.40 | 76.53 | 79.17 |
| SMART | 84.39 | 83.01 | 91.00 | 97.73 |
Significant Improvements: - XCAV: DSC is 6.49% higher than CPC-SAM. - CAVSA: Using only 1.5% of the labeled data, the DSC is improved by 13.1%.
Ablation Study (Impact of key components on XCAV/CAVSA):
| Configuration | XCAV DSC | XCAV clDice | CAVSA DSC | CAVSA clDice |
|---|---|---|---|---|
| TPT+CCR (w/o DSTC) | 82.38 | 79.84 | 78.87 | 81.17 |
| TPT+DSTC (w/o CCR) | 76.71 | 79.86 | 25.82 | 32.65 |
| CCR+DSTC (w/o TPT) | 76.24 | 78.53 | 47.77 | 50.37 |
| Full SMART | 84.39 | 83.01 | 91.00 | 97.73 |
When CCR is removed, the CAVSA DSC drops sharply to 25.82%, demonstrating that regularization of unreliable teacher outputs is indispensable. Experiments on the number of noise perturbations indicate that N=8 is the optimal choice (DSC of 84.39 vs 83.59 with N=2).
Highlights & Insights¶
- Ingenious Application of SAM3 Concept Prompts: Replacing geometric prompts with textual semantic descriptions avoids the dependence of point/box prompts on shape priors, achieving significantly better cross-institution generalization than point/box prompting schemes.
- Elegant Design of Confidence-Aware Regularization: A counter-intuitive design of uncertainty weighting—applying stronger supervision on more uncertain regions to push the model to improve on its weak spots, rather than simply ignoring uncertain areas.
- Dual-Stream Optical Flow Consistency: Bidirectional (forward + backward) flow mitigates confirmation bias in unidirectional flow, with motion consistency and flow coherence ensuring pixel alignment and foreground/background distinction respectively.
- Strong Performance with Minimal Annotations: Achieves SOTA performance with only 16 labeled videos (having only 1-2 frames annotated per video).
- Cross-Domain Generalization on CADICA: Qualitatively demonstrates robust cross-domain segmentation capabilities on an unlabeled, third-party dataset.
- Open-Source Code: The full code is released, ensuring high reproducibility.
Limitations & Future Work¶
- The teacher model is frozen (not updated) during semi-supervised training, preventing continuous improvement of pseudo-label quality from unlabeled data. This potentially sacrifices further performance gains compared to schemes using updatable teachers.
- Optical flow estimation relies on pre-trained SEA-RAFT. The quality of the optical flow directly affects the effectiveness of temporal consistency, and no sensitivity analysis was conducted on optical flow accuracy.
- Validated only on coronary angiography scenarios without extending to other medical video segmentation tasks (such as endoscopy or ultrasound videos).
- Only qualitative visualization is performed on the CADICA dataset without quantitative metrics; hence, the statistical significance of cross-domain generalization remains unknown.
- Inference speed and model size are not reported; SAM3 as a foundation model incurs a relatively large computational overhead.
- Training is conducted for only 6k iterations with a batch size of 4, reflecting a small scale, and performance when scaled to larger data remains unknown.
- Dual-stream temporal consistency assumes that the vessel topology remains invariant in XCA videos, which might not hold true in longer sequences.
Rating¶
- Novelty: 4/5 — The combination of SAM3 concept prompts, uncertainty awareness, and dual-stream optical flow is novel and practical.
- Experimental Thoroughness: 4/5 — Evaluated on three datasets with detailed ablation studies and comparisons against multiple baselines, yielding highly convincing results.
- Writing Quality: 3/5 — Generally clear, but some mathematical symbols are inconsistent; open-sourced code provides extra credit.
- Value: 4/5 — Strong performance under extremely limited annotations holds practical significance for clinical deployment.