ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction¶

Conference: ICCV 2025 arXiv: 2507.15803 Code: N/A Area: Image Segmentation Keywords: Semi-supervised segmentation, conformal prediction, SAM/SEEM, uncertainty calibration, pseudo labels

TL;DR¶

This paper proposes ConformalSAM, a framework that leverages Conformal Prediction to calibrate the output uncertainty of the foundation segmentation model SEEM on target domains. Unreliable pixel labels are filtered out before serving as supervision signals for unlabeled data. Combined with a late-stage self-reliance training strategy, the framework achieves 81.21 mIoU on PASCAL VOC under the 1/16 labeled setting.

Background & Motivation¶

The core challenge of semi-supervised semantic segmentation (SSSS) is how to effectively exploit large amounts of unlabeled data. A natural approach is to use foundation segmentation models such as SAM/SEEM to directly generate pseudo labels for unlabeled data — however, experiments show this actually degrades performance: - PASCAL VOC 1/16 split: 50.65 mIoU with labels only → drops to 42.00 when SEEM pseudo labels are added - Reason: A domain gap exists between SEEM's pretraining data and the target domain, resulting in inconsistent prediction quality on the target domain

Core problem: How can one reliably exploit the powerful capabilities of foundation models while filtering out their unreliable predictions?

This paper adopts Conformal Prediction (CP) as the uncertainty calibration tool because: (1) CP is a black-box method that requires only a small amount of labeled data for calibration; (2) it provides theoretically guaranteed coverage; (3) it does not require modifying the foundation model.

Method¶

Overall Architecture¶

ConformalSAM employs a two-stage training procedure: - Stage I: Joint training with CP-calibrated SEEM pseudo labels and ground-truth labels - Stage II: SEEM pseudo labels are discarded and training switches to Self-Reliance

Key Designs¶

CP-calibrated Foundation Model Inference (Stage I):
- Calibration procedure: The labeled set \(D_l\) is used as the calibration set
  - SEEM generates a probability map \(P_i \in \mathbb{R}^{K \times H \times W}\) for each labeled image
  - Nonconformity scores are computed as: \(\hat{P}_i^j(a,b) = 1 - P_i^j(a,b)\) (only for pixels of the ground-truth class)
  - Nonconformity scores are aggregated across all images to compute the \((1-\alpha)\) quantile threshold \(\hat{q}_\alpha\)
- Calibrated inference: For unlabeled image \(x_i\), the prediction set at pixel \((a,b)\) is \(\mathcal{C}_i(a,b) = \{j: \hat{P}_i^j(a,b) \leq \hat{q}_\alpha(a,b)\}\)
- Class-conditional filtering: Since background pixels dominate, when both background and non-background classes appear in the prediction set, non-background classes are prioritized: \(M_i(a,b) = \begin{cases} \arg\min_j \mathcal{C}_i[j], & |\mathcal{C}_i| > 0 \land 0 \notin \mathcal{C}_i \\ \arg\min_{j \neq 0} \mathcal{C}_i[j], & |\mathcal{C}_i| > 0 \land 0 \in \mathcal{C}_i \\ \text{NaN}, & |\mathcal{C}_i| = 0 \end{cases}\)
- When the prediction set is empty, the pixel label is set to NaN (ignored), effectively filtering low-confidence predictions
- Miscoverage rate \(\alpha = 0.05\)
Self-Reliance Training Strategy (Stage II):
- SEEM-generated masks are discarded; the model's own pseudo labels are used instead
- Dynamic weight decay strategy: \(\mathcal{L} = (1 - \lambda(t)) \times \mathcal{L}_s + \lambda(t) \times \mathcal{L}_u\)
- \(\lambda(t)\) decays exponentially, so that the model relies increasingly on ground-truth supervision in later epochs
- PASCAL VOC: Stage I for 60 epochs, Stage II for 20 epochs
- ADE20K: Stage I for 30 epochs, Stage II for 10 epochs
Flexible Plug-in Design:
- The Stage II self-training framework can be replaced by other methods such as AllSpark
- ConformalSAM(AllSpark): Stage I uses CP-calibrated pseudo labels, Stage II switches to AllSpark
- This demonstrates the generality and composability of the framework

Loss & Training¶

Labeled data: standard cross-entropy loss
Unlabeled data (Stage I): NaN pixels are ignored; CE is computed only on high-confidence pixels selected by CP
Stage II applies exponentially decayed weights to balance supervised and unsupervised losses
SegFormer-B5 is used as the segmentation backbone

Key Experimental Results¶

Main Results¶

Method	VOC 1/16(92)	VOC 1/8(183)	VOC 1/4(366)	VOC 1/2(732)	VOC Full
UniMatch	75.2	77.2	78.8	79.9	-
AllSpark	76.07	78.41	79.77	80.75	82.12
ConformalSAM(AllSpark)	80.69	81.29	81.33	82.69	83.44
ConformalSAM	81.21	82.22	81.84	83.52	83.85

Method	ADE20K 1/128(158)	1/64(316)	1/32(632)	1/16(1263)	1/8(2526)
AllSpark	16.17	23.03	26.42	28.40	32.10
ConformalSAM	26.21	30.02	33.33	34.64	36.25

Ablation Study¶

Configuration	SEEM	CP	SR	VOC 1/16	VOC 1/2
Semi-Baseline	✗	✗	✗	52.89	74.22
+SEEM (direct)	✓	✗	✗	42.00	44.99
+SEEM+CP	✓	✓	✗	78.09	79.10
+SEEM+CP+SR	✓	✓	✓	81.21	83.52

CP Variant	α=0.1	α=0.05	α=0.01
Pixel-wise	74.31	78.09	68.01
Image-wise	75.99	75.54	44.59
K-Means	69.36	69.13	44.16

Key Findings¶

Critical role of CP: Directly applying SEEM reduces mIoU by 8.65; adding CP yields a gain of 25.2 mIoU (1/16 setting)
Class-conditional filtering is essential: compared to vanilla CP, it brings an average gain of 34.11 mIoU
Pixel-wise CP outperforms image-wise, K-Means, GenAnn, and other CP variants
\(\alpha=0.05\) is consistently the optimal miscoverage rate
The SR strategy contributes an additional average gain of 3.76 mIoU
On ADE20K under the 1/128 setting, the improvement reaches 10.04 mIoU (AllSpark: 16.17 → 26.21)
When integrated as a plug-in into AllSpark, an average gain of 2.07 mIoU is achieved

Highlights & Insights¶

First application of CP to calibrate pseudo labels from foundation segmentation models in SSSS, with a concise idea backed by strong empirical validation
Class-conditional filtering addresses the critical failure mode of SEEM in segmentation tasks — foreground being overwhelmed by background pixels
The two-stage strategy follows a clear design logic: exploit foundation model knowledge early, then avoid overfitting to SEEM noise in later training
As a plug-in framework, it can be freely combined with existing SSSS methods such as AllSpark

Limitations & Future Work¶

Effectiveness depends on the overlap between foundation model knowledge and the target task — gains are smaller on datasets with novel categories such as ADE20K and Cityscapes
CP calibration requires labeled data; calibration accuracy may be insufficient in extremely low-label scenarios (e.g., tens of images)
Only SEEM is evaluated as the foundation model; stronger models such as SAM2 and GLAMM remain unexplored
The switching point for SR training is determined empirically (60 epochs) and may require adjustment for different datasets
In-depth comparison with prompt-engineering-based SAM approaches is absent

UniMatch/AllSpark: Current SSSS state-of-the-art methods; ConformalSAM is complementary to them
SemiSAM/CPC-SAM: Leverage SAM via improved prompting, whereas this paper directly uses SEEM outputs with CP calibration
Conformal Prediction: Imported from classification/detection into segmentation as an uncertainty calibration tool
CP holds broad promise for calibrating outputs of other foundation models, including LLMs

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of applying CP to calibrate foundation segmentation models is novel, though the two-stage training itself is relatively straightforward
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets (VOC/VOC-aug/ADE20K), plug-in validation, and comprehensive CP variant ablations
Writing Quality: ⭐⭐⭐⭐ Motivation is clear and ablation design is well-conceived, though the method section is equation-heavy
Value: ⭐⭐⭐⭐ Demonstrates a general paradigm for safely leveraging foundation models to assist downstream training