Skip to content

Enhancing Image-Conditional Coverage in Segmentation: Adaptive Thresholding via Differentiable Miscoverage Loss

Conference: ICLR 2026
Code: bjbbbb/Conditional-Optimization-for-Adaptive-Thresholding
Area: segmentation
Keywords: conformal prediction, image-conditional coverage, adaptive thresholding, differentiable miscoverage loss, uncertainty quantification

TL;DR

The COAT framework is proposed to learn image-adaptive threshold predictors end-to-end using a differentiable sigmoid soft TPR approximation as a loss function, significantly reducing the per-image Coverage Gap in Conformal Risk Control for image segmentation.

Background & Motivation

Background: Conformal Risk Control (CRC) provides marginal statistical guarantees for image segmentation by searching for a single threshold \(\tau'\) on a calibration set to control the False Negative Rate (FNR). Limitations of Prior Work: A single global threshold applies a "one-size-fits-all" approach—"easy" images are over-covered while "hard" images are severely under-covered. This leads to a high Coverage Gap (the mean difference between per-image TPR and the target coverage \(1-\alpha\)). Furthermore, the relationship between thresholds and coverage is non-monotonic and discontinuous (as shown in Figure 2), preventing direct gradient computation on coverage. Key Challenge: Marginal guarantees (average FNR \(\le \alpha\)) \(\neq\) conditional guarantees (FNR for each image \(\le \alpha\)). While CRC solves the former, the latter is the actual requirement in high-risk scenarios such as medicine and autonomous driving. Goal: To learn an image-adaptive threshold \(\hat{\tau}(X)\) for each image so that its per-image coverage closely matches the target \(1-\alpha\). Core Idea: Replace hard threshold binarization with a soft mask using a sigmoid function, making TPR differentiable with respect to \(\hat{\tau}\). This allows defining an end-to-end optimizable miscoverage loss, bypassing the tedious process of pre-computing optimal thresholds.

Method

Overall Architecture

The paper proposes two progressive schemes: AT (supervised regression baseline) and COAT (end-to-end differentiable optimization). Both share the same threshold predictor \(f_D\), which takes image \(X\) and the base segmentation model's probability map \(\hat{p}(X)\) as inputs. The difference lies in the training objective: AT is supervised by pre-calculated optimal hard thresholds, while COAT directly optimizes conditional coverage using soft TPR. After training, both compute a global correction value \(t'\) on a calibration set to maintain marginal guarantees.

flowchart TD
    A["Input Image X"] --> B["Base Segmentation Model\nOutput Probability Map p̂(X)"]
    A --> C["Threshold Predictor fD(X, p̂(X))"]
    B --> C
    C --> D["Predicted Threshold τ̂(X)"]
    D --> E{"COAT Training"}
    B --> E
    E --> F["Soft Mask Msoft = σ((p̂-τ̂)/T)"]
    F --> G["Soft TPR = ΣMsoft·Y / ΣY"]
    G --> H["LCOAT = (Soft TPR - (1-α))²"]
    H --> |"Gradient Backpropagation"| C
    D --> I["Calibration Set Correction t'"]
    I --> J["Final Threshold τ'i = clip(τ̂i - t', 0, 1)"]
    J --> K["Prediction Set Ĉ(X) = {p̂(X) ≥ τ'i}"]

Key Designs

1. AT: Supervised Threshold Regression — Foundation of the Adaptive Framework

AT treats threshold prediction as supervised regression. For each image \((X_i, Y_i)\) in the training set, an "ideal threshold" \(\tau^*(X, Y)\) is pre-computed via binary search such that the TPR exactly equals \(1-\alpha\). Then, \(f_D\) is trained using MSE loss:

\[\mathcal{L}_\text{AT} = \mathbb{E}_{(X,Y)\sim D_\text{train}}\left[(\hat{\tau}(X) - \tau^*(X,Y))^2\right]\]

AT directly regresses a threshold scalar, which is simple and effective but relies on pre-computation and suffers from errors when the threshold-coverage relationship is non-monotonic.

2. COAT: Differentiable Miscoverage Loss — Direct Optimization of Conditional Coverage

The core insight of COAT is that hard threshold binarization \(\mathbf{1}[\hat{p}_j \geq \hat{\tau}]\) is non-differentiable. By replacing it with a sigmoid, a soft mask is obtained:

\[M_\text{soft}(X) = \sigma\!\left(\frac{\hat{p}(X) - \hat{\tau}(X)}{T}\right)\]

where the temperature parameter \(T > 0\) controls the steepness of the sigmoid (\(T \to 0\) approaches a hard threshold). The soft TPR is:

\[\widetilde{\text{TPR}}(X, Y, \hat{\tau}) = \frac{\sum_j M_\text{soft}(X)[j] \cdot Y[j]}{\sum_j Y[j] + \epsilon}\]

The loss function directly penalizes the difference between the soft TPR and the target coverage:

\[\mathcal{L}_\text{COAT} = \mathbb{E}\left[\left(\widetilde{\text{TPR}}(X, Y, \hat{\tau}(X)) - (1-\alpha)\right)^2\right]\]

Gradients flow from \(\mathcal{L}_\text{COAT}\) through \(M_\text{soft}\) back to the parameters of \(f_D\), requiring no intermediate supervised labels.

3. Post-hoc Calibration Correction — Layering Marginal Guarantees on Adaptive Thresholds

COAT training only optimizes conditional coverage. Marginal guarantees are fulfilled by the calibration set. A global correction \(t'\) is computed on \(D_\text{cal}\):

\[t' = \inf\!\left\{t \;\middle|\; R(t) \geq \frac{|D_\text{cal}|+1}{|D_\text{cal}|}(1-\alpha)\right\}\]

where \(R(t)\) is the empirical coverage after shifting all calibration image thresholds by \(-t\). The final test threshold is \(\tau'_i = \text{clip}(\hat{\tau}_i - t', 0, 1)\). This step grants AT/COAT finite-sample marginal guarantees (Theorem 1, inherited from CRC theory).

4. Threshold Predictor Architecture

\(f_D\) takes the concatenated tensor of image \(X\) and probability map \(\hat{p}(X)\) as input and outputs a single scalar \(\hat{\tau}(X) \in [0,1]\). This architecture is independent of the base segmentation model and can be flexibly replaced (compatible with DeepLab v3+, UNet, PSPNet, and SINet in experiments).

Key Experimental Results

Main Results

The following table compares different methods under the Polyp dataset, PSPNet base model, and \(\alpha=0.1\) (mean ± SD over 20 random splits):

Method Marginal Coverage Coverage Gap ↓
CRC 0.906 (0.019) 0.150 (0.015)
AA-CRC 0.908 (0.018) 0.119 (0.016)
AT 0.899 (0.018) 0.119 (0.014)
COAT 0.894 (0.016) 0.110 (0.015)

For Polyp+SINet with \(\alpha=0.1\): COAT Coverage Gap is 0.102 vs. CRC 0.149 (31% reduction). For Skin+DeepLab v3+ with \(\alpha=0.2\): COAT 0.073 vs. CRC 0.107 (32% reduction). COAT consistently achieves the best Coverage Gap across 24 experimental groups (3 datasets × 4 models × 2 \(\alpha\) values).

Ablation Study

Configuration Coverage Gap (Polyp, PSPNet, α=0.1) Description
CRC (Non-adaptive) 0.150 Global single threshold baseline
AT (Supervised) 0.119 Adaptive but relies on pre-computed hard thresholds
COAT (Differentiable Loss) 0.110 End-to-end direct optimization of conditional coverage

Ablation of temperature \(T\) (Appendix A.5): When \(T\) is too small, it approaches hard thresholds and gradients vanish; when \(T\) is too large, the over-softening deviates from the target. A medium temperature is optimal.

Key Findings

  • COAT achieves the best Coverage Gap in all experimental combinations while still satisfying marginal coverage (≈ target \(1-\alpha\)); the two are not in conflict.
  • The COAT training loss converges quickly and stably near 0 across four different base segmentation models (Figure 5).
  • Qualitative visualization (Figure 3/4): While CRC shows an FNR as high as 0.613 on hard images, COAT controls almost all images near the target FNR.
  • The improvement is relatively smaller on the Fire dataset (due to lower variance in image difficulty), highlighting that the method's advantages are more prominent on highly heterogeneous datasets.

Highlights & Insights

  • Clean Differentiability: Replacing the non-differentiable indicator function with a sigmoid soft mask is simple yet enables differentiability for the entire TPR, turning "direct coverage optimization" from a theoretical possibility into an engineering reality.
  • Complete Theoretical Guarantees: COAT does not abandon marginal guarantees—it uses the post-hoc calibration correction \(t'\) to recover CRC's finite-sample theory, achieving "conditional coverage optimization + marginal guarantee layering."
  • Model Agnosticism: \(f_D\) takes the probability maps of any segmentation model as input. It requires no changes to the base model and can serve as a plug-and-play post-processing module.
  • Introduction of the Coverage Gap Metric: Measuring the quality of conditional coverage using the difference between per-image coverage and target coverage provides a finer granularity than marginal coverage and is worth adopting.

Limitations & Future Work

  • Training \(f_D\) requires an independent \(D_2\) (separated from \(D_1\) used for training the base model), increasing data partitioning complexity and data volume requirements.
  • The temperature \(T\) is a hyperparameter requiring additional tuning; its optimal value depends on the dataset and model, and an adaptive determination scheme is lacking.
  • Currently limited to binary segmentation (foreground/background); the extension of conditional coverage to multi-class semantic segmentation remains to be explored.
  • Theoretical proofs for conditional validity (Appendix A.1) rely on strong distributional assumptions, and the actual guarantee strength is weaker than marginal guarantees.
  • vs. CRC (Angelopoulos et al., 2024): CRC is the foundation of this work and provides marginal guarantees; COAT adds image-level adaptation on top of it to bridge the gap in conditional coverage.
  • vs. AA-CRC (Blot et al., 2025): AA-CRC also attempts adaptive thresholding but does not use differentiable optimization; COAT further reduces the Coverage Gap through end-to-end training.
  • vs. SACP (Bereska et al., 2025): SACP adapts in the spatial dimension (pixel neighborhoods), while COAT adapts in the image dimension (whole-image thresholds); the two are complementary.
  • Insights for Segmentation Uncertainty Estimation: The soft-thresholding idea can be extended to other tasks requiring differentiable coverage control, such as conditional coverage control for object detection bounding boxes.

Rating

  • Novelty: ⭐⭐⭐⭐ The construction of a differentiable miscoverage loss is novel, advancing coverage optimization from "calibration post-processing" to a "training objective."
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage of 24 groups across 3 datasets, 4 models, and 2 \(\alpha\) values, with intuitive qualitative visualizations.
  • Writing Quality: ⭐⭐⭐⭐ Problem modeling is clear, the progression from AT to COAT is logical, and the algorithm pseudocode is complete.
  • Value: ⭐⭐⭐⭐ Directly applicable for uncertainty quantification in high-risk segmentation scenarios like medical imaging and autonomous driving.