Towards Unsupervised Domain Bridging via Image Degradation in Semantic Segmentation¶

Conference: NeurIPS 2025 arXiv: 2412.10339 Code: Available Area: Image Segmentation Keywords: Unsupervised Domain Adaptation, Semantic Segmentation, Diffusion Process, Image Degradation, Domain Bridging

TL;DR¶

This paper proposes DiDA, which formalizes image degradation operations as the forward process of diffusion models to construct a continuous intermediate domain between the source and target domains. Combined with a semantic shift compensation mechanism, DiDA serves as a plug-and-play module that consistently improves existing UDA semantic segmentation methods.

Background & Motivation¶

Semantic segmentation models suffer from severe performance degradation when deployed across domains. While self-training (ST) has become the dominant paradigm in UDA (e.g., DAFormer, HRDA, MIC series), these methods overlook the explicit modeling of domain-shared feature extraction.

From a causal representation learning perspective, the observed feature $x = \Phi(c, e)$, where $c$ denotes causal features determining class identity (e.g., shape) and $e$ denotes domain-specific features (e.g., texture). Since $e_S \neq e_T$, we have $x_S \neq x_T$, which hinders the learning of domain-invariant features.

The core insight is drawn from the forward process of diffusion models: progressively adding noise removes attributes in order of granularity — fine-grained domain-specific attributes (texture) are lost first, while coarse-grained domain-invariant attributes (shape) are lost later. This implies that the overlapping region of intermediate domain distributions created by degradation can serve as a prior for the domain-shared distribution.

However, directly using degradation as domain bridging poses two major challenges: (1) stable feature representations must be maintained across a wide range of degradation levels; and (2) degradation inevitably damages domain-invariant features, leading to the semantic shift problem.

Method¶

Overall Architecture¶

DiDA is integrated into the standard self-training (ST) UDA pipeline and consists of two core modules: (1) degradation-based intermediate domain construction, which creates a continuous intermediate domain via the diffusion forward process; and (2) semantic shift compensation, which uses a diffusion encoder to disentangle and compensate for semantic information loss caused by degradation. At inference time, only the backbone segmentation network $f_\theta = h \circ g$ is used, with no additional computational overhead.

Key Designs¶

Degradation-based Intermediate Domain Construction: The intermediate states $X_1, X_2, \ldots, X_T$ produced by the diffusion forward process $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$ are treated as intermediate domains. As the timestep increases, the overlapping area across different domain distributions gradually expands, eliminating domain-specific attributes. Based on a theoretical proposition (monotonic relationship between attribute loss and timestep), the degradation operation constructs a continuous bridge from the source/target domain to a shared domain.
Semantic Shift Compensation: A trainable diffusion encoder $g'$, conditioned on a time embedding module, is introduced to extract semantic shift information from the degraded image $x_t$: $\hat{z}_{(t,i)} = z'_{(t,i)} (MLP_s^i \circ \text{Embed}(t) + 1) + MLP_b^i \circ \text{Embed}(t)$ Features are fused via residual connections $g + g'$ at multiple levels, supervised by a reconstruction loss $\mathcal{L}^R = \|f_\theta(x_t, t) - \epsilon\|_2^2$. The design motivation is that time embeddings enable the network to precisely disentangle the semantic loss corresponding to different degradation levels, thereby enabling targeted compensation.
Degraded Image Consistency (DIC) Loss: $\mathcal{L}^D = \sum_{i}^{N_S} \mathcal{L}_{ce}(\bar{f}_\theta(x_{i,t}^S, t), y_i^S) + \sum_{i}^{N_T} \mathcal{L}_{ce}(\bar{f}_\theta(x_{i,t}^T, t), p_i^T, q^T)$ where $\bar{f}_\theta = h \circ (g + g')$, enforcing prediction consistency between degraded and original images.

Loss & Training¶

The total training loss is a weighted sum of four terms: $$\mathcal{L} = \mathcal{L}^S + \mathcal{L}^T + \lambda_D \mathcal{L}^D + \lambda_R \mathcal{L}^R$$

$\mathcal{L}^S$: source domain supervised loss
$\mathcal{L}^T$: target domain pseudo-label self-training loss
$\lambda_D = 0.5$; $\lambda_R$ is adjusted per architecture (DAFormer: 5, DeepLabV2: 1)
Noise schedule: $T=100$, sigmoid schedule
At inference, $g'$ and $h'$ are fully removed, incurring zero additional overhead

Key Experimental Results¶

Main Results¶

Consistent Gains Across Methods, Architectures, and Benchmarks (mIoU)

Method	GTA→CS (CNN)	GTA→CS (Trans)	SYN→CS (CNN)	SYN→CS (Trans)	CS→ACDC (Trans)
DAFormer	56.0	68.3	54.7	60.9	55.4
+DiDA	58.3 (+2.3)	70.3 (+2.0)	57.6 (+2.9)	63.1 (+2.2)	59.1 (+3.7)
HRDA	63.0	73.8	61.2	65.8	68.0
+DiDA	64.3 (+1.3)	75.4 (+1.6)	62.6 (+1.4)	67.8 (+2.0)	70.7 (+2.7)
MIC	64.2	75.5	62.4	67.3	69.8
+DiDA	65.0 (+0.8)	76.8 (+1.3)	63.5 (+1.1)	68.6 (+1.3)	72.1 (+2.3)

Ablation Study¶

GTA→CS (Transformer), based on DAFormer

$\mathcal{L}^D$	$\mathcal{L}^R$	$g_{time}$	$g'$	$h'$	mIoU
-	-	-	-	-	68.3
✓	-	-	-	-	66.5
✓	-	✓	-	-	69.5
✓	✓	✓	-	-	69.4
✓	✓	✓	-	✓	69.9
✓	✓	✓	✓	✓	70.3

Key Findings¶

Plug-and-play effectiveness: DiDA consistently improves performance across all 3 UDA methods × 2 architectures × 5 settings
Largest gains in weather adaptation: Improvements reach +3.7 mIoU on CS→ACDC, indicating that degradation bridging is particularly effective when domain gaps are large
Semantic shift compensation is critical: Applying the DIC loss without time embeddings actually decreases performance by 1.8 mIoU (66.5 vs. 68.3); introducing time embeddings recovers and surpasses the baseline
Strong extensibility: The framework is compatible with arbitrary degradation operations such as blur and inpainting, all yielding improvements

Highlights & Insights¶

Elegant theoretical motivation: The paper formalizes the intuition that "degradation = domain bridging" by grounding it in a theoretical proposition about attribute loss in diffusion models
Zero inference overhead: The degradation encoder and reconstruction head are used only during training and incur no deployment cost
Strong generality: Compatible with both CNN and Transformer architectures, multiple UDA baselines, and various degradation operations
The domain bridging perspective offers a new direction compared to conventional adversarial training and style transfer approaches

Limitations & Future Work¶

The choice of degradation level $T=100$ and noise schedule is empirically determined; the theoretically optimal degradation strategy remains unclear
The diffusion encoder $g'$ shares the same architecture as the backbone encoder $g$, doubling the parameter count during training (though removed at inference)
Gains over the strong MIC baseline are relatively modest (+0.8–1.3), possibly approaching a performance ceiling
The combination with non-self-training UDA methods (e.g., adversarial training) remains unexplored

Relationship with MIC/HRDA: DiDA serves as a plug-and-play complement rather than a replacement, and is orthogonal to consistency regularization approaches
Diffusion models in segmentation: Unlike methods that leverage diffusion for data generation or internal feature extraction, DiDA directly integrates the diffusion strategy into the UDA training pipeline
Broader inspiration: The idea of degradation as domain bridging is generalizable to other cross-domain tasks such as detection and classification

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (The degradation-as-domain-bridging perspective is novel, with a natural connection to diffusion theory)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Multiple methods, architectures, benchmarks, and detailed ablations)
Writing Quality: ⭐⭐⭐⭐ (Motivation is clearly articulated; theory and practice are well integrated)
Value: ⭐⭐⭐⭐⭐ (A general plug-and-play UDA enhancement strategy with high practical value)