Towards Unsupervised Domain Bridging via Image Degradation in Semantic Segmentation¶
Conference: NeurIPS 2025 arXiv: 2412.10339 Code: Available Area: Image Segmentation Keywords: Unsupervised Domain Adaptation, Semantic Segmentation, Diffusion Process, Image Degradation, Domain Bridging
TL;DR¶
This paper proposes DiDA, which formalizes image degradation operations as the forward process of diffusion models to construct a continuous intermediate domain between the source and target domains. Combined with a semantic shift compensation mechanism, DiDA serves as a plug-and-play module that consistently improves existing UDA semantic segmentation methods.
Background & Motivation¶
Semantic segmentation models suffer from severe performance degradation when deployed across domains. While self-training (ST) has become the dominant paradigm in UDA (e.g., DAFormer, HRDA, MIC series), these methods overlook the explicit modeling of domain-shared feature extraction.
From a causal representation learning perspective, the observed feature \(x = \Phi(c, e)\), where \(c\) denotes causal features determining class identity (e.g., shape) and \(e\) denotes domain-specific features (e.g., texture). Since \(e_S \neq e_T\), we have \(x_S \neq x_T\), which hinders the learning of domain-invariant features.
The core insight is drawn from the forward process of diffusion models: progressively adding noise removes attributes in order of granularity — fine-grained domain-specific attributes (texture) are lost first, while coarse-grained domain-invariant attributes (shape) are lost later. This implies that the overlapping region of intermediate domain distributions created by degradation can serve as a prior for the domain-shared distribution.
However, directly using degradation as domain bridging poses two major challenges: (1) stable feature representations must be maintained across a wide range of degradation levels; and (2) degradation inevitably damages domain-invariant features, leading to the semantic shift problem.
Method¶
Overall Architecture¶
DiDA is integrated into the standard self-training (ST) UDA pipeline and consists of two core modules: (1) degradation-based intermediate domain construction, which creates a continuous intermediate domain via the diffusion forward process; and (2) semantic shift compensation, which uses a diffusion encoder to disentangle and compensate for semantic information loss caused by degradation. At inference time, only the backbone segmentation network \(f_\theta = h \circ g\) is used, with no additional computational overhead.
Key Designs¶
-
Degradation-based Intermediate Domain Construction: The intermediate states \(X_1, X_2, \ldots, X_T\) produced by the diffusion forward process \(x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon\) are treated as intermediate domains. As the timestep increases, the overlapping area across different domain distributions gradually expands, eliminating domain-specific attributes. Based on a theoretical proposition (monotonic relationship between attribute loss and timestep), the degradation operation constructs a continuous bridge from the source/target domain to a shared domain.
-
Semantic Shift Compensation: A trainable diffusion encoder \(g'\), conditioned on a time embedding module, is introduced to extract semantic shift information from the degraded image \(x_t\): \(\hat{z}_{(t,i)} = z'_{(t,i)} (MLP_s^i \circ \text{Embed}(t) + 1) + MLP_b^i \circ \text{Embed}(t)\) Features are fused via residual connections \(g + g'\) at multiple levels, supervised by a reconstruction loss \(\mathcal{L}^R = \|f_\theta(x_t, t) - \epsilon\|_2^2\). The design motivation is that time embeddings enable the network to precisely disentangle the semantic loss corresponding to different degradation levels, thereby enabling targeted compensation.
-
Degraded Image Consistency (DIC) Loss: \(\mathcal{L}^D = \sum_{i}^{N_S} \mathcal{L}_{ce}(\bar{f}_\theta(x_{i,t}^S, t), y_i^S) + \sum_{i}^{N_T} \mathcal{L}_{ce}(\bar{f}_\theta(x_{i,t}^T, t), p_i^T, q^T)\) where \(\bar{f}_\theta = h \circ (g + g')\), enforcing prediction consistency between degraded and original images.
Loss & Training¶
The total training loss is a weighted sum of four terms: $\(\mathcal{L} = \mathcal{L}^S + \mathcal{L}^T + \lambda_D \mathcal{L}^D + \lambda_R \mathcal{L}^R\)$
- \(\mathcal{L}^S\): source domain supervised loss
- \(\mathcal{L}^T\): target domain pseudo-label self-training loss
- \(\lambda_D = 0.5\); \(\lambda_R\) is adjusted per architecture (DAFormer: 5, DeepLabV2: 1)
- Noise schedule: \(T=100\), sigmoid schedule
- At inference, \(g'\) and \(h'\) are fully removed, incurring zero additional overhead
Key Experimental Results¶
Main Results¶
Consistent Gains Across Methods, Architectures, and Benchmarks (mIoU)
| Method | GTA→CS (CNN) | GTA→CS (Trans) | SYN→CS (CNN) | SYN→CS (Trans) | CS→ACDC (Trans) |
|---|---|---|---|---|---|
| DAFormer | 56.0 | 68.3 | 54.7 | 60.9 | 55.4 |
| +DiDA | 58.3 (+2.3) | 70.3 (+2.0) | 57.6 (+2.9) | 63.1 (+2.2) | 59.1 (+3.7) |
| HRDA | 63.0 | 73.8 | 61.2 | 65.8 | 68.0 |
| +DiDA | 64.3 (+1.3) | 75.4 (+1.6) | 62.6 (+1.4) | 67.8 (+2.0) | 70.7 (+2.7) |
| MIC | 64.2 | 75.5 | 62.4 | 67.3 | 69.8 |
| +DiDA | 65.0 (+0.8) | 76.8 (+1.3) | 63.5 (+1.1) | 68.6 (+1.3) | 72.1 (+2.3) |
Ablation Study¶
GTA→CS (Transformer), based on DAFormer
| \(\mathcal{L}^D\) | \(\mathcal{L}^R\) | \(g_{time}\) | \(g'\) | \(h'\) | mIoU |
|---|---|---|---|---|---|
| - | - | - | - | - | 68.3 |
| ✓ | - | - | - | - | 66.5 |
| ✓ | - | ✓ | - | - | 69.5 |
| ✓ | ✓ | ✓ | - | - | 69.4 |
| ✓ | ✓ | ✓ | - | ✓ | 69.9 |
| ✓ | ✓ | ✓ | ✓ | ✓ | 70.3 |
Key Findings¶
- Plug-and-play effectiveness: DiDA consistently improves performance across all 3 UDA methods × 2 architectures × 5 settings
- Largest gains in weather adaptation: Improvements reach +3.7 mIoU on CS→ACDC, indicating that degradation bridging is particularly effective when domain gaps are large
- Semantic shift compensation is critical: Applying the DIC loss without time embeddings actually decreases performance by 1.8 mIoU (66.5 vs. 68.3); introducing time embeddings recovers and surpasses the baseline
- Strong extensibility: The framework is compatible with arbitrary degradation operations such as blur and inpainting, all yielding improvements
Highlights & Insights¶
- Elegant theoretical motivation: The paper formalizes the intuition that "degradation = domain bridging" by grounding it in a theoretical proposition about attribute loss in diffusion models
- Zero inference overhead: The degradation encoder and reconstruction head are used only during training and incur no deployment cost
- Strong generality: Compatible with both CNN and Transformer architectures, multiple UDA baselines, and various degradation operations
- The domain bridging perspective offers a new direction compared to conventional adversarial training and style transfer approaches
Limitations & Future Work¶
- The choice of degradation level \(T=100\) and noise schedule is empirically determined; the theoretically optimal degradation strategy remains unclear
- The diffusion encoder \(g'\) shares the same architecture as the backbone encoder \(g\), doubling the parameter count during training (though removed at inference)
- Gains over the strong MIC baseline are relatively modest (+0.8–1.3), possibly approaching a performance ceiling
- The combination with non-self-training UDA methods (e.g., adversarial training) remains unexplored
Related Work & Insights¶
- Relationship with MIC/HRDA: DiDA serves as a plug-and-play complement rather than a replacement, and is orthogonal to consistency regularization approaches
- Diffusion models in segmentation: Unlike methods that leverage diffusion for data generation or internal feature extraction, DiDA directly integrates the diffusion strategy into the UDA training pipeline
- Broader inspiration: The idea of degradation as domain bridging is generalizable to other cross-domain tasks such as detection and classification
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (The degradation-as-domain-bridging perspective is novel, with a natural connection to diffusion theory)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Multiple methods, architectures, benchmarks, and detailed ablations)
- Writing Quality: ⭐⭐⭐⭐ (Motivation is clearly articulated; theory and practice are well integrated)
- Value: ⭐⭐⭐⭐⭐ (A general plug-and-play UDA enhancement strategy with high practical value)