Exploiting Domain Properties in Language-Driven Domain Generalization for Semantic Segmentation¶
Conference: ICCV 2025 arXiv: 2512.03508 Code: https://github.com/jone1222/DPMFormer Area: Image Segmentation Keywords: domain generalization, semantic segmentation, vision-language model, prompt learning, domain-aware
TL;DR¶
This paper proposes DPMFormer, a framework that transforms domain-specific properties of input images into textual context prompts via domain-aware prompt learning, combined with domain-robust consistency learning, to address the semantic misalignment between visual and textual contexts in language-driven domain generalization for semantic segmentation.
Background & Motivation¶
Domain Generalized Semantic Segmentation (DGSS) aims to train a model on a single source domain and generalize it to various unseen target domains. Recent methods leveraging the semantic knowledge of VLMs (e.g., CLIP) have achieved notable progress, yet two critical issues remain overlooked:
Semantic misalignment in textual context: Fixed context prompts—whether hand-crafted templates such as "a photo of" or a single prompt learned on the source domain—lead to visual-textual semantic mismatch on target domains. For instance, prompts learned on daytime synthetic images offer limited expressiveness when encountering nighttime real-world scenes.
Lack of domain-robustness guidance: Existing methods fail to effectively guide models toward consistent predictions under domain shift.
The authors' core observation is that the semantic form of text representations should vary dynamically with the domain properties of the input image. For example, for the class "car" in a nighttime scene, a prompt such as "at night in the real-world" is more appropriate than the generic "a photo of," allowing text features to encode domain-specific information such as dark vehicle surfaces and light reflections.
Method¶
Overall Architecture¶
DPMFormer is built upon the Mask2Former architecture, comprising a CLIP-initialized image encoder \(ENC_I\), a pixel decoder \(DEC_{pix}\), a Transformer decoder \(DEC_{tr}\), and a frozen text encoder \(ENC_T\). The framework enhances generalization along two dimensions: (1) domain-awareness—exploiting domain-specific properties of input images; and (2) domain-robustness—maintaining prediction consistency under texture variations.
Key Designs¶
-
Domain-Aware Context Prompt Learning: An auxiliary network \(h_\theta(\cdot)\) is designed to generate domain-specific prompt embeddings \(\pi_x = h_\theta(\hat{F}(x))\) from the CLS token extracted by the frozen CLIP visual backbone. These embeddings are added to learnable context prompts \(p\) to obtain domain-aware prompts \(p_x = p + \pi_x\), producing domain-aware text features \(t_{x,k} = ENC_T([p_x, \{class_k\}])\). A domain-aware contrastive loss ensures that \(h_\theta\) captures domain properties: $\(\mathcal{L}_{contra} = -\frac{1}{2B}\sum_{i=1}^{2B}\log\frac{\sum_{j\in\mathcal{P}_i}\exp\text{sim}(\pi_i, \pi_j)/\tau}{\sum_{j\in\mathcal{P}_i\cup\mathcal{N}_i}\exp\text{sim}(\pi_i, \pi_j)/\tau}\)$ where the positive set \(\mathcal{P}_i\) consists of images from the same domain and the negative set \(\mathcal{N}_i\) consists of images from different domains. Design Motivation: To enable text queries to dynamically adapt to the visual context of target domains.
-
Texture Perturbation: Photometric transformations (strong color jitter, Gaussian blur, and noise injection) are applied to synthesize new-domain images \(x'\), which are paired with original images \(x\) to form training batches. These operations preserve content structure while altering domain properties, thereby enriching the diversity of observable domains within a single-source setting. Design Motivation: To obtain diverse domain properties in a single-source setting, providing positive and negative sample pairs for contrastive learning.
-
Domain-Robust Consistency Learning: Consistency constraints are imposed at every layer of the Transformer decoder, encouraging consistent predictions for both original and augmented images: $\(\mathcal{L}_{cons} = \sum_{s=1}^{S}\lambda_{mc}\cdot\mathcal{L}_{mc}(\hat{y}^{mask}_s, \hat{y'}^{mask}_s) + \lambda_{cc}\cdot\mathcal{L}_{cc}(\hat{c}_{q_i,s}, \hat{c}_{q'_i,s})\)$ where \(\mathcal{L}_{mc}\) computes mask consistency via BCE and \(\mathcal{L}_{cc}\) computes class consistency via JSD. Design Motivation: Imposing constraints at each decoder layer prevents inconsistencies in early layers from propagating to subsequent layers.
Loss & Training¶
Total loss: \(\mathcal{L}_{total} = \mathcal{L}_{seg} + \lambda_{reg}\mathcal{L}_{reg} + \lambda_{contra}\mathcal{L}_{contra} + \lambda_{cons}\mathcal{L}_{cons}\)
Loss weights: \(\lambda_{reg}=1, \lambda_{contra}=1, \lambda_{cons}=10\). The AdamW optimizer is used with a learning rate of 1e-5 for synthetic datasets and 1e-4 for real datasets. Training runs for 20,000 iterations with a batch size of 8 and input crops of 512×512. A linear warm-up is applied over the first 1,500 iterations. The auxiliary network \(h_\theta\) is a lightweight architecture (BatchNorm–Linear–ReLU–Linear).
Key Experimental Results¶
Main Results¶
Synthetic-to-Real (GTAV → Cityscapes/BDD/Mapillary):
| Method | Backbone | Cityscapes | BDD | Mapillary | Mean |
|---|---|---|---|---|---|
| SHADE | ResNet-101 | 46.66 | 43.66 | 45.50 | 45.27 |
| FAMix | ResNet-101 | 49.47 | 46.40 | 51.97 | 49.28 |
| TQDM | ViT-B | 57.50 | 47.66 | 59.76 | 54.97 |
| DPMFormer | ViT-B | 59.00 | 51.80 | 63.62 | 58.14 |
| TQDM | EVA02-L | 68.88 | 59.18 | 70.10 | 66.05 |
| DPMFormer | EVA02-L | 70.08 | 60.48 | 70.66 | 67.07 |
SYNTHIA → Real World:
| Method | Backbone | Cityscapes | BDD | Mapillary | Mean |
|---|---|---|---|---|---|
| TQDM | EVA02-L | 57.99 | 52.43 | 54.87 | 55.10 |
| DPMFormer | EVA02-L | 58.92 | 54.39 | 60.08 | 57.80 |
Ablation Study¶
Component Contributions (GTAV, ViT-B):
| Configuration | Cityscapes | BDD | Mapillary | Mean |
|---|---|---|---|---|
| Baseline (TQDM) | 57.50 | 47.66 | 59.76 | 54.97 |
| + Texture Perturbation | 57.04 | 48.19 | 60.91 | 55.38 |
| + \(\mathcal{L}_{cons}\) | 58.22 | 49.39 | 60.84 | 56.15 |
| + \(\mathcal{L}_{contra}\) (Full) | 59.00 | 51.80 | 63.62 | 58.14 |
Comparison of Prompt Learning Methods:
| Method | Cityscapes | BDD | Mapillary | Mean |
|---|---|---|---|---|
| w/o contrastive loss | 57.65 | 49.63 | 61.10 | 56.13 |
| CoCoOp | 57.84 | 49.91 | 61.33 | 56.36 |
| MaPLe | 57.87 | 50.12 | 61.04 | 56.34 |
| PromptSRC | 58.10 | 49.73 | 62.51 | 56.78 |
| Ours (sim(π,π)) | 59.00 | 51.80 | 63.62 | 58.14 |
Key Findings¶
- Domain-aware prompt learning contributes the most (+1.99% mean mIoU), with particularly large gains on BDD (+4.14%) and Mapillary (+3.17%), where environmental variation is substantial.
- Computing the contrastive loss on the context embedding \(\pi\) yields the best results, as it provides direct domain-level supervision to \(h_\theta\).
- CoCoOp's instance-specific prompts exhibit poor generalization, as they focus on instance-level attributes rather than domain-level properties.
- DPMFormer maintains correct segmentation under extreme artistic styles (minimalism, cubism, etc.), demonstrating its domain robustness.
Highlights & Insights¶
- Domain-Aware vs. Instance-Aware: The paper draws a clear and effective distinction between "domain-specific attributes" and "instance-specific attributes." CoCoOp's instance-level prompts lead to poor generalization, whereas DPMFormer's domain-level prompts are better suited to DGSS tasks.
- Dual Role of Texture Perturbation: Texture perturbation simultaneously enriches domain diversity to supply contrastive learning samples and serves as the augmentation source for consistency learning.
- Per-Layer Consistency Constraints: Consistency is enforced not only at the final output layer but at every layer of the Transformer decoder, preventing error accumulation.
- Lightweight Domain Prompt Generator: A shallow BN–Linear–ReLU–Linear network suffices to capture domain properties.
Limitations & Future Work¶
- In real-to-real settings (Cityscapes → BDD/Mapillary) with the EVA02-L backbone, the advantage is marginal, merely matching TQDM.
- Texture perturbations are limited to photometric transformations, leaving more complex domain shifts (e.g., weather conditions, sensor differences) unaddressed.
- The domain-aware prompt generator relies on the CLS token, which may discard spatially localized domain information.
- Combining test-time adaptation with domain-aware prompts at inference is a promising direction for future exploration.
- The definition of positive and negative samples in the contrastive loss is relatively coarse; finer-grained domain distance metrics could be introduced.
Related Work & Insights¶
- TQDM: The first framework to apply text queries to DGSS and the direct baseline of this work; it relies on fixed context prompts.
- CoCoOp: Conditional context optimization for prompt learning; this work demonstrates that domain-awareness outperforms instance-awareness.
- Mask2Former: A mask-classification segmentation architecture whose attention mechanism exhibits inherent robustness to domain shift.
- SHADE: A pioneer in style consistency loss; this work extends the concept to two-dimensional consistency (mask + class) across multiple decoder layers.
Rating¶
- Novelty: ⭐⭐⭐⭐ The domain-aware prompt learning concept is clear and effective; injecting domain properties into text representations is an elegant design.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers synthetic-to-real and real-to-real settings with in-depth comparative analysis of prompt learning methods.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; PCA visualizations and artistic-style experiments strengthen the paper's persuasiveness.
- Value: ⭐⭐⭐⭐ Achieves state-of-the-art performance across all settings and offers a new perspective on visual-language alignment for domain generalization.