Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation¶
Conference: NeurIPS 2025 arXiv: 2506.09881 Code: GitHub Area: Autonomous Driving / Semantic Segmentation Keywords: open-vocabulary segmentation, domain generalization, depth estimation, visual foundation models, semantic segmentation
TL;DR¶
This paper proposes Vireo, the first single-stage framework that unifies open-vocabulary semantic segmentation (OVSS) and domain-generalized semantic segmentation (DGSS). By introducing GeoText Query to fuse depth-geometric features with linguistic cues, Vireo achieves state-of-the-art performance under both extreme environmental conditions and on unseen categories.
Background & Motivation¶
Background: OVSS enables recognition of arbitrary text-described categories, while DGSS maintains robustness on unseen domains; each paradigm has complementary strengths.
Limitations of Prior Work: The text-visual alignment modules in OVSS degrade significantly under out-of-domain conditions (e.g., nighttime, rain); domain-invariant strategies in DGSS may suppress fine-grained semantic cues, compromising precise response to text queries.
Key Challenge: How can cross-domain robustness and open-vocabulary recognition be achieved simultaneously? DGSS focuses on encoder-side feature generalization, while OVSS focuses on decoder-side open recognition — the two paradigms are naturally complementary.
Goal: Construct a unified OV-DGSS (Open-Vocabulary Domain-Generalized Semantic Segmentation) framework that robustly segments unseen categories under domain shift.
Key Insight: Exploit the domain-invariant nature of depth maps (depth and geometric cues are insensitive to illumination and texture variations), combined with the generalization capacity of frozen visual foundation models (VFMs).
Core Idea: Inject depth-geometric and text-semantic priors into frozen VFM layers via GeoText Query, augmented by CMPE for enhanced gradient flow and DOV-VEH for multi-modal feature fusion.
Method¶
Overall Architecture¶
Vireo comprises three core modules: - Tunable Vireo + GeoText Query: Injects and aligns geometric and textual information between layers of the frozen VFM encoder. - Coarse Mask Prior Embedding (CMPE): Generates coarse prior masks to enhance gradient back-propagation. - Domain-Open-Vocabulary Vector Embedding Head (DOV-VEH): Fuses visual, geometric, and semantic features to produce final predictions.
Input images are fed simultaneously into a frozen VFM encoder and a frozen DepthAnything V2 encoder for depth estimation; text category labels are encoded into semantic embeddings via a frozen CLIP text encoder.
Key Designs¶
-
GeoText Query:
- Function: Injects structural-semantic priors between layers of the frozen VFM, progressively refining features layer by layer.
- Design Motivation: Depth features provide domain-invariant spatial constraints that mitigate domain shift in RGB features; text embeddings provide open-vocabulary semantic alignment.
- Mechanism: At each layer, a learnable query \(P_l\) interacts with visual features \(f_l^V\), depth features \(f_l^D\), and text embeddings \(\{t_k\}\) via cross-attention: \(\mathcal{A}_l = \text{CrossAttn}(P_l, f_l^V, f_l^D, \{t_k\})\) The attention output refines the visual representation \(\hat{f}_l^V\) through weighted summation, MLP projection, and residual connection.
- Novelty: Unlike REIN, which performs only prompt tuning, GeoText Query simultaneously integrates depth and text as cross-modal priors.
-
Coarse Mask Prior Embedding (CMPE):
- Function: Generates coarse semantic probability maps to provide denser gradient signals back through the frozen encoder.
- Design Motivation: Freezing the encoder leads to sparse gradients and slow convergence; CMPE injects richer gradient signals to alleviate this.
- Mechanism: Refined features from VFM layers 8/12/16/24 are upsampled to a unified resolution and fused via an Adaptive Attention Gate (AAG). Coarse masks are computed against text embeddings via Einstein summation: \(\mathcal{M}(x,y,k) = \langle f^M(x,y), t_k \rangle\). Query priors are further generated via softmax normalization: \(q_j^{prior} = \sum_k \text{Softmax}(\langle q_j, e_k^{class} \rangle) \cdot e_k^{class}\)
-
DOV-VEH (Domain-Open-Vocabulary Vector Embedding Head):
- Function: Fuses multi-scale refined features to generate pixel-level segmentation masks.
- Mechanism: Multi-scale features are enhanced spatially through a Pixel Decoder; a Transformer Decoder then allows GeoText Query to interact with decoded features and text embeddings, yielding mask embeddings \(\mathcal{E}_{mask}\) and classification embeddings \(\mathcal{E}_{cls}\). The final prediction is: \(\hat{\mathcal{M}}(x,y,k) = \sum_d \mathcal{E}_{mask}(x,y,d) \cdot \mathcal{E}_{cls}(k,d)\)
Loss & Training¶
- Optimizer: AdamW with initial learning rate 1e-4 and weight decay 0.05.
- Polynomial learning rate decay over 40K total iterations.
- Data augmentation: multi-scale resizing, random cropping, random horizontal flipping, and photometric distortion.
- Trained on a single RTX A6000 GPU, batch size 8, approximately 14 hours.
Key Experimental Results¶
Main Results¶
Domain Generalization (Cityscapes → ACDC + BDD + Mapillary, mIoU %):
| Method | Night-ACDC | Fog-ACDC | Rain-ACDC | Snow-ACDC | BDD100k | Mapillary |
|---|---|---|---|---|---|---|
| FC-CLIP (OVSS) | 40.8 | 64.4 | 63.2 | 61.5 | 55.9 | 66.1 |
| REIN (DGSS) | 55.9 | 79.5 | 72.5 | 70.6 | 63.5 | 74.0 |
| FADA (DGSS) | 57.4 | 80.2 | 75.0 | 73.5 | 65.1 | 75.9 |
| Vireo | 60.6 | 82.3 | 76.3 | 76.2 | 66.7 | 76.0 |
Open-Vocabulary Generalization (Cityscapes → DELIVER + ADE, mIoU %):
| Method | Sun | Night | Cloud | Rain | Fog | ADE150 | ADE847 |
|---|---|---|---|---|---|---|---|
| CAT-Seg | 28.2 | 20.6 | 26.2 | 26.5 | 24.8 | 20.2 | 7.0 |
| Vireo | 35.7 | 27.5 | 32.3 | 31.8 | 32.7 | 21.4 | 7.3 |
Ablation Study¶
Component Ablation (Cityscapes → ACDC + BDD + Map, mIoU %):
| Configuration | Snow | Night | Fog | Rain | BDD | Map |
|---|---|---|---|---|---|---|
| REIN (baseline) | 70.6 | 55.9 | 79.5 | 72.5 | 63.5 | 74.0 |
| + DepthAnything V2 | 71.5 | 56.7 | 80.5 | 73.3 | 64.4 | 74.5 |
| + GeoText Query | 74.0 | 58.4 | 81.1 | 74.8 | 65.3 | 75.3 |
| Vireo (full) | 76.2 | 60.6 | 82.3 | 76.3 | 66.7 | 76.0 |
Multi-Backbone Generalization (GTA5 → Citys+BDD+Map, mIoU):
| Backbone | REIN | FADA | Vireo | Trainable Params |
|---|---|---|---|---|
| EVA02-L | 63.6 | 64.9 | 66.0 | 3.78M |
| DINOv2-L | 64.3 | 66.1 | 67.7 | 3.78M |
Key Findings¶
- GeoText Query is the most critical component, yielding approximately +2.5% mIoU in nighttime scenes.
- Depth-geometric features provide the greatest benefit under extreme weather conditions, particularly nighttime and snow.
- Vireo outperforms existing OVSS methods on both seen and unseen categories, with a more pronounced advantage on unseen categories (>+7%).
- Only 3.78M trainable parameters are required — significantly fewer than FADA (11.65M) — while achieving superior performance.
Highlights & Insights¶
- Meaningful Problem Formulation: This work is the first to define the OV-DGSS problem and propose a unified framework, closely aligned with the practical requirements of autonomous driving.
- Effective Use of Depth Cues: Depth maps are inherently domain-invariant; leveraging frozen DepthAnything V2 for geometric feature extraction is a lightweight yet effective strategy.
- Complementarity Insight: DGSS optimizes encoder-side generalization while OVSS optimizes decoder-side recognition — Vireo simultaneously advances both ends.
- Parameter Efficiency: State-of-the-art performance with only 3.78M trainable parameters, making deployment practical.
Limitations & Future Work¶
- Depth estimation quality depends on DepthAnything V2, which may be unreliable in extreme scenes such as dense fog or nighttime.
- Coarse masks generated by CMPE are of limited quality and may introduce noisy priors.
- Absolute performance on ADE847 remains low (7.3% mIoU); ultra-fine-grained segmentation across 847 categories remains challenging.
- Training memory requirements are high (~45 GB GPU memory), and inference efficiency warrants further optimization.
- Validation is limited to autonomous driving scenarios; other OV-DGSS applications (e.g., medical imaging) remain unexplored.
Related Work & Insights¶
- REIN / FADA: VFM-based DGSS methods that insert learnable modules into frozen VFMs to improve domain generalization.
- FC-CLIP / CAT-Seg: OVSS methods that leverage CLIP alignment between vision and text for open-vocabulary recognition.
- DepthForge: Demonstrates that injecting depth queries into frozen VFMs improves domain generalization, inspiring this work's use of DepthAnything.
- Insight: The paradigm of using depth/geometry as domain-invariant anchors can be extended to video segmentation, 3D scene understanding, and related tasks.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to define the OV-DGSS problem and propose a unified framework; GeoText Query design is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Eight datasets, five evaluation settings, multi-backbone validation, and detailed ablation studies.
- Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear, experiments are well-organized, and method descriptions are complete.
- Value: ⭐⭐⭐⭐⭐ Unifying OV and DG is highly practical for autonomous driving scenarios, with strong parameter efficiency.