DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation¶

Conference: CVPR 2025
arXiv: 2505.11676
Code: None
Area: Segmentation / Multimodal VLM
Keywords: Open-vocabulary segmentation, dual prompts, cost volume, visual prompt, CLIP

TL;DR¶

DPSeg proposes leveraging both text prompts and visual prompts generated by Stable Diffusion to construct a dual-prompt cost volume for open-vocabulary semantic segmentation. Utilizing a multi-scale visual cost volume guided decoder and a two-round inference semantic refinement strategy, the method consistently outperforms existing approaches across five public datasets.

Background & Motivation¶

Background: Open-vocabulary semantic segmentation (OVSS) aims to perform pixel-level classification on images containing categories unseen during training. Current methods primarily leverage the text-image alignment capability of CLIP by computing the cost volume between image patch features and text embeddings. CAT-Seg and SED are representative works in this direction.

Limitations of Prior Work: (1) Although CLIP aligns image and text modalities during training, a significant modality gap still exists between them, which limits the positioning precision of targets in images when relying solely on text embeddings; (2) Existing methods only use deep, text-aligned features to generate cost volumes, lacking guidance from shallow features, resulting in insufficient detection capabilities for small objects and fine details.

Key Challenge: Texts and images inherently belong to different modalities. Even after large-scale pre-training alignment, their cosine similarity in the embedding space is still significantly lower than intra-modality similarity. This modality gap limits the segmentation accuracy based on text prompts.

Goal: How to bridge the text-image modality gap to improve segmentation accuracy? How to utilize multi-scale features to provide richer spatial-semantic cues?

Key Insight: The authors experimentally found that the similarity between visual prompts (reference images generated from text descriptions using Stable Diffusion) and target images in the CLIP embedding space is significantly higher than that of text prompts (approximately 0.7 vs. 0.3). Cost volumes generated by visual prompts also exhibit clearer semantic boundaries. Therefore, combining text and visual prompts can leverage complementary advantages.

Core Idea: Using Stable Diffusion to generate visual prompts bridges the text-image modality gap; integrating them into a dual-prompt cost volume combined with multi-scale decoding achieves a new SOTA in OVSS.

Method¶

Overall Architecture¶

DPSeg consists of three core modules: (1) Dual-Prompt Cost Volume Generation: Uses text templates to generate text prompt embeddings while utilizing Stable Diffusion to generate corresponding visual prompt images to extract embeddings. The average of the two is used to compute the cost volume with image features; (2) Cost Volume-Guided Decoder: Sequentially upsamples the cost volume while fusing multi-scale features from the image encoder and multi-scale features from the visual prompts at each scale; (3) Semantic-Guided Prompt Refinement: Crops the target regions from the first-round inference segmentation results to serve as the visual prompts for the second round, replacing the Stable Diffusion generated prompts to refine the segmentation.

Key Designs¶

Dual-Prompt Cost Volume Generation:
- Function: Generates a more precise pixel-level semantic similarity map compared to a single text prompt.
- Mechanism: For each category \(C_k\), multiple text templates (e.g., "a photo of a {\(C_k\)}") are used to generate text embeddings \(\mathbf{T}\). Simultaneously, these templates are input into Stable Diffusion to generate visual prompt images, from which visual embeddings \(\mathbf{V}\) are extracted using the CLIP image encoder. The fusion is done by simply calculating the average \(\mathbf{R} = \text{Avg}(\mathbf{V} + \mathbf{T})\), followed by pixel-wise cosine similarity with image features \(\mathbf{E}\) to obtain the cost volume \(\mathcal{F}_c\).
- Design Motivation: t-SNE visualization demonstrates that the distance between the dual-prompt embedding and target image features is only 0.18 (compared to 0.39 for text and 0.33 for visual prompts). Averaging is the simplest and most effective fusion strategy. Ablation studies verify that calculating similarity after averaging embeddings outperforms calculating similarities separately and then concatenating/averaging them.
Cost Volume-Guided Decoder (CVGD):
- Function: Performs multi-scale fusion of the cost volume, image features, and visual prompt features, sequentially upsampling to predict the segmentation map.
- Mechanism: The decoder consists of three stages, each containing hybrid dilated convolutions (dilation rates of 1/2/4), self-attention layers, and deconvolution upsampling. The key innovation is using intermediate features from the image encoder and visual prompt encoder at each stage to compute visual cost volumes \(\mathcal{F}_v^j\), which are directly aligned with the decoded features, avoiding detail degradation caused by upsampling the initial cost volume.
- Design Motivation: Upsampling the initial cost volume in prior methods leads to loss of fine-grained information. Using multi-scale visual cost volumes provides complementary spatial-semantic cues at different levels.
Semantic-Guided Prompt Refinement:
- Function: Utilizes the first-round segmentation results to refine visual prompts, improving boundary accuracy.
- Mechanism: A two-round inference strategy — Inference I uses visual prompts generated by Stable Diffusion to obtain initial segmentation results; Inference II uses the initial segmentation masks to crop the detected category regions from the original image as new visual prompts for that category. It then re-runs inference with these updated prompts, while undetected categories still use the original prompts.
- Design Motivation: Visual prompts generated by Stable Diffusion may not align perfectly with specific instances in the input image. Using scene-adaptive cropped regions as prompts is more accurate.

Loss & Training¶

Uses pixel-wise binary cross-entropy loss.
Trained on COCO-Stuff using the AdamW optimizer with a learning rate of \(2 \times 10^{-4}\).
Both the text encoder and visual prompt encoder are frozen; only the image encoder and decoder are trained.
Trained for 80K iterations with a batch size of 4 using two V100 GPUs.

Key Experimental Results¶

Main Results¶

ConvNeXt-B Configuration:

Method	A-847	PC-459	A-150	PC-59	PAS-20
SED	11.4	18.6	31.6	57.3	94.4
CAT-Seg	8.4	16.6	27.2	57.5	93.7
DPSeg (Inf. II)	12.5	20.1	33.3	58.4	96.9

ConvNeXt-L Configuration:

Method	A-847	PC-459	A-150	PC-59	PAS-20
SED	13.9	22.6	35.2	60.6	96.1
DPSeg (Inf. II)	15.7	24.1	37.1	62.3	98.5

Ablation Study¶

Ablation on Prompt Strategies:

Prompt Strategy	A-847	PC-459	A-150	PC-59	PAS-20
Text T Only	10.4	17.4	30.6	56.2	93.4
Visual V Only	11.1	18.0	31.2	56.9	94.5
Avg(T,V) (ours)	12.0	19.5	32.9	58.1	96.0

Ablation on Multi-Scale Cost Volume:

Configuration	A-847	A-150	Description
Only \(\mathcal{F}_c\)	10.6	31.6	No multi-scale guidance
\(\mathcal{F}_c + \mathcal{F}_v^{2}\)	11.4	32.1	Add 1-layer visual cost volume
\(\mathcal{F}_c + \mathcal{F}_v^{2,3,4}\) (ours)	12.0	32.9	Full 3-layer visual cost volume

Key Findings¶

Visual prompts consistently outperform text prompts, demonstrating that intra-modality alignment is superior to cross-modality alignment.
Dual-prompt fusion via simple averaging outperforms concatenation and fusion strategies at the cost volume level.
Inference II consistently improves performance by about 0.5–0.9 mIoU compared to Inference I, proving the efficacy of scene-adaptive prompt refinement.
Each added scale of the visual cost volume improves performance on A-847 by approximately 0.3–0.6, indicating that while multi-scale guidance has diminishing returns, each layer remains valuable.
Upsampling the cost volume (the conventional strategy) performs significantly worse than utilizing multi-scale visual cost volumes.

Highlights & Insights¶

Bridging the Modality Gap with Stable Diffusion: Utilizing generative models to transform text prompts into visual prompts cleverly converts cross-modality alignment into intra-modality matching. This simple yet highly effective approach can be transferred to any task requiring text-image alignment.
Two-Round Inference Refinement Strategy: Cropping detected regions from the first round to serve as visual prompts in the second round follows a coarse-to-fine philosophy. It requires no additional training and boosts accuracy purely through inference pipeline design.
Multi-Scale Visual Cost Volume Guidance: Avoids information loss associated with cost volume upsampling by directly computing cost volumes with corresponding-scale visual features at each decoding layer.

Limitations & Future Work¶

Dependence on Stable Diffusion to pre-generate visual prompts increases inference preparation time and storage overhead.
The two-round inference strategy doubles runtime, making it less suitable for real-time applications.
The quality of visual prompts relies heavily on Stable Diffusion generation quality, which may yield poor-quality reference images for uncommon concepts.
Using only frozen features from the text and visual prompt encoders during training limits the model's adaptation capability.
The resolution of the cost volume is constrained by the encoder output, which may still be insufficient for segmenting extremely small objects.

vs. CAT-Seg: CAT-Seg constructs cost volumes using only text prompts. By introducing visual prompts, DPSeg significantly improves A-150 from 27.2 to 33.3 (+6.1).
vs. SED: SED uses hierarchical encoders and multi-scale features but relies solely on text alignment; DPSeg builds upon SED by introducing visual prompts and multi-scale cost volumes.
Insights: The visual prompt concept can be extended to other tasks requiring cross-modal alignment (e.g., open-vocabulary detection, VQA) by using generative models as modality bridges.

Rating¶

Novelty: ⭐⭐⭐⭐ The approach of using generative models to produce visual prompts to bridge the modality gap is novel and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across 5 datasets with two backbones and detailed ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear motivation analysis (t-SNE and modality gap experiments) and well-structured presentation.
Value: ⭐⭐⭐⭐ Provides a new paradigm for open-vocabulary segmentation, and the introduction of visual prompts has general applicability.