ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations¶

Conference: ICCV 2025 arXiv: 2501.14607 Code: Project Page Area: Image Segmentation Keywords: Referring Video Object Segmentation, Visual Grounding, GroundingDINO, Deformable Attention, Query Pruning

TL;DR¶

This paper proposes ReferDINO, which end-to-end adapts the GroundingDINO visual grounding foundation model to the Referring Video Object Segmentation (RVOS) task. By introducing a grounding-guided deformable mask decoder, an object-consistent temporal enhancer, and a confidence-based query pruning strategy, ReferDINO significantly surpasses state-of-the-art methods across five benchmarks (e.g., +3.9% \(\mathcal{J}\&\mathcal{F}\) on Ref-YouTube-VOS) while achieving real-time inference at 51 FPS.

Background & Motivation¶

Referring Video Object Segmentation (RVOS) segments target objects in videos based on textual descriptions, requiring three capabilities: deep vision-language understanding, pixel-level dense prediction, and spatiotemporal reasoning. Existing methods face the following core challenges:

Insufficient Vision-Language Capability: Existing RVOS models frequently confuse visually similar objects when handling complex attribute descriptions (e.g., "shape + color" combinations), owing to the limited scale of RVOS training data.

Limitations of Visual Grounding Foundation Models: Although GroundingDINO provides strong object-level VL understanding, it (a) only predicts bounding boxes without producing masks (lacking pixel-level dense prediction), and (b) cannot reason about dynamic attributes (e.g., "a cat wagging its tail").

Efficiency Bottleneck: GroundingDINO employs 900 queries, making frame-by-frame video processing computationally expensive.

Non-End-to-End Integration: Existing approaches (e.g., Grounded-SAM-2) cascade GroundingDINO and SAM2 in a non-end-to-end, non-differentiable pipeline, precluding further joint optimization.

Core Motivation: An end-to-end framework is needed to seamlessly integrate GroundingDINO's open-world grounding knowledge (region-level VL alignment) with pixel-level segmentation and spatiotemporal reasoning capabilities.

Method¶

Overall Architecture¶

ReferDINO augments GroundingDINO with three new modules: 1. Confidence Query Pruning → reduces per-frame computation 2. Object-Consistent Temporal Enhancer → enables cross-frame interaction 3. Grounding-Guided Deformable Mask Decoder → produces high-quality masks

Workflow: video frames are fed into GroundingDINO frame-by-frame → query pruning → per-frame object feature collection → temporal enhancement → mask decoding → output mask sequences.

Key Designs¶

Grounding-Guided Deformable Mask Decoder:
- Core Design: Box prediction and mask prediction are cascaded into a grounding–deformation–segmentation pipeline rather than operating in parallel.
- The predicted box center \(\{b_x, b_y\}\) is directly used as the reference point for deformable attention, injecting grounding priors into mask prediction.
- Object features \(\tilde{\boldsymbol{o}}\) serve as Query, and FPN high-resolution features \(\boldsymbol{F}_{\text{seg}}\) serve as Memory.
- Sampling is performed via bilinear interpolation, making the process end-to-end differentiable so that segmentation gradients can back-propagate to optimize box predictions.
- A cross-modal attention layer (Query = object features, K/V = text features) is further applied to suppress background noise.
- Comparison with Dynamic Mask Head: The latter stores independent high-resolution feature maps per object, incurring substantial memory overhead; ReferDINO performs position-guided sampling over a shared feature map with zero additional memory cost.
- The decoder consists of \(L_m\) blocks, each comprising deformable cross-attention followed by cross-modal attention.
Object-Consistent Temporal Enhancer:
- Composed of a memory-augmented tracker and a cross-modal temporal decoder.
- Memory-Augmented Tracker: Applies the Hungarian algorithm with cosine similarity to align objects across adjacent frames; memory is updated via momentum as \(\mathcal{M}^t = (1 - \alpha \cdot \boldsymbol{c}^t) \cdot \mathcal{M}^{t-1} + \alpha \cdot \boldsymbol{c}^t \cdot \hat{\mathcal{O}}^t\), where \(\boldsymbol{c}^t\) denotes the object-text similarity, preventing frames in which the target is invisible from corrupting long-term memory.
- Cross-Modal Temporal Decoder: Injects time-varying text features \(\{\boldsymbol{f}_{\text{cls}}^t\}\) as frame proxies for cross-frame interaction, rather than relying on static text embeddings; dynamic information is extracted via temporal self-attention followed by cross-attention (text as Query, objects as K/V).
- Design Motivation: (a) GroundingDINO processes each frame independently, lacking temporal consistency; (b) prior methods use static text embeddings, which cannot capture dynamic changes (e.g., action descriptions).
Confidence Query Pruning:
- Low-confidence queries are progressively pruned at each layer of the cross-modal decoder.
- Confidence is defined by two terms: \(s_j = \frac{1}{N_l-1}\sum_{i \neq j} \boldsymbol{A}^s_{ij} + \max_k \boldsymbol{A}^c_{kj}\)
  - First term: average attention received by the \(j\)-th query from all other queries (irreplaceability).
  - Second term: maximum probability that the \(j\)-th object is mentioned in the text.
- The top \(1/k\) high-confidence queries are retained at each layer, yielding \(N_s \ll N_q = 900\) at the final layer.
- Decoder time complexity is reduced from \(O(L \cdot N^2 d)\) to \(O(\frac{k^2}{k^2-1} N^2 d)\), independent of depth \(L\).
- At \(k=2\), decoder computation is reduced to 24.7% of the original.

Loss & Training¶

Hungarian matching selects the prediction sequence with minimum cost as the positive sample.
Total loss: \(\mathcal{L}_{\text{total}} = \lambda_{\text{cls}}\mathcal{L}_{\text{cls}} + \lambda_{\text{box}}\mathcal{L}_{\text{box}} + \lambda_{\text{mask}}\mathcal{L}_{\text{mask}}\)
\(\mathcal{L}_{\text{cls}}\): focal loss; \(\mathcal{L}_{\text{box}}\): L1 + GIoU; \(\mathcal{L}_{\text{mask}}\): DICE + binary focal + projection loss.
The backbone is frozen; the cross-modal Transformer is fine-tuned using LoRA (rank = 32).
The model is first pre-trained on RefCOCO/+/g image data, then fine-tuned on RVOS data.

Key Experimental Results¶

Main Results¶

Dataset	Metric	ReferDINO (Swin-T)	DsHmp (CVPR'24)	SOC (NeurIPS'23)	Gain
Ref-YouTube-VOS	\(\mathcal{J}\&\mathcal{F}\)	67.5	63.6	62.4	+3.9
MeViS	\(\mathcal{J}\&\mathcal{F}\)	48.0	46.4	-	+1.6
Ref-DAVIS17	\(\mathcal{J}\&\mathcal{F}\)	66.7	64.0	63.5	+2.7
Ref-YouTube-VOS (Swin-B)	\(\mathcal{J}\&\mathcal{F}\)	69.3	67.1	66.0	+2.2
MeViS (Swin-B)	\(\mathcal{J}\&\mathcal{F}\)	49.3	-	-	-

Ablation Study¶

Configuration	\(\mathcal{J}\&\mathcal{F}\) (MeViS)	Note
ReferDINO	48.0	Full model
w/o CMA (remove cross-modal attention)	47.6 (−0.4)	Cross-modal filtering provides auxiliary benefit
w/o DCA (remove deformable cross-attention)	45.3 (−2.7)	Grounding-guided sampling is the core component
w/o Tracker	47.6 (−0.4)	Tracker contributes temporal consistency
w/o Temporal Decoder	45.8 (−2.2)	Temporal decoder is a critical component
50% query pruning	67.5 (vs. 67.6 full)	Negligible performance loss, −40.6% FLOPs
Random 50% pruning	38.3 (−29.3)	Random pruning causes catastrophic degradation

Key Findings¶

ReferDINO with Swin-T outperforms the GroundingDINO baseline with Swin-B backbone (67.5 vs. 66.7 \(\mathcal{J}\&\mathcal{F}\)).
The query pruning strategy achieves a 10× speedup (4.9 → 51 FPS) with negligible performance loss.
The grounding-guided deformable cross-attention (DCA) contributes the largest gain (−2.7%), validating the importance of box priors for mask quality.
Comparisons against G-DINO+SH/DH baselines demonstrate that naive adaptation is insufficient to fully unlock the potential of the foundation model.

Highlights & Insights¶

End-to-End Adaptation Paradigm: The first work to adapt GroundingDINO end-to-end for RVOS, enabling differentiable joint optimization of box and mask predictions.
Grounding-to-Mask Cascade Design: The pre-trained box predictions are elegantly leveraged as spatial priors to guide mask generation, outperforming a parallel design.
Time-Varying Text Embeddings: The cross-modal encoder of GroundingDINO generates frame-specific text representations, capturing temporal dynamics more effectively than static text embeddings.
Efficient Query Pruning: Confidence scores are computed by reusing decoder self-attention weights, introducing zero additional computational overhead.
Real-Time Performance: The 51 FPS inference speed makes the method directly applicable to real-time video scenarios.

Limitations & Future Work¶

Freezing the backbone and applying LoRA fine-tuning may limit the depth of adaptation to RVOS-specific scenarios.
The momentum-based update in the memory-augmented tracker may suffer from drift in long videos.
The model supports only single-target RVOS (multi-target selection on MeViS is handled via thresholding); scalability to multi-target settings has not been thoroughly validated.
Integration with segmentation foundation models such as SAM2 remains unexplored.

Compared to Video-GroundingDINO (which merely adds temporal self-attention), ReferDINO presents a more thorough adaptation encompassing a mask decoder, temporal enhancer, and query pruning.
The strategy of setting deformable attention reference points using box centers (replacing MLP-generated references) is a concise and effective design choice.
The query pruning approach is generalizable to other DETR-like models for inference acceleration.
The end-to-end foundation model adaptation paradigm offers valuable insights for other downstream tasks such as video grounding and temporal action detection.

Rating¶

Novelty: ⭐⭐⭐⭐ The end-to-end foundation model adaptation paradigm is original, and the grounding-to-mask cascade design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five datasets, two backbone variants, detailed ablations, and efficiency analysis — highly comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure with tightly coupled motivation and methodology; high-quality figures and tables.
Value: ⭐⭐⭐⭐⭐ Establishes a new state of the art on RVOS while achieving real-time inference, offering strong practical utility.