ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations¶
Conference: ICCV 2025 arXiv: 2501.14607 Code: Project Page Area: Image Segmentation Keywords: Referring Video Object Segmentation, Visual Grounding, GroundingDINO, Deformable Attention, Query Pruning
TL;DR¶
This paper proposes ReferDINO, which end-to-end adapts the GroundingDINO visual grounding foundation model to the Referring Video Object Segmentation (RVOS) task. By introducing a grounding-guided deformable mask decoder, an object-consistent temporal enhancer, and a confidence-based query pruning strategy, ReferDINO significantly surpasses state-of-the-art methods across five benchmarks (e.g., +3.9% \(\mathcal{J}\&\mathcal{F}\) on Ref-YouTube-VOS) while achieving real-time inference at 51 FPS.
Background & Motivation¶
Referring Video Object Segmentation (RVOS) segments target objects in videos based on textual descriptions, requiring three capabilities: deep vision-language understanding, pixel-level dense prediction, and spatiotemporal reasoning. Existing methods face the following core challenges:
Insufficient Vision-Language Capability: Existing RVOS models frequently confuse visually similar objects when handling complex attribute descriptions (e.g., "shape + color" combinations), owing to the limited scale of RVOS training data.
Limitations of Visual Grounding Foundation Models: Although GroundingDINO provides strong object-level VL understanding, it (a) only predicts bounding boxes without producing masks (lacking pixel-level dense prediction), and (b) cannot reason about dynamic attributes (e.g., "a cat wagging its tail").
Efficiency Bottleneck: GroundingDINO employs 900 queries, making frame-by-frame video processing computationally expensive.
Non-End-to-End Integration: Existing approaches (e.g., Grounded-SAM-2) cascade GroundingDINO and SAM2 in a non-end-to-end, non-differentiable pipeline, precluding further joint optimization.
Core Motivation: An end-to-end framework is needed to seamlessly integrate GroundingDINO's open-world grounding knowledge (region-level VL alignment) with pixel-level segmentation and spatiotemporal reasoning capabilities.
Method¶
Overall Architecture¶
ReferDINO augments GroundingDINO with three new modules: 1. Confidence Query Pruning → reduces per-frame computation 2. Object-Consistent Temporal Enhancer → enables cross-frame interaction 3. Grounding-Guided Deformable Mask Decoder → produces high-quality masks
Workflow: video frames are fed into GroundingDINO frame-by-frame → query pruning → per-frame object feature collection → temporal enhancement → mask decoding → output mask sequences.
Key Designs¶
-
Grounding-Guided Deformable Mask Decoder:
- Core Design: Box prediction and mask prediction are cascaded into a grounding–deformation–segmentation pipeline rather than operating in parallel.
- The predicted box center \(\{b_x, b_y\}\) is directly used as the reference point for deformable attention, injecting grounding priors into mask prediction.
- Object features \(\tilde{\boldsymbol{o}}\) serve as Query, and FPN high-resolution features \(\boldsymbol{F}_{\text{seg}}\) serve as Memory.
- Sampling is performed via bilinear interpolation, making the process end-to-end differentiable so that segmentation gradients can back-propagate to optimize box predictions.
- A cross-modal attention layer (Query = object features, K/V = text features) is further applied to suppress background noise.
- Comparison with Dynamic Mask Head: The latter stores independent high-resolution feature maps per object, incurring substantial memory overhead; ReferDINO performs position-guided sampling over a shared feature map with zero additional memory cost.
- The decoder consists of \(L_m\) blocks, each comprising deformable cross-attention followed by cross-modal attention.
-
Object-Consistent Temporal Enhancer:
- Composed of a memory-augmented tracker and a cross-modal temporal decoder.
- Memory-Augmented Tracker: Applies the Hungarian algorithm with cosine similarity to align objects across adjacent frames; memory is updated via momentum as \(\mathcal{M}^t = (1 - \alpha \cdot \boldsymbol{c}^t) \cdot \mathcal{M}^{t-1} + \alpha \cdot \boldsymbol{c}^t \cdot \hat{\mathcal{O}}^t\), where \(\boldsymbol{c}^t\) denotes the object-text similarity, preventing frames in which the target is invisible from corrupting long-term memory.
- Cross-Modal Temporal Decoder: Injects time-varying text features \(\{\boldsymbol{f}_{\text{cls}}^t\}\) as frame proxies for cross-frame interaction, rather than relying on static text embeddings; dynamic information is extracted via temporal self-attention followed by cross-attention (text as Query, objects as K/V).
- Design Motivation: (a) GroundingDINO processes each frame independently, lacking temporal consistency; (b) prior methods use static text embeddings, which cannot capture dynamic changes (e.g., action descriptions).
-
Confidence Query Pruning:
- Low-confidence queries are progressively pruned at each layer of the cross-modal decoder.
- Confidence is defined by two terms: \(s_j = \frac{1}{N_l-1}\sum_{i \neq j} \boldsymbol{A}^s_{ij} + \max_k \boldsymbol{A}^c_{kj}\)
- First term: average attention received by the \(j\)-th query from all other queries (irreplaceability).
- Second term: maximum probability that the \(j\)-th object is mentioned in the text.
- The top \(1/k\) high-confidence queries are retained at each layer, yielding \(N_s \ll N_q = 900\) at the final layer.
- Decoder time complexity is reduced from \(O(L \cdot N^2 d)\) to \(O(\frac{k^2}{k^2-1} N^2 d)\), independent of depth \(L\).
- At \(k=2\), decoder computation is reduced to 24.7% of the original.
Loss & Training¶
- Hungarian matching selects the prediction sequence with minimum cost as the positive sample.
- Total loss: \(\mathcal{L}_{\text{total}} = \lambda_{\text{cls}}\mathcal{L}_{\text{cls}} + \lambda_{\text{box}}\mathcal{L}_{\text{box}} + \lambda_{\text{mask}}\mathcal{L}_{\text{mask}}\)
- \(\mathcal{L}_{\text{cls}}\): focal loss; \(\mathcal{L}_{\text{box}}\): L1 + GIoU; \(\mathcal{L}_{\text{mask}}\): DICE + binary focal + projection loss.
- The backbone is frozen; the cross-modal Transformer is fine-tuned using LoRA (rank = 32).
- The model is first pre-trained on RefCOCO/+/g image data, then fine-tuned on RVOS data.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | ReferDINO (Swin-T) | DsHmp (CVPR'24) | SOC (NeurIPS'23) | Gain |
|---|---|---|---|---|---|
| Ref-YouTube-VOS | \(\mathcal{J}\&\mathcal{F}\) | 67.5 | 63.6 | 62.4 | +3.9 |
| MeViS | \(\mathcal{J}\&\mathcal{F}\) | 48.0 | 46.4 | - | +1.6 |
| Ref-DAVIS17 | \(\mathcal{J}\&\mathcal{F}\) | 66.7 | 64.0 | 63.5 | +2.7 |
| Ref-YouTube-VOS (Swin-B) | \(\mathcal{J}\&\mathcal{F}\) | 69.3 | 67.1 | 66.0 | +2.2 |
| MeViS (Swin-B) | \(\mathcal{J}\&\mathcal{F}\) | 49.3 | - | - | - |
Ablation Study¶
| Configuration | \(\mathcal{J}\&\mathcal{F}\) (MeViS) | Note |
|---|---|---|
| ReferDINO | 48.0 | Full model |
| w/o CMA (remove cross-modal attention) | 47.6 (−0.4) | Cross-modal filtering provides auxiliary benefit |
| w/o DCA (remove deformable cross-attention) | 45.3 (−2.7) | Grounding-guided sampling is the core component |
| w/o Tracker | 47.6 (−0.4) | Tracker contributes temporal consistency |
| w/o Temporal Decoder | 45.8 (−2.2) | Temporal decoder is a critical component |
| 50% query pruning | 67.5 (vs. 67.6 full) | Negligible performance loss, −40.6% FLOPs |
| Random 50% pruning | 38.3 (−29.3) | Random pruning causes catastrophic degradation |
Key Findings¶
- ReferDINO with Swin-T outperforms the GroundingDINO baseline with Swin-B backbone (67.5 vs. 66.7 \(\mathcal{J}\&\mathcal{F}\)).
- The query pruning strategy achieves a 10× speedup (4.9 → 51 FPS) with negligible performance loss.
- The grounding-guided deformable cross-attention (DCA) contributes the largest gain (−2.7%), validating the importance of box priors for mask quality.
- Comparisons against G-DINO+SH/DH baselines demonstrate that naive adaptation is insufficient to fully unlock the potential of the foundation model.
Highlights & Insights¶
- End-to-End Adaptation Paradigm: The first work to adapt GroundingDINO end-to-end for RVOS, enabling differentiable joint optimization of box and mask predictions.
- Grounding-to-Mask Cascade Design: The pre-trained box predictions are elegantly leveraged as spatial priors to guide mask generation, outperforming a parallel design.
- Time-Varying Text Embeddings: The cross-modal encoder of GroundingDINO generates frame-specific text representations, capturing temporal dynamics more effectively than static text embeddings.
- Efficient Query Pruning: Confidence scores are computed by reusing decoder self-attention weights, introducing zero additional computational overhead.
- Real-Time Performance: The 51 FPS inference speed makes the method directly applicable to real-time video scenarios.
Limitations & Future Work¶
- Freezing the backbone and applying LoRA fine-tuning may limit the depth of adaptation to RVOS-specific scenarios.
- The momentum-based update in the memory-augmented tracker may suffer from drift in long videos.
- The model supports only single-target RVOS (multi-target selection on MeViS is handled via thresholding); scalability to multi-target settings has not been thoroughly validated.
- Integration with segmentation foundation models such as SAM2 remains unexplored.
Related Work & Insights¶
- Compared to Video-GroundingDINO (which merely adds temporal self-attention), ReferDINO presents a more thorough adaptation encompassing a mask decoder, temporal enhancer, and query pruning.
- The strategy of setting deformable attention reference points using box centers (replacing MLP-generated references) is a concise and effective design choice.
- The query pruning approach is generalizable to other DETR-like models for inference acceleration.
- The end-to-end foundation model adaptation paradigm offers valuable insights for other downstream tasks such as video grounding and temporal action detection.
Rating¶
- Novelty: ⭐⭐⭐⭐ The end-to-end foundation model adaptation paradigm is original, and the grounding-to-mask cascade design is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five datasets, two backbone variants, detailed ablations, and efficiency analysis — highly comprehensive.
- Writing Quality: ⭐⭐⭐⭐⭐ Clear structure with tightly coupled motivation and methodology; high-quality figures and tables.
- Value: ⭐⭐⭐⭐⭐ Establishes a new state of the art on RVOS while achieving real-time inference, offering strong practical utility.