DualGround: Structured Phrase and Sentence-Level Temporal Grounding¶
Conference: NeurIPS 2025 arXiv: 2510.20244 Code: None Area: Video Understanding Keywords: Video Temporal Grounding, Phrase-Level Semantics, Dual-Branch Architecture, Attention Decoupling, Highlight Detection
TL;DR¶
This paper identifies that existing video temporal grounding (VTG) models over-rely on the global sentence semantics encoded in the [EOS] token while neglecting word-level signals. It proposes DualGround, a dual-branch architecture that explicitly decouples global and local semantics via a sentence-level path (adaptive cross-attention) and a phrase-level path (recurrent phrase generation + Slot Attention), achieving state-of-the-art performance on QVHighlights and Charades-STA.
Background & Motivation¶
-
Background: VTG has advanced significantly through pretrained vision-language models (VLMs) such as CLIP and InternVideo2, yet these models treat all text tokens uniformly in cross-modal attention.
-
Limitations of Prior Work: Experiments reveal that VTG models are heavily biased toward the global semantics of the [EOS] token—using only [EOS] yields performance comparable to or better than using all tokens, indicating severe underutilization of word-level cues.
-
Key Challenge: The [EOS] token serves as a sentence summary computed independently of visual context, and may fail to capture visually salient textual cues (e.g., "red jacket"), leading to insufficient fine-grained grounding.
-
Goal: Design an architecture that explicitly leverages word- and phrase-level semantics to balance global and local alignment.
-
Key Insight: Decompose text representations into sentence-level ([EOS]) and phrase-level (word clusters) components, and model their alignment with video separately.
-
Core Idea: A dual-path architecture in which the sentence path augments [EOS] attention via dummy tokens, while the phrase path clusters words into semantic phrases and computes phrase-segment context representations.
Method¶
Overall Architecture¶
Given video features \(V \in \mathbb{R}^{T \times d}\) and text features comprising [EOS] and word tokens, the model processes them through a sentence-level path and a phrase-level path in parallel, then fuses the outputs for temporal pyramid prediction.
Key Designs¶
-
Sentence-Level Path (ACA + Dummy Tokens):
- Function: Strengthen global alignment between [EOS] and video segments.
- Mechanism: \(L_d\) learnable dummy tokens are introduced as attention absorbers. Semantically irrelevant video segments attend to dummy tokens, while relevant segments attend to [EOS], producing sharper alignment signals \(\alpha_i\).
- Design Motivation: The single [EOS] token is incompatible with standard attention mechanisms; dummy tokens provide the necessary contrastive background.
-
Phrase-Level Path (RPG + Slot Attention):
- Function: Construct semantically coherent phrase representations from word tokens.
- Mechanism: The Recurrent Phrase Generator (RPG) uses [EOS] and the previous phrase as guidance vectors to perform soft attention aggregation over word tokens. Slot Attention iteratively refines the phrases to disentangle overlapping semantics. Phrase-segment context embeddings are then computed as \(C = f_{ctx}(f_p(P) \odot f_v(V)) \in \mathbb{R}^{N \times T \times d}\).
- Design Motivation: Word-level semantics are context-dependent; phrase-level abstraction is better suited for alignment with visual content than individual words.
-
Phrase-Guided Aggregation:
- Function: Aggregate \(N\) phrase-segment contexts into a unified representation.
- Mechanism: Importance weights are computed using the reconstructed global token \(P_{[EOS]}\) over all phrases, and contexts are aggregated accordingly: \(v_{p,t} = \sum_{n=1}^N \text{softmax}(\langle W_q P_{[EOS]}, W_k p^{(n)} \rangle / \sqrt{d}) \cdot C_{n,t}\).
- Design Motivation: Different phrases contribute differently to different temporal segments, necessitating adaptive aggregation.
Loss & Training¶
- MR loss: Focal loss (classification) + L1 (regression)
- HD loss: Ranking loss + contrastive loss (applied to saliency scores and attention weights respectively)
- Phrase loss: DQA loss (orthogonalization of phrase attention) + EOS reconstruction loss (InfoNCE aligning \(P_{[EOS]}\) with \(e_{[EOS]}\))
- Total loss: \(\mathcal{L} = \lambda_{mr}\mathcal{L}_{mr} + \lambda_{hd}\mathcal{L}_{hd} + \lambda_{phrase}(\mathcal{L}_{DQA} + \mathcal{L}_{EOS})\)
Key Experimental Results¶
Main Results¶
| Dataset | Metric | DualGround | FlashVTG | Gain |
|---|---|---|---|---|
| QVHighlights Test (IV2) | R1@0.7 | 56.94% | 53.96% | +2.98% |
| QVHighlights Test (IV2) | mAP | 52.73% | 52.00% | +0.73% |
| QVHighlights Val (IV2) | R1@0.7 | 58.97% | 56.06% | +2.91% |
| Charades-STA (IV2) | R1@0.7 | SOTA | - | - |
Ablation Study¶
| Configuration | R1@0.7 | mAP | Note |
|---|---|---|---|
| FlashVTG baseline | 56.13 | 52.24 | Full token sequence |
| + RPG + Slot + DQA + EOS | 58.97 | 53.26 | Full DualGround |
| − DQA loss | 56.55 | 52.53 | Insufficient phrase disentanglement |
| − EOS reconstruction | 58.02 | 53.11 | Slight drop in global consistency |
| − Slot Attention | 57.83 | 53.02 | Insufficient refinement |
Key Findings¶
- Performance gains are more pronounced for long queries (>20 words), confirming the phrase path's effectiveness on complex queries.
- Attention visualizations clearly demonstrate that baseline methods neglect visually salient words such as "red jacket."
- Highlight detection is sensitive to the phrase path—unregularized phrases may introduce noise.
- The DQA loss contributes +2.42 R1@0.7; the EOS reconstruction loss contributes +0.95 R1@0.7; Slot Attention contributes +1.14 R1@0.7.
- The model assumes a fixed phrase count \(N\), which may not be optimal for all queries; the dual-path design also incurs additional inference overhead.
- Compared to the FlashVTG baseline (InternVideo2 features): R1@0.7 improves from 56.13% to 58.97% (+2.84), and mAP improves from 52.24% to 53.26% (+1.02).
- Controlled experiments reveal an [EOS] bias in VTG models: using only the [EOS] token performs comparably to or better than using all tokens, demonstrating severe underutilization of word-level cues. This finding directly motivates DualGround's dual-path design.
Highlights & Insights¶
- Controlled experiments (words-only / EOS-only / all tokens) provide a clear and rigorous demonstration of the [EOS] bias in VTG models.
- Applying Slot Attention to refine phrase clusters is an elegant design transfer.
- The sentence + phrase dual-path design is generalizable to other visual-language alignment tasks.
- The DQA loss and EOS reconstruction loss provide effective regularization.
- The loss formulation combines Focal loss (classification) + L1 (regression) for MR, and ranking loss + contrastive loss for HD. The total loss is \(\mathcal{L} = \lambda_{mr}\mathcal{L}_{mr} + \lambda_{hd}\mathcal{L}_{hd} + \lambda_{phrase}(\mathcal{L}_{DQA} + \mathcal{L}_{EOS})\).
Limitations & Future Work¶
- The phrase count \(N\) must be specified in advance; the optimal \(N\) may vary across queries.
- The dual-path design incurs additional computational overhead at inference time.
- No comparison is made against MLLM-based methods (e.g., TimeZero).
- Validation is limited to MR and HD tasks; extension to broader tasks such as video QA is not explored.
Related Work & Insights¶
- vs. FlashVTG: FlashVTG employs a multi-scale temporal pyramid but processes text uniformly; DualGround decouples sentence- and phrase-level semantics.
- vs. CG-DETR: CG-DETR introduces ACA but applies it only at the sentence level; DualGround extends it to a complete dual-path framework.
- vs. Keyword-DETR: Keyword-DETR emphasizes visually salient keywords but does not perform phrase clustering; DualGround achieves semantically coherent phrase abstraction through Slot Attention.
Rating¶
Implementation Details¶
Built on the FlashVTG framework using InternVideo2/CLIP features. \(L_d\) learnable dummy tokens serve as attention absorbers. - Novelty: ⭐⭐⭐⭐ The sentence/phrase dual-path decoupling is an effective and well-motivated design. - Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark comparisons, detailed ablations, and visualization analyses. - Writing Quality: ⭐⭐⭐⭐ Motivation is rigorously established; controlled experiments are persuasive. - Value: ⭐⭐⭐⭐ Offers meaningful reference for VTG and visual-language alignment research.