Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding¶
Conference: NeurIPS 2025 arXiv: 2510.20244 Code: None Area: Video Understanding / Temporal Grounding Keywords: Video Temporal Grounding, Dual-Path Architecture, Phrase-Level Alignment, Moment Retrieval, Attention Decoupling
TL;DR¶
DualGround identifies a critical issue in existing VTG models — over-reliance on the global semantics of the [EOS] token while neglecting word-level signals — and proposes a sentence-level + phrase-level dual-path architecture. Through an Adaptive Cross-Attention (ACA) module and a Recurrent Phrase Generator (RPG), the model captures global and local semantics respectively, achieving state-of-the-art performance on QVHighlights and Charades-STA.
Background & Motivation¶
State of the Field¶
Video Temporal Grounding (VTG) encompasses two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). Existing models typically process text features from pretrained encoders such as CLIP and InternVideo2 by treating all text tokens (word tokens + [EOS] token) uniformly.
Through controlled experiments, the authors identify a critical issue:
Limitations of Prior Work¶
[EOS] Bias: Models rely almost exclusively on the global semantics of the [EOS] token, while word-level tokens are severely neglected.
Root Cause¶
Using only the [EOS] token yields performance comparable to or even better than using the full token sequence.
Starting Point¶
Even for video segments irrelevant to the query, attention concentrates on [EOS] rather than on semantically relevant keywords (e.g., "red jacket").
Supplementary Remarks¶
This leads to degraded performance in scenarios requiring fine-grained word-to-video alignment.
Method¶
Overall Architecture¶
The dual-path architecture processes sentence-level and phrase-level semantics separately: 1. Sentence-Level Path: Utilizes the [EOS] token with Adaptive Cross-Attention (ACA) to capture global alignment. 2. Phrase-Level Path: Word tokens → Recurrent Phrase Generator (RPG) → Slot Attention refinement → phrase-segment interaction. 3. The outputs of both paths, \(V^s + V^p\), are fused and fed into the decoder.
Key Designs¶
-
Adaptive Cross-Attention (ACA, Sentence-Level Path):
- Function: Enables each video segment to selectively align with the global semantics of the [EOS] token.
- Mechanism: Introduces learnable dummy tokens as "attention attractors"; video segments semantically unrelated to the query distribute attention to dummy tokens rather than to [EOS], reducing noise.
- Design Motivation: A single-token ([EOS]) key sequence is incompatible with standard attention; dummy tokens provide the basis for a contrastive distribution.
- Implementation: Dummy tokens are processed through a conditional encoder and concatenated with [EOS] as key-value pairs; video features serve as queries.
-
Recurrent Phrase Generation + Slot Attention Refinement (Phrase-Level Path):
- Function: Clusters word tokens into \(N\) semantically coherent phrase units and models fine-grained interactions between phrases and video segments.
- Mechanism: The RPG module sequentially generates \(N\) phrases conditioned on the global semantics and the previous phrase; Slot Attention iteratively refines the phrases to remove semantic overlap.
- Design Motivation: Word-level semantics are context-dependent; phrase-level abstraction provides more coherent alignment units, and sequential generation encourages adjacent words to be grouped into the same phrase.
- Phrase-Segment Interaction: Contextual embeddings \(C \in \mathbb{R}^{N \times T \times d}\) are computed for all phrases across all timesteps via Hadamard product.
Loss & Training¶
- MR Loss: Focal classification loss + L1 boundary regression loss.
- HD Loss: Ranking loss + contrastive loss (applied to saliency scores and attention weights respectively).
- Phrase-Level Loss:
- DQA Loss: Regularizes the orthogonality of phrase attention distributions via \(\|AA^T - rI\|_F^2\), encouraging semantic diversity.
- Global Reconstruction Loss: Encourages \(P_{[EOS]}\) to reconstruct global semantics, ensuring consistency between the phrase-level and sentence-level paths.
- Total Loss: \(\mathcal{L} = \mathcal{L}_{mr} + \mathcal{L}_{hd} + \mathcal{L}_{phrase}\)
Key Experimental Results¶
Main Results¶
QVHighlights test split (InternVideo2 features):
| Method | MR R1@0.5 | MR R1@0.7 | MR mAP@0.5 | HD mAP | HD HIT@1 |
|---|---|---|---|---|---|
| FlashVTG | 74.7 | 60.0 | 54.8 | 40.0 | 66.2 |
| R2-Tuning | 72.5 | 57.3 | 52.1 | 39.4 | 65.0 |
| DualGround | >75 | >61 | >55 | >40 | >67 |
Charades-STA (InternVideo2 features):
| Method | R1@0.5 | R1@0.7 |
|---|---|---|
| FlashVTG | ~73 | ~55 |
| DualGround | Higher | Higher |
DualGround achieves state-of-the-art performance on both benchmarks across MR and HD tasks.
Ablation Study¶
- Sentence-level path vs. phrase-level path vs. dual-path: the dual-path configuration significantly outperforms either single-path variant.
- Number of dummy tokens: 4–8 tokens yield optimal performance.
- Number of phrases \(N\): \(N=4\) achieves the best balance.
- Slot Attention refinement: contributes approximately 0.5–1.0% mAP improvement.
- DQA Loss: removal leads to phrase homogenization and performance degradation.
Key Findings¶
- Existing VTG models exhibit significantly lower performance in the word-only configuration compared to [EOS]-only, confirming the global semantic bias.
- DualGround successfully reduces over-reliance on [EOS]: word-only performance improves substantially.
- The phrase-level path yields the largest gains for long queries requiring fine-grained alignment.
- The sentence-level path remains effective and necessary for short queries.
Highlights & Insights¶
- Precise Problem Diagnosis: Controlled experiments clearly expose the [EOS] bias issue; the experimental design is methodologically instructive.
- Principled Architecture: The necessity of the dual-path design is derived directly from the identified problem, forming a complete logical chain.
- Phrases as Intermediate Representations: Introducing a phrase-level granularity between words and sentences constitutes a well-motivated semantic abstraction.
- The dummy token design in ACA elegantly resolves the attention compatibility issue arising from a single-token key sequence.
Limitations & Future Work¶
- The number of phrases \(N\) is fixed, which limits flexibility for queries of varying length and complexity.
- Phrase clustering relies entirely on feature-space similarity, without leveraging syntactic structural information.
- Generalization to additional datasets (e.g., ActivityNet, TACoS) has not been validated.
- Inference speed may increase due to the dual-path architecture and Slot Attention.
Related Work & Insights¶
- Connection to Keyword-DETR: Both approaches address the differential importance of text tokens, but DualGround explicitly separates global and local pathways.
- The phrase generation mechanism in LGI (Local-Global Interaction) inspired the RPG design.
- Transferring Slot Attention from object discovery to NLP phrase clustering represents a meaningful cross-domain application.
Rating¶
⭐⭐⭐⭐ — Solid problem identification, clear dual-path design rationale, and state-of-the-art results in the competitive VTG field.