Skip to content

Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding

Conference: NeurIPS 2025 arXiv: 2510.20244 Code: None Area: Video Understanding / Temporal Grounding Keywords: Video Temporal Grounding, Dual-Path Architecture, Phrase-Level Alignment, Moment Retrieval, Attention Decoupling

TL;DR

DualGround identifies a critical issue in existing VTG models — over-reliance on the global semantics of the [EOS] token while neglecting word-level signals — and proposes a sentence-level + phrase-level dual-path architecture. Through an Adaptive Cross-Attention (ACA) module and a Recurrent Phrase Generator (RPG), the model captures global and local semantics respectively, achieving state-of-the-art performance on QVHighlights and Charades-STA.

Background & Motivation

State of the Field

Video Temporal Grounding (VTG) encompasses two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). Existing models typically process text features from pretrained encoders such as CLIP and InternVideo2 by treating all text tokens (word tokens + [EOS] token) uniformly.

Through controlled experiments, the authors identify a critical issue:

Limitations of Prior Work

[EOS] Bias: Models rely almost exclusively on the global semantics of the [EOS] token, while word-level tokens are severely neglected.

Root Cause

Using only the [EOS] token yields performance comparable to or even better than using the full token sequence.

Starting Point

Even for video segments irrelevant to the query, attention concentrates on [EOS] rather than on semantically relevant keywords (e.g., "red jacket").

Supplementary Remarks

This leads to degraded performance in scenarios requiring fine-grained word-to-video alignment.

Method

Overall Architecture

The dual-path architecture processes sentence-level and phrase-level semantics separately: 1. Sentence-Level Path: Utilizes the [EOS] token with Adaptive Cross-Attention (ACA) to capture global alignment. 2. Phrase-Level Path: Word tokens → Recurrent Phrase Generator (RPG) → Slot Attention refinement → phrase-segment interaction. 3. The outputs of both paths, \(V^s + V^p\), are fused and fed into the decoder.

Key Designs

  1. Adaptive Cross-Attention (ACA, Sentence-Level Path):

    • Function: Enables each video segment to selectively align with the global semantics of the [EOS] token.
    • Mechanism: Introduces learnable dummy tokens as "attention attractors"; video segments semantically unrelated to the query distribute attention to dummy tokens rather than to [EOS], reducing noise.
    • Design Motivation: A single-token ([EOS]) key sequence is incompatible with standard attention; dummy tokens provide the basis for a contrastive distribution.
    • Implementation: Dummy tokens are processed through a conditional encoder and concatenated with [EOS] as key-value pairs; video features serve as queries.
  2. Recurrent Phrase Generation + Slot Attention Refinement (Phrase-Level Path):

    • Function: Clusters word tokens into \(N\) semantically coherent phrase units and models fine-grained interactions between phrases and video segments.
    • Mechanism: The RPG module sequentially generates \(N\) phrases conditioned on the global semantics and the previous phrase; Slot Attention iteratively refines the phrases to remove semantic overlap.
    • Design Motivation: Word-level semantics are context-dependent; phrase-level abstraction provides more coherent alignment units, and sequential generation encourages adjacent words to be grouped into the same phrase.
    • Phrase-Segment Interaction: Contextual embeddings \(C \in \mathbb{R}^{N \times T \times d}\) are computed for all phrases across all timesteps via Hadamard product.

Loss & Training

  • MR Loss: Focal classification loss + L1 boundary regression loss.
  • HD Loss: Ranking loss + contrastive loss (applied to saliency scores and attention weights respectively).
  • Phrase-Level Loss:
    • DQA Loss: Regularizes the orthogonality of phrase attention distributions via \(\|AA^T - rI\|_F^2\), encouraging semantic diversity.
    • Global Reconstruction Loss: Encourages \(P_{[EOS]}\) to reconstruct global semantics, ensuring consistency between the phrase-level and sentence-level paths.
  • Total Loss: \(\mathcal{L} = \mathcal{L}_{mr} + \mathcal{L}_{hd} + \mathcal{L}_{phrase}\)

Key Experimental Results

Main Results

QVHighlights test split (InternVideo2 features):

Method MR R1@0.5 MR R1@0.7 MR mAP@0.5 HD mAP HD HIT@1
FlashVTG 74.7 60.0 54.8 40.0 66.2
R2-Tuning 72.5 57.3 52.1 39.4 65.0
DualGround >75 >61 >55 >40 >67

Charades-STA (InternVideo2 features):

Method R1@0.5 R1@0.7
FlashVTG ~73 ~55
DualGround Higher Higher

DualGround achieves state-of-the-art performance on both benchmarks across MR and HD tasks.

Ablation Study

  • Sentence-level path vs. phrase-level path vs. dual-path: the dual-path configuration significantly outperforms either single-path variant.
  • Number of dummy tokens: 4–8 tokens yield optimal performance.
  • Number of phrases \(N\): \(N=4\) achieves the best balance.
  • Slot Attention refinement: contributes approximately 0.5–1.0% mAP improvement.
  • DQA Loss: removal leads to phrase homogenization and performance degradation.

Key Findings

  • Existing VTG models exhibit significantly lower performance in the word-only configuration compared to [EOS]-only, confirming the global semantic bias.
  • DualGround successfully reduces over-reliance on [EOS]: word-only performance improves substantially.
  • The phrase-level path yields the largest gains for long queries requiring fine-grained alignment.
  • The sentence-level path remains effective and necessary for short queries.

Highlights & Insights

  • Precise Problem Diagnosis: Controlled experiments clearly expose the [EOS] bias issue; the experimental design is methodologically instructive.
  • Principled Architecture: The necessity of the dual-path design is derived directly from the identified problem, forming a complete logical chain.
  • Phrases as Intermediate Representations: Introducing a phrase-level granularity between words and sentences constitutes a well-motivated semantic abstraction.
  • The dummy token design in ACA elegantly resolves the attention compatibility issue arising from a single-token key sequence.

Limitations & Future Work

  • The number of phrases \(N\) is fixed, which limits flexibility for queries of varying length and complexity.
  • Phrase clustering relies entirely on feature-space similarity, without leveraging syntactic structural information.
  • Generalization to additional datasets (e.g., ActivityNet, TACoS) has not been validated.
  • Inference speed may increase due to the dual-path architecture and Slot Attention.
  • Connection to Keyword-DETR: Both approaches address the differential importance of text tokens, but DualGround explicitly separates global and local pathways.
  • The phrase generation mechanism in LGI (Local-Global Interaction) inspired the RPG design.
  • Transferring Slot Attention from object discovery to NLP phrase clustering represents a meaningful cross-domain application.

Rating

⭐⭐⭐⭐ — Solid problem identification, clear dual-path design rationale, and state-of-the-art results in the competitive VTG field.