Enrich and Detect: Video Temporal Grounding with Multimodal LLMs¶
Conference: ICCV 2025 arXiv: 2510.17023 Code: Project Page Area: Multimodal VLM Keywords: Video Temporal Grounding, Multimodal LLM, Query Enrichment, Multiple Instance Learning, Temporal Detection
TL;DR¶
This paper proposes ED-VTG, a two-stage framework for video temporal grounding (VTG) that first enriches the input query and then predicts temporal intervals. By leveraging the descriptive capability of multimodal LLMs to supplement query details, combined with a lightweight interval decoder and a multiple instance learning (MIL) framework, ED-VTG is the first LLM-based method to comprehensively match or surpass specialized models across multiple benchmarks.
Background & Motivation¶
Video Temporal Grounding (VTG) requires localizing temporal intervals in a video based on natural language queries. Existing methods face two major challenges:
Query quality: Queries in existing datasets are often coarse, incomplete, or ambiguous (e.g., "Man starts surfing" lacks appearance details), limiting localization precision when used directly. Captioning and grounding are dual tasks — the output of captioning is precisely the input of grounding — yet this complementary relationship is rarely exploited.
Precision bottleneck of LLM-based methods: Existing LLM-based approaches either represent timestamps as text tokens (limited by tokenization precision) or introduce special tokens for frame indices (incurring large vocabulary overhead). They cannot apply detection-oriented losses such as L1/gIoU, and while they generalize well, their precision falls short of specialized models.
Core hypothesis: More detailed queries lead to more accurate grounding. For example, enriching "Man starts surfing" to "The man with a yellow surfboard slowly runs to start surfing" yields more precise temporal boundaries.
Method¶
Overall Architecture¶
ED-VTG is a two-stage cascaded framework:
- Enrich stage: A multimodal LLM enriches the input query into a more detailed description conditioned on video content.
- Detect stage: The LLM emits a special token <INT>, whose hidden state is passed to a lightweight interval decoder to predict precise temporal boundaries \((c, w)\).
The key pipeline is: \((V, Q) \to \hat{Q}^{enr}\), followed by \((V, \hat{Q}^{enr}) \to \hat{I}\). The quality of the enriched query directly affects interval prediction accuracy.
Key Designs¶
-
Query Enrichment (Enrich) The LLM autoregressively generates the enriched query \(\hat{Q}^{enr}\) token by token, then emits an
<INT>token to trigger interval prediction. The generation follows the format: "The query \(\hat{Q}^{enr}\) occurs at<INT>". During training, pseudo-labels for enriched queries are generated by an external strong captioning model (LLaVA OneVision 72B) — given the original query and the corresponding video segment, the model is prompted to supplement details while preserving the original meaning. -
Lightweight Interval Decoder (Detect) The hidden state \(\mathbf{h}_{int}\) of the
<INT>token is linearly projected and concatenated with video tokens \(\mathbf{T}_V\), then fed into a two-layer Transformer + MLP to predict the interval center \(\hat{c}\) and width \(\hat{w}\).
Design motivation: (a) Decouples the precise temporal regression task from the LLM's token prediction, allowing the LLM to focus on language generation where it excels; (b) Enables direct application of established detection losses such as L1 + gIoU. The interval is parameterized as \((c, w)\), decoupling position and scale, drawing on experience from object detection.
- Multiple Instance Learning (MIL) Framework Enriched queries generated by the external captioning model may contain hallucinations and are not necessarily easier to ground than the original queries. To address this, a MIL framework is introduced: during training, each sample undergoes two forward passes — one with the original query and one with the enriched query (teacher-forcing) — yielding two sets of predictions \(\hat{I}^{dir}\) and \(\hat{I}^{enr}\). The one with the smaller grounding loss is selected for backpropagation:
$\(\mathcal{L} = \begin{cases} \lambda_{LM}\mathcal{L}_{LM}^{dir} + \lambda_{grnd}\mathcal{L}_{grnd}^{dir} & \text{if } \mathcal{L}_{grnd}^{dir} < \mathcal{L}_{grnd}^{enr} \\ \lambda_{LM}\mathcal{L}_{LM}^{enr} + \lambda_{grnd}\mathcal{L}_{grnd}^{enr} & \text{otherwise} \end{cases}\)$
This enables the model to autonomously determine at inference time whether query enrichment is needed, passing already-detailed queries through as-is.
Loss & Training¶
- Language modeling loss \(\mathcal{L}_{LM}\): Standard cross-entropy, supervising enriched query generation.
- Temporal grounding loss \(\mathcal{L}_{grnd}\): \(\lambda_{L1}(|\hat{c}-c| + |\hat{w}-w|) + \lambda_{gIoU} \cdot \text{gIoU}((\hat{c},\hat{w}), (c,w))\)
- Pre-training: 136K samples from 8 public datasets, 40 epochs, 16-node V100 cluster.
- Fine-tuning: Further training on each downstream dataset.
- Visual encoder: EVA-CLIP ViT-G/14 (frozen); LLM: Video-LLaMA-7B + LoRA (rank 32).
Key Experimental Results¶
Main Results¶
Zero-Shot Single-Query Temporal Grounding (STG)
| Method | General Model | Training Samples | Charades R@0.5 | ANet R@0.5 | TACoS mIoU |
|---|---|---|---|---|---|
| HawkEye | ✓ | 715K | 31.4 | 29.3 | - |
| ChatVTG | ✓ | 100K | 33.0 | 22.5 | 5.5 |
| Momenter | ✓ | 10M | 26.6 | 23.0 | - |
| ED-VTG | ✓ | 136K | 39.3 | 33.1 | 12.7 |
| Δ vs HawkEye | +7.9 | +3.8 | - |
ED-VTG substantially outperforms all prior methods on all three zero-shot benchmarks, including Momenter trained on 10M samples (mIoU +11.7 on Charades).
Fine-Tuned Video Paragraph Grounding (VPG) — Charades-CD-OOD
| Method | Type | R@0.3 | R@0.5 | mIoU |
|---|---|---|---|---|
| SiamGTR (specialized) | Non-general | 59.1 | 35.5 | 38.9 |
| TimeChat (LLM) | General | 60.5 | 36.1 | 38.3 |
| ED-VTG | General | 70.7 | 47.3 | 45.0 |
| Δ vs TimeChat | +10.2 | +11.2 | +6.7 |
The first LLM-based method to report results on the VPG task, surpassing all specialized models.
Ablation Study¶
Effect of Query Enrichment + MIL (Charades-STA, Zero-Shot)
| Training Paradigm | R@0.3 | R@0.5 | mIoU |
|---|---|---|---|
| Detect (direct grounding) | 48.1 | 30.6 | 31.0 |
| Enrich & Detect | 58.1 | 37.3 | 37.7 |
| Enrich & Detect + MIL | 59.5 | 39.3 | 40.2 |
Query enrichment alone yields a substantial gain of +6.7 mIoU, with MIL contributing an additional +2.5 mIoU.
Offline vs. Online Enrichment (FT w/o PT Setting)
| Strategy | Charades mIoU | ANet mIoU |
|---|---|---|
| Direct grounding | 33.2 | 34.0 |
| Offline enrichment then grounding | 33.4 | 33.7 |
| Online enrichment and grounding (Ours) | 38.4 | 37.8 |
Offline enrichment (pre-processing the training set but using original queries at inference) yields negligible gains, validating the core argument that enrichment at inference time is essential.
Key Findings¶
- Achieves state-of-the-art zero-shot performance on NeXT-GQA query-based grounding (QG) (IoU@0.3: 39.5), demonstrating strong generalization.
- First LLM-based method to report results on the HT-Step article grounding (AG) task, surpassing all specialized models on the unseen split.
- Joint L1+gIoU training for the interval decoder is optimal; using only the LM loss (without the decoder) leads to substantial performance degradation.
- Pre-training uses only 136K samples, far fewer than HawkEye (715K) and Momenter (10M), yet achieves the best performance.
Highlights & Insights¶
- Core idea of Enrich-and-Detect: Elegantly exploits the duality between captioning and grounding — rather than directly grounding an ambiguous query, the method first "translates" it into a detailed, groundable description.
- MIL framework for noisy pseudo-labels: Without relying on complex confidence estimation, the framework automatically selects the better query by simply comparing the losses of two forward passes — an elegant and efficient design.
- Design philosophy of the interval decoder: Let the LLM do what it excels at (language generation) and delegate precise regression to a dedicated module, with contextual information transferred via the hidden state of the
<INT>token. - The first LLM-based method to conduct comprehensive evaluation across all four task types — STG, VPG, QG, and AG — with strong cross-task consistency.
Limitations & Future Work¶
- Relies on an external 72B-scale captioning model to generate pseudo-labels, resulting in relatively high training data preparation costs.
- Video encoding remains constrained by a fixed frame count, potentially limiting fine-grained event localization in very long videos.
- MIL selects between only two options (original vs. enriched query); extending to multiple candidates may yield further improvements.
- Grounding small or occluded targets in long videos is the primary source of failure cases.
- More sophisticated enrichment strategies — such as decomposing abstract concepts into multiple concrete, groundable sub-queries — remain unexplored.
Related Work & Insights¶
- TimeChat [2024] and VTimeLLM [2024] are the primary comparison methods, both using LLMs for temporal grounding without query enrichment.
- Moment-DETR [NeurIPS 2021] introduced the DETR paradigm for end-to-end temporal grounding; ED-VTG's interval decoder can be viewed as a lightweight variant thereof.
- LaViLa augments video-text alignment training data with a paraphraser, but only as an offline augmentation; ED-VTG demonstrates that online enrichment at inference time is more critical.
- The proposed framework is extensible to downstream tasks requiring precise temporal understanding, such as video summarization and video editing.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (The Enrich-and-Detect paradigm is novel; the MIL selection mechanism is elegant)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (4 task types, 6+ datasets, both zero-shot and fine-tuned protocols, comprehensive ablations)
- Writing Quality: ⭐⭐⭐⭐ (Clear presentation, intuitive figures and tables)
- Value: ⭐⭐⭐⭐⭐ (First LLM-based method to comprehensively match specialized models in temporal grounding; high impact)