Moment Quantization for Video Temporal Grounding¶

Conference: ICCV 2025 arXiv: 2504.02286 Code: None Area: Video Understanding Keywords: Video temporal grounding, vector quantization, moment codebook, highlight detection, discrete representation learning

TL;DR¶

This paper proposes MQVTG, which for the first time introduces vector quantization into video temporal grounding (VTG) by mapping video clips to discrete vectors via a moment codebook and soft quantization, thereby enhancing foreground/background discriminability and achieving state-of-the-art performance on 6 benchmarks.

Background & Motivation¶

Video temporal grounding (VTG) aims to localize relevant moments in a video given natural language descriptions. The core challenge lies in distinguishing relevant from irrelevant moments.

Limitations of existing methods:

Weak discriminability in continuous feature space: Taking TR-DETR as an example, foreground and similar background features are close to each other in the feature space, making them difficult to separate.

Insufficient foreground feature aggregation: Features from different foreground regions are scattered in the feature space without effective aggregation.

Redundancy in video data: Videos contain substantial redundant information, while the essence of VTG is to extract discriminative features that distinguish foreground from background.

Key insight: foreground moments can be described concisely in discrete natural language (e.g., "stirring curry with a spoon"). This raises the question of whether discrete vectors can describe continuous video moments to enhance discriminability. The clustering nature of vector quantization naturally aligns with the foreground/background separation requirement in VTG.

Method¶

Overall Architecture¶

MQVTG progresses from simple Clip Quantization to Moment Quantization, with three core improvements: (1) quantization is placed after temporal modeling to accommodate cross-clip characteristics; (2) soft quantization is employed to preserve visual diversity; (3) a moment codebook with prior initialization and joint projection is designed.

Key Designs¶

Evolution from Clip Quantization to Moment Quantization:
- Clip Quantization, analogous to image quantization, directly quantizes individual clips and ignores two properties of video moments: cross-clip nature (an action spans multiple clips) and visual diversity (the same description has multiple visual manifestations).
- Moment Quantization performs quantization after the temporal encoder \(E_t\), so that the quantization operates on features \(z_t = E_t(z_s)\) that already encode temporal relationships.
Soft Quantization: Rather than directly replacing continuous features with discrete codewords (hard quantization), the quantization process serves as a clustering regularization. A codebook loss \(\mathcal{L}_{cb} = \|C(z_t) - \text{sg}(E_t(z_s))\|_2^2\) and a commitment loss \(\mathcal{L}_{cmt} = \|\text{sg}(C(z_t)) - E_t(z_s)\|_2^2\) drive feature–codeword clustering, while the downstream grounding module continues to use continuous features \(z_t\). This avoids information loss caused by a limited-capacity codebook.
Moment Codebook:
- Prior initialization: CLIP features are extracted for all training video clips, and k-means clustering is applied; the cluster centers are used to initialize the codebook, ensuring that codewords are effective from the start.
- Joint projection: A trainable projection layer \(C' = P(C)\) (linear layer) is introduced to replace direct optimization of codebook vectors, enabling the exploration of temporal semantic relationships among different codewords.
Plug-and-play property: The quantization module can be integrated into both encoder-only and encoder-decoder (DETR) architectures. During training, only codebook parameters are added; at inference, there is zero additional cost.

Loss & Training¶

Overall loss: \(\mathcal{L}_{overall} = \mathcal{L}_{mr} + \lambda_{hd}\mathcal{L}_{hd} + \lambda_{mq}\mathcal{L}_{mq} + \lambda_{align}\mathcal{L}_{align}\)

\(\mathcal{L}_{mr}\): moment retrieval loss (L1 + Focal)
\(\mathcal{L}_{hd}\): highlight detection loss (intra-video contrastive learning)
\(\mathcal{L}_{mq} = \mathcal{L}_{cb} + \lambda_{cmt}\mathcal{L}_{cmt}\): quantization supervision
\(\mathcal{L}_{align}\): InfoNCE video–text alignment loss

Key Experimental Results¶

Main Results¶

QVHighlights validation set (MR + HD):

Method	R1@0.5	R1@0.7	mAP@0.5	mAP Avg.	HD mAP	HD HIT@1
TR-DETR	67.10	51.48	66.27	45.09	40.55	64.77
CG-DETR	67.35	52.06	65.57	44.93	40.79	66.71
R²-Tuning	68.71	52.06	-	47.59	40.59	64.32
MQVTG	67.94	53.03	68.54	48.81	40.23	65.29

Moment retrieval on Charades-STA / TACoS / Ego4D-NLQ:

Method	Charades R1@0.7	TACoS R1@0.7	Ego4D mIoU
R²-Tuning	37.02	25.12	4.94
MQVTG	38.84	25.82	5.08

Ablation Study¶

Core component ablation (QVHighlights val):

Configuration	R1@0.5	R1@0.7	mAP@0.5	mAP Avg.
Baseline (no quantization)	65.35	49.42	66.99	45.63
+ Quantization after temporal modeling (QATM)	66.37	51.11	67.43	47.02
+ QATM + Soft Quantization (SQ)	66.52	51.23	68.18	47.54
+ QATM + SQ + Moment Codebook (MC)	67.94	53.03	68.54	48.81

Comparison of quantization strategies:

Quantization	R1@0.7	mAP Avg.	Note
Image Quantization	51.03	46.55	Image-level quantization
Clip Quantization	51.61	46.93	Clip-level quantization
Moment Quantization	53.03	48.81	Moment-level quantization
Hard Quantization	50.90	47.46	Direct replacement with discrete vectors

Key Findings¶

Plug-and-play validation: consistent improvements are observed when integrated into DETR-based models including QD-DETR, TR-DETR, and TaskWeave.
Highlight detection surpasses the prior state of the art by 2.1% on YouTube HL and 1.4% on TVSum.
Low codebook utilization (< 10%) is the current bottleneck, limiting performance on fine-grained scenes.

Highlights & Insights¶

This work is the first to transfer vector quantization from image/audio domains to video temporal grounding, filling a gap in this research direction.
The soft quantization strategy is elegant: the quantization process serves as a regularizer that drives feature clustering without directly using discrete codewords, balancing discriminability and information completeness.
The k-means prior initialization is simple yet effective, addressing the cold-start problem in codebook training.

Limitations & Future Work¶

Codebook utilization is low (< 10%), with a large proportion of codewords remaining inactive, limiting performance on fine-grained scenes.
Improvements on highlight detection are less pronounced than on moment retrieval, reflecting a conflict between global and local information requirements.
Future work may explore dynamic codebook sizes or hierarchical codebook structures.

The idea of using quantization as a regularization tool (rather than for compression or reconstruction purposes) is generalizable to other video tasks requiring foreground/background separation.
The joint projection strategy of the moment codebook offers insights into establishing associations among codebook vectors.
The plug-and-play nature allows this method to be combined with future stronger baseline models.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐⭐
Experimental Thoroughness	⭐⭐⭐⭐⭐
Value	⭐⭐⭐⭐
Writing Quality	⭐⭐⭐⭐
Overall	⭐⭐⭐⭐