GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval¶

Conference: AAAI 2026 arXiv: 2601.00584 Code: N/A Area: LLM Evaluation Keywords: zero-shot video moment retrieval, semantic granularity alignment, query rewriting, LLM, vision-language models

TL;DR¶

This paper proposes GranAlign, a training-free granularity-aware alignment framework that addresses the core challenge of semantic granularity mismatch in zero-shot video moment retrieval (ZVMR). By rewriting queries into simplified and detailed variants and matching them against query-agnostic and query-aware video descriptions respectively, GranAlign achieves a 3.23% improvement in mAP@avg on QVHighlights.

Background & Motivation¶

Video Moment Retrieval (VMR) aims to localize temporal segments in untrimmed videos given natural language queries. Traditional supervised approaches rely on expensive annotated data, making zero-shot VMR (ZVMR)—which leverages pretrained VLMs and LLMs—an increasingly valuable paradigm.

However, ZVMR faces a fundamental challenge: Granularity Mismatch. Users may describe the same event at varying levels of abstraction, e.g., "a cute dog" vs. "a golden retriever puppy wandering around." Coarse-grained queries yield high recall but low precision (broad coverage but imprecise localization), while fine-grained queries yield high precision but low recall (semantically specific but fragile to minor discrepancies). This creates an inevitable trade-off.

Through quantitative analysis, the authors categorize queries into types (simple/detailed/erroneous/other) and demonstrate significant performance variance across categories in existing methods—providing empirical evidence of the practical impact of granularity mismatch. Prior methods either perform query rewriting at a single granularity level ("one-size-fits-all") or rely solely on query-agnostic video descriptions that lack semantic alignment with the query. This single-channel inference paradigm is identified as the core bottleneck limiting robust retrieval.

The paper's Core Idea is to abandon single-channel design and establish dual coarse-to-fine channels on both the query side and the video side, aligning them by granularity level—simplified queries matched with generic descriptions (high recall), and detailed queries matched with query-aware descriptions (high precision)—combining the strengths of both.

Method¶

Overall Architecture¶

GranAlign is a fully training-free three-stage framework: (1) granularity-aware alignment—rewriting queries into simplified and detailed variants while generating both query-agnostic and query-aware frame-level descriptions; (2) moment proposal generation—generating candidate temporal segments based on frame-level semantic similarity scores; (3) post-processing—applying NMS to select final predictions.

Key Designs¶

Granularity-based Query Rewriting
Function: Rewrites the original query into two semantically complementary versions via LLaMA-3.
Mechanism: The simplified query \(Q_s\) replaces rare words with common ones, retains core entities and actions, and removes incidental details—providing high generalizability for retrieving broadly relevant candidates. The detailed query \(Q_d\) preserves fine-grained expressions, temporal context, and specific lexical choices—enabling more precise alignment and localization.
Design Motivation: Single-granularity rewriting cannot simultaneously achieve high recall and high precision. Multiple manually designed prompt pairs are used to generate both variants; experiments show the framework is robust to the specific choice of prompt pairs.
Query-Aware Captioning
Function: Generates two types of descriptions for video frames—generic and query-aware.
Mechanism: Query-agnostic descriptions \(C_{agn} \in \mathbb{R}^{L_v \times l}\) are first generated for all frames as a baseline. The top-K% frames (totaling \(L_k\) frames) with the highest query similarity are then selected, and query-aware descriptions \(C_{awr} \in \mathbb{R}^{L_k \times l}\) are generated for these frames only using Qwen2.5-VL, guided by entities and actions extracted from the query.
Design Motivation: Generating query-aware descriptions for all frames is computationally prohibitive. This hybrid strategy applies semantically precise descriptions to critical regions while maintaining global computational efficiency. Since query-aware descriptions may suffer from hallucinations or over-imitation of the query's linguistic structure, error-tolerant mechanisms are incorporated at the scoring stage.
Granular Moment Scoring
Function: Fuses similarity scores from two query–description pairs.
Mechanism: For each frame \(f\), a composite similarity score is computed as: \(S_f = \frac{1}{2m}\sum_{i=1}^{m}[g(q_s^{(i)}, C_{agn,f}) + g(q_d^{(i)}, C_{awr,f})]\), where \(m\) is the number of rewriting pairs and \(g(\cdot, \cdot)\) denotes normalized cosine similarity.
Design Motivation: The simplified–agnostic pair \((Q_s, C_{agn})\) provides broad coverage and high recall, while the detailed–aware pair \((Q_d, C_{awr})\) delivers precise alignment but is susceptible to hallucinations. Fusing the complementary scores from both pairs eliminates biases and false positives introduced by either pair alone.
Moment Proposal Generation and Post-processing
Function: Generates and filters candidate temporal segments from frame-level scores.
Mechanism: Adjacent high-scoring frames are merged into a single proposal if their gap does not exceed threshold \(\tau\); proposals with average similarity in the bottom \(n\)% are discarded. Each candidate segment is scored as \(\text{Score}(p) = (1-\lambda)\mu_p + \lambda\rho_p\), where \(\mu_p\) is the average semantic similarity and \(\rho_p\) is a normalized length regularization term, with \(\lambda = 0.3\). NMS is applied to remove redundant proposals.
Design Motivation: Length regularization prevents both overly long low-quality proposals and overly short fragmented ones.

Loss & Training¶

GranAlign is a fully zero-shot, training-free framework requiring no training data or fine-tuning. Query rewriting is performed with LLaMA3-8B, caption generation with Qwen2.5-VL-7B, and initial frame filtering with CLIP ViT-B/32. Query-agnostic descriptions are generated offline; query-aware descriptions are generated online only for key frames, achieving significantly shorter inference time than Moment-GPT (6.2s vs. 16.1s).

Key Experimental Results¶

Main Results¶

Dataset	Metric	GranAlign	Prev. SOTA (Moment-GPT)	Gain
QVHighlights val	R1@0.5	61.94	58.9	+3.04
QVHighlights val	mAP@avg	39.12	35.9	+3.22
QVHighlights test	R1@0.5	59.92	58.3	+1.62
QVHighlights test	mAP@avg	38.23	35.0	+3.23
Charades-STA	R1@0.5	39.6	38.4	+1.2
Charades-STA	mIoU	38.0	36.5	+1.5
ActivityNet	R1@0.5	34.0	31.1	+2.9
ActivityNet	mIoU	33.1	30.8	+2.3

Ablation Study (QVHighlights val)¶

Query → \(C_{agn}\)	Query → \(C_{awr}\)	R1@0.5	mAP@avg
\(Q_r\) (original)	-	57.94	31.80
-	\(Q_r\)	58.19	32.13
\(Q_s\) (simplified)	-	58.97	37.13
-	\(Q_d\) (detailed)	59.48	37.65
\(Q_s\)	\(Q_d\) (full GranAlign)	61.94	39.12

Key Findings¶

Granularity matching is critical: pairing the simplified query with query-aware descriptions (granularity mismatch) degrades performance, while matched-granularity pairings yield clear improvements.
Dual-channel fusion consistently outperforms any single-channel variant: the simplified pair contributes high recall, the detailed pair contributes high precision, and their combination benefits both.
On the Video Highlight Detection (VHD) task, GranAlign achieves 39.35% mAP, surpassing the fully supervised QD-DETR (39.04%), demonstrating the substantial potential of zero-shot approaches.
Inference efficiency is excellent: 6.2s per query far outperforms Moment-GPT's 16.1s, owing to the two-stage captioning strategy.
The framework is robust to hyperparameter choices (rewriting count \(m=3\), \(\lambda=0.3\) yields stable performance across nearby values).

Highlights & Insights¶

The formulation of "granularity mismatch" is precise and well-motivated, supported by quantitative evidence through query-type categorization (Error/Simple/Detail/Else).
The dual-channel granularity alignment design is both intuitive and effective—one channel handles simple queries, the other handles detailed ones, and score averaging avoids complex fusion mechanisms.
The framework is fully zero-shot with no training cost, yet approaches or surpasses fully supervised methods on certain QVHighlights metrics, demonstrating strong practical value.
The two-stage captioning strategy (offline generic + online key-frame-aware) represents a sound engineering trade-off between efficiency and accuracy.

Limitations & Future Work¶

Query-aware descriptions may suffer from hallucinations—generating visual content absent from the video or over-imitating the linguistic structure of the query.
LLM-based query rewriting may alter the original intent, necessitating a semantic validation step.
The framework depends on multiple large models (LLaMA3 + Qwen2.5-VL + CLIP + SentenceTransformer), resulting in non-trivial deployment costs.
For event-dense long videos, simplified queries may cover too much irrelevant content.

Moment-GPT is the direct predecessor, using LLaMA-3 for query rewriting and Video-ChatGPT for scoring, but remains a single-channel, single-granularity design.
The granularity-aware paradigm proposed here is generalizable to other retrieval settings—such as handling varying formulations of the same question in text retrieval or abstract/concrete query pairs in image retrieval.
Combining the "generate multi-granularity representations then align" paradigm with RAG or multi-step reasoning may yield more powerful multimodal understanding systems.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐