Skip to content

GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

Conference: AAAI 2026 arXiv: 2601.00584 Code: N/A Area: LLM Evaluation Keywords: zero-shot video moment retrieval, semantic granularity alignment, query rewriting, LLM, vision-language models

TL;DR

This paper proposes GranAlign, a training-free granularity-aware alignment framework that addresses the core challenge of semantic granularity mismatch in zero-shot video moment retrieval (ZVMR). By rewriting queries into simplified and detailed variants and matching them against query-agnostic and query-aware video descriptions respectively, GranAlign achieves a 3.23% improvement in mAP@avg on QVHighlights.

Background & Motivation

Video Moment Retrieval (VMR) aims to localize temporal segments in untrimmed videos given natural language queries. Traditional supervised approaches rely on expensive annotated data, making zero-shot VMR (ZVMR)—which leverages pretrained VLMs and LLMs—an increasingly valuable paradigm.

However, ZVMR faces a fundamental challenge: Granularity Mismatch. Users may describe the same event at varying levels of abstraction, e.g., "a cute dog" vs. "a golden retriever puppy wandering around." Coarse-grained queries yield high recall but low precision (broad coverage but imprecise localization), while fine-grained queries yield high precision but low recall (semantically specific but fragile to minor discrepancies). This creates an inevitable trade-off.

Through quantitative analysis, the authors categorize queries into types (simple/detailed/erroneous/other) and demonstrate significant performance variance across categories in existing methods—providing empirical evidence of the practical impact of granularity mismatch. Prior methods either perform query rewriting at a single granularity level ("one-size-fits-all") or rely solely on query-agnostic video descriptions that lack semantic alignment with the query. This single-channel inference paradigm is identified as the core bottleneck limiting robust retrieval.

The paper's Core Idea is to abandon single-channel design and establish dual coarse-to-fine channels on both the query side and the video side, aligning them by granularity level—simplified queries matched with generic descriptions (high recall), and detailed queries matched with query-aware descriptions (high precision)—combining the strengths of both.

Method

Overall Architecture

GranAlign is a fully training-free three-stage framework: (1) granularity-aware alignment—rewriting queries into simplified and detailed variants while generating both query-agnostic and query-aware frame-level descriptions; (2) moment proposal generation—generating candidate temporal segments based on frame-level semantic similarity scores; (3) post-processing—applying NMS to select final predictions.

Key Designs

  1. Granularity-based Query Rewriting

  2. Function: Rewrites the original query into two semantically complementary versions via LLaMA-3.

  3. Mechanism: The simplified query \(Q_s\) replaces rare words with common ones, retains core entities and actions, and removes incidental details—providing high generalizability for retrieving broadly relevant candidates. The detailed query \(Q_d\) preserves fine-grained expressions, temporal context, and specific lexical choices—enabling more precise alignment and localization.
  4. Design Motivation: Single-granularity rewriting cannot simultaneously achieve high recall and high precision. Multiple manually designed prompt pairs are used to generate both variants; experiments show the framework is robust to the specific choice of prompt pairs.

  5. Query-Aware Captioning

  6. Function: Generates two types of descriptions for video frames—generic and query-aware.

  7. Mechanism: Query-agnostic descriptions \(C_{agn} \in \mathbb{R}^{L_v \times l}\) are first generated for all frames as a baseline. The top-K% frames (totaling \(L_k\) frames) with the highest query similarity are then selected, and query-aware descriptions \(C_{awr} \in \mathbb{R}^{L_k \times l}\) are generated for these frames only using Qwen2.5-VL, guided by entities and actions extracted from the query.
  8. Design Motivation: Generating query-aware descriptions for all frames is computationally prohibitive. This hybrid strategy applies semantically precise descriptions to critical regions while maintaining global computational efficiency. Since query-aware descriptions may suffer from hallucinations or over-imitation of the query's linguistic structure, error-tolerant mechanisms are incorporated at the scoring stage.

  9. Granular Moment Scoring

  10. Function: Fuses similarity scores from two query–description pairs.

  11. Mechanism: For each frame \(f\), a composite similarity score is computed as: \(S_f = \frac{1}{2m}\sum_{i=1}^{m}[g(q_s^{(i)}, C_{agn,f}) + g(q_d^{(i)}, C_{awr,f})]\), where \(m\) is the number of rewriting pairs and \(g(\cdot, \cdot)\) denotes normalized cosine similarity.
  12. Design Motivation: The simplified–agnostic pair \((Q_s, C_{agn})\) provides broad coverage and high recall, while the detailed–aware pair \((Q_d, C_{awr})\) delivers precise alignment but is susceptible to hallucinations. Fusing the complementary scores from both pairs eliminates biases and false positives introduced by either pair alone.

  13. Moment Proposal Generation and Post-processing

  14. Function: Generates and filters candidate temporal segments from frame-level scores.

  15. Mechanism: Adjacent high-scoring frames are merged into a single proposal if their gap does not exceed threshold \(\tau\); proposals with average similarity in the bottom \(n\)% are discarded. Each candidate segment is scored as \(\text{Score}(p) = (1-\lambda)\mu_p + \lambda\rho_p\), where \(\mu_p\) is the average semantic similarity and \(\rho_p\) is a normalized length regularization term, with \(\lambda = 0.3\). NMS is applied to remove redundant proposals.
  16. Design Motivation: Length regularization prevents both overly long low-quality proposals and overly short fragmented ones.

Loss & Training

GranAlign is a fully zero-shot, training-free framework requiring no training data or fine-tuning. Query rewriting is performed with LLaMA3-8B, caption generation with Qwen2.5-VL-7B, and initial frame filtering with CLIP ViT-B/32. Query-agnostic descriptions are generated offline; query-aware descriptions are generated online only for key frames, achieving significantly shorter inference time than Moment-GPT (6.2s vs. 16.1s).

Key Experimental Results

Main Results

Dataset Metric GranAlign Prev. SOTA (Moment-GPT) Gain
QVHighlights val R1@0.5 61.94 58.9 +3.04
QVHighlights val mAP@avg 39.12 35.9 +3.22
QVHighlights test R1@0.5 59.92 58.3 +1.62
QVHighlights test mAP@avg 38.23 35.0 +3.23
Charades-STA R1@0.5 39.6 38.4 +1.2
Charades-STA mIoU 38.0 36.5 +1.5
ActivityNet R1@0.5 34.0 31.1 +2.9
ActivityNet mIoU 33.1 30.8 +2.3

Ablation Study (QVHighlights val)

Query → \(C_{agn}\) Query → \(C_{awr}\) R1@0.5 mAP@avg
\(Q_r\) (original) - 57.94 31.80
- \(Q_r\) 58.19 32.13
\(Q_s\) (simplified) - 58.97 37.13
- \(Q_d\) (detailed) 59.48 37.65
\(Q_s\) \(Q_d\) (full GranAlign) 61.94 39.12

Key Findings

  • Granularity matching is critical: pairing the simplified query with query-aware descriptions (granularity mismatch) degrades performance, while matched-granularity pairings yield clear improvements.
  • Dual-channel fusion consistently outperforms any single-channel variant: the simplified pair contributes high recall, the detailed pair contributes high precision, and their combination benefits both.
  • On the Video Highlight Detection (VHD) task, GranAlign achieves 39.35% mAP, surpassing the fully supervised QD-DETR (39.04%), demonstrating the substantial potential of zero-shot approaches.
  • Inference efficiency is excellent: 6.2s per query far outperforms Moment-GPT's 16.1s, owing to the two-stage captioning strategy.
  • The framework is robust to hyperparameter choices (rewriting count \(m=3\), \(\lambda=0.3\) yields stable performance across nearby values).

Highlights & Insights

  • The formulation of "granularity mismatch" is precise and well-motivated, supported by quantitative evidence through query-type categorization (Error/Simple/Detail/Else).
  • The dual-channel granularity alignment design is both intuitive and effective—one channel handles simple queries, the other handles detailed ones, and score averaging avoids complex fusion mechanisms.
  • The framework is fully zero-shot with no training cost, yet approaches or surpasses fully supervised methods on certain QVHighlights metrics, demonstrating strong practical value.
  • The two-stage captioning strategy (offline generic + online key-frame-aware) represents a sound engineering trade-off between efficiency and accuracy.

Limitations & Future Work

  • Query-aware descriptions may suffer from hallucinations—generating visual content absent from the video or over-imitating the linguistic structure of the query.
  • LLM-based query rewriting may alter the original intent, necessitating a semantic validation step.
  • The framework depends on multiple large models (LLaMA3 + Qwen2.5-VL + CLIP + SentenceTransformer), resulting in non-trivial deployment costs.
  • For event-dense long videos, simplified queries may cover too much irrelevant content.
  • Moment-GPT is the direct predecessor, using LLaMA-3 for query rewriting and Video-ChatGPT for scoring, but remains a single-channel, single-granularity design.
  • The granularity-aware paradigm proposed here is generalizable to other retrieval settings—such as handling varying formulations of the same question in text retrieval or abstract/concrete query pairs in image retrieval.
  • Combining the "generate multi-granularity representations then align" paradigm with RAG or multi-step reasoning may yield more powerful multimodal understanding systems.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐