APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval¶
Conference: AAAI 2026 arXiv: 2506.04953v3 Code: No public link Area: Video Understanding / Multimodal LLM Keywords: long video understanding, training-free, keyframe retrieval, token compression, dual-granularity retrieval
TL;DR¶
This paper proposes APVR, a training-free dual-granularity visual information retrieval framework. At the frame level, it iteratively retrieves keyframes (up to 1024) via query expansion and spatiotemporal semantic confidence scoring; at the token level, it compresses visual tokens through query-aware attention-driven selection. APVR overcomes memory limitations to process hour-long videos, achieving improvements of up to 9.5%, 4.6%, and 9.7% on LongVideoBench, VideoMME, and MLVU, respectively.
Background & Motivation¶
Existing video MLLMs face three major challenges in processing long videos: (1) Uniform sampling: dilutes critical information, with a large proportion of irrelevant frames; (2) Sparse keyframe retrieval: loses temporal-semantic relationships, making temporal reasoning tasks intractable; (3) Dense frame processing: hits the memory wall. Training-based approaches (sequence parallelism, feature compression) require costly multi-stage retraining and are tightly coupled to specific architectures.
Key Challenge: The fundamental trade-off between temporal coverage and computational feasibility.
Core Problem¶
How can a model efficiently retrieve query-relevant frames and tokens from hour-long videos without retraining, while overcoming MLLM memory constraints and preserving semantic integrity?
Method¶
Overall Architecture¶
APVR = Pivot Frame Retrieval (PFR) + Pivot Token Retrieval (PTR), integrated as a plug-in module into arbitrary MLLMs. - PFR: Query expansion → dual-model scoring with CLIP + Grounding-DINO → temporal diffusion → adaptive resampling → selection of \(K\) frames - PTR: Query-aware attention scoring → dynamic chunk selection + head-wise soft voting → compressed visual tokens
Key Designs¶
-
Semantic Information Expansion: An LLM expands the original query into four categories: Objects (detectable entities), Descriptions (entity descriptions/hypernyms), Relations (spatiotemporal/causal relation triples among objects), and Semantics (knowledge-graph semantics). This substantially improves recall in frame retrieval.
-
Spatiotemporal Semantic Confidence Scoring: Two complementary models are employed — CLIP computes semantic similarity (cosine distance between text and image embeddings), while Grounding-DINO detects specific objects and models spatial relationships (intra-frame co-occurrence, temporal appearance, etc.). The final score is \(\mathcal{S}_t = (1-\lambda) \cdot s_t^{CLIP} + \lambda \cdot s_t^{GD}\). Temporal diffusion propagates high scores to neighboring frames.
-
Iterative Adaptive Resampling: Rather than a single scoring pass, multiple rounds of iteration are performed (default: 3), with the sampling stride reduced at each round. The candidate set comprises a high-confidence subset and a high-uncertainty subset (regions with high Shannon entropy), simultaneously exploiting existing knowledge and exploring unvisited regions.
-
Query-Aware Token Selection: Within the MLLM, cross-modal attention scores from the text query to visual tokens are used to dynamically select important tokens at varying chunk granularities. Head-wise soft voting mitigates discrepancies across attention heads.
Loss & Training¶
Entirely training-free. CLIP and Grounding-DINO use off-the-shelf pretrained weights. Only the LLM is used for query expansion.
Key Experimental Results¶
| Base Model | Method | LVB val | VideoMME Long | VideoMME Overall | MLVU dev |
|---|---|---|---|---|---|
| Qwen2-VL-7B | Vanilla | 55.6 | 53.8 | 63.3 | 66.9 |
| Qwen2-VL-7B | +APVR | 60.9(+9.5%) | 55.1(+2.4%) | 65.2(+3.0%) | 73.4(+9.7%) |
| Qwen2.5-VL-7B | Vanilla | 59.5 | 55.6 | 65.4 | 70.2 |
| Qwen2.5-VL-7B | +APVR | 64.9(+9.1%) | 59.1(+6.3%) | 68.4(+4.6%) | 76.1(+8.4%) |
| VideoLLaMA3-7B | Vanilla | 59.8 | 54.9 | 66.2 | 73.0 |
| VideoLLaMA3-7B | +APVR | 63.5(+6.2%) | 58.7(+6.9%) | 68.1(+2.9%) | 77.2(+5.5%) |
APVR + Qwen2.5-VL-7B surpasses GPT-4V (59.1) and Gemini-1.5-Pro (64.0) on LVB.
Ablation Study¶
- Both PFR and PTR are indispensable: Removing PFR reverts to uniform sampling; removing PTR limits processing to 256 frames (vs. 1024).
- Contribution of query expansion: Removing expanded semantic information reduces LVB by approximately 1–2%.
- Effect of frame count \(K\): Performance increases monotonically as \(K\) grows from 32 to 1024; the ability to process 1024 frames is central to APVR's gains.
- Optimal iteration count \(P=3\): \(P=2\) is insufficiently fine-grained; \(P=5\) leads to overfitting.
- Optimal \(\lambda=0.5\): Balanced weighting of CLIP and Grounding-DINO scores yields the best performance.
Highlights & Insights¶
- Training-free and plug-and-play: No modification to MLLM parameters; compatible with arbitrary backbone models, making it highly practical for the rapidly evolving MLLM ecosystem.
- Complementarity of dual-granularity retrieval: Frame-level retrieval addresses which frames to observe, while token-level retrieval addresses what to attend to within each frame; together they overcome the memory wall.
- Iterative resampling with uncertainty exploration: Keyframes are progressively refined rather than selected in a single pass; high-uncertainty regions are also explored to avoid local optima.
- Query expansion substantially enhances retrieval: Expanding a simple question into objects, relations, and semantics improves the precision of CLIP/Grounding-DINO matching.
- 7B model surpasses GPT-4V: The training-free approach with a 7B model achieves 64.9% on LVB, exceeding GPT-4V at 59.1%.
Limitations & Future Work¶
- Dependency on external models: The pipeline requires three additional models — CLIP, Grounding-DINO, and an LLM — for query expansion and frame scoring, increasing system complexity.
- Latency: Processing an hour-long video takes approximately two minutes per query, which may be insufficient for real-time applications.
- Restricted to multiple-choice QA: All benchmarks adopt multiple-choice formats; performance on open-ended question answering remains unexplored.
- Query expansion quality depends on the LLM: Erroneous objects or relations generated during expansion can mislead frame retrieval.
- Grounding-DINO detection limitations: Retrieval of abstract concepts or actions (rather than physical objects) may be suboptimal.
Related Work & Insights¶
| Method | Type | Mechanism | Key Difference from APVR |
|---|---|---|---|
| AKS | Training-free | Keyframe selection + information pre-filtering | Frame-level only; no token-level compression |
| QuoTA | Training-free | Query-aware token allocation | Token-level only; no frame-level retrieval |
| LongVILA | Training-based | 5-stage training + sequence parallelism | Requires substantial training resources; architecture-specific |
| Video-XL-Pro | Training-based | Learned compression module | Requires retraining |
The core advantage of APVR lies in its dual-granularity design — joint optimization at both the frame and token levels — while remaining entirely training-free.
Broader Insights: - The iterative resampling and uncertainty-guided exploration paradigm is transferable to other information retrieval tasks in agentic settings (e.g., RAG, document understanding). - The query expansion strategy (objects + relations + semantics) offers general utility for video retrieval and video grounding tasks. - Training-free methods serve as a compelling alternative to parameter scaling in an era of rapid model iteration.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dual-granularity retrieval framework and iterative resampling strategy are noteworthy, though individual components rely on existing tools.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks, three baseline MLLMs, comprehensive ablation analysis, and qualitative comparisons.
- Writing Quality: ⭐⭐⭐⭐ Architecture diagrams and algorithmic descriptions are clear; motivation is well articulated.
- Value: ⭐⭐⭐⭐⭐ Training-free, plug-and-play, and surpassing GPT-4V — the practical value is exceptionally high.