AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding¶
Conference: NeurIPS 2025 arXiv: 2506.13589 Code: https://github.com/xzc-zju/AdaVideoRAG Area: Multimodal VLM / Video Understanding Keywords: long video understanding, retrieval-augmented generation, adaptive retrieval, knowledge graph, intent classification
TL;DR¶
AdaVideoRAG is proposed to route queries to one of three retrieval pathways (no retrieval / naive retrieval / graph retrieval) via a lightweight intent classifier, combined with an omni-knowledge indexing module (caption + ASR + OCR + visual + knowledge graph) to achieve an optimal efficiency–accuracy trade-off in long video understanding, yielding a 39.8% improvement for Qwen2.5-VL-7B on MLVU.
Background & Motivation¶
MLLMs face three core challenges in long video understanding: (1) fixed context windows cause information loss for long videos; (2) static parametric knowledge cannot be dynamically updated; and (3) multi-hop reasoning capacity is insufficient. Existing VideoRAG approaches suffer from fixed retrieval paradigm limitations: - Naive retrieval (VideoRAG [Luo]): vector retrieval over caption + ASR + OCR, unable to handle multi-hop questions requiring global understanding. - Graph retrieval (VideoRAG [Ren]): constructs hierarchical knowledge graphs with high accuracy but large computational overhead (complex graph traversal), introducing unnecessary latency for simple queries.
Key insight: queries of different difficulty levels should be handled by retrieval strategies of corresponding complexity.
Core Problem¶
How to adaptively assign appropriate retrieval strategies to video understanding queries of varying complexity, saving computation on simple questions while ensuring deep reasoning on difficult ones?
Method¶
Overall Architecture¶
A four-stage pipeline: (1) query intent classification → (2) omni-knowledge index construction → (3) adaptive retrieval → (4) multimodal information integration and generation. The system is integrated with existing MLLMs as a plug-and-play API.
Key Designs¶
-
Query Intent Classifier: A lightweight LLM (Qwen2.5-7B + CoT) classifies queries into three levels:
- L1 (direct factual): e.g., "What object appears at second 5?" → sent directly to MLLM, no retrieval needed.
- L2 (simple reasoning): e.g., "Why did the woman cry before it rained?" → naive vector retrieval (caption/ASR/OCR + visual retrieval).
- L3 (complex reasoning): e.g., "What life lesson does this film convey?" → graph retrieval + multi-hop reasoning.
The classifier accounts for ≤5% of total inference time.
-
Omni-Knowledge Indexing Module: Extracts multimodal information from the video to build four knowledge stores:
- Caption store: 5 frames sampled every 30s, fine-grained descriptions generated by MiniCPM-V.
- ASR store: speech-to-text extracted via FastWhisper.
- OCR store: scene text extracted via EasyOCR.
- Visual store: frame-level visual features extracted by ImageBind, mapped to a unified semantic space.
- Knowledge graph: entities and relations (spatio-temporal / causal / functional) extracted from text chunks using BGE-M3.
-
Adaptive Retrieval Paradigm:
- L1: direct MLLM inference.
- L2: query rewriting (separately for caption/ASR/OCR) → vector retrieval + visual grounding → filtering and ranking.
- L3: builds upon L2 with additional LightRAG-based graph retrieval, extracting entity relations and associated information to construct a query-centric reasoning graph.
-
Evidence Filtering and Ranking: Deduplication → fine-grained filtering of irrelevant results using a small model (Qwen2.5-7B) → reordering by video temporal order to preserve causal relationships.
Loss & Training¶
No training is involved; the entire framework operates via inference-time API calls. The intent classifier is implemented through prompt engineering.
Key Experimental Results¶
| Model | MLVU AVG | Gain | Video-MME Overall | Gain |
|---|---|---|---|---|
| Qwen2.5-VL-7B | 29.0 | - | 47.2 | - |
| + VideoRAG | - | - | 55.0 | +7.9 |
| + AdaVideoRAG | 40.5 | +39.8% | 59.9 | +12.7 |
| VideoLLaMA3-7B | 47.7 | - | 64.2 | - |
| + VideoRAG | - | - | 67.3 | +3.1 |
| + AdaVideoRAG | 53.2 | +11.6% | 68.5 | +4.3 |
| GPT-4o | 54.9 | - | 71.9 | - |
VideoLLaMA3 + AdaVideoRAG (7B) achieves performance comparable to GPT-4o (53.2 vs. 54.9 on MLVU).
On the HiVU benchmark, the Overall Winner rate on L3 (hard reasoning) reaches 77.13% vs. 22.87% for the baseline, representing a highly significant advantage.
Ablation Study¶
- Classifier choice: Qwen2.5-7B achieves accuracy 0.81, substantially outperforming the 1.5B variant (0.41), with the highest overall score of 68.5.
- Without classifier: routing all queries to L1 yields 64.2; all to L2 yields 67.5; all to L3 yields 67.1; adaptive routing yields 68.5—validating the value of need-based routing.
- Without graph retrieval: Overall Winner drops to 54.18% (vs. 69.42% for the full model), confirming that graph retrieval is critical for complex queries.
- Without text retrieval: the largest single-component impact (68.75% → 31.25%), indicating that auxiliary text is the most essential knowledge source.
- Sampling frequency: 5 frames/30s vs. 30 frames/30s differs by only ~1 point, suggesting 5 frames is sufficient.
Highlights & Insights¶
- Adaptive routing is a highly practical design—simple queries avoid unnecessary computation, while complex queries retain full reasoning depth.
- The plug-and-play architecture requires no modification to the MLLM itself and can enhance any video MLLM via API calls.
- The HiVU benchmark is proposed: the first hierarchically difficulty-stratified long video understanding evaluation set (L1/L2/L3), comprising 120 videos totaling 60 hours.
- A 7B model augmented with AdaVideoRAG can surpass 72B models and achieve performance on par with GPT-4o.
Limitations & Future Work¶
- Only three routing levels are evaluated; finer-grained difficulty granularity may be needed in practice.
- Knowledge base construction is time-consuming (approximately 412s for L3); parallelization can mitigate this but it remains a deployment bottleneck.
- The intent classifier accuracy of 0.81 introduces misclassification risk; routing L2 queries to L1 results in insufficient retrieved information.
- The HiVU benchmark is relatively small (120 videos), which may limit evaluation comprehensiveness.
Related Work & Insights¶
- vs. VideoRAG [Luo]: uses only naive retrieval without multi-hop reasoning support; AdaVideoRAG shows a clear advantage on long videos (+4.8 on Video-MME).
- vs. VideoRAG [Ren]: applies graph retrieval uniformly to all queries, resulting in low efficiency; AdaVideoRAG outperforms VideoRAG on HiVU L3 (57.77 vs. 42.23 Overall Winner) while being more efficient on simple queries.
- vs. Adaptive-RAG [text]: extends the adaptive retrieval concept from text to video multimodal settings, adding visual grounding and knowledge graph components.
The adaptive routing paradigm is transferable to image understanding: simple queries are answered directly by a VLM, while complex queries trigger RAG. The multimodal information extraction pipeline of the omni-knowledge indexing module (caption + ASR + OCR + vision + graph) can serve as a general-purpose video knowledge base construction paradigm. AdaVideoRAG forms an interesting complementary relationship with Balanced Token Pruning: BTP compresses input tokens, while AdaVideoRAG expands external knowledge.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The integration of adaptive routing and omni-knowledge indexing constitutes a systematic and well-motivated contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers multiple benchmarks, multiple MLLMs, and comprehensive ablation analysis.
- Writing Quality: ⭐⭐⭐⭐ — System architecture is clearly described with well-motivated design choices.
- Value: ⭐⭐⭐⭐ — Strong practical utility; the plug-and-play design is valuable for industrial deployment.