Skip to content

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Conference: CVPR2026
arXiv: 2602.20913
Code: qiujihao19/LongVideo-R1
Area: Video Understanding
Keywords: Long Video Understanding, Smart Navigation, Multimodal Agent, Hierarchical Reasoning, Reinforcement Learning, Chain-of-Thought

TL;DR

This paper proposes LongVideo-R1, a reasoning-capable multimodal agent that organizes videos into a hierarchical tree structure and employs an intelligent navigation strategy to achieve efficient long-video question answering with an average of only 10.5 tool calls, significantly outperforming exhaustive methods on the accuracy–efficiency trade-off.

Background & Motivation

  1. Computational bottleneck in long video understanding: Current MLLMs are constrained by limited context windows and cannot directly process videos of 1–2 hours, relying instead on brute-force pipelines (clip → segment-wise processing → aggregation) whose computational cost scales linearly with video duration.
  2. Inefficiency of existing methods: Methods such as Ego-R1 and VideoTree achieve reasonable accuracy but require exhaustive processing of all or most video segments (e.g., Ego-R1 generates one caption every 30 seconds, averaging 86 caption segments per video), incurring substantial latency.
  3. Constraints on real-world deployment: High computational costs severely limit the applicability of long-video MLLMs in embodied agents (which require low-latency responses) and high-throughput video chat services.
  4. Neglect of the accuracy–efficiency trade-off: Existing work optimizes almost exclusively for QA accuracy, lacking formal metrics and optimization objectives for computational budget.
  5. Inspiration from human search strategies: Humans do not watch long videos frame by frame; instead, they first obtain a global overview and then selectively "drill into" segments of interest based on the question—an active, goal-directed strategy far more efficient than exhaustive scanning.
  6. Maturity of large reasoning models: Large reasoning models (LRMs) such as Qwen3-8B and the chain-of-thought (CoT) paradigm provide the technical foundation for training agents capable of autonomously deciding when to stop and where to look next.

Method

Overall Architecture

LongVideo-R1 organizes a long video into a multi-level tree structure with depth \(D=3\), where each non-leaf node has \(K = \text{round}(\sqrt[D]{T/16s})\) child nodes and each leaf node corresponds to a short clip of approximately 16 seconds. The agent is driven by a fine-tuned LRM (Qwen3-8B) and performs Chain-of-Thought-with-Tool (CoTwT) reasoning using two multimodal tools.

Key Designs

Two multimodal tools:

  • video_cap(): Accepts a video segment at any level of the hierarchy and outputs a textual description (generated by Qwen2.5-VL-72B), used to acquire global or local context.
  • video_qa(): Called only at leaf nodes (executed by Qwen2.5-VL-32B) to generate the final answer to the specific question.

Reasoning procedure:

  1. Obtain a top-level caption from the root node (the entire video).
  2. The LRM reasons over the accumulated context and determines whether the information is sufficient to answer the question.
  3. If not, the LRM decides the next navigation action: drilling down into child segments, traversing sibling nodes laterally, or backtracking to a higher level for re-orientation.
  4. Call video_cap() to obtain a description of the target segment and update the dialogue history.
  5. Repeat steps 2–4 until the LRM judges the information sufficient, then call video_qa() to generate the answer, or until the maximum number of rounds is reached.

The entire reasoning process involves only text (multimodal tools are invoked as external function calls), allowing the LRM to focus exclusively on planning and reasoning.

Data Construction

  • Based on CG-Bench (with clue-grounded QA annotations): 800 videos and 5.6K QA pairs.
  • Video captions at each tree level are pre-extracted using Qwen2.5-VL-72B (sampling 256/128/64/32 frames).
  • GPT-5 generates CoTwT reasoning trajectories in a zero-shot manner; when generation fails, CG-Bench's clue-grounded annotations are used to provide progressive hints, ensuring correctness while minimizing information leakage.
  • The final dataset comprises 5.6K trajectories (averaging 5.8 steps), expanded into approximately 33K SFT training samples.

Loss & Training

Stage 1: SFT Cold Start — Fine-tune Qwen3-8B for 3 epochs to learn the structured reasoning format of <think>...</think> + <tool>...</tool> + <answer>...</answer>.

Stage 2: GRPO Reinforcement Learning — 2 epochs with a composite reward function:

\[R = w_{\text{ans}} \cdot r_{\text{ans}} + w_{\text{loc}} \cdot r_{\text{loc}} + w_{\text{repeat}} \cdot r_{\text{repeat}}\]
  • \(r_{\text{ans}}\) (answer reward): 1 if the answer is correct, 0 otherwise.
  • \(r_{\text{loc}}\) (localization reward): Measures the coverage and precision of the temporal segments accessed by the model relative to the ground-truth key segments using an F1 metric, encouraging precise localization while penalizing redundant exploration.
  • \(r_{\text{repeat}}\) (repetition penalty): Penalizes repeated access to the same segment to reduce wasted computation.

Key Experimental Results

Main Results

Benchmark LongVideo-R1 LongVideo-R1 (new) Best Competing Method
LVBench Overall 50.0% 60.7% AdaReTake-72B: 53.3%
LVBench-TG (Temporal Grounding) 56.4% 62.7% AdaReTake-72B: 45.5%
LVBench-KIR (Key Info Retrieval) 56.4% 70.1% AdaReTake-72B: 62.2%
MLVU 68.1% 71.3% VideoChat-Flash-7B: 74.7%
Video-MME-Long (w/ sub) 64.4% 68.6% Ego-R1: 64.9%
  • On LVBench, the 8B LongVideo-R1 surpasses GPT-4o (48.9%) and GLM-4V-plus (48.7%).
  • The temporal grounding (TG) sub-task reaches 56.4%, leading the second-best method by 10.9 percentage points.
  • Upgrading the caption tool to Qwen3-VL-32B-Instruct raises the overall accuracy to 60.7%.

Efficiency Comparison

Metric LongVideo-R1 Ego-R1
Avg. caption segments per question (Video-MME) 10.5 rounds 86 segments
Avg. time per question (LVBench) ~3 minutes Substantially longer

Ablation Study

Ablation LVBench Video-MME/L
SFT only (10K) 39.1% 57.7%
SFT only (full 33K) 41.6% 59.2%
+ RL (10K data) 47.4% 60.2%
+ RL (full data, complete model) 50.0% 64.4%
w/o \(r_{\text{loc}}\) 45.8% 61.4%
  • Scaling SFT data from 10K to 33K: LVBench +2.5%; adding RL: +8.4%.
  • Contribution of the localization reward \(r_{\text{loc}}\): LVBench +4.2%, Video-MME +3.0%.
  • Increasing the maximum number of rounds from 10 to 30: LVBench 43.0% → 50.0%, but inference time increases from 104s to 176s.

Highlights & Insights

  1. Valuable problem formulation: This work is the first to formally define the problem of long video understanding under a low computational budget, proposing a research direction toward Pareto-optimal accuracy–efficiency trade-offs.
  2. Elegant design intuition: The hierarchical video tree combined with active reasoning-guided navigation mirrors the human strategy of understanding videos from global to local.
  3. Significant efficiency advantage: An average of 10.5 rounds suffices to complete QA, requiring only ~1/8 the computation of Ego-R1, while achieving comparable or superior accuracy.
  4. Capability on ultra-long videos: The method remains effective on TV dramas spanning dozens of hours, completing QA within 10–20 rounds, a regime where linear scanning is infeasible.
  5. Clever data construction strategy: CG-Bench's grounding annotations are used to progressively prompt GPT-5, ensuring correctness while minimizing hint leakage.
  6. Fully open-source: The LRM is based on Qwen3-8B and the tools on the Qwen2.5-VL series, enabling complete local deployment.

Limitations & Future Work

  1. Uniform segmentation is suboptimal: The video tree uses equal-length splitting, so semantically related content may fall into adjacent segments, increasing localization ambiguity.
  2. Limited tool variety: Only caption and QA tools are available, lacking fine-grained tools such as instance recognition and segment detection.
  3. Underperformance on global questions: The method is inferior to uniform sampling approaches on MLVU (which includes short videos) and Video-MME (which contains global questions about video themes), since such questions do not require precise localization.
  4. Navigation can be misled: The LRM is occasionally attracted to semantically related but irrelevant segments, getting stuck in incorrect regions and requiring manual hints to recover.
  5. Single-question assumption: Each QA pair is processed independently; the scenario of sharing a video index across multiple questions to amortize the initial overhead is not considered.
  6. Dependence on caption quality: The framework's performance is highly sensitive to the quality of the video description tool; inaccurate descriptions cause reasoning errors to propagate.
Method Type LVBench Computation Limitations
VideoAgent Agent 29.3% Exhaustive + GPT Low accuracy
VideoTree Agent 28.8% Tree-based exhaustive Linear complexity
MemVid Agent 44.4% Memory-augmented Weak on some sub-tasks
Ego-R1 Agent+RL ~64.9% (VME) Caption every 30s High computation cost
AdaReTake-72B MLLM 53.3% Adaptive sampling 72B large model
LongVideo-R1 Agent+RL 50.0% ~10 navigation rounds Weak on global questions

Rating

  • Novelty: ⭐⭐⭐⭐ — The problem formulation of "low-cost long video understanding" and the hierarchical active navigation framework represent clear contributions.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three mainstream benchmarks, ultra-long video case studies, and multi-dimensional ablations (data scale / reward components / tool size / maximum rounds).
  • Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, method description is complete, and algorithm pseudocode is well-formatted.
  • Value: ⭐⭐⭐⭐⭐ — Addresses the core efficiency bottleneck of long-video agents; fully open-source and reproducible, with direct implications for real-world deployment.