VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking¶

Conference: CVPR 2026 arXiv: 2603.20185 Code: https://github.com/jylins/videoseek Area: Video Understanding / Agent Keywords: Video Agent, Long Video Understanding, Tool Invocation, Logical Flow, Think-Act-Observe

TL;DR¶

VideoSeek proposes a long-horizon video agent that actively seeks critical evidence via video logical flow rather than exhaustively parsing all frames. Through a think-act-observe loop and a multi-granularity toolkit (overview/skim/focus), it achieves a 10.2-point improvement over the base model GPT-5 on LVBench while reducing frame usage by 93%.

Background & Motivation¶

Background: Video-language understanding has advanced rapidly alongside progress in large multimodal models (LMMs). However, mainstream approaches—including Qwen2.5-VL and GPT-4o—still adopt a single-pass inference paradigm that generates answers from a fixed number of input frames, struggling on long videos and complex reasoning tasks. Video agent methods such as DrVideo and DVD introduce iterative reasoning but rely heavily on dense video preprocessing.
Limitations of Prior Work: Existing video agents densely sample frames at 0.2–2 FPS and generate detailed textual descriptions or structured memory for each frame, incurring computational costs that grow linearly with video length. For instance, DVD processes 8,074 frames on LVBench, as does MR.Video. Yet over 80% of questions on LVBench can be answered by examining fewer than 5% of video frames—making exhaustive parsing extremely wasteful.
Key Challenge: Thorough video preprocessing can improve accuracy but is prohibitively expensive and does not scale. The central challenge is achieving performance comparable to or better than dense parsing under an extremely sparse visual budget.
Goal: To design an efficient video agent that answers video questions by actively locating critical evidence rather than resorting to brute-force enumeration.
Key Insight: Humans do not watch videos frame by frame to answer questions—they exploit temporal and causal structure to infer where useful evidence is likely to appear, quickly establish a coarse storyline, inspect promising time intervals, and attend carefully only when fine-grained details are needed.
Core Idea: Employ a ReAct-style think-act-observe loop together with a three-level granularity toolkit (global overview → coarse skim → fine focus) to actively navigate to answer-critical frames guided by video logical flow.

Method¶

Overall Architecture¶

VideoSeek formulates video-language tasks as long-horizon problems rather than single-step inference. Given a query \(\mathbf{Q}\) and video \(\mathbf{X}\), the agent produces a think-act-observe triplet \(\langle z_t, a_t, o_t \rangle\) at each reasoning step \(t\), forming a trajectory \(\tau\). The final answer is generated by \(p(\mathbf{Y}|\mathbf{X}, \mathbf{Q}, \tau)\). The key mechanism is that the trajectory exploration process \(p(\tau|\mathbf{X}, \mathbf{Q})\) focuses on a small number of highly informative observations rather than exhaustive coverage. The agent uses GPT-5 as the default thinking LLM with a maximum of \(N=20\) reasoning rounds.

Key Designs¶

Multi-Granularity Toolkit:
- Function: Emulates the human video-watching behavior of "overview first, then skim, then focus."
- Mechanism: Three tools operate at different temporal granularities—
  - <overview>: Uniformly samples \(16\alpha\) frames from the entire video (for LVBench, \(\alpha=4\), i.e., 64 frames) to produce a coarse storyline and annotate key intervals. Used primarily at the beginning to build a global cognitive map.
  - <skim>: Uniformly samples \(4\alpha\) frames from a candidate long segment (minimum \(4\alpha\) seconds) to rapidly locate query-relevant moments. Can be invoked multiple times on different segments to progressively narrow the search space.
  - <focus>: Densely inspects a short segment at 1 FPS (maximum \(4\alpha\) seconds) to capture fine-grained details such as text reading, face recognition, and object counting. Serves as the final "close-up" step to ensure answer accuracy.
- Design Motivation: Different levels of information require different observational granularities. Global overview reveals narrative structure and logical flow; skimming localizes relevant intervals; focusing retrieves decisive evidence. The three tools are complementary and mutually indispensable.
Think-Act-Observe Loop:
- Function: Enables adaptive long-horizon reasoning and evidence accumulation.
- Mechanism: Adopts the ReAct architecture—at each step, the thinking LLM reads the full trajectory (including all prior thoughts, actions, and observations), outputs a reasoning chain \(z_t\) and a tool-invocation plan \(\alpha_t\). If \(\alpha_t\) is <answer>, the answer is parsed and the loop terminates; otherwise the tool is executed to obtain a new observation \(o_t\), which is appended to the trajectory for the next round. The key innovation is that the agent dynamically adjusts its tool-invocation strategy based on continuously accumulated observations, rather than following a predefined coarse-to-fine rule.
- Design Motivation: Compared to pre-building video databases (DrVideo/DVD) or maintaining fixed memory buffers (VCA), reasoning directly over the full conversation history is more flexible—the agent can revisit prior observations, revise judgments, and redirect the search.
Exploitation of Video Logical Flow:
- Function: Enables the agent to infer evidence locations from the temporal-causal structure of video.
- Mechanism: Videos possess an inherent logical flow (scene causality, temporal ordering, character relationships, etc.). When subtitles are available, the logical flow is directly exposed through a textual storyline, allowing the agent to localize key segments more efficiently. Without subtitles, <overview> constructs a coarse logical flow via visual summarization. The agent leverages logical flow to infer where the answer is likely to appear rather than searching blindly.
- Design Motivation: LVBench experiments show that adding subtitles reduces frame usage from 92.3 to 27.2 while raising accuracy from 68.4 to 76.7, demonstrating that exposing logical flow substantially improves navigation precision and efficiency.

Loss & Training¶

VideoSeek is a training-free, model-agnostic agent framework that directly leverages GPT-5's reasoning and tool-use capabilities. Visual content returned by tool calls is interpreted by GPT-5.

Key Experimental Results¶

Main Results¶

Method	Type	LVBench (w/o sub.)	Frames	LVBench (w/ sub.)	Frames
GPT-5 (Base)	LMM	60.1	384	66.5	384
Gemini 1.5 Pro	LMM	33.1	3600	—	—
DVD	Agent	74.2	8074	76.0	8074
MR. Video	Agent	60.8	8074	—	—
VideoSeek	Agent	68.4	92.3	76.7	27.2

VideoMME Long (w/ subtitles): VideoSeek 81.2% / 15.9 frames vs. GPT-5 78.1% / 384 frames. LongVideoBench Long: VideoSeek 73.5% / 29.6 frames vs. GPT-5 64.5% / 384 frames.

Ablation Study¶

Configuration	LVBench (w/o sub.)	Notes
Full toolkit	68.4	Full model
w/o overview	55.1 (−13.3)	Loss of global perspective
w/o skim	62.4 (−6.0)	Loss of intermediate-granularity browsing
w/o focus	63.7 (−4.7)	Loss of fine-grained inspection

Thinking LLM	Frames	Rounds	LVBench
GPT-5	92.3	4.42	68.4
o4-mini	112.6	5.08	58.5 (−9.9)
GPT-4.1	74.2	2.99	53.0 (−15.4)

Key Findings¶

Overview is the most critical tool (removing it causes a 13.3-point drop) because it provides the global storyline and logical flow that serve as the foundation for all subsequent navigation.
Reasoning capability determines the upper bound: GPT-4.1 (a non-thinking model) tends to halt reasoning prematurely with overconfidence (only 2.99 rounds on average), leaving insufficient evidence; o4-mini explores more but with lower reasoning quality, so the additional computation does not translate into better performance.
Subtitles = explicit logical flow: Adding subtitles reduces frame usage by 70% (92.3→27.2) while raising accuracy by 8.3 points, confirming that a textual storyline greatly simplifies evidence search.
vs. DVD: VideoSeek with subtitles surpasses DVD (76.7 vs. 76.0) using only 0.3% of the frames (27.2 vs. 8,074).
On the complex reasoning benchmark Video-Holmes, VideoSeek achieves 47.3%, surpassing Gemini 2.5 Pro's 45.0% with only one-quarter of the frames.

Highlights & Insights¶

"Active seeking" vs. "exhaustive parsing"—a paradigm shift: This is the paper's most profound contribution. VideoSeek demonstrates that intelligent navigation is far more effective than brute-force dense sampling for long video understanding—achieving or exceeding dense methods with only 1% of the frames. This aligns with the principle of cognitive economy in human perception.
Aesthetic elegance of the tiered toolkit design: The overview/skim/focus hierarchy maps precisely onto human "glance → browse → scrutinize" behavior—intuitive yet highly effective. In particular, the global overview contributes far more than expected (13.3 points).
Thinking models are the core engine: Non-thinking models cannot effectively leverage this framework—genuine reasoning capability is required to judge "whether evidence is sufficient" and "where to look next." This highlights video agent systems' strong dependence on the underlying reasoning model.

Limitations & Future Work¶

Full reliance on the closed-source model GPT-5 precludes open-source reproducibility and deployment in cost-sensitive scenarios.
The approach may underperform on abrupt or highly localized critical moments (e.g., anomaly detection), where logical-flow-guided navigation cannot anticipate unexpected events.
Each tool call requires LMM interpretation of visual content, potentially incurring high API costs.
Distilling this agent framework into smaller open-source models remains unexplored.
Hyperparameters of the toolkit (\(\alpha\), maximum frame counts, etc.) require tuning for different benchmarks.

vs. DVD Agent: DVD constructs a multi-granularity video database for retrieval, requiring 8,074 frames of preprocessing. VideoSeek explores on demand, reducing frame count by two orders of magnitude while achieving comparable (without subtitles) or superior (with subtitles) performance.
vs. DrVideo: DrVideo converts video into long documents at 0.2 FPS—an exhaustive paradigm. VideoSeek demonstrates that exhaustive enumeration is unnecessary.
vs. single-pass inference LMMs: GPT-5 in single-pass mode with 384 frames achieves 60.1%; VideoSeek using the same model within an agent framework improves this to 68.4%/76.7%, demonstrating that the agent paradigm can unlock the latent potential of the base model.

Rating¶

Novelty: ⭐⭐⭐⭐ The concept of "active seeking over exhaustive parsing" is valuable, though the ReAct + tool-calling framework itself is not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four benchmarks, with/without subtitle comparisons, thinking LLM ablations, and toolkit ablations—analysis is exceptionally thorough.
Writing Quality: ⭐⭐⭐⭐⭐ Narrative is fluent; Figure 1's efficiency-performance trade-off visualization is immediately clear; case studies are well-presented.
Value: ⭐⭐⭐⭐ Provides important insights for video agent efficiency optimization, though closed-source dependency limits broader community impact.