LensWalk: Agentic Video Understanding by Planning How You See in Videos¶

Conference: CVPR 2026 arXiv: 2603.24558 Code: N/A Area: Video Understanding Keywords: Video Agent, Active Observation, Vision-Language Models, Long Video Understanding, Tool Calling

TL;DR¶

This paper presents LensWalk, an agentic framework that enables an LLM reasoner to actively control the temporal scope and sampling density of video observations. Through a reason-plan-observe loop, LensWalk achieves adaptive video understanding without any fine-tuning, yielding plug-and-play performance gains exceeding 5% on long video benchmarks.

Background & Motivation¶

Video understanding is a core task in computer vision, yet the densely temporal nature of video poses substantial challenges for automated analysis. Existing approaches suffer from a fundamental disconnect between reasoning and perception.

Three categories of limitations are prevalent: (1) single-pass forward methods uniformly sample videos into a fixed visual context, risking missed critical events or saturation by redundant information; (2) heuristic keyframe selection methods offer finer granularity but remain one-shot static sampling unable to adapt as intermediate hypotheses evolve; (3) retrieval-based agents can dynamically acquire information but operate over pre-processed static representations (e.g., ASR transcripts, clip-level captions), precluding on-demand generation of new observations from source video.

Key Challenge: A model's reasoning process should drive what and how it observes, yet existing pipelines decouple observation from reasoning—observations are completed before reasoning begins or are constrained to fixed preprocessing artifacts. Key Insight: This paper draws on human visual cognition strategies, wherein humans cope with information overload through purposeful information seeking, continuously alternating between macro scanning and fine-grained focusing while reflecting and verifying throughout. Core Idea: Allow the LLM reasoner to autonomously determine the temporal scope and sampling density of observations, transforming video understanding into an active reason-plan-observe loop.

Method¶

Overall Architecture¶

LensWalk models video understanding as a multi-round iterative process. In each round, the LLM reasoner (\(M_r\)) analyzes the current question and accumulated evidence, then formulates an action plan (\(a_t\)) specifying the observation tool, guiding sub-question, temporal range, and sampling density. This plan is executed by a VLM observer (\(M_o\)) to extract visual evidence from the video. The evidence is appended to the history, forming the input for the next round of reasoning. Additionally, the system maintains timestamp anchors and a global entity memory table to ensure cross-round consistency.

Key Designs¶

Multi-Granularity Observation Tool Suite:
- Function: Provides three complementary observation tools supporting video browsing at different granularities.
- Mechanism: Scan Search discovers cues by scanning slices in parallel over a broad temporal range; Segment Focus performs high-density sampling within a single temporal segment to extract fine-grained details; Stitched Verify combines frames from multiple non-contiguous temporal segments into a single batch, enabling cross-segment comparison and causal verification.
- Design Motivation: The three tools constitute a complete cognitive chain of "discover–focus–verify," covering the full spectrum from global cue search to local detail extraction to cross-segment integration.
Reasoning-Scheduled Active Observation Mechanism:
- Function: Enables the reasoner to explicitly control the temporal range (\(\mathcal{I}_t\)) and sampling strategy at each step.
- Mechanism: Each action \(a_t = (o_t, q_t, \mathcal{I}_t, \rho_{o_t})\) encodes the tool selection, guiding question, spatiotemporal range, and tool-specific parameters, realizing an end-to-end mapping from reasoning state to observation plan.
- Design Motivation: Embedding parameterized observation plans into the history allows the agent to track its own exploration progress and provides a basis for subsequent steps to target insufficiently explored regions.
Evidence Anchoring and Entity Memory:
- Function: Ensures temporal localization accuracy and entity consistency throughout long multi-round reasoning.
- Mechanism: Timestamp anchors insert inter-frame temporal markers during VLM observation, enabling the observer to return answers with precise temporal references; a global entity memory table is maintained independently of the reasoning history, recording entity attributes and appearance times.
- Design Motivation: Avoids the overhead of redundantly recognizing the same entities across rounds, prevents entity reference confusion in long historical contexts, and provides temporal anchors for precise re-observation in subsequent steps.

Loss & Training¶

LensWalk is a training-free, plug-and-play framework requiring no model fine-tuning.
The agent is limited to a maximum of 20 tool calls, one per round.
Per-call frame budgets for Scan Search, Segment Focus, and Stitched Verify are 180, 32, and 128, respectively.
The reasoner simultaneously serves as the updater of the entity memory table.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours (Best Config)	Prev. SOTA	Gain
LVBench	Accuracy	68.6% (o3 self)	60.8% (MR.Video)	+7.8%
VideoMME Long	Accuracy (w/o sub)	71.4% (o3 self)	67.3% (DVD)	+4.1%
LongVideoBench	Accuracy	70.6% (o3 self)	68.6% (DVD)	+2.0%
MMVU (MC)	Accuracy	80.9% (o3/GPT-4.1)	78.9% (o3)	+2.0%
Video-MMMU	Overall	78.33% (o3 self)	75.44% (o3)	+2.89%
EgoSchema	Val	77.2% (o3/Qwen2.5-VL-72B)	76.6% (DVD)	+0.6%

Ablation Study¶

Configuration	Key Metric (VideoMME Long)	Note
Full LensWalk (o3/GPT-4.1)	70.0%	Baseline
w/o Scan Search	65.4%	−4.6%; cue localization is most critical
w/o Stitched Verify	66.8%	−3.2%; cross-segment integration is important
w/o Segment Focus	68.1%	−1.9%; fine-grained extraction contributes
w/o Timestamp Anchor	69.4%	−0.6%
w/o Subject Memory	69.7%	−0.3%

Key Findings¶

When o3 serves as its own observer (reasoner and observer share the same model), performance is exceptional—+11.5% on LVBench and +6.7% on VideoMME Long—representing a "free lunch."
The open-source reasoner Qwen3-235B-A22B is effective with weak observers (e.g., +4.3% for Qwen2.5-VL-7B) but offers limited benefit with strong observers (e.g., only +0.1% for GPT-4.1).
The agent exhibits six behavioral patterns: direct querying, progressive zooming, range splitting, strategic reflection, integrative verification, and static repetition.
The framework adaptively allocates the observation budget: simple questions are resolved quickly with few frames, while complex questions involve more observation rounds.

Highlights & Insights¶

The core design philosophy of incorporating "how to observe" into the reasoning loop is highly elegant, analogous to purposeful human visual search strategies.
The training-free, plug-and-play nature allows the framework to directly enhance existing models, offering considerable engineering value.
The emergent diversity of cognitive strategies (progressive zooming, strategic reflection, etc.) demonstrates the agent's autonomous reasoning capability.
Token consumption is comparable to single-pass forward methods while substantially reducing peak token count per round, alleviating memory pressure from long contexts.

Limitations & Future Work¶

Framework effectiveness is highly dependent on the reasoner's cognitive capability—weaker reasoners may generate invalid observation plans.
A small proportion of "static repetition" behavior (repeated observation of the same region) persists, indicating that the planning mechanism remains imperfect.
Current observation tools are limited to the visual modality and do not exploit multimodal information such as audio or subtitles.
The maximum of 20 tool calls may be insufficient in extreme long-video scenarios.

vs. Deep Video Discovery (DVD): DVD pre-generates captions for an entire video to support reasoning, consuming millions of tokens; LensWalk observes on demand, with token consumption approximating single-pass forward methods while achieving higher accuracy.
vs. MR.Video: MR.Video relies on pre-processed clip retrieval, with fixed observation granularity and scope; LensWalk dynamically adjusts the temporal range and sampling density of observations.
vs. VideoAgent: VideoAgent's tools operate solely on preprocessing artifacts; LensWalk directly schedules new observations from source video.
Insights: The notion of "scalable visual cognition"—not merely scaling model size but enabling models to learn active observation—offers a compelling research direction.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Reframes video understanding as an active observation scheduling problem; conceptually innovative and elegantly realized.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six benchmarks, multiple model combinations, detailed ablations, and behavioral analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Fluent narrative, clear exposition of principles, and in-depth experimental analysis.
Value: ⭐⭐⭐⭐⭐ Plug-and-play framework that directly enhances existing strong models; highly practical.