Skip to content

LensWalk: Agentic Video Understanding by Planning How You See in Videos

Conference: CVPR 2026 arXiv: 2603.24558 Code: N/A Area: Video Understanding Keywords: Video Agent, Active Observation, Vision-Language Models, Long Video Understanding, Tool Calling

TL;DR

This paper presents LensWalk, an agentic framework that enables an LLM reasoner to actively control the temporal scope and sampling density of video observations. Through a reason-plan-observe loop, LensWalk achieves adaptive video understanding without any fine-tuning, yielding plug-and-play performance gains exceeding 5% on long video benchmarks.

Background & Motivation

Video understanding is a core task in computer vision, yet the densely temporal nature of video poses substantial challenges for automated analysis. Existing approaches suffer from a fundamental disconnect between reasoning and perception.

Three categories of limitations are prevalent: (1) single-pass forward methods uniformly sample videos into a fixed visual context, risking missed critical events or saturation by redundant information; (2) heuristic keyframe selection methods offer finer granularity but remain one-shot static sampling unable to adapt as intermediate hypotheses evolve; (3) retrieval-based agents can dynamically acquire information but operate over pre-processed static representations (e.g., ASR transcripts, clip-level captions), precluding on-demand generation of new observations from source video.

Key Challenge: A model's reasoning process should drive what and how it observes, yet existing pipelines decouple observation from reasoning—observations are completed before reasoning begins or are constrained to fixed preprocessing artifacts. Key Insight: This paper draws on human visual cognition strategies, wherein humans cope with information overload through purposeful information seeking, continuously alternating between macro scanning and fine-grained focusing while reflecting and verifying throughout. Core Idea: Allow the LLM reasoner to autonomously determine the temporal scope and sampling density of observations, transforming video understanding into an active reason-plan-observe loop.

Method

Overall Architecture

LensWalk models video understanding as a multi-round iterative process. In each round, the LLM reasoner (\(M_r\)) analyzes the current question and accumulated evidence, then formulates an action plan (\(a_t\)) specifying the observation tool, guiding sub-question, temporal range, and sampling density. This plan is executed by a VLM observer (\(M_o\)) to extract visual evidence from the video. The evidence is appended to the history, forming the input for the next round of reasoning. Additionally, the system maintains timestamp anchors and a global entity memory table to ensure cross-round consistency.

Key Designs

  1. Multi-Granularity Observation Tool Suite:

    • Function: Provides three complementary observation tools supporting video browsing at different granularities.
    • Mechanism: Scan Search discovers cues by scanning slices in parallel over a broad temporal range; Segment Focus performs high-density sampling within a single temporal segment to extract fine-grained details; Stitched Verify combines frames from multiple non-contiguous temporal segments into a single batch, enabling cross-segment comparison and causal verification.
    • Design Motivation: The three tools constitute a complete cognitive chain of "discover–focus–verify," covering the full spectrum from global cue search to local detail extraction to cross-segment integration.
  2. Reasoning-Scheduled Active Observation Mechanism:

    • Function: Enables the reasoner to explicitly control the temporal range (\(\mathcal{I}_t\)) and sampling strategy at each step.
    • Mechanism: Each action \(a_t = (o_t, q_t, \mathcal{I}_t, \rho_{o_t})\) encodes the tool selection, guiding question, spatiotemporal range, and tool-specific parameters, realizing an end-to-end mapping from reasoning state to observation plan.
    • Design Motivation: Embedding parameterized observation plans into the history allows the agent to track its own exploration progress and provides a basis for subsequent steps to target insufficiently explored regions.
  3. Evidence Anchoring and Entity Memory:

    • Function: Ensures temporal localization accuracy and entity consistency throughout long multi-round reasoning.
    • Mechanism: Timestamp anchors insert inter-frame temporal markers during VLM observation, enabling the observer to return answers with precise temporal references; a global entity memory table is maintained independently of the reasoning history, recording entity attributes and appearance times.
    • Design Motivation: Avoids the overhead of redundantly recognizing the same entities across rounds, prevents entity reference confusion in long historical contexts, and provides temporal anchors for precise re-observation in subsequent steps.

Loss & Training

  • LensWalk is a training-free, plug-and-play framework requiring no model fine-tuning.
  • The agent is limited to a maximum of 20 tool calls, one per round.
  • Per-call frame budgets for Scan Search, Segment Focus, and Stitched Verify are 180, 32, and 128, respectively.
  • The reasoner simultaneously serves as the updater of the entity memory table.

Key Experimental Results

Main Results

Dataset Metric Ours (Best Config) Prev. SOTA Gain
LVBench Accuracy 68.6% (o3 self) 60.8% (MR.Video) +7.8%
VideoMME Long Accuracy (w/o sub) 71.4% (o3 self) 67.3% (DVD) +4.1%
LongVideoBench Accuracy 70.6% (o3 self) 68.6% (DVD) +2.0%
MMVU (MC) Accuracy 80.9% (o3/GPT-4.1) 78.9% (o3) +2.0%
Video-MMMU Overall 78.33% (o3 self) 75.44% (o3) +2.89%
EgoSchema Val 77.2% (o3/Qwen2.5-VL-72B) 76.6% (DVD) +0.6%

Ablation Study

Configuration Key Metric (VideoMME Long) Note
Full LensWalk (o3/GPT-4.1) 70.0% Baseline
w/o Scan Search 65.4% −4.6%; cue localization is most critical
w/o Stitched Verify 66.8% −3.2%; cross-segment integration is important
w/o Segment Focus 68.1% −1.9%; fine-grained extraction contributes
w/o Timestamp Anchor 69.4% −0.6%
w/o Subject Memory 69.7% −0.3%

Key Findings

  • When o3 serves as its own observer (reasoner and observer share the same model), performance is exceptional—+11.5% on LVBench and +6.7% on VideoMME Long—representing a "free lunch."
  • The open-source reasoner Qwen3-235B-A22B is effective with weak observers (e.g., +4.3% for Qwen2.5-VL-7B) but offers limited benefit with strong observers (e.g., only +0.1% for GPT-4.1).
  • The agent exhibits six behavioral patterns: direct querying, progressive zooming, range splitting, strategic reflection, integrative verification, and static repetition.
  • The framework adaptively allocates the observation budget: simple questions are resolved quickly with few frames, while complex questions involve more observation rounds.

Highlights & Insights

  • The core design philosophy of incorporating "how to observe" into the reasoning loop is highly elegant, analogous to purposeful human visual search strategies.
  • The training-free, plug-and-play nature allows the framework to directly enhance existing models, offering considerable engineering value.
  • The emergent diversity of cognitive strategies (progressive zooming, strategic reflection, etc.) demonstrates the agent's autonomous reasoning capability.
  • Token consumption is comparable to single-pass forward methods while substantially reducing peak token count per round, alleviating memory pressure from long contexts.

Limitations & Future Work

  • Framework effectiveness is highly dependent on the reasoner's cognitive capability—weaker reasoners may generate invalid observation plans.
  • A small proportion of "static repetition" behavior (repeated observation of the same region) persists, indicating that the planning mechanism remains imperfect.
  • Current observation tools are limited to the visual modality and do not exploit multimodal information such as audio or subtitles.
  • The maximum of 20 tool calls may be insufficient in extreme long-video scenarios.
  • vs. Deep Video Discovery (DVD): DVD pre-generates captions for an entire video to support reasoning, consuming millions of tokens; LensWalk observes on demand, with token consumption approximating single-pass forward methods while achieving higher accuracy.
  • vs. MR.Video: MR.Video relies on pre-processed clip retrieval, with fixed observation granularity and scope; LensWalk dynamically adjusts the temporal range and sampling density of observations.
  • vs. VideoAgent: VideoAgent's tools operate solely on preprocessing artifacts; LensWalk directly schedules new observations from source video.
  • Insights: The notion of "scalable visual cognition"—not merely scaling model size but enabling models to learn active observation—offers a compelling research direction.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Reframes video understanding as an active observation scheduling problem; conceptually innovative and elegantly realized.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Six benchmarks, multiple model combinations, detailed ablations, and behavioral analysis.
  • Writing Quality: ⭐⭐⭐⭐⭐ Fluent narrative, clear exposition of principles, and in-depth experimental analysis.
  • Value: ⭐⭐⭐⭐⭐ Plug-and-play framework that directly enhances existing strong models; highly practical.