WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=8jsaazdAb3
Code: https://github.com/Alibaba-NLP/DeepResearch/tree/main/WebAgent/WebWatcher
Area: Multimodal VLM / Agent
Keywords: Vision-Language Agent, Deep Research, Tool Call, GRPO, Multimodal VQA

TL;DR¶

WebWatcher is a "deep research" web agent capable of joint reasoning across text and image modalities. It utilizes automatically synthesized high-quality tool-calling trajectories for SFT cold-starts, followed by GRPO reinforcement learning to refine decision-making. It also introduces the BrowseComp-VL benchmark requiring cross-modal retrieval, outperforming prompt-based workflows and existing open-source multimodal agents on challenging leaderboards like HLE, LiveVQA, and MMSearch.

Background & Motivation¶

Background: Web agents represented by "deep research" can already perform multi-step planning and invoke search/browsing tools to solve extremely difficult information retrieval problems, demonstrating superhuman capabilities on benchmarks like BrowseComp and Humanity's Last Exam (HLE). However, most current work is "text-centric," treating the pervasive visual information in the real world as a blind spot.

Limitations of Prior Work: Numerous real-world scenarios—reading scientific charts, analyzing graphical data, navigating web pages with interfaces—require joint reasoning between vision and language. Current multimodal agents follow two ineffective paths. One is VL Agents: they focus on "visual perception" tools like OCR, detection boxes, cropping, and tagging. They can "see" images but fail to link visual perception with deep text understanding and cross-modal inference, failing on tasks like GAIA or HLE that require "multi-step reasoning after viewing images." The other is pure Search Agents: while retrieval-augmented methods answer many knowledge questions, they fail when answers are implicit, require interaction via links, or necessitate additional computation.

Key Challenge: The true barrier to multimodal deep research lies in the simultaneous requirement for stronger perception, logic, knowledge reasoning, and flexible orchestration of a set of tools with diverse input/output formats. Existing methods either have a limited toolset (only vision or only search) or rely on templated, fixed-scenario pipelines, lacking flexible reasoning and planning. The paper illustrates this with a GAIA case: identifying an animal in an image (an Atlantic Puffin), then searching its Wikipedia history to count revisions with the "visual edit" tag before 2020 (answer: 11). Pure vision agents over-infer on edge/texture analysis, while search agents cannot click into pages to browse; both fail.

Goal: To build a true "cross-modal deep research" agent, three sub-problems must be addressed: (1) a lack of training data featuring both high-quality visual content and complex multi-hop reasoning; (2) a lack of tool-calling trajectories that coordinate heterogeneous tools and align with real reasoning processes; (3) a lack of high-difficulty benchmarks to evaluate these capabilities.

Core Idea: Construct a complete pipeline consisting of "data synthesis \(\rightarrow\) automatic trajectory annotation \(\rightarrow\) SFT cold-start \(\rightarrow\) GRPO reinforcement learning" to transform a standard multimodal large model into a deep research agent capable of planning, using five tools, and cross-modal reasoning. Accompanying this is the BrowseComp-VL benchmark, which brings "intentionally under-determined, human-challenging" problems from BrowseComp into the visual domain.

Method¶

Overall Architecture¶

The core of WebWatcher is not a new model architecture, but a comprehensive training pipeline that instills the ability to "see, search, and reason." On the input side, large-scale multimodal VQA data (BrowseComp-VL) is synthesized starting from open web/Wikipedia content. GPT-4o is then used to generate ReAct-style tool-calling trajectories on this data, followed by rigorous filtering. These high-quality trajectories are used for SFT cold-starts, and finally, GRPO reinforcement learning is employed to optimize tool usage and decision-making. During inference, the trained agent is equipped with five tools (Image Search, Text Search, Web Visit, Code Interpreter, and internal OCR), solving problems step-by-step through a think-act-observe loop until a "Finish" action provides the answer.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    A["Web/Wikipedia<br/>Random Walk Collection"] --> B["Data Construction<br/>Multi-hop QA + Entity Masking + QA→VQA"]
    B --> C["Trajectory Annotation & Filtering<br/>ReAct + 5 Tools + 3-Stage Selection"]
    C --> D
    subgraph D["Two-Stage Post-Training"]
        direction TB
        D1["SFT Cold-Start<br/>Predict Next Action"] --> D2["GRPO RL<br/>Intra-group Relative Advantage"]
    end
    D --> E["WebWatcher Agent<br/>think-act-observe Loop"]

Key Designs¶

1. BrowseComp-VL Data Construction: From Random Walks to Masked Multimodal Multi-hop Problems

The first pain point is the lack of training data with "authentic visual content + deep reasoning"—existing VQA datasets mostly involve shallow perception within two hops, lacking planning complexity. WebWatcher uses a three-stage pipeline. First is QA Generation: simulating human browsing by recursively traversing hyperlinks from Wikipedia pages, aggregating content, and using GPT-4o to synthesize QA pairs. Level 1 questions use explicit entities but require multi-hop reasoning. Level 2 questions follow the WebSailor approach of "entity blurring"—starting from a root entity \(B_{root}\), a hyperlink tree is expanded with depth \(d=3\) and branching \(k=3\) (total \((k^{d+1}-1)/(k-1)\) nodes). A subgraph with \(N\) entities is sampled to define the reasoning path to a target entity \(B\), and precise references are replaced with "partial/vague descriptions," forcing the model to rely on context reasoning rather than string matching. Next is QA \(\rightarrow\) VQA Conversion: entities lacking visual grounding (e.g., pure temporal references) are discarded. For the remaining entities \(\hat{B}\), Google SerpApi retrieves \(K=2\) real web images as visual anchors. The target entity in the text question \(q_t\) is then masked with a visual reference token \(r_{vis}\) like "the entity in this image," generating \(K\) multimodal questions from one text question. Finally, Two-stage Quality Control: a Selector discards failed samples where the converted question remains identical to the original or where entity names/aliases are leaked. An Examiner tasks GPT-4o with answering the image query using only the image and caption; failure indicates insufficient visual context, leading to removal. The key is "strictly real images + masked entities + information density."

2. Five-Tool Coordination and Automatic Trajectory Annotation: Distilling Real Tool Interactions into Learnable Demonstrations

The second pain point is the difficulty of creating tool-calling trajectories—heterogeneous tool formats and reasoning roles vary, and manually templated trajectories are rigid and adapt poorly to tasks. WebWatcher equips the agent with: Web Image Search (with captions and URLs), Web Text Search, Visit (access and summarize web pages), Code Interpreter (symbolic/numeric calculation), and OCR (internalized via prompting and SFT). GPT-4o automatically constructs ReAct trajectories on BrowseComp-VL instances \((I,q,a)\): each step generates a Thought (internal reasoning/planning wrapped in <think>), Action (tool call in <tool_call> or final answer in <answer>), and Observation (environment feedback in <tool_response>). A trajectory of length \(L\) is denoted as \(\tau = \{(t_0,o_0),\dots,(t_L,o_L)\}\). These trajectories are grounded in "real tool behavior." Filtering is stringent with three stages: (1) final answer matches the ground truth; (2) GPT-4o checks logical consistency step-by-step to discard trajectories with hallucinations or contradictions; (3) trajectories with fewer than 3 tool calls are removed to ensure multi-step interaction.

3. Two-Stage Post-Training: SFT Cold-Start + GRPO Reinforcement Learning

To enable both tool proficiency and autonomous decision optimization, WebWatcher uses SFT for cold-starting. On \(K\) high-quality trajectories, given image \(I^{(i)}\), question \(q^{(i)}\), and prior actions/observations \((t^{(i)}_{<l}, o^{(i)}_{<l})\), it maximizes the log-likelihood of the next correct action:

\[\max_{\theta} \sum_{i=1}^{K}\sum_{l=1}^{L_i} \log P_\theta\!\left(t^{(i)}_l \mid I^{(i)}, q^{(i)}, t^{(i)}_{<l}, o^{(i)}_{<l}\right).\]

Subsequently, GRPO RL is applied: for a problem, the current policy \(\pi_\theta\) samples a group \(G=\{\tau_1,\dots,\tau_K\}\) of trajectories. It uses intra-group relative advantage \(A_{rel}(\tau^{(i)}) = R^{(i)} - \frac{1}{K}\sum_{j} R^{(j)}\) to normalize rewards, eliminating the need for a separate value function, and optimizes using a clipped surrogate loss \(L_{GRPO}\) (including importance sampling ratio \(\rho^{(i)}\), clipping threshold \(\epsilon\), and KL penalty \(\beta\)). Rewards are provided only at the end of the episode, weighted by a format score \(r_f\in[0,1]\) and an LLM-graded semantic accuracy score \(r_a\in[0,1]\):

\[R = w\,r_f + (1-w)\,r_a, \quad w=0.2.\]

A low \(w=0.2\) prioritizes task completion while maintaining structured tool usage. 16 rollouts per group are sampled to balance diversity and efficiency.

Key Experimental Results¶

Main Results¶

WebWatcher was evaluated on five high-difficulty benchmarks: HLE-VL, BrowseComp-VL (BC-VL), LiveVQA, MMSearch, and SimpleVQA.

Benchmark	Metric	WebWatcher-32B	Strong Baseline	Notes
HLE-VL	Avg	13.6	o4-mini 16.0 / Gemini-2.5-Pro 15.8	32B model approaches large models; 33.8 in Biology
BC-VL	Avg	27.0	o3 24.9 / OmniSearch 16.3	Multi-page browsing + fine-grained visual grounding
LiveVQA	Avg	58.7	o3 50.0	SOTA
MMSearch	Avg	55.3	o3 54.3	SOTA
SimpleVQA	Avg	59.0	o3 70.3	High competitiveness in pure visual reasoning

WebWatcher-32B outranks all compared methods on BC-VL (L1 28.4 / L2 25.0, Avg 27.0), LiveVQA, and MMSearch, achieving SOTA. The 7B version is also strong (BC-VL 21.2, LiveVQA 51.2). On HLE, the 32B model is slightly behind specialized reasoning models but has significantly fewer parameters.

Ablation Study¶

The study ablated the required number of tool calls: 8,000 trajectories were sampled for SFT at different call counts and tested on HLE.

Tool Call Count	Best Pass@1	Average@3	Best Pass@3
=1	8.79	7.98	14.24
=2	10.61	9.90	18.18
=3	10.61	9.90	19.09
≥3	12.12	10.61	19.09
=5	9.70	9.49	16.58
=6	8.79	8.33	15.76

Performance peaks at \(\ge3\) tool calls, validating the trajectory filtering threshold.

Key Findings¶

Agent surpasses humans on L2: Human accuracy on L2 (blurred labels) is only 18.0%, with many giving up after 100 mins. WebWatcher-32B achieves 25.0% on L2 in 0.8 mins on average, showing agents are more patient and efficient for under-determined info integration.
Humans still lead on L1: Humans achieve 33.2% vs. 28.4% for the Agent, but take 35 mins vs. 0.3 mins.
Search vs. Balanced Tool Use: HLE requires balanced use of search, calculation, and reasoning tools. BC-VL and MMSearch are dominated by retrieval tools.
Parameter Efficiency: The 32B model rivals closed-source models, highlighting the value of the "data + trajectory + training" pipeline.

Highlights & Insights¶

Moving BrowseComp to the Visual Domain: The combination of entity masking, real web images, and high info density forces cross-modal reasoning rather than string matching, serving as a "cheat-proof" data construction method.
Trajectories Grounded in Real Tool Behavior: Using GPT-4o for trajectories with triple filtering (answer match + consistency + minimum calls) directly addresses the core issue of agents "guessing" the answer.
SFT + GRPO Combo: SFT fixes tool usage during cold-start, while GRPO handles credit assignment under sparse rewards, avoiding the instability of RL from scratch.
L2 Superhuman Performance: Agents excel at "hard nuts" that humans tend to abandon, proving the value of deep research agents in information-dense, vaguely defined scenarios.

Limitations & Future Work¶

Heavy reliance on GPT-4o: QA synthesis, VQA conversion, trajectory labeling, and RL grading all depend on GPT-4o, creating a quality ceiling and high construction costs.
Lagging in pure visual reasoning: On SimpleVQA (59.0 vs. o3 70.3), the advantage of tool calling diminishes, revealing weaknesses in the base perception capability.
HLE scores trailing specialization models: Cross-modal deep research may not beat pure reasoning models in purely logic-dense disciplines without external info needs.
Limited visual anchors: \(K=2\) images provide limited visual context for tasks requiring multi-image comparison or richer visual evidence.

vs. Visual VL Agents: Those agents are strong in perception tools but cannot link vision with deep text inference. WebWatcher unifies five heterogeneous tools into the ReAct loop and uses RL for orchestration.
vs. Text-based Search Agents (e.g., WebSailor): This work extends the entity blurring concept into the visual domain, adding image search and OCR to solve tasks where the answer is hidden in images or requires interactive browsing.
vs. Template-driven Pipelines: Older methods are rigid. WebWatcher creates training data that reflects real reasoning through automatic annotation and triple filtering, allowing for significantly higher flexibility via SFT+GRPO.

Rating¶

Novelty: ⭐⭐⭐⭐ Extends "Deep Research Agents" to multimodal vision-language. The pipeline is systematic, though individual components (GRPO, ReAct) are established.
Experimental Thoroughness: ⭐⭐⭐⭐ Wide coverage across five benchmarks + human baselines. Strong verification at 7B/32B scales.
Writing Quality: ⭐⭐⭐⭐ Clear motivation using GAIA cases. Pipe-lining is well-structured.
Value: ⭐⭐⭐⭐⭐ Provides a trainable agent, a reusable data pipeline, and the BrowseComp-VL benchmark. Open-sourcing facilitates progress in multimodal research.