Skip to content

WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning

Conference: ACL2026
arXiv: 2602.12852
Code: https://github.com/AQ-MedAI/AntAFu-DeepResearch
Area: llm_agent
Keywords: Web Agent, Trajectory Pruning, State Graph, Tool-call Efficiency, Agent Training

TL;DR

WebClipper models long tool-call trajectories of Web Agents as "Action-Information" state graphs and extracts the Minimum Necessary DAG to prune cyclic searches and invalid branches. This allows Deep Research agents to reduce tool rounds by approximately 21% and tokens by 19.4% on average while maintaining or even improving accuracy.

Background & Motivation

Background: Deep Research Web Agents can process complex information retrieval tasks by repeatedly searching, visiting pages, running code, and generating answers. Existing open-source agents primarily pursue final accuracy, increasing coverage through longer contexts, deeper searches, and more tool calls.

Limitations of Prior Work: This "more searching is always better" strategy is costly in real-world deployments. Tongyi-DeepResearch allows up to 100 tool calls, and MiroThinker allows up to 600. Using commercial search, web parsing, or code execution tools significantly increases latency and cost. Furthermore, long trajectories do not guarantee higher accuracy, as many errors stem from repetitive verification, deviating from the main problem, or following noisy branches.

Key Challenge: Effective information in Web Agent trajectories is often sparsely distributed. The final answer depends only on a few key actions and observations, yet training data preserves all circuitous steps. Directly shortening trajectories risks breaking the ReAct structure, causing remaining thoughts to refer to deleted observations and producing incoherent training signals.

Goal: Instead of training a new agent from scratch, the authors aim to evolve existing high-performance but low-efficiency Web Agents into versions that require fewer tool calls. The specific goal is to remove redundant searches, cyclic verifications, and invalid branches without sacrificing accuracy, using a unified metric to balance accuracy and efficiency.

Key Insight: A key observation is that agent trajectories can be abstracted into state graphs: actions produce information, and subsequent actions depend on existing information. Consequently, determining "which steps are truly necessary for the final answer" becomes a problem of mining the minimum necessary subgraph rather than relying on another LLM for coarse pruning.

Core Idea: Use graph structures to explicitly represent information dependencies in tool-call trajectories. Mine the Minimum Necessary DAG from the initial question to the final answer. After coherence-aware rewriting and continued training, the agent learns shorter, more focused search paths.

Method

WebClipper is not an inference-time pruning method but a training data processing and agent evolution framework. it distills raw trajectories from a strong agent, converts them into state graphs to identify necessary action sets, removes redundant steps, rewrites broken thoughts, and fine-tunes the original agent using these refined trajectories.

Overall Architecture

The input consists of queries and ReAct trajectories generated by a raw Web Agent, each containing initial observations, thoughts, actions, and new observations for each round. The output is a set of shorter trajectories that still support the correct answer, along with an evolved agent trained on these trajectories.

The process involves four stages. First, initial trajectories are collected and filtered, keeping samples that challenge the original agent but are not impossible. Second, trajectories are converted into directed bipartite graphs consisting of Action nodes and Information nodes. Third, an approximate Minimum Necessary DAG is found to identify the set of essential actions. Fourth, coherence-aware thought rewriting is applied to pruned trajectories, followed by agent training using efficiency-oriented or hybrid strategies.

Key Designs

  1. Trajectory-to-State-Graph:

    • Function: Converts linear tool calls into a graph structure for dependency analysis.
    • Mechanism: Action nodes \(A_t\) represent the thought and action of round \(t\); Information nodes \(I_t\) represent atomic information returned by the environment. If an action is based on information, an edge \(I \rightarrow A\) is added; if an action produces information, \(A \rightarrow I\) is added. The graph is constructed by an LLM extractor that decomposes observations into atomic information and identifies dependencies.
    • Design Motivation: Redundancy in long trajectories involves useless loops and branches in the information dependency chain. The graph structure makes these dependencies explicit, providing a finer basis for pruning than single-round LLM judgments.
  2. MNDAG Pruning & Majority Voting:

    • Function: Identifies the minimum set of necessary actions to support the final answer.
    • Mechanism: The initial query node is the source, and the final answer action is the sink. Action nodes have a cost of 1, and information nodes have a cost of 0. A Dijkstra-style search finds the low-cost path, and a backward closure identifies necessary precursors. To handle LLM instability, the process is repeated three times, accepting the set only if at least two results match.
    • Design Motivation: Deleting turns based on semantic similarity or subjective judgment risk removing critical evidence. MNDAG clarifies the pruning goal: preserving necessary information pathways from problem to answer.
  3. Coherence-aware Rewriting & Hybrid Evolutionary Training:

    • Function: Converts pruned trajectories into natural ReAct data for SFT while balancing efficiency and accuracy.
    • Mechanism: If two preserved actions were not adjacent in the original trajectory, the latter thought might refer to a deleted observation. WebClipper rewrites the thought using the full context and generates candidates, selecting the version that best fits the base model's style via perplexity. Training uses two strategies: Eff (pruned trajectories only) and Hybrid (mixing pruned and unpruned hard trajectories).
    • Design Motivation: Pursuing only brevity can harm complex task capabilities. Eff is suitable for cost-sensitive deployment, while Hybrid preserves long-chain capabilities for complex searches.

Loss & Training

The objective is to maximize the likelihood of trajectories. Efficiency-oriented training optimizes \(L_{eff}=-\sum_{\tilde{\tau}}\log P_M(\tilde{\tau})\) on pruned trajectories. Hybrid training optimizes \(L_{hybrid}=-\sum_{\tau^*}\log P_M(\tau^*)\) on a mix. The evaluation uses the F-AE Score, a harmonic mean of accuracy and efficiency \(E=1-Rounds/Max\_Rounds\): \(F\text{-}AE=2\times Acc\times E/(Acc+E)\), with \(Max\_Rounds=100\).

Key Experimental Results

Main Results

Using Tongyi-DeepResearch as the base agent, evaluation was conducted on xbench-deepsearch, Browsecomp, GAIA, and HLE.

Method xbench Acc / F-AE / Rounds Browsecomp Acc / F-AE / Rounds GAIA Acc / F-AE / Rounds HLE Acc / F-AE / Rounds
Tongyi-DeepResearch 0.713 / 0.779 / 14.26 0.410 / 0.385 / 63.70 0.682 / 0.733 / 20.56 0.358 / 0.487 / 23.92
WebClipper (Eff) 0.713 / 0.792 / 10.81 0.427 / 0.431 / 56.50 0.684 / 0.760 / 14.44 0.353 / 0.492 / 18.60
WebClipper (Hybrid) 0.733 / 0.797 / 12.57 0.467 / 0.428 / 60.42 0.695 / 0.744 / 19.92 0.361 / 0.495 / 21.07

Compared to the original model, the Eff version reduces tool calls by ~21% and tokens by 19.4% while maintaining or improving accuracy. The Hybrid version increases average accuracy by ~4.8% and still reduces rounds by ~7%.

Pruning/Training Method xbench Acc / Rounds / Token Browsecomp Acc / Rounds / Token Main Conclusion
Prompt Control 0.676 / 12.50 / 6321 0.373 / 62.80 / 12222 Constraints via prompts reduce rounds minimally and hurt accuracy
Coarse Prune 0.603 / 8.85 / 4774 0.220 / 37.10 / 8365 Coarse pruning shortens trajectories but severely damages accuracy
WebClipper (Eff) 0.713 / 10.81 / 5931 0.427 / 56.50 / 10599 Maintains accuracy while significantly reducing calls
WebClipper (Hybrid) 0.733 / 12.57 / 6205 0.467 / 60.42 / 11507 Highest accuracy, efficiency still better than original

Ablation Study

Ablations verified the graph pruning, PPL-based rewrite selection, and context-aware selective rewriting. Removing any component led to degradation, with context-aware rewriting being critical to avoid contradictory training signals.

Ablation Modification Observed Impact Reason
w/o GP Replace graph pruning with coarse pruning Performance drop Single LLM judgment struggles with long-range dependencies
w/o PPL-S No PPL-based candidate selection Performance drop Rewritten style mismatches base model, causing distribution shift
w/o CSR Rewriting all thoughts without history Most severe degradation Rewriting breaks logic and introduces inconsistent references
Unpruned-Distill SFT with unpruned hard trajectories Acc may rise, but rounds increase Amplifies base agent capabilities but reinforces inefficiency

Key Findings

  • WebClipper (Eff) is particularly effective on GAIA, reducing rounds from 20.56 to 14.44 (~30%). Pruning prevents over-reliance on external tools for logic-based tasks.
  • The F-AE Score does not reward short trajectories in isolation; Kimi models have low rounds but low accuracy, resulting in low F-AE scores.
  • Case studies show baselines often get distracted by secondary details, while WebClipper learns to stay on the critical path.

Highlights & Insights

  • Shifting trajectory compression from "text deletion" to "preserving dependency paths" is more suitable for tool-call scenarios than standard CoT compression.
  • The MNDAG cost design (Action=1, Information=0) is intuitive: tool calls are expensive, whereas information serves as evidentiary support.
  • Majority voting is a pragmatic engineering detail to handle LLM extractor instability and ensure high-confidence pruning.
  • Coherence-aware rewriting prevents models from learning "hallucinating references" from broken logic chains.

Limitations & Future Work

  • WebClipper is bounded by the base agent's capacity; it removes redundancy but does not discover new strategies for failed tasks.
  • Evaluation focuses on search, web access, and code. Applicability to multimodal tools or enterprise APIs requires further validation.
  • Offline processing is heavy; constructing trajectories with Qwen3-235B on 8×H800 took about one day.
  • F-AE depends on the \(Max\_Rounds\) setting. Real systems may need to consider token pricing and latency explicitly.
  • vs Deep Research / WebExplorer: Those focus on search depth and data synthesis; WebClipper focuses on efficiency evolution.
  • vs Prompt Control: Prompts provide weak constraints; training data modifications lead to more stable search patterns.
  • vs CoT Compression: CoT methods compress reasoning chains, while WebClipper handles ReAct trajectories requiring environment consistency.
  • Future Insights: State graphs could be extended to online memory graphs to prune branches during inference or used for RL-based rewards to penalize invalid calls.

Rating

  • Novelty: ⭐⭐⭐⭐☆ (Clear formalization of trajectory pruning via state graphs and MNDAG).
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ (Covers 4 benchmarks and multiple baselines).
  • Writing Quality: ⭐⭐⭐⭐☆ (Complete methodology and good index explanation).
  • Value: ⭐⭐⭐⭐⭐ (Very practical for reducing costs in long-range Web Agent deployments).