SynthAgent: Adapting Web Agents with Synthetic Supervision¶
Conference: ACL 2026
arXiv: 2511.06101
Code: GitHub
Area: LLM Agent / Web Agent Adaptation
Keywords: Synthetic Data, Web Agent, Dual Refinement, Categorized Exploration, Trajectory Quality
TL;DR¶
This paper proposes SynthAgent, a web agent adaptation framework built entirely on synthetic supervision. It employs categorized exploration to systematically cover functional regions of webpages for diverse task synthesis, followed by a dual refinement strategy—task refinement (conflict-triggered correction of hallucinations) and trajectory refinement (global-context denoising)—to improve synthetic data quality. SynthAgent significantly outperforms existing synthetic methods on WebArena and Online-Mind2Web.
Background & Motivation¶
Background: LLM-driven web agents demonstrate strong web interaction capabilities on standardized benchmarks, yet performance degrades sharply when deployed to unseen websites. Adapting to new environments requires environment-specific tasks and demonstration data, but manual annotation is costly and does not scale.
Limitations of Prior Work: (1) Self-Instruct prompts LLMs to "imagine" tasks without environmental grounding, resulting in simple and repetitive tasks; (2) OS-Genesis synthesizes tasks by back-translating from single-step observations, where insufficient context leads to substantial hallucinations (referencing non-existent elements or states); (3) Explorer continuously refines tasks during execution, but frequent intent changes (8.6 times on average) cause 68.3% of trajectories to exceed the step budget.
Key Challenge: Task synthesis requires environmental grounding to avoid hallucinations, yet over-grounding tasks during execution introduces trajectory noise—a fundamental design tension in synthetic supervision.
Goal: To design a fully synthetic supervision framework that efficiently adapts web agents to new environments without human involvement or test-set leakage.
Key Insight: Decouple task refinement and trajectory refinement into two complementary stages: task refinement ensures feasibility but introduces noise, while trajectory refinement subsequently eliminates that noise.
Core Idea: Dual refinement—during execution, tasks are refined only when explicit conflicts are detected (conflict-triggered rather than continuous); after execution, trajectories are refined using global context. This simultaneously guarantees task feasibility and trajectory quality.
Method¶
Overall Architecture¶
SynthAgent consists of four stages: (1) Categorized Exploration for Task Synthesis—webpage elements are grouped by function, interaction triples \((o_t, a_t, o_{t+1})\) are uniformly sampled, and an LLM proposes multi-step tasks grounded in real interface transitions; (2) Conflict-Triggered Task Refinement—conflicts between tasks and observations are detected during trajectory collection, with refinement invoked only upon triggering (2.0 times on average); (3) Global Trajectory Refinement—noisy and misaligned actions are removed post-hoc using the complete trajectory and the final task \(\tau^{\star}\); (4) Agent Fine-Tuning—open-source models are fine-tuned via SFT on the refined synthetic data. The training objective is standard autoregressive cross-entropy:
Key Designs¶
-
Categorized Exploration:
- Function: Systematically covers functional regions of webpages to improve task diversity.
- Mechanism: On each page \(o_t\), an LLM categorizes interactive elements by semantic role (e.g., "account management," "search filters"), and at most 2 unvisited elements per category are sampled for interaction. Per-category sampling budgets prevent any single dense region from dominating exploration.
- Design Motivation: Random exploration (as in OS-Genesis) repeatedly visits redundant elements while missing important functional regions. Categorized exploration reframes this as a function-aware coverage problem, discovering an average of 6.0 functional categories per page.
-
Conflict-Triggered Task Refinement:
- Function: Corrects hallucinations in tasks while minimizing disruption to trajectories.
- Mechanism: A lightweight conflict predicate is defined as \(\mathcal{C}(h_t, \tau_t) = \neg\textsf{ExistsUI} \vee \textsf{MissingArgs} \vee \textsf{Stall}\), detecting absent UI elements, missing arguments, and execution stalls, respectively. LLM refinement is invoked only upon conflict, following four principles: specify missing details, align with actual observations, reduce scope, and preserve category.
- Design Motivation: Unlike Explorer's continuous per-step refinement (8.6 times/trajectory on average, 68.3% timeout rate), conflict-triggered refinement averages only 2.0 invocations/trajectory with a 6.3% timeout rate—initial tasks are sufficiently specified, and refinement focuses on correcting rather than redefining intent.
-
Trajectory Refinement with Global Context:
- Function: Post-hoc removal of noisy and misaligned segments from trajectories.
- Mechanism: The complete trajectory \(h_T\) and final task \(\tau^{\star}\) are reviewed offline. Four editing operations are applied: Remove(i) deletes irrelevant/redundant steps, Reorder(i,j) swaps commutable steps, Drop(\(h_T\)) discards excessively noisy trajectories, and Keep(\(h_T\)) retains high-quality ones. The design deliberately favors precision—uncertain swaps are rejected rather than risking causal dependency violations.
- Design Motivation: Noise introduced by task refinement requires a post-hoc global perspective to clean up. Reorder accounts for only 4.1% of editing operations, yet reordered trajectories achieve higher preference win rates (42% vs. 27%).
Loss & Training¶
Standard SFT paradigm with a history context window of 3 recent steps. Up to 500 task-trajectory pairs are synthesized per website, and a single model is trained on data mixed from five websites (learning rate 1e-5, batch size 32, 3 epochs).
Key Experimental Results¶
Main Results¶
WebArena (5 websites) — Qwen2.5-VL-7B backbone
| Method | Training Data | Shopping | CMS | Gitlab | Maps | Overall | |
|---|---|---|---|---|---|---|---|
| Base Qwen | - | 13.71 | 8.24 | 9.43 | 6.18 | 5.50 | 8.80 |
| +Self-Instruct | Synthetic | 18.18 | 8.77 | 3.85 | 12.50 | 9.38 | 11.50 |
| +OS-Genesis | Synthetic | 14.55 | 10.53 | 11.54 | 16.07 | 12.50 | 13.27 |
| +Explorer | Synthetic | 10.91 | 3.51 | 0.00 | 1.82 | 3.12 | 4.44 |
| +SynthAgent | Synthetic | 20.00 | 21.05 | 15.38 | 19.64 | 28.12 | 20.80 |
Online-Mind2Web (136 real websites)
| Method | GPT-4.1 Judge | GPT-5.1 Judge | WebJudge | Average |
|---|---|---|---|---|
| Self-Instruct | 17.67 | 13.00 | 19.67 | 16.78 |
| OS-Genesis | 19.53 | 11.00 | 19.33 | 16.62 |
| SynthAgent | 31.67 | 15.67 | 23.33 | 23.56 |
Ablation Study¶
| Configuration | Overall | Change |
|---|---|---|
| SynthAgent (full) | 20.80 | — |
| w/o Categorized Exploration | 17.26 | −3.54 |
| w/o Task Refinement | 15.93 | −4.87 |
| w/o Trajectory Refinement | 16.81 | −3.99 |
| w/o Dual Refinement | 15.93 | −4.87 |
Key Findings¶
- Explorer performs even below the base model—continuous refinement produces excessively long, misaligned trajectories that act as negative supervision.
- Synthetic data quality: SynthAgent trajectories score 82.6, far exceeding Explorer (36.4) and OS-Genesis (52.0).
- SynthAgent achieves a trajectory completion rate of 96.5% vs. Explorer's 30.5%, at lower API cost ($0.13 vs. $0.22 per trajectory).
- Performance further improves on the stronger Qwen3 backbone (15.93→24.34), validating the model-agnostic nature of the approach.
Highlights & Insights¶
- The core design insight—that task refinement and trajectory refinement are complementary—is precisely motivated: the former ensures feasibility but introduces noise, while the latter eliminates it.
- The contrast between conflict-triggered and continuous refinement reveals a key design principle: the quality of initial tasks determines the appropriate refinement strategy.
- Categorized exploration transforms random exploration into a structured coverage problem—a simple yet effective contribution.
Limitations & Future Work¶
- Validation is limited to offline and restricted online environments; synthesis on real, actively changing websites remains unexplored.
- Task and trajectory synthesis rely entirely on GPT-4.1; more advanced LLMs or hyperparameter optimization are not investigated.
- Only standard SFT is employed; more advanced training paradigms such as DPO or online RL are not explored.
Related Work & Insights¶
- vs. OS-Genesis: OS-Genesis synthesizes tasks from single-step observations, leading to hallucinations; SynthAgent addresses this via categorized exploration and conflict-triggered refinement.
- vs. Explorer: Explorer's continuous refinement causes intent drift and excessively long trajectories; SynthAgent's conflict-triggered refinement preserves intent consistency.
- vs. AgentTrek: AgentTrek relies on offline tutorials that may be outdated; SynthAgent synthesizes data by directly interacting with the environment.
Rating¶
- Novelty: ⭐⭐⭐⭐ The synergistic design of dual refinement and the conflict-triggered mechanism constitute clear contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two benchmarks, multiple backbones, detailed ablations, data quality analysis, and scaling experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ Problem motivation is clearly articulated with in-depth analysis of the core design tension.
- Value: ⭐⭐⭐⭐ Provides a practical, high-quality synthetic data solution for unsupervised web agent adaptation.