Web-Shepherd: Advancing PRMs for Reinforcing Web Agents¶
Conference: NeurIPS 2025
arXiv: 2505.15277
Code: Available
Area: Agent
Keywords: process reward model, Web Agent, Checklist-based Evaluation, Step-level Reward, web navigation
TL;DR¶
This paper proposes Web-Shepherd, the first process reward model (PRM) specifically designed for web navigation. By decomposing task objectives into evaluable sub-goals via checklists, 3B/8B models achieve trajectory accuracy far surpassing GPT-4o (85% vs. 10%) at only 1/10 of the cost, making reinforcement learning and inference-time search for web agents practically feasible.
Background & Motivation¶
Background: MLLM-driven web navigation agents suffer from reliability issues, frequently falling into repetitive action loops and struggling with goal-directed planning across multiple steps.
Limitations of Prior Work: ① Binary reward signals (success/failure) are sparse and yield low learning efficiency; ② Using GPT-4o as an evaluator is prohibitively expensive ($14,000 and 40 A100 hours to evaluate 812 queries), making it unsuitable for practical deployment.
Key Challenge: While process reward models (PRMs) have proven successful in mathematical reasoning, no dedicated PRM exists for web navigation. Outcome reward models (ORMs) are also ill-suited — users cannot be refunded after multiple failed flight bookings.
Goal: ① Construct a web navigation-specific PRM; ② Build training data and an evaluation benchmark; ③ Achieve low-cost, high-accuracy step-level reward evaluation.
Key Insight: High-level user instructions are decomposed into structured sub-goals via checklists, making step-level reward evaluation more reliable and interpretable.
Core Idea: Checklist-based sub-goal decomposition combined with NTP-trained step-level reward modeling yields a low-cost, high-accuracy PRM for web agents.
Method¶
Overall Architecture¶
A two-stage architecture: ① Checklist generation — decomposing user instructions into 3–5 sub-goals; ② Checklist-based reward modeling — evaluating the completion degree of each sub-goal at every action step, outputting a continuous reward score and textual feedback.
Key Designs¶
-
Checklist Generation:
- Function: Automatically decomposes high-level user instructions into evaluable intermediate milestones \((g_1, g_2, \cdots, g_k)\)
- Mechanism: Employs coarse-grained sub-goals (e.g., abstracting "filter A + filter B" as "filtering") to reduce website-specific bias and sensitivity to action ordering
- Design Motivation: Fine-grained checklist items lead to poor cross-policy generalization; coarse granularity (3–5 items) corresponds to task complexity and accommodates diverse execution paths
-
Checklist-based Reward Modeling:
- Function: Trains the model under an NTP objective to jointly generate feedback \(F\) and judgment \(J\)
- Mechanism: \(r_k(o,a) = \frac{1}{L}\sum_l [P(\text{"Yes"}) + 0.5 \times P(\text{"In Progress"})]\), with final reward \(r(o,a) = \frac{1}{K}\sum_k r_k(o,a)\). Soft probabilities are extracted from logits via a Verbalizer
- Design Motivation: Continuous rewards provide finer-grained learning signals than binary rewards; generated feedback enhances interpretability
-
WebPRM Collection Dataset Construction:
- Function: Constructs a large-scale annotation dataset of 40K step-level preference pairs
- Mechanism: Professional annotators collect tasks across three difficulty levels (easy ≈5 steps / medium ≈9 steps / hard ≈20 steps); checklists are generated by GPT-4o and verified by humans; rejection actions are sampled from 5 candidate policies
- Design Motivation: WebRewardBench is simultaneously released as the first meta-evaluation benchmark for web navigation PRMs, accelerating future research
Loss & Training¶
NTP loss \(\mathcal{L} = -\sum_t \log P_\theta(y_t|y_{<t},C,o,a)\), where \(y=[F;J]\). Base models are Qwen2.5-3B/Qwen3-8B, fine-tuned with LoRA for 3 epochs.
Key Experimental Results¶
Main Results (WebRewardBench)¶
| Model | Mind2Web MRR | WebArena Trajectory Acc. | Cross-domain Acc. |
|---|---|---|---|
| GPT-4o (text+image+checklist) | 62.4 | 10.0% | 6.6% |
| Qwen-2.5-VL-72B | 52.9 | 0.0% | 2.5% |
| Web-Shepherd (3B) | 87.6 | 60.0% | 47.1% |
| Web-Shepherd (8B) | 88.3 | 85.0% | 61.2% |
The 8B model achieves 85% trajectory accuracy, vastly outperforming GPT-4o's 10%.
Ablation Study (WebArena-lite Tree Search)¶
| Policy | Reward Model | Success Rate | Gain |
|---|---|---|---|
| GPT-4o-mini | No search | 23.64% | - |
| GPT-4o-mini | GPT-4o-mini PRM | 24.24% | +0.6% |
| GPT-4o-mini | Web-Shepherd (8B) | 34.55% | +10.9% |
| GPT-4o | No search | 31.52% | - |
Key Findings¶
- Web-Shepherd elevates a weak policy (GPT-4o-mini) beyond a strong policy (GPT-4o): 34.55% > 31.52%
- Cost revolution: 10× faster and 10× cheaper than GPT-4o, making RL and tree search for web agents practically viable
- Checklists are essential: Ablations show that removing checklists substantially degrades trajectory accuracy across all models
- Counter-intuitive finding on multimodal input: Adding images sometimes introduces noise and reduces performance
- Cross-domain generalization: Still achieves +6.37% improvement on entirely out-of-domain data (WorkArena)
Highlights & Insights¶
- First PRM for web navigation: Fills a critical gap and makes RL in web agents practically feasible
- Structured evaluation via checklists: Transforms subjective assessment into structured sub-goal verification, substantially improving reliability and interpretability
- Complete data + benchmark ecosystem: WebPRM Collection (40K) + WebRewardBench accelerates subsequent research
Limitations & Future Work¶
- Checklist generation relies on GPT-4o, so the cost of obtaining high-quality checklists remains non-trivial
- Experiments are conducted primarily in simulated environments (WebArena-lite); validation on real-world, complex webpages is insufficient
- The 40K annotation scale remains modest compared to PRMs for mathematical reasoning
- The scalability of step-level evaluation over all candidate actions on very long trajectories requires further investigation
Related Work & Insights¶
- vs. GPT-4o-as-judge: Cost \(14K/40h vs. ~\)1.4K/4h, accuracy 10% vs. 85% — comprehensively outperformed
- vs. mathematical reasoning PRMs: Checklist design for web navigation and multimodal observation handling distinguish this work from math-domain PRMs
- vs. ORMs: ORMs evaluate only final outcomes and are unsuitable for irreversible web actions (e.g., flight booking); PRMs enable mid-process error correction
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First PRM for web navigation; the unified contribution of checklist design, dataset, and benchmark fills a critical gap
- Experimental Thoroughness: ⭐⭐⭐⭐ Three-scenario coverage, tree-search validation, and complete ablations; lacks validation on real-world complex webpages
- Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, intuitive presentation of step-level reward computation, well-motivated throughout
- Value: ⭐⭐⭐⭐⭐ The 10× cost/speed advantage directly addresses deployment bottlenecks and provides a foundational contribution to web agent research