Web-Shepherd: Advancing PRMs for Reinforcing Web Agents¶

Conference: NeurIPS 2025
arXiv: 2505.15277
Code: Available
Area: Agent
Keywords: process reward model, Web Agent, Checklist-based Evaluation, Step-level Reward, web navigation

TL;DR¶

This paper proposes Web-Shepherd, the first process reward model (PRM) specifically designed for web navigation. By decomposing task objectives into evaluable sub-goals via checklists, 3B/8B models achieve trajectory accuracy far surpassing GPT-4o (85% vs. 10%) at only 1/10 of the cost, making reinforcement learning and inference-time search for web agents practically feasible.

Background & Motivation¶

Background: MLLM-driven web navigation agents suffer from reliability issues, frequently falling into repetitive action loops and struggling with goal-directed planning across multiple steps.

Limitations of Prior Work: ① Binary reward signals (success/failure) are sparse and yield low learning efficiency; ② Using GPT-4o as an evaluator is prohibitively expensive ($14,000 and 40 A100 hours to evaluate 812 queries), making it unsuitable for practical deployment.

Key Challenge: While process reward models (PRMs) have proven successful in mathematical reasoning, no dedicated PRM exists for web navigation. Outcome reward models (ORMs) are also ill-suited — users cannot be refunded after multiple failed flight bookings.

Goal: ① Construct a web navigation-specific PRM; ② Build training data and an evaluation benchmark; ③ Achieve low-cost, high-accuracy step-level reward evaluation.

Key Insight: High-level user instructions are decomposed into structured sub-goals via checklists, making step-level reward evaluation more reliable and interpretable.

Core Idea: Checklist-based sub-goal decomposition combined with NTP-trained step-level reward modeling yields a low-cost, high-accuracy PRM for web agents.

Method¶

Overall Architecture¶

A two-stage architecture: ① Checklist generation — decomposing user instructions into 3–5 sub-goals; ② Checklist-based reward modeling — evaluating the completion degree of each sub-goal at every action step, outputting a continuous reward score and textual feedback.

Key Designs¶

Checklist Generation:
- Function: Automatically decomposes high-level user instructions into evaluable intermediate milestones $(g_1, g_2, \cdots, g_k)$
- Mechanism: Employs coarse-grained sub-goals (e.g., abstracting "filter A + filter B" as "filtering") to reduce website-specific bias and sensitivity to action ordering
- Design Motivation: Fine-grained checklist items lead to poor cross-policy generalization; coarse granularity (3–5 items) corresponds to task complexity and accommodates diverse execution paths
Checklist-based Reward Modeling:
- Function: Trains the model under an NTP objective to jointly generate feedback $F$ and judgment $J$
- Mechanism: $r_k(o,a) = \frac{1}{L}\sum_l [P(\text{"Yes"}) + 0.5 \times P(\text{"In Progress"})]$, with final reward $r(o,a) = \frac{1}{K}\sum_k r_k(o,a)$. Soft probabilities are extracted from logits via a Verbalizer
- Design Motivation: Continuous rewards provide finer-grained learning signals than binary rewards; generated feedback enhances interpretability
WebPRM Collection Dataset Construction:
- Function: Constructs a large-scale annotation dataset of 40K step-level preference pairs
- Mechanism: Professional annotators collect tasks across three difficulty levels (easy ≈5 steps / medium ≈9 steps / hard ≈20 steps); checklists are generated by GPT-4o and verified by humans; rejection actions are sampled from 5 candidate policies
- Design Motivation: WebRewardBench is simultaneously released as the first meta-evaluation benchmark for web navigation PRMs, accelerating future research

Loss & Training¶

NTP loss $\mathcal{L} = -\sum_t \log P_\theta(y_t|y_{<t},C,o,a)$, where $y=[F;J]$. Base models are Qwen2.5-3B/Qwen3-8B, fine-tuned with LoRA for 3 epochs.

Key Experimental Results¶

Main Results (WebRewardBench)¶

Model	Mind2Web MRR	WebArena Trajectory Acc.	Cross-domain Acc.
GPT-4o (text+image+checklist)	62.4	10.0%	6.6%
Qwen-2.5-VL-72B	52.9	0.0%	2.5%
Web-Shepherd (3B)	87.6	60.0%	47.1%
Web-Shepherd (8B)	88.3	85.0%	61.2%

The 8B model achieves 85% trajectory accuracy, vastly outperforming GPT-4o's 10%.

Ablation Study (WebArena-lite Tree Search)¶

Policy	Reward Model	Success Rate	Gain
GPT-4o-mini	No search	23.64%	-
GPT-4o-mini	GPT-4o-mini PRM	24.24%	+0.6%
GPT-4o-mini	Web-Shepherd (8B)	34.55%	+10.9%
GPT-4o	No search	31.52%	-

Key Findings¶

Web-Shepherd elevates a weak policy (GPT-4o-mini) beyond a strong policy (GPT-4o): 34.55% > 31.52%
Cost revolution: 10× faster and 10× cheaper than GPT-4o, making RL and tree search for web agents practically viable
Checklists are essential: Ablations show that removing checklists substantially degrades trajectory accuracy across all models
Counter-intuitive finding on multimodal input: Adding images sometimes introduces noise and reduces performance
Cross-domain generalization: Still achieves +6.37% improvement on entirely out-of-domain data (WorkArena)

Highlights & Insights¶

First PRM for web navigation: Fills a critical gap and makes RL in web agents practically feasible
Structured evaluation via checklists: Transforms subjective assessment into structured sub-goal verification, substantially improving reliability and interpretability
Complete data + benchmark ecosystem: WebPRM Collection (40K) + WebRewardBench accelerates subsequent research

Limitations & Future Work¶

Checklist generation relies on GPT-4o, so the cost of obtaining high-quality checklists remains non-trivial
Experiments are conducted primarily in simulated environments (WebArena-lite); validation on real-world, complex webpages is insufficient
The 40K annotation scale remains modest compared to PRMs for mathematical reasoning
The scalability of step-level evaluation over all candidate actions on very long trajectories requires further investigation

vs. GPT-4o-as-judge: Cost $14K/40h vs. ~$1.4K/4h, accuracy 10% vs. 85% — comprehensively outperformed
vs. mathematical reasoning PRMs: Checklist design for web navigation and multimodal observation handling distinguish this work from math-domain PRMs
vs. ORMs: ORMs evaluate only final outcomes and are unsuitable for irreversible web actions (e.g., flight booking); PRMs enable mid-process error correction

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First PRM for web navigation; the unified contribution of checklist design, dataset, and benchmark fills a critical gap
Experimental Thoroughness: ⭐⭐⭐⭐ Three-scenario coverage, tree-search validation, and complete ablations; lacks validation on real-world complex webpages
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, intuitive presentation of step-level reward computation, well-motivated throughout
Value: ⭐⭐⭐⭐⭐ The 10× cost/speed advantage directly addresses deployment bottlenecks and provides a foundational contribution to web agent research