Skip to content

Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Conference: NeurIPS 2025
arXiv: 2505.15277
Code: Available
Area: Agent
Keywords: process reward model, Web Agent, Checklist-based Evaluation, Step-level Reward, web navigation

TL;DR

This paper proposes Web-Shepherd, the first process reward model (PRM) specifically designed for web navigation. By decomposing task objectives into evaluable sub-goals via checklists, 3B/8B models achieve trajectory accuracy far surpassing GPT-4o (85% vs. 10%) at only 1/10 of the cost, making reinforcement learning and inference-time search for web agents practically feasible.

Background & Motivation

Background: MLLM-driven web navigation agents suffer from reliability issues, frequently falling into repetitive action loops and struggling with goal-directed planning across multiple steps.

Limitations of Prior Work: ① Binary reward signals (success/failure) are sparse and yield low learning efficiency; ② Using GPT-4o as an evaluator is prohibitively expensive ($14,000 and 40 A100 hours to evaluate 812 queries), making it unsuitable for practical deployment.

Key Challenge: While process reward models (PRMs) have proven successful in mathematical reasoning, no dedicated PRM exists for web navigation. Outcome reward models (ORMs) are also ill-suited — users cannot be refunded after multiple failed flight bookings.

Goal: ① Construct a web navigation-specific PRM; ② Build training data and an evaluation benchmark; ③ Achieve low-cost, high-accuracy step-level reward evaluation.

Key Insight: High-level user instructions are decomposed into structured sub-goals via checklists, making step-level reward evaluation more reliable and interpretable.

Core Idea: Checklist-based sub-goal decomposition combined with NTP-trained step-level reward modeling yields a low-cost, high-accuracy PRM for web agents.

Method

Overall Architecture

A two-stage architecture: ① Checklist generation — decomposing user instructions into 3–5 sub-goals; ② Checklist-based reward modeling — evaluating the completion degree of each sub-goal at every action step, outputting a continuous reward score and textual feedback.

Key Designs

  1. Checklist Generation:

    • Function: Automatically decomposes high-level user instructions into evaluable intermediate milestones \((g_1, g_2, \cdots, g_k)\)
    • Mechanism: Employs coarse-grained sub-goals (e.g., abstracting "filter A + filter B" as "filtering") to reduce website-specific bias and sensitivity to action ordering
    • Design Motivation: Fine-grained checklist items lead to poor cross-policy generalization; coarse granularity (3–5 items) corresponds to task complexity and accommodates diverse execution paths
  2. Checklist-based Reward Modeling:

    • Function: Trains the model under an NTP objective to jointly generate feedback \(F\) and judgment \(J\)
    • Mechanism: \(r_k(o,a) = \frac{1}{L}\sum_l [P(\text{"Yes"}) + 0.5 \times P(\text{"In Progress"})]\), with final reward \(r(o,a) = \frac{1}{K}\sum_k r_k(o,a)\). Soft probabilities are extracted from logits via a Verbalizer
    • Design Motivation: Continuous rewards provide finer-grained learning signals than binary rewards; generated feedback enhances interpretability
  3. WebPRM Collection Dataset Construction:

    • Function: Constructs a large-scale annotation dataset of 40K step-level preference pairs
    • Mechanism: Professional annotators collect tasks across three difficulty levels (easy ≈5 steps / medium ≈9 steps / hard ≈20 steps); checklists are generated by GPT-4o and verified by humans; rejection actions are sampled from 5 candidate policies
    • Design Motivation: WebRewardBench is simultaneously released as the first meta-evaluation benchmark for web navigation PRMs, accelerating future research

Loss & Training

NTP loss \(\mathcal{L} = -\sum_t \log P_\theta(y_t|y_{<t},C,o,a)\), where \(y=[F;J]\). Base models are Qwen2.5-3B/Qwen3-8B, fine-tuned with LoRA for 3 epochs.

Key Experimental Results

Main Results (WebRewardBench)

Model Mind2Web MRR WebArena Trajectory Acc. Cross-domain Acc.
GPT-4o (text+image+checklist) 62.4 10.0% 6.6%
Qwen-2.5-VL-72B 52.9 0.0% 2.5%
Web-Shepherd (3B) 87.6 60.0% 47.1%
Web-Shepherd (8B) 88.3 85.0% 61.2%

The 8B model achieves 85% trajectory accuracy, vastly outperforming GPT-4o's 10%.

Policy Reward Model Success Rate Gain
GPT-4o-mini No search 23.64% -
GPT-4o-mini GPT-4o-mini PRM 24.24% +0.6%
GPT-4o-mini Web-Shepherd (8B) 34.55% +10.9%
GPT-4o No search 31.52% -

Key Findings

  • Web-Shepherd elevates a weak policy (GPT-4o-mini) beyond a strong policy (GPT-4o): 34.55% > 31.52%
  • Cost revolution: 10× faster and 10× cheaper than GPT-4o, making RL and tree search for web agents practically viable
  • Checklists are essential: Ablations show that removing checklists substantially degrades trajectory accuracy across all models
  • Counter-intuitive finding on multimodal input: Adding images sometimes introduces noise and reduces performance
  • Cross-domain generalization: Still achieves +6.37% improvement on entirely out-of-domain data (WorkArena)

Highlights & Insights

  • First PRM for web navigation: Fills a critical gap and makes RL in web agents practically feasible
  • Structured evaluation via checklists: Transforms subjective assessment into structured sub-goal verification, substantially improving reliability and interpretability
  • Complete data + benchmark ecosystem: WebPRM Collection (40K) + WebRewardBench accelerates subsequent research

Limitations & Future Work

  • Checklist generation relies on GPT-4o, so the cost of obtaining high-quality checklists remains non-trivial
  • Experiments are conducted primarily in simulated environments (WebArena-lite); validation on real-world, complex webpages is insufficient
  • The 40K annotation scale remains modest compared to PRMs for mathematical reasoning
  • The scalability of step-level evaluation over all candidate actions on very long trajectories requires further investigation
  • vs. GPT-4o-as-judge: Cost \(14K/40h vs. ~\)1.4K/4h, accuracy 10% vs. 85% — comprehensively outperformed
  • vs. mathematical reasoning PRMs: Checklist design for web navigation and multimodal observation handling distinguish this work from math-domain PRMs
  • vs. ORMs: ORMs evaluate only final outcomes and are unsuitable for irreversible web actions (e.g., flight booking); PRMs enable mid-process error correction

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First PRM for web navigation; the unified contribution of checklist design, dataset, and benchmark fills a critical gap
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three-scenario coverage, tree-search validation, and complete ablations; lacks validation on real-world complex webpages
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, intuitive presentation of step-level reward computation, well-motivated throughout
  • Value: ⭐⭐⭐⭐⭐ The 10× cost/speed advantage directly addresses deployment bottlenecks and provides a foundational contribution to web agent research