Exploring Reasoning Reward Model for Agents¶
Conference: ACL 2026
arXiv: 2601.22154
Code: https://github.com/kxfan2002/Reagent
Area: LLM Alignment / Reward Model / Agentic RL
Keywords: agentic RL, reasoning reward model, GRPO, critique-guided refinement, multimodal feedback
TL;DR¶
The authors observe that current agentic RL predominantly utilizes sparse outcome rewards (evaluating only the final correctness), causing high-quality multi-step reasoning signals to be lost. To address this, they propose Agent-RRM—a reasoning reward model that generates structured feedback in a <think>/<critique>/<score> format. By systematically comparing three integration strategies (C: pure critique refinement, R: scalar reward enhancement, U: combined critique + score GRPO), the final Reagent-U model achieves 43.7% on GAIA and 46.2% on WebWalkerQA using Qwen3-8B across 12 benchmarks. This demonstrates that joint supervision of "language-level critique + numerical reward" is significantly superior to single-signal approaches.
Background & Motivation¶
Background: Reinforcement Learning from Verifiable Rewards (RLVR) has been proven to significantly enhance LLM reasoning capabilities in works like DeepSeek-R1. Recently, frameworks such as Search-R1, WebSailor, and Agent0 have extended this paradigm to agents (multi-turn tool calls + information retrieval), achieving substantial gains.
Limitations of Prior Work: (1) Outcome-based rewards are too sparse—most agentic RL systems only evaluate the final answer. a trajectory that fails only at the last step is scored the same as a completely nonsensical one (zero), wasting high-quality intermediate steps. (2) Step-level reward annotation is expensive and prone to reward hacking. (3) Existing reasoning reward models focus on pair-wise preference (which is better), failing to provide actionable guidance like "exactly what went wrong and how to fix it." (4) Almost all works rely solely on scalar rewards, completely ignoring natural language critique as a potential dense supervision signal.
Key Challenge: Long-horizon agent tasks (e.g., GAIA Lv.3 requiring 10+ tool calls) need dense signals to learn nuanced reasoning skills, yet current reward frameworks (outcome / step / preference) are either sparse, expensive, or coarse.
Goal: (1) Design a multifaceted reward model capable of simultaneously producing a reasoning trace, textual critique, and scalar score. (2) Systematically compare three integration strategies for feeding critique and score into agentic RL. (3) Provide a training recipe that consistently outperforms SOTA across 12 benchmarks.
Key Insight: Borrowing from the generative reasoning RM concept in DeepSeek-R1 (RM-R1, R1-Reward), the authors extend it from single-turn QA to multi-turn agentic trajectories and, for the first time, utilize the critique text itself as a training signal (rather than just for inference-time refinement).
Core Idea: Enable the reward model to "reason before judging"—first generating <think> to analyze trajectory consistency, then <critique> to identify specific flaws, and finally <score> for an overall rating. The downstream agent can use the critique for in-context refinement and the score for GRPO advantage normalization. In Reagent-U, these are pooled together to achieve a "1+1>2" effect.
Method¶
Overall Architecture¶
The framework consists of two models across two stages: (a) Agent-RRM Training—SFT is performed on Reagent-RRM-SFT-28K (structured judgments annotated by GPT-OSS-120B) to learn the "
Key Designs¶
-
Structured Output of Agent-RRM:
- Function: Upgrades trajectory judgment from a single scalar to an interpretable chain of "Analysis → Critique → Scoring," providing dense training signals for the agent and transparency for human inspection.
- Mechanism: During training, the model writes which steps are reasonable and which have logical flaws in
<think>, specifies "exactly what to fix" in<critique>, and provides a global score \(s \in [0,1]\) in<score>. Data for Agent-RRM is sampled from various agent models (Qwen3-8B/14B, Qwen3-ARPO-DeepSearch, etc.) to maximize error coverage, annotated by GPT-OSS-120B; final calibration is done via SFT (28K) + GRPO (90K). - Design Motivation: A single scalar reward cannot capture fine-grained differences like "correct answer with unnecessary steps" vs. "wrong answer with mostly correct logic." Explicit reasoning by the RM also reduces reward hacking—the model must provide a consistent internal justification to assign a high score.
-
Three Integration Variants: C / R / U:
- Function: Systematically compares the value of "linguistic critique" vs. "numerical score" signals in agentic RL, both individually and combined.
- Mechanism: (a) Reagent-C is training-free; it samples \(o^{(1)}_i \sim \pi_\theta(o|q)\), generates critique \(c_i\) via RRM, and performs in-context refinement \(o^{(2)}_i \sim \pi_\theta(o|q, o^{(1)}_i, c_i)\). (b) Reagent-R uses a weighted reward \(R_i = R_{\text{rule}}(q, o_i) + \lambda \cdot R_{\text{model}}(q, o_i)\) for GRPO training. (c) Reagent-U samples from both stages and combines them into \(\mathcal{G}_{pool} = \{o^{(k)}_i\}\) (\(k \in \{1, 2\}\)) to calculate a joint advantage \(A^{(k)}_i = (R^{(k)}_i - \text{mean}(\mathbf{R}_{pool})) / \text{std}(\mathbf{R}_{pool})\). The loss is: $\(\mathcal{J}_U(\theta) = \mathbb{E}[\frac{1}{2G}\sum_{k=1}^2 \sum_{i=1}^G (\min(r^{(k)}_i A^{(k)}_i, \text{clip}_\epsilon) - \beta \mathbb{D}_{KL}^{(i,k)})]\)$
- Design Motivation: C isolates the zero-shot value of critique; R isolates the dense value of scalar rewards; U allows the model to simultaneously learn "how to fix based on critique" and "how to rank trajectories," inherently internalizing critique capabilities into the policy. Consequently, Reagent-U requires no RRM at inference time, maintaining zero additional inference cost.
-
Unified Pool Joint Advantage Normalization:
- Function: Allows initial and refined trajectories to share a single advantage distribution to facilitate cross-stage quality comparison, guiding the model to propagate refined quality back to the initial generation.
- Mechanism: While traditional GRPO normalizes \(G\) samples within a batch, Reagent-U expands this to \(2G\) (initial + refined). If refined trajectories are generally better, initial samples automatically receive negative advantages, naturally pushing the policy to generate initial outputs closer to the refined ones.
- Design Motivation: Separate normalization would decouple the two stages, leading the model to learn refinement skills without improving the initial generation. The unified pool binds both stages under a single gradient signal.
Loss & Training¶
Based on the GRPO framework. Rule reward \(R_{\text{rule}}\) uses string matching for final answers; model reward \(R_{\text{model}}\) uses the <score> from Agent-RRM. The base model is Qwen3-8B, starting with SFT on Reagent-SFT-55.6K for cold-start, followed by RL.
Key Experimental Results¶
Main Results¶
On four core agent benchmarks: GAIA (Lv.1/2/3), WebWalkerQA, HLE, and xbench:
| Model | Backbone | GAIA Avg | WebWalker Avg | HLE | xbench |
|---|---|---|---|---|---|
| WebThinker | Qwen3-8B | 22.3 | 13.0 | 6.6 | 13.0 |
| WebDancer | Qwen2.5-7B | 31.0 | 36.0 | – | – |
| VerlTool | Qwen3-8B | 34.0 | – | 8.4 | – |
| ARPO (≤8B) | Qwen3-8B | 38.8 | 30.5 | 8.8 | 25.0 |
| ARPO (≤32B) | Qwen3-14B | 43.7 | 36.0 | 10.0 | 32.0 |
| Search-o1 | QwQ-32B-Preview | 39.8 | 34.1 | 10.8 | 40.0 |
| DeepSeek-R1-671B | – | 25.2 | 10.0 | 8.6 | 32.0 |
| QwQ-32B | – | 18.9 | 3.8 | 6.4 | 10.0 |
| Proprietary OpenAI-o3 | – | 70.5 | 71.7 | 20.2 | 66.0 |
| Claude-4-Sonnet | – | 68.3 | 61.7 | 20.2 | 64.0 |
| OpenAI DeepResearch | – | 67.4 | – | 26.6 | – |
| Reagent-U (Ours) | Qwen3-8B | 43.7 | 46.2 | – | – |
→ Using an 8B model, Reagent-U matches ARPO 14B on GAIA and outperforms it on WebWalker by +10.2 pp. Compared to the 8B baseline ARPO (38.8 / 30.5), it sees absolute gains of +4.9 / +15.7 points, demonstrating significant RL improvement.
Ablation Study¶
Self-comparison of the three variants:
| Configuration | GAIA Avg | WebWalker Avg | Description |
|---|---|---|---|
| Reagent-SFT only | < 38.8 | < 30.5 | Cold-start only, weaker than ARPO 8B |
| Reagent-C | Medium | Medium | Inference-only critique refinement, no training |
| Reagent-R | High | High | Trained with RM scalar as dense reward |
| Reagent-U | 43.7 | 46.2 | Joint training internalizes critique, no extra inference cost |
Key Findings¶
- Reagent-U 8B matches or exceeds ARPO 14B: With the same backbone size, GRPO + Agent-RRM outperforms GRPO + rule-only by 4.9 (GAIA) and 15.7 (WebWalker) points, indicating that reward signal density is more critical than model size.
- Gains are larger on WebWalker (+15.7 pp) than GAIA (+4.9 pp): WebWalker involves multi-turn web navigation (long horizon), which is more dependent on intermediate step quality. This validates that longer horizons necessitate denser critiques.
- Internalization vs. Inference-time use: Reagent-U maintains high performance without RRM during inference. This significantly reduces deployment costs compared to Reagent-C, implying the value of critique lies in teaching reasoning style rather than real-time correction.
- Unified pool is key to U > R + C: Simple addition of R and C does not yield U's results. Only by placing initial and refined trajectories in the same advantage distribution can the initial generation truly align with the refined quality.
Highlights & Insights¶
- Structured feedback upgrades RM to "Judge + Teacher":
<think>provides transparency,<critique>provides actionability, and<score>provides numerical calibration. - Critique-as-training-signal as a new paradigm: Unlike traditional critic feedback used only at inference (self-refine), this proves that using critique as GRPO training material allows the policy to internalize these capabilities.
- Unified pool joint advantage normalization: A simple but effective trick that allows GRPO to support multi-stage trajectories, extendable to tree search or iterative refinement.
- Inference-cost-neutral: Reagent-U requires no additional RRM calls or multi-stage sampling at deployment, making it highly attractive for industrial applications.
- High-quality dataset release: The 4 datasets (SFT-55.6K, RL-709K, RRM-SFT-28K, RRM-RL-90K) provide infrastructure for math, multimodal, and tool-use scenarios.
Limitations & Future Work¶
- Reliability bottleneck of Agent-RRM: Signal quality is capped by the GPT-OSS-120B labels; if the RM has reasoning bugs, the policy will follow incorrect signals.
- Gap with proprietary models: 43.7 (Reagent-U) vs 70.5 (OpenAI-o3) suggests that while RM signals help, the base model capacity remains a constraint.
- Ablation granularity: Detailed R/C/U comparisons per benchmark were not fully explicitly shown in the main text.
- Hyperparameter sensitivity: The impact of \(\lambda\) on training stability is not thoroughly discussed.
Related Work & Insights¶
- vs ARPO (Dong 2025): Reagent-U's reasoning RM density provides a clear advantage in long-horizon tasks over ARPO's rule-based reward.
- vs Atom-Searcher (Deng 2025): While others use scalar process rewards, Reagent simultaneously produces critique and scores.
- vs Self-Refine: Reagent-U avoids the doubled deployment costs of iterative inference by internalizing the critique capability.
Rating¶
- Novelty: ⭐⭐⭐⭐ Systematic application of reasoning RMs to multi-turn agentic RL with internalized critique.
- Experimental Thoroughness: ⭐⭐⭐⭐ Extensive benchmarks and baseline comparisons, though some hyperparameter analysis is missing.
- Writing Quality: ⭐⭐⭐⭐ Clear visualization of variants and rigorous mathematical formulation of the unified pool.
- Value: ⭐⭐⭐⭐⭐ High-quality datasets and models provided for the community, with immediate industrial applicability.