VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models¶

Conference: ICLR 2026 arXiv: 2505.15801 Code: GitHub Area: Reinforcement Learning Keywords: reward model, benchmark, verification, LLM, reinforcement-learning

TL;DR¶

This work introduces VerifyBench and VerifyBench-Hard, two evaluation benchmarks targeting reference-based reward systems widely used in training Large Reasoning Models (LRMs). Through rigorous human annotation, the benchmarks assess the accuracy of various verification systems and reveal that even the strongest models achieve only approximately 88% accuracy on hard samples, exposing substantial room for improvement in current verification systems.

Background & Motivation¶

Background: Reasoning models such as OpenAI o1 and DeepSeek-R1 rely on reference-based reward systems during RL training, granting rewards based on the consistency between model outputs and ground-truth answers.

Limitations of Prior Work: Existing reward model benchmarks (e.g., RewardBench) primarily evaluate pairwise preference judgments—selecting the better of two responses—rather than determining whether a single response is correct.

Key Challenge: Reward systems used in LRM training must assess absolute correctness (i.e., whether a response agrees with the reference answer), which is fundamentally different from relative preference comparison. Existing benchmarks therefore fail to reflect real training scenarios.

Goal: Rule-based methods such as math-verify, used in SimpleRL, exhibit clear deficiencies in matching mathematical expressions, yet no standardized benchmark exists to quantify these shortcomings. Moreover, while models perform well on simple verification tasks (~95%), performance degrades markedly on genuinely ambiguous hard samples (~70–88%), motivating the need for a dedicated hard benchmark.

Method¶

Overall Architecture¶

VerifyBench is constructed through the following pipeline: 1. Data Collection: Reasoning questions and reference answers are collected from 41 open-source datasets, spanning general, logical, and mathematical reasoning. 2. Answer Type Annotation: Answers are automatically classified into four types: numeric, algebraic expression, multiple-choice, and free-form text. 3. Response Generation: 22 open- and closed-source models generate responses, which are pre-annotated by an LLM. 4. Human Annotation: At least two annotators independently label each sample for correctness; disagreements are resolved by a meta-annotator. 5. Balanced Sampling: Sampling is controlled to ensure uniform distribution across the four answer types, with one correct and one incorrect response per question.

Key Designs¶

Design 1: VerifyBench Dataset

Function: Constructs 2,000 balanced quadruples of (question, reference answer, model response, correctness label).
Mechanism: 1,000 questions × 2 responses each (1 correct + 1 incorrect), uniformly distributed across 4 answer types (250 questions × 500 responses per type), ensuring unbiased evaluation. Diverse responses are generated by 22 models, with label quality assured through human annotation.
Design Motivation: Reflects the actual scenario in LRM RL training—judging whether a single response is consistent with the reference answer, rather than comparing two responses. Balanced sampling eliminates bias from answer type distribution and correctness ratio.

Design 2: VerifyBench-Hard Dataset

Function: Constructs 1,000 hard verification samples focusing on contested cases where models strongly disagree.
Mechanism: 18 open-source models generate approximately 1.45 million responses; cases where 5 top-tier judge models split 2:3 are selected (i.e., two models disagree with the remaining three). Stratified sampling and human annotation produce the final dataset. Sampling is natural (not forced-balanced), with correct responses comprising only 29.1%.
Design Motivation: Models already achieve 93–95% accuracy on standard verification tasks, making it difficult to differentiate methods. Hard samples concentrate on genuinely ambiguous cases, more effectively exposing weaknesses in verification systems.

Design 3: Category-wise Evaluation Across Four Answer Types

Function: Answers are divided into Numeric, Expression, Multi-choice, and String categories and evaluated separately.
Mechanism: Different answer types pose distinct challenges—numeric comparison is relatively straightforward, expressions require mathematical equivalence judgment, multiple-choice requires understanding option semantics, and free-form text is hardest to match accurately.
Design Motivation: Fine-grained analysis reveals the specific weaknesses of verification systems across different scenarios, guiding targeted improvements.

Design 4: Multi-dimensional Evaluation Framework

Function: Simultaneously evaluates rule-based methods (math-verify) and LLM-as-judge approaches.
Mechanism: The evaluation metric is accuracy \(\text{Acc} = \frac{1}{|D|} \sum \mathbb{I}[\mathcal{E}(R_\phi(q, g_t, r)) = y]\), where \(R_\phi\) denotes the verification system output and \(y\) is the human-annotated correctness label.
Design Motivation: DeepSeek-R1 employs rule-based methods to prevent reward hacking, while Seed1.5-Thinking uses model-based methods for more accurate signals. Both paradigms have distinct trade-offs and require comparison under a unified framework.

Loss & Training¶

VerifyBench is an evaluation benchmark and does not involve model training. Core quality assurance measures include: - Dual annotation with adjudication: Each sample is labeled by at least two annotators; a meta-annotator resolves disagreements. - Strict correctness definition: A response is considered successful only if it is both executable and correct, verified using 1,000 random test inputs. - Stratified sampling: Data domain and source distributions are controlled to avoid sampling bias.

Key Experimental Results¶

Main Results¶

VerifyBench Overall Accuracy (%):

Model/Method	Numeric	Expression	MC	String	AVG
math-verify (rule-based)	85.60	75.60	55.00	51.60	66.95
GPT-4o	94.80	90.20	96.80	90.80	93.15
DeepSeek-V3	96.80	93.00	97.60	91.60	94.75
DeepSeek-R1	98.00	92.60	98.00	92.00	95.15
Qwen3-32B	97.60	94.00	99.00	92.60	95.80
gpt-oss-120b	98.00	94.80	99.20	91.40	95.85

VerifyBench-Hard Overall Accuracy (%):

Model/Method	Numeric	Expression	MC	String	AVG
math-verify (rule-based)	84.52	82.95	68.37	78.26	76.00
GPT-4o	71.43	65.91	75.35	71.30	72.60
DeepSeek-R1	82.14	81.82	90.93	85.22	86.60
gpt-oss-120b	84.13	80.68	92.56	86.09	87.90
Llama-3.2-1B	44.40	41.00	37.60	53.60	44.15

Ablation Study¶

Difficulty Analysis by Answer Type (VB-Hard):

Answer Type	VerifyBench Best	VB-Hard Best	Drop
Numeric	98.00%	84.52%	−13.5%
Expression	94.80%	82.95%	−11.9%
Multi-choice	99.20%	92.56%	−6.6%
String	92.60%	86.09%	−6.5%

Model Scale Effect (Llama Series, VB-Hard):

Model	Parameters	VB-Hard AVG
Llama-3.2-1B	1B	25.60%
Llama-3.2-3B	3B	33.90%
Llama-3.1-8B	8B	43.20%
Llama-3.3-70B	70B	54.70%
Llama-4-17B-16E	17B×16E	48.50%

Key Findings¶

Rule-based methods are severely insufficient: math-verify achieves only 66.95% on VerifyBench, with performance near random chance on multiple-choice (55.00%) and free-form text (51.60%), revealing significant deficiencies in the rule-based rewards used by DeepSeek-R1.
Large gap between VB and VB-Hard: Top models exceed 95% on VerifyBench but reach only 87–88% on VB-Hard, confirming that hard verification cases represent a genuine bottleneck.
Larger models are more prone to false acceptance: Correct responses constitute only 29.1% of VB-Hard, indicating that larger models tend to incorrectly accept wrong answers—particularly dangerous for RL training as it introduces spurious positive rewards.
Limited gains from scaling alone: On VB-Hard, accuracy improves from 25.6% (Llama-1B) to 54.7% (Llama-70B), yet remains far from reliable, suggesting that simply scaling model size is insufficient.
Reasoning ability benefits verification: DeepSeek-R1's reasoning capability yields a clear advantage on VB-Hard (86.60% vs. GPT-4o's 72.60%).

Highlights & Insights¶

Filling an evaluation gap: VerifyBench is the first benchmark specifically designed to evaluate reference-based reward systems, directly corresponding to the actual scenario of LRM RL training.
Elegant construction of VerifyBench-Hard: Model disagreement is leveraged to identify hard samples, ensuring the benchmark is genuinely discriminative.
Systematic weaknesses of rule-based methods: The quantification of math-verify's performance across answer types provides empirical guidance for reward system selection in RL training.
False-acceptance bias: The finding that large models tend to accept incorrect answers carries important implications for reward hacking in RL training.
Rigorous data quality assurance: Dual annotation with meta-annotator adjudication, combined with large-scale coverage across 41 data sources and 22 models, ensures benchmark reliability.

Limitations & Future Work¶

Restricted to reasoning tasks: VerifyBench focuses on mathematical and logical reasoning and does not cover verification scenarios such as code generation or creative writing.
Limited answer types: Proof-based and open-ended questions are excluded, despite their importance in practical research.
Static benchmark: As model capabilities improve, VB-Hard may saturate quickly and will require continuous updates.
Narrow evaluation paradigm: Only prompt-based LLM-as-judge approaches are assessed; specially trained verification models are not explored.
Downstream effects of verification failures not examined: The paper does not empirically analyze how verification inaccuracies concretely affect RL training quality (e.g., reward hacking, training instability).

RewardBench: A benchmark for pairwise preference judgment; VerifyBench complements it—one assessing relative preference, the other absolute correctness.
DeepSeek-R1: Employs rule-based rewards to prevent reward hacking, but VerifyBench reveals substantial deficiencies in this approach (66.95%), suggesting that model-based methods should be incorporated.
Seed1.5-Thinking: Uses model-based methods to generate more precise reward signals; VerifyBench provides a standardized tool for evaluating such approaches.
Insight: Verification accuracy in RL training directly constrains the upper bound of model reasoning capability. An accuracy of ~88% on VB-Hard implies that approximately 12% of reward signals are erroneous, systematically degrading RL training effectiveness. Developing more accurate verification systems may constitute one of the key bottlenecks in advancing reasoning model capabilities.

Rating¶

Novelty: ⭐⭐⭐⭐ First work to systematically disentangle reference-based reward evaluation from preference comparison; the construction methodology for VB-Hard is creative.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 20+ models, 4 answer types, and two difficulty levels with rigorous human annotation.
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear, the benchmark construction pipeline is complete, and data statistics are thorough.
Value: ⭐⭐⭐⭐ Provides direct guidance for reward system design in LRM RL training, revealing the inadequacy of rule-based methods and the improvement space for model-based verification.