VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models¶

Conference: ICCV 2025 arXiv: 2503.07478 Code: https://github.com/JCruan519/VLRMBench Area: Time Series / Multimodal Evaluation Keywords: reward model, vision-language understanding, benchmark, process reasoning, multimodal evaluation

TL;DR¶

This paper proposes VLRMBench, a comprehensive and challenging benchmark for vision-language reward models (VLRMs) comprising 12,634 questions across 12 tasks, covering three dimensions: process understanding, outcome judgment, and criticism generation. Extensive experiments on 26 models reveal significant deficiencies in current VLRMs.

Background & Motivation¶

Reward models (RMs) play a critical role in both the training and inference stages of large models: they are used to filter high-quality data before training, to guide preference optimization during training (e.g., RLAIF), and to enable test-time scaling (TTS) at inference. However, existing VLRM benchmarks suffer from serious limitations:

Narrow evaluation dimensions: VLRewardBench covers only pairwise comparison, which is insufficient for a comprehensive assessment of VLRM capabilities.

Absence of step-level annotations: Most benchmarks do not include labels at the level of individual reasoning steps.

Focus on language-only settings: Existing RM benchmarks (e.g., PRMBench, ProcessBench) target purely textual inputs and are not applicable to vision-language scenarios.

Insufficient challenge: Overly simple benchmarks fail to expose latent weaknesses of VLRMs.

Method¶

Overall Architecture¶

The construction of VLRMBench follows a three-stage pipeline: (1) data collection and filtering; (2) reasoning process generation and step segmentation; (3) task design based on three overarching themes, yielding 12 tasks in total.

Key Designs¶

Collaborative Data Filtering & Generation Pipeline:
- Quality filtering: Qwen2VL-7B is prompted to answer questions without images; samples answered correctly are considered low-quality (i.e., solvable by text alone) and discarded.
- Difficulty filtering: Samples that Qwen2VL-7B answers correctly even with images are deemed too easy and discarded. This reduces the pool from 16,550 to 6,715 samples.
- Reasoning process generation: QVQ-72B-preview generates reasoning chains; only samples with correct final answers are retained.
- Step segmentation: GPT-4o performs semantic-level step segmentation.
- Human verification: Three doctoral students validate and correct errors in the generated reasoning processes.
- The final dataset retains 1,000 high-quality samples spanning mathematical reasoning, hallucination understanding, and multi-image understanding.
Step-based Tasks (8 tasks): Evaluate a VLRM's ability to understand reasoning processes.
- SC (Step Correctness): Detects injected errors in reasoning steps.
- RD (Redundancy Detection): Identifies redundant information in reasoning processes.
- CM (Confidence Misdirection): Inserts high-confidence expressions into erroneous steps to test robustness.
- EH (Existential Hallucination): Detects entities mentioned in reasoning that do not exist in the image.
- AH (Attribute Hallucination): Detects incorrect descriptions of entity attributes.
- DE (Detail Error): Detects fine-grained errors in numerical computations or symbols.
- SR (Spatial Relationship): Detects errors in spatial relationship descriptions.
- IC (Image Confusion): Detects incorrect image references in multi-image tasks.
Outcome-based Tasks (2 tasks): Evaluate a VLRM's ability to judge final outcomes.
- MJ (Multi-solution Judgment): Compares the quality of different reasoning processes for the same question.
- FF (Forecasting Future): Predicts the correctness of the final answer based on the first \(m\) reasoning steps.
Criticism-based Tasks (2 tasks): Evaluate a VLRM's ability to analyze and correct errors.
- ERA (Error Reason Analysis): Analyzes the cause of erroneous reasoning steps.
- EC (Error Correction): Directly corrects errors and produces a revised reasoning chain.

Loss & Training¶

As a benchmarking study, no model training is required. Evaluation metrics include: - Step-based tasks: F1-Score (balancing precision and recall) - Outcome-based tasks: Accuracy - Criticism-based tasks: Win Rate (using GPT-4o as the judge)

Key Experimental Results¶

Main Results¶

Average F1-Score on step-based tasks (selected models):

Model	SC	RD	CM	EH	AH	DE	Step Avg.
GPT-4o	73.7	50.6	66.6	57.6	58.6	71.8	62.4
Claude-3.5-Sonnet	70.8	53.7	65.7	63.9	62.8	63.4	62.9
Qwen2.5VL-72B	72.8	41.7	70.4	64.6	59.9	72.4	62.6
Qwen2.5VL-7B	43.4	33.2	37.8	22.8	23.9	45.5	33.4
InternVL2.5-8B	36.6	28.4	31.1	21.9	21.2	36.5	28.6
Ovis2-34B	65.3	51.1	64.5	54.5	51.6	59.6	57.0

Performance on outcome-based and criticism-based tasks:

Model	MJ(Acc)	FF(Acc)	Outcome Avg.	ERA(WinRate)	EC(WinRate)
GPT-4o	58.4	76.0	66.3	0.0	0.0
Claude-3.5-Sonnet	82.2	75.1	79.0	60.6/25.5/13.9	21.2/53.9/24.9
Qwen2.5VL-72B	65.6	80.2	72.1	74.1/15.1/10.8	15.6/77.0/7.3
Qwen2.5VL-7B	26.0	70.7	46.0	37.7/22.0/40.3	9.3/51.9/38.8

Ablation Study¶

Effect of model scale on step-based and outcome-based tasks:

Group	Scale	Step Avg.	Outcome Avg.
Small (<10B)	2B–8B	29.8	45.2
Medium (10–40B)	11B–38B	45.9	54.7
Large (>40B)	72B–90B	46.1	56.8
Closed-source	—	62.4+	66.3+

Key Findings¶

Even the state-of-the-art GPT-4o achieves only 76.0% accuracy on the FF (Forecasting Future) task and an average F1 of 62.4% on step-based tasks.
Open-source models are closing the gap with closed-source counterparts: Qwen2.5VL-72B outperforms GPT-4o on criticism tasks (ERA win rate 74.1% vs. 0.0%).
The CM task confirms that VLRMs are susceptible to high-confidence expressions: CM F1 scores are consistently lower than SC scores across all models.
Redundancy Detection (RD) is the most challenging step-based task: all models record their lowest F1 scores on RD.
Performance improves substantially from small to medium scale (29.8→45.9), but the gain from medium to large scale is marginal (45.9→46.1).

Highlights & Insights¶

Comprehensive 12-task design: The three-dimensional evaluation framework—covering process, outcome, and criticism—substantially surpasses existing benchmarks that rely solely on pairwise comparison.
Dual filtering mechanism: Answering correctly without images indicates low quality; answering correctly with images indicates insufficient difficulty. This ensures that only high-quality, high-difficulty samples are retained.
Innovation of the Confidence Misdirection task: Tests whether models are misled by confidence-laden expressions such as "definitely" and "without a doubt."
Practical value of Forecasting Future: The ability to predict reasoning correctness early could significantly accelerate TTS inference.

Limitations & Future Work¶

Reasoning processes are generated by QVQ-72B-preview, which may introduce biases inherent to that model.
Mathematical reasoning samples dominate the dataset; hallucination and multi-image samples are comparatively scarce.
Using GPT-4o as the judge for criticism-based tasks introduces a risk of evaluator bias.
The benchmark currently evaluates only text-form reward models and does not consider scalar-valued RMs.
The absence of a dynamic update mechanism means models may overfit to a fixed test set over time.

PRMBench: Fine-grained evaluation of process reward models in the text-only domain.
VLRewardBench: The first vision-language RM benchmark, but limited to a single task type.
ProcessBench: Mathematical reasoning evaluation for step-level error detection.
Insight: Evaluating the capabilities of reward models requires a multi-dimensional framework such as VLRMBench; single-task evaluation (e.g., pairwise comparison) is far from sufficient.

Rating¶

Novelty: ⭐⭐⭐⭐ First comprehensive VLRM benchmark with a uniquely designed set of 12 tasks.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extensive evaluation across 26 models, analyzed in four groups (small/medium/large/closed-source).
Writing Quality: ⭐⭐⭐⭐ Task designs are clearly articulated, with rich tables and figures.
Value: ⭐⭐⭐⭐ Fills the gap in comprehensive VLRM evaluation and exposes critical weaknesses of current models.