Skip to content

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Conference: ICLR 2026 arXiv: 2505.17018 Code: GitHub Area: Multimodal Reasoning / RL Alignment Keywords: Thinking Reward, MLLM Reasoning, Trust-GRPO, Annealing Strategy, Process Supervision

TL;DR

This paper proposes SophiaVL-R1, which introduces a holistic-level thinking process reward into rule-based RL training of MLLMs. A Thinking Reward Model (TRM) is trained to evaluate reasoning quality along five dimensions (including logical soundness and redundancy). Trust-GRPO is proposed to compute a reliability weight \(\gamma\) from the contrast of thinking rewards between correct and incorrect answer groups, mitigating reward hacking. A time-based annealing strategy \(e^{-\text{steps}/T}\) gradually reduces the thinking reward contribution so that the model relies more on accurate rule-based rewards in later training. The resulting 7B model comprehensively outperforms LLaVA-OneVision-72B on multiple benchmarks, including MathVista (71.3%) and MMMU (61.3%).

Background & Motivation

Background: DeepSeek-R1-style rule-based RL (GRPO + outcome reward) has successfully elicited reasoning capabilities in LLMs and MLLMs, with representative works including R1-OneVision, OpenVLThinker, and Video-R1, all centered on using rule functions to produce accurate outcome reward signals.

Core Problem: Relying solely on outcome rewards cannot guarantee the quality of the reasoning process — models may arrive at correct answers via flawed reasoning paths ("right answer, wrong reasoning"), and GRPO equally encourages such responses, leading to suboptimal or erroneous reasoning strategies and poor generalization.

Limitations of Prior Work: Traditional process reward models (PRMs) impose step-wise constraints, which (1) are overly rigid, limiting flexibility and generalizability; (2) make step-level correctness assessment inherently difficult; and (3) are susceptible to exploitation via repeating valid steps or inserting meaningless ones.

Reward Hacking Risk: Thinking rewards generated by models are unreliable on certain samples (Ye et al., 2024; Li et al., 2025a); naively incorporating them into GRPO may cause reward hacking, where the model learns to satisfy the reward model rather than genuinely improve reasoning.

Temporal Issue of Thinking Rewards: Maintaining a constant thinking reward intensity throughout training is not necessarily optimal — it aids strategy discovery in early stages but may accumulate errors from imperfect reward signals in later stages.

Goal: Design a reliable method to incorporate thinking process rewards into GRPO training, guiding the model to develop stronger and more generalizable reasoning capabilities without incurring additional computational overhead.

Method

Key Design 1: Holistic-Level Thinking Reward Model (TRM)

  • Function: A 3B-parameter Thinking Reward Model is trained to score the overall thinking process of MLLMs on a scale of 0–1, focusing solely on reasoning quality independent of final answer correctness.
  • Mechanism: A total of 470,331 (question, response) pairs are collected from GRPO training trajectories and scored by Qwen2.5-VL-72B along five dimensions — Logical Soundness, Correct Reasoning, Error Identification, Language Consistency, and Redundancy. After rule-based filtering and uniform sampling, 156,703 high-quality samples are retained to fine-tune Qwen2.5-VL-3B-Instruct via SFT.
  • Design Motivation: Holistic-level evaluation is more flexible than step-level assessment, avoiding the rigidity and exploitation issues of PRMs. Collecting data from GRPO training trajectories ensures coverage of realistic reasoning error patterns.
  • Formulation: Given a question \(q\) and thinking process \(t\), the TRM outputs \(R^t = f_{\phi}(q, t) \in [0, 1]\).

Key Design 2: Trust-GRPO

  • Function: When integrating thinking rewards and rule-based rewards in GRPO, Trust-GRPO assigns a reliability weight \(\gamma\) to the thinking reward, adaptively reducing the influence of unreliable signals.
  • Mechanism: For each question \(q\) with \(N\) sampled responses, responses are divided into a correct group \(G_{\text{correct}}\) and an incorrect group \(G_{\text{wrong}}\) based on outcome rewards. Group-level mean thinking rewards are computed as:
\[\mu_c = \frac{1}{|G_{\text{correct}}|}\sum_{i \in G_{\text{correct}}} R_i^t, \quad \mu_w = \frac{1}{|G_{\text{wrong}}|}\sum_{i \in G_{\text{wrong}}} R_i^t\]

The reliability weight is defined as:

\[\gamma = \begin{cases} 1, & \mu_c \geq \mu_w \\ e^{\mu_c - \mu_w}, & \mu_c < \mu_w \end{cases}\]

When the incorrect group receives higher average thinking rewards (\(\mu_c < \mu_w\)), \(\gamma\) decays exponentially, reducing the weight of the thinking reward. The final reward is: \(R_i = R_i^o + \gamma \alpha \cdot R_i^t\).

  • Design Motivation: This approach leverages the group sampling already available in GRPO to estimate thinking reward reliability at zero additional computational cost, making it substantially more efficient than methods such as MC Dropout that require multiple additional forward passes.

Key Design 3: Time-based Annealing

  • Function: The influence of the thinking reward is progressively reduced throughout training so that the model increasingly relies on accurate rule-based outcome rewards in later stages.
  • Mechanism: An exponential decay factor is introduced, yielding the final reward:
\[R_i = R_i^o + \gamma \alpha e^{-\text{steps}/T} \cdot R_i^t\]

where \(\text{steps}\) denotes the current global training step and \(T\) the total number of training steps. As training progresses, \(e^{-\text{steps}/T}\) monotonically decreases, naturally attenuating the contribution of the thinking reward.

  • Design Motivation: In early training, the thinking reward aids discovery of effective reasoning strategies (exploration phase); in later stages, when the model already possesses basic reasoning capabilities, imperfect thinking reward signals may introduce noise. Annealing directs the model back toward the more reliable rule-based reward, preventing the accumulation of reward hacking.

Rule-based Outcome Reward

  • Numerical tasks: exact match → binary reward
  • Multiple choice: option matching → binary reward
  • OCR tasks: negative Word Error Rate (WER)
  • Free-form text: average of ROUGE-1/2/L

Key Experimental Results

Table 1: Mathematical Reasoning Benchmarks (MathVista & MathVerse)

Model MathVista MathVerse Params
LLaVA-OneVision-72B 68.4 27.2 72B
URSA-8B 59.8 45.7 8B
R1-OneVision-7B 64.1 46.4 7B
Qwen2.5-VL-7B+GRPO 69.9 45.3 7B
Qwen2.5-VL-7B+SFT+GRPO 66.8 43.1 7B
SophiaVL-R1-7B 71.3 48.8 7B

Table 2: General Multimodal Benchmarks

Model MMMU MME ChartQA MMBench MMStar
LLaVA-OneVision-72B 56.8 2261 83.7 - 66.1
URSA-8B 43.1 1606 44.4 55.5 42.3
Qwen2.5-VL-7B+GRPO 58.0 2298 87.2 83.4 65.6
SophiaVL-R1-7B 61.3 2404 88.5 85.4 66.7

Table 3: Ablation Study

Variant MathVista MathVerse MMMU
Qwen2.5-VL-7B+GRPO (baseline) 69.9 45.3 58.0
w/o trained TRM (SophiaVL-R1-wo-trained-TRM) 68.4 47.9 57.0
w/o Trust, w/o Annealing 67.4 46.3 56.7
w/o Trust (annealing only) 70.2 47.8 60.0
Full SophiaVL-R1 71.3 48.8 61.3

Key Findings

  1. Holistic-level thinking reward outperforms step-level PRM: Compared with VisualPRM (InternVL2.5-8B), SophiaVL-R1 achieves an 18.1-point gain on MathVerse (48.8 vs. 30.7) and leads across all sub-tasks, demonstrating that holistic-level evaluation is more flexible and robust.

  2. Trust weight \(\gamma\) effectively prevents reward hacking: Ablations show that removing the Trust weight causes MMMU to drop from 61.3 to 60.0 and MathVista from 71.3 to 70.2, confirming that \(\gamma\) successfully identifies and downweights unreliable signals via the correct/incorrect group thinking reward contrast.

  3. Annealing strategy is indispensable: Removing annealing (SophiaVL-R1-wo-trust-and-annealing) leads to across-the-board performance degradation (MathVista 67.4 vs. 71.3; MMMU 56.7 vs. 61.3), indicating that sustained application of potentially imperfect thinking rewards induces optimization bias, while annealing directs the model back to reliable rule-based rewards in later training.

  4. An untrained TRM is nearly useless: Replacing the trained TRM with an unmodified Qwen2.5-VL-3B yields performance comparable to the pure GRPO baseline, validating the importance of the dedicated training pipeline and the SophiaVL-R1-Thinking-156k dataset.

  5. Training dynamics: SophiaVL-R1 exhibits the fastest and highest rise in outcome reward, indicating that Trust-GRPO accelerates strategy exploration.

Highlights & Insights

  • "Evaluating not just correctness but quality of reasoning": Analogous to a teacher grading not only final answers but also solution processes, SophiaVL-R1 achieves process supervision in MLLM training through thinking rewards.
  • Zero-cost self-calibration in Trust-GRPO: The method cleverly exploits the group sampling already present in GRPO — contrasting thinking rewards between correct and incorrect groups — to estimate reliability without any additional sampling, providing a computationally friendly defense against reward hacking.
  • Reversing a 10× parameter gap: The 7B model surpasses the 72B model on MMMU by 4.5 points (61.3 vs. 56.8), suggesting that reasoning quality matters more than parameter scale — a significant implication for efficient reasoning model research.
  • Engineering wisdom of annealing: Thinking rewards in early stages promote exploration; rule-based rewards in later stages enable stable refinement. This "diverge-then-converge" paradigm has broad applicability.

Limitations & Future Work

  1. TRM training relies on large-model annotation: Training data is scored by Qwen2.5-VL-72B, so annotation quality is bounded by that model's capabilities, potentially introducing systematic biases, with non-trivial annotation costs.
  2. Completeness of the five evaluation dimensions is not fully validated: The five dimensions — logical soundness, correct reasoning, etc. — are derived from error patterns observed during training and may not cover all types of reasoning deficiencies, potentially requiring extension for new tasks or domains.
  3. Validation limited to the Qwen2.5-VL series: The method has not been tested on other architectures such as InternVL or LLaVA, leaving its generalizability to be further established.

vs. VisualPRM (Wang et al., 2025b)

VisualPRM employs a step-level process reward model, whereas SophiaVL-R1 adopts a holistic-level thinking reward. The latter achieves an 18.1-point improvement on MathVerse (48.8 vs. 30.7), demonstrating that holistic-level evaluation is more effective for MLLM reasoning training, circumventing the rigidity and exploit vulnerabilities of step-level constraints.

vs. Video-R1 / R1-OneVision

These works rely solely on rule-based outcome rewards without supervising the reasoning process. SophiaVL-R1 additionally incorporates thinking rewards, Trust-GRPO, and annealing, yielding more comprehensive reward signals and better generalization (MathVista 71.3 vs. R1-OneVision's 64.1).

Rating

  • Novelty: ⭐⭐⭐⭐ The triple combination of holistic-level thinking reward, Trust-GRPO reliability mechanism, and annealing strategy is novel, though each individual component is relatively straightforward.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Seven benchmarks, detailed ablations, training curves, and VLRewardBench validation of the TRM are provided; experiments across different base models are lacking.
  • Writing Quality: ⭐⭐⭐⭐⭐ The logical chain from problem definition to analysis to solution is clear and complete, with concise and elegant mathematical derivations.
  • Value: ⭐⭐⭐⭐⭐ The work provides a practical and effective path toward process supervision in RL-based MLLM reasoning training, and the Trust-GRPO reliability estimation paradigm has broad applicability.