Skip to content

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Conference: NeurIPS 2025 arXiv: 2506.01713 Code: https://srpo.pages.dev/ Area: Multimodal VLM / LLM Reasoning Keywords: Multimodal Reasoning, Self-Reflection, Reinforcement Learning, GRPO, Reflection-Aware RL

TL;DR

This paper proposes SRPO (Self-Reflection enhanced reasoning with Group Relative Policy Optimization), a two-stage reflection-aware RL framework. Stage 1 constructs reflection data via large model distillation for SFT cold-start; Stage 2 designs a reflection-aware reward function within GRPO to reinforce concise and effective self-reflection. SRPO achieves state-of-the-art results at the 7B/32B scale on multimodal reasoning benchmarks including MathVista, MathVision, and MMMU-Pro.

Background & Motivation

Background: Multimodal large language models (MLLMs) have demonstrated strong potential on reasoning tasks. Works such as DeepSeek-R1 have extended RL-based reasoning from text to multimodal settings. However, existing methods (MM-Eureka, Vision-R1, VL-Rethinker, etc.) still struggle to match closed-source model reasoning performance at the 7B scale.

Limitations of Prior Work: (1) MLLM generation follows a token-level Markov process relying on local dependencies, which leads to redundant, repetitive, or erroneous reasoning steps; (2) GPT-o1 achieves only 73.9% on MathVista, underperforming Qwen2.5-VL-72B (74.8%), indicating that erroneous and redundant steps degrade final performance.

Key Challenge: Self-reflection is an effective remedy for redundant or erroneous reasoning, yet pretraining largely fixes the upper bound of a model's reasoning capability. RL can only activate existing decision structures rather than acquire new knowledge—surpassing this bound requires external intervention, such as injecting high-quality reflection experience.

Goal: To enable MLLMs to learn effective self-reflection and self-correction, thereby breaking through the reasoning capability ceiling established during pretraining.

Key Insight: Inspired by cognitive science—human robust reasoning involves active self-reflection and iterative error correction—the paper explicitly integrates reflection mechanisms into both the SFT and RL training stages.

Core Idea: A two-stage training pipeline: first inject reflection capability via SFT on large-model-distilled reflection data, then reinforce concise and effective reflection behavior in GRPO using a reflection-aware reward function.

Method

Overall Architecture

SRPO consists of two stages: - Stage 1 (Reflection-oriented Cold-start SFT): Construct a high-quality reflection dataset and train the policy model to acquire basic reflection capability. - Stage 2 (Reflection-aware RL): Introduce a dedicated reflection reward function within the GRPO framework to further reinforce reflection behavior.

Generation format: first solution → <reflection>...</reflection> → second refined solution

Key Designs

  1. Self-Reflection SFT Data Construction:

    • Curate \(N = 10{,}000\) multimodal reasoning samples from LLaVA-CoT (100K), Mulberry (260K), and MathV360K.
    • Use the policy model (e.g., Qwen-2.5-VL-7B) to generate initial responses.
    • Use a large model (GPT-o4-mini) to generate reflection processes conditioned on ground truth.
    • Two complementary strategies: simplification of correct CoT (removing redundancy) + correction of erroneous CoT (error fixing).
    • Approximately 30% of initial responses are correct and 70% contain errors, highlighting the necessity of reflection.
    • "Less is More"—only 10K samples suffice to effectively inject reflection capability.
  2. Cold-start SFT Training: The objective is \(\mathcal{L} = -\mathbb{E}[\log \pi(a_1, \texttt{<reflection>...</reflection>}, a_2 \mid q)]\), where \(a_1\) is the policy model's initial response, the reflection is generated by the large model, and \(a_2\) is the ground truth. The model jointly learns: (1) to revise \(a_1\) to \(a_2\) through reflection; (2) to leverage the reasoning knowledge in \(a_2\) to guide future predictions.

  3. Reflection-Aware Reward (Core of SRPO):

    • Total Reward: \(R_{\text{total}} = R_{\text{task}} + R_{\text{reflection}}\)
    • Task Reward: \(R_{\text{format}}\) (0.5 if format is correct) + \(R_{\text{accuracy}}\) (0.5 if the first solution is correct)
    • Reflection Reward \(= I_{\text{eff}} + I_{\text{ref}} + \alpha \cdot f_{\text{len}}(L)\)
      • \(I_{\text{eff}}\) (Effectiveness Indicator): reflection corrects an erroneous answer \(+0.5\); maintains a correct answer \(+0.25\); fails to correct \(0\); converts a correct answer to incorrect \(-0.25\)
      • \(I_{\text{ref}}\): correct reflection format \(+0.25\)
      • \(f_{\text{len}}\): length reward, encouraging concise output via \(\exp\!\left(-\left|\frac{L - T_{\text{target}}}{T_{\text{max}} - T_{\text{target}}}\right|\right)^2\)

Loss & Training

  • The RL stage is optimized with GRPO; advantages are computed via within-group normalization: \(A_i = (r_i - \text{mean}(r)) / \text{std}(r)\)
  • Key improvement: reward signals address not only accuracy but also provide fine-grained rewards for the effectiveness, conciseness, and format of reflection behavior.
  • SFT data is aggregated from multiple sources: ScienceQA, Geometric Math QA, ChartQA, DVQA, AI2D, MATH, Virgo, R1-OneVision, MMK12, and PhyX.

Key Experimental Results

Main Results

7B Model Comparison

Model MathVista MathVerse MathVision MMMU-Pro EMMA
Qwen-2.5-VL-7B 68.2 46.3 25.1 36.9 21.5
VL-Rethinker-7B 74.9 54.2 32.3 41.7 29.7
Vision-R1-7B 73.5 52.4 27.2 37.7 22.4
MM-Eureka-7B 73.0 50.3 26.9 37.6 23.5
SRPO-7B 75.8 55.8 32.9 42.3 29.6

32B Model Comparison

Model MathVista MathVerse MathVision MMMU-Pro EMMA
Qwen-2.5-VL-32B 74.7 48.5 38.4 49.5 31.1
MM-Eureka-32B 74.8 56.5 34.4 50.4 34.5
SRPO-32B 78.5 58.9 39.6 51.3 38.2

SRPO-7B outperforms all open-source reasoning models of comparable scale across all benchmarks. SRPO-32B achieves 51.3 on MMMU-Pro, approaching the closed-source Claude 3.7 Sonnet (51.5).

Ablation Study

Ablation Dimension Key Findings
Remove \(I_{\text{eff}}\) (effectiveness reward) Significant performance drop; model generates hollow reflections
Remove \(f_{\text{len}}\) (length reward) Increased output redundancy; reflection segments become excessively long
SFT only, no RL Performance below full SRPO, validating the necessity of the RL stage
Standard GRPO without reflection reward Model does not generate meaningful reflection behavior

Key Findings

  1. Both training stages are indispensable: SFT injects reflection capability, RL reinforces reflection quality.
  2. The \(I_{\text{eff}}\) reward design is central—distinguishing "maintaining correctness" from "correcting errors" with differentiated rewards prevents the model from corrupting correct answers.
  3. The length reward effectively prevents reward hacking via output inflation.
  4. Effective reflection capability injection is achievable with as few as 10K SFT samples.

Highlights & Insights

  1. Refined reflection-aware reward design: The four-tier \(I_{\text{eff}}\) reward (\(+0.5 / +0.25 / 0 / -0.25\)) is directly tied to whether reflection genuinely improves outcomes, avoiding the degenerate "reflection for reflection's sake" behavior.
  2. Novel SFT data construction strategy: The "Less is More" approach—10K samples covering both correct simplification and error correction scenarios—is more efficient than large-scale CoT distillation used in prior work.
  3. Cross-scale consistency: Significant gains at both 7B and 32B scales demonstrate the generality of the method.
  4. Addresses a gap in multimodal reflection: Prior work (VL-Rethinker, Vision-R1) does not explicitly reinforce reflection in both the SFT and RL stages simultaneously.

Limitations & Future Work

  1. Reflection data construction relies on a closed-source large model (GPT-o4-mini), incurring distillation costs.
  2. The reflection format is fixed to the <reflection> tag structure; more flexible reflection forms remain unexplored.
  3. Validation is currently limited to mathematical and general scientific reasoning; open-ended visual reasoning is not covered.
  4. The four-tier thresholds of \(I_{\text{eff}}\) are manually defined; finer-grained or adaptive reward schemes are not explored.
  5. No in-depth comparison with RL-based methods such as DAPO or Dr.GRPO is provided.
  • VL-Rethinker: Improves reasoning via selective sample replay and textual rethinking triggers, but does not inject reflection knowledge at the SFT stage.
  • MM-Eureka: Proposes the MMK12 dataset and a two-stage RL pipeline but lacks an explicit reflection mechanism.
  • Vision-R1: Augments CoT data with DeepSeek-R1 and applies progressive thinking suppression, improving only the SFT side.
  • DeepSeek-R1: Validates the effectiveness of RL for text-based reasoning; SRPO extends this to multimodal settings with explicit reflection.
  • Insight: The reflection mechanism is potentially transferable to visual generation, embodied intelligence, and other multi-step error-correction scenarios.

Rating

  • Novelty: ⭐⭐⭐⭐ — The reflection-aware reward function design within the GRPO framework is a first; the dual-stage SFT+RL reflection injection paradigm is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Eight benchmarks, two model scales (7B/32B), detailed ablations and comparisons.
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation, complete formulations, rich figures and tables.
  • Value: ⭐⭐⭐⭐ — Provides a reproducible training paradigm for self-reflection in multimodal reasoning.