SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning¶
Conference: NeurIPS 2025 arXiv: 2506.01713 Code: https://srpo.pages.dev/ Area: Multimodal VLM / LLM Reasoning Keywords: Multimodal Reasoning, Self-Reflection, Reinforcement Learning, GRPO, Reflection-Aware RL
TL;DR¶
This paper proposes SRPO (Self-Reflection enhanced reasoning with Group Relative Policy Optimization), a two-stage reflection-aware RL framework. Stage 1 constructs reflection data via large model distillation for SFT cold-start; Stage 2 designs a reflection-aware reward function within GRPO to reinforce concise and effective self-reflection. SRPO achieves state-of-the-art results at the 7B/32B scale on multimodal reasoning benchmarks including MathVista, MathVision, and MMMU-Pro.
Background & Motivation¶
Background: Multimodal large language models (MLLMs) have demonstrated strong potential on reasoning tasks. Works such as DeepSeek-R1 have extended RL-based reasoning from text to multimodal settings. However, existing methods (MM-Eureka, Vision-R1, VL-Rethinker, etc.) still struggle to match closed-source model reasoning performance at the 7B scale.
Limitations of Prior Work: (1) MLLM generation follows a token-level Markov process relying on local dependencies, which leads to redundant, repetitive, or erroneous reasoning steps; (2) GPT-o1 achieves only 73.9% on MathVista, underperforming Qwen2.5-VL-72B (74.8%), indicating that erroneous and redundant steps degrade final performance.
Key Challenge: Self-reflection is an effective remedy for redundant or erroneous reasoning, yet pretraining largely fixes the upper bound of a model's reasoning capability. RL can only activate existing decision structures rather than acquire new knowledge—surpassing this bound requires external intervention, such as injecting high-quality reflection experience.
Goal: To enable MLLMs to learn effective self-reflection and self-correction, thereby breaking through the reasoning capability ceiling established during pretraining.
Key Insight: Inspired by cognitive science—human robust reasoning involves active self-reflection and iterative error correction—the paper explicitly integrates reflection mechanisms into both the SFT and RL training stages.
Core Idea: A two-stage training pipeline: first inject reflection capability via SFT on large-model-distilled reflection data, then reinforce concise and effective reflection behavior in GRPO using a reflection-aware reward function.
Method¶
Overall Architecture¶
SRPO consists of two stages: - Stage 1 (Reflection-oriented Cold-start SFT): Construct a high-quality reflection dataset and train the policy model to acquire basic reflection capability. - Stage 2 (Reflection-aware RL): Introduce a dedicated reflection reward function within the GRPO framework to further reinforce reflection behavior.
Generation format: first solution → <reflection>...</reflection> → second refined solution
Key Designs¶
-
Self-Reflection SFT Data Construction:
- Curate \(N = 10{,}000\) multimodal reasoning samples from LLaVA-CoT (100K), Mulberry (260K), and MathV360K.
- Use the policy model (e.g., Qwen-2.5-VL-7B) to generate initial responses.
- Use a large model (GPT-o4-mini) to generate reflection processes conditioned on ground truth.
- Two complementary strategies: simplification of correct CoT (removing redundancy) + correction of erroneous CoT (error fixing).
- Approximately 30% of initial responses are correct and 70% contain errors, highlighting the necessity of reflection.
- "Less is More"—only 10K samples suffice to effectively inject reflection capability.
-
Cold-start SFT Training: The objective is \(\mathcal{L} = -\mathbb{E}[\log \pi(a_1, \texttt{<reflection>...</reflection>}, a_2 \mid q)]\), where \(a_1\) is the policy model's initial response, the reflection is generated by the large model, and \(a_2\) is the ground truth. The model jointly learns: (1) to revise \(a_1\) to \(a_2\) through reflection; (2) to leverage the reasoning knowledge in \(a_2\) to guide future predictions.
-
Reflection-Aware Reward (Core of SRPO):
- Total Reward: \(R_{\text{total}} = R_{\text{task}} + R_{\text{reflection}}\)
- Task Reward: \(R_{\text{format}}\) (0.5 if format is correct) + \(R_{\text{accuracy}}\) (0.5 if the first solution is correct)
- Reflection Reward \(= I_{\text{eff}} + I_{\text{ref}} + \alpha \cdot f_{\text{len}}(L)\)
- \(I_{\text{eff}}\) (Effectiveness Indicator): reflection corrects an erroneous answer \(+0.5\); maintains a correct answer \(+0.25\); fails to correct \(0\); converts a correct answer to incorrect \(-0.25\)
- \(I_{\text{ref}}\): correct reflection format \(+0.25\)
- \(f_{\text{len}}\): length reward, encouraging concise output via \(\exp\!\left(-\left|\frac{L - T_{\text{target}}}{T_{\text{max}} - T_{\text{target}}}\right|\right)^2\)
Loss & Training¶
- The RL stage is optimized with GRPO; advantages are computed via within-group normalization: \(A_i = (r_i - \text{mean}(r)) / \text{std}(r)\)
- Key improvement: reward signals address not only accuracy but also provide fine-grained rewards for the effectiveness, conciseness, and format of reflection behavior.
- SFT data is aggregated from multiple sources: ScienceQA, Geometric Math QA, ChartQA, DVQA, AI2D, MATH, Virgo, R1-OneVision, MMK12, and PhyX.
Key Experimental Results¶
Main Results¶
7B Model Comparison¶
| Model | MathVista | MathVerse | MathVision | MMMU-Pro | EMMA |
|---|---|---|---|---|---|
| Qwen-2.5-VL-7B | 68.2 | 46.3 | 25.1 | 36.9 | 21.5 |
| VL-Rethinker-7B | 74.9 | 54.2 | 32.3 | 41.7 | 29.7 |
| Vision-R1-7B | 73.5 | 52.4 | 27.2 | 37.7 | 22.4 |
| MM-Eureka-7B | 73.0 | 50.3 | 26.9 | 37.6 | 23.5 |
| SRPO-7B | 75.8 | 55.8 | 32.9 | 42.3 | 29.6 |
32B Model Comparison¶
| Model | MathVista | MathVerse | MathVision | MMMU-Pro | EMMA |
|---|---|---|---|---|---|
| Qwen-2.5-VL-32B | 74.7 | 48.5 | 38.4 | 49.5 | 31.1 |
| MM-Eureka-32B | 74.8 | 56.5 | 34.4 | 50.4 | 34.5 |
| SRPO-32B | 78.5 | 58.9 | 39.6 | 51.3 | 38.2 |
SRPO-7B outperforms all open-source reasoning models of comparable scale across all benchmarks. SRPO-32B achieves 51.3 on MMMU-Pro, approaching the closed-source Claude 3.7 Sonnet (51.5).
Ablation Study¶
| Ablation Dimension | Key Findings |
|---|---|
| Remove \(I_{\text{eff}}\) (effectiveness reward) | Significant performance drop; model generates hollow reflections |
| Remove \(f_{\text{len}}\) (length reward) | Increased output redundancy; reflection segments become excessively long |
| SFT only, no RL | Performance below full SRPO, validating the necessity of the RL stage |
| Standard GRPO without reflection reward | Model does not generate meaningful reflection behavior |
Key Findings¶
- Both training stages are indispensable: SFT injects reflection capability, RL reinforces reflection quality.
- The \(I_{\text{eff}}\) reward design is central—distinguishing "maintaining correctness" from "correcting errors" with differentiated rewards prevents the model from corrupting correct answers.
- The length reward effectively prevents reward hacking via output inflation.
- Effective reflection capability injection is achievable with as few as 10K SFT samples.
Highlights & Insights¶
- Refined reflection-aware reward design: The four-tier \(I_{\text{eff}}\) reward (\(+0.5 / +0.25 / 0 / -0.25\)) is directly tied to whether reflection genuinely improves outcomes, avoiding the degenerate "reflection for reflection's sake" behavior.
- Novel SFT data construction strategy: The "Less is More" approach—10K samples covering both correct simplification and error correction scenarios—is more efficient than large-scale CoT distillation used in prior work.
- Cross-scale consistency: Significant gains at both 7B and 32B scales demonstrate the generality of the method.
- Addresses a gap in multimodal reflection: Prior work (VL-Rethinker, Vision-R1) does not explicitly reinforce reflection in both the SFT and RL stages simultaneously.
Limitations & Future Work¶
- Reflection data construction relies on a closed-source large model (GPT-o4-mini), incurring distillation costs.
- The reflection format is fixed to the
<reflection>tag structure; more flexible reflection forms remain unexplored. - Validation is currently limited to mathematical and general scientific reasoning; open-ended visual reasoning is not covered.
- The four-tier thresholds of \(I_{\text{eff}}\) are manually defined; finer-grained or adaptive reward schemes are not explored.
- No in-depth comparison with RL-based methods such as DAPO or Dr.GRPO is provided.
Related Work & Insights¶
- VL-Rethinker: Improves reasoning via selective sample replay and textual rethinking triggers, but does not inject reflection knowledge at the SFT stage.
- MM-Eureka: Proposes the MMK12 dataset and a two-stage RL pipeline but lacks an explicit reflection mechanism.
- Vision-R1: Augments CoT data with DeepSeek-R1 and applies progressive thinking suppression, improving only the SFT side.
- DeepSeek-R1: Validates the effectiveness of RL for text-based reasoning; SRPO extends this to multimodal settings with explicit reflection.
- Insight: The reflection mechanism is potentially transferable to visual generation, embodied intelligence, and other multi-step error-correction scenarios.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The reflection-aware reward function design within the GRPO framework is a first; the dual-stage SFT+RL reflection injection paradigm is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Eight benchmarks, two model scales (7B/32B), detailed ablations and comparisons.
- Writing Quality: ⭐⭐⭐⭐ — Clear motivation, complete formulations, rich figures and tables.
- Value: ⭐⭐⭐⭐ — Provides a reproducible training paradigm for self-reflection in multimodal reasoning.