Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models¶

Conference: ICLR 2026 arXiv: 2503.06749 Code: GitHub Area: Multimodal VLM Keywords: Multimodal Reasoning, Reinforcement Learning, Chain-of-Thought, GRPO, Cold-Start Initialization

TL;DR¶

This paper proposes Vision-R1, which constructs 200K high-quality multimodal CoT data via Modality Bridging for cold-start initialization, followed by Progressive Thinking Suppression Training (PTST) combined with GRPO reinforcement learning. At the 7B parameter scale, Vision-R1 achieves multimodal mathematical reasoning performance approaching OpenAI O1.

Background & Motivation¶

DeepSeek-R1 successfully demonstrated that pure RL can elicit complex reasoning capabilities in LLMs (e.g., self-reflection, self-questioning). However, whether this success can be transferred to Multimodal LLMs (MLLMs) remains an open question.

The authors first attempted to train MLLMs directly with RL (termed Vision-R1-Zero), identifying the following key challenges:

Direct RL training fails to elicit complex reasoning: Due to the lack of large-scale, high-quality multimodal reasoning data, the model cannot generate complex CoT.

Insufficient quality of existing multimodal CoT data: Existing data lacks human cognitive processes such as self-reflection and self-questioning, amounting to formatted "pseudo-CoT."

Overthinking after cold-start initialization: After SFT on CoT data, the model generates excessively long reasoning chains, while correct reasoning is concentrated in shorter chains, making RL optimization difficult.

Method¶

Overall Architecture¶

A two-stage pipeline: (1) construct the Vision-R1-cold dataset → cold-start SFT to obtain Vision-R1-CI; (2) Progressive Thinking Suppression Training (PTST) + GRPO reinforcement learning → final Vision-R1.

Key Designs¶

Modality Bridging for Multimodal CoT Data Construction: DeepSeek-R1 can generate human-like complex reasoning but cannot process images directly as a text-only model. The solution proceeds in three steps:
Input image-text pairs into an MLLM to generate "pseudo-CoT" (containing image descriptions and reasoning processes), exposing richer visual details.
Feed the "pseudo-CoT" along with the original image-text pair back into the MLLM to obtain detailed descriptions — achieving modality bridging.
Pass the pure-text descriptions into DeepSeek-R1 to obtain high-quality complex CoT.

After rule-based filtering, this yields the 200K Vision-R1-cold dataset, in which "Wait" appears 585K times (vs. 2.3K in LLaVA-CoT), demonstrating significantly richer self-reflection characteristics.

Overthinking Optimization Problem: After cold-start initialization, the model tends to generate extremely long reasoning chains for all problems, while correct answers are predominantly found in shorter chains. Directly applying RL with a 16K context length causes the model to generate longer but incorrect reasoning, degrading performance.
Progressive Thinking Suppression Training (PTST): Training is divided into stages. In early stages, reasoning length is strictly constrained (e.g., 4K tokens), forcing the model to learn correct reasoning within a limited budget. As training progresses, the constraint is gradually relaxed (e.g., 8K), allowing the model to autonomously learn to apply more complex reasoning for harder problems. Specifically: Stage 1 uses 4K×16 (length × number of samples), Stage 2 uses 8K×8, keeping the total sampling volume × length constant across stages. A Hard Format Result Reward Function (HFRRF) is employed: a reward of 1 is given only when both the format is correct and the answer is correct; otherwise the reward is 0.

Loss & Training¶

GRPO objective with PTST:

\[J_{\text{GRPO}}^{(s)}(\theta) = \mathbb{E}\left[\frac{1}{G_s}\sum_{i=1}^{G_s}\min\left(\frac{\pi_\theta(o_i^{(s)}|q)}{\pi_{\theta_{\text{old}}}(o_i^{(s)}|q)}A_i^{(s)}, \text{clip}(\cdot, 1-\varepsilon, 1+\varepsilon)A_i^{(s)}\right) - \beta D_{\text{KL}}\right]\]

where \(\varepsilon=0.2\), \(\beta=10^{-2}\), and the advantage estimate is \(A_i = \frac{r_i - \text{mean}(\{r_j\})}{\text{std}(\{r_j\})}\).

The cold-start stage applies standard SFT on Vision-R1-cold to fine-tune the base model (Qwen2.5-VL).

Key Experimental Results¶

Main Results¶

Model	Params	MathVista	MathVerse	MM-Math	DynaMath	Avg.
OpenAI O1	-	73.9	-	-	-	-
GPT-4o	-	63.8	37.6	31.8	64.9	-
Qwen2.5-VL-7B	7B	68.1	46.7	34.1	50.7	49.9
Qwen2.5-VL-72B	72B	73.5	51.3	45.6	61.2	57.9
Vision-R1-7B	7B	73.5	52.4	40.2	56.3	55.6
Vision-R1-32B	32B	76.4	62.1	55.3	65.6	64.9
Vision-R1-72B	72B	78.2	63.2	59.3	66.4	66.8

Vision-R1-7B vs. base Qwen2.5-VL-7B: GEO +13.4, ALG +10.3, GPS +16.4, MathVista overall +5.4.

Ablation Study¶

Method	Cold Start	GRPO	PTST	Avg. Reasoning Length	Avg. Score (MathVista/MathVerse/MM-Math)
Vision-R1-Zero	✗	✓	✗	1285	50.7
Vision-R1-CI	✓	✗	✗	3566	44.5
Vision-R1-Long	✓	✓	✗	3107	47.7
Vision-R1	✓	✓	✓	2057	55.4

PTST Config	Stage 1	Stage 2	MathVista	Avg.	Note
Fixed 16K	16K×4	16K×4	70.3	47.7	Severe overthinking with no early constraint
Fixed 4K	4K×16	4K×16	72.6	54.3	Effective but limits complex reasoning
PTST 2-stage	4K×16	8K×8	73.5	55.4	Optimal; progressive relaxation
PTST 3-stage	4K×16	6K×12 → 8K×8	73.0	55.1	Additional stage yields no significant gain

Key Findings¶

7B surpasses 70B: Vision-R1-7B achieves 73.5% on MathVista, only 0.4% below OpenAI O1, outperforming Qwen2.5-VL-72B.
Direct RL training is insufficient: Vision-R1-Zero achieves only 50.7 average score and fails to elicit effective reasoning.
Cold-start is necessary but not sufficient: The CI model scores 44.5 (severe overthinking) and must be combined with PTST.
PTST is simple yet effective: The two-stage setup (4K→8K) reaches optimum; additional stages provide no benefit, demonstrating robustness.
Data quality is critical: "Wait" appears 586K times in Vision-R1-cold vs. only 2.3K in LLaVA-CoT — self-reflection token frequency is two orders of magnitude higher.
Cross-model generalization is verified on Llama-3.2-11B-V: Vision-R1-cold SFT outperforms LLaVA-CoT and Mulberry across all benchmarks.

Highlights & Insights¶

This work is the first systematic exploration of R1-style RL in MLLMs, clearly delineating the individual contributions of direct RL, cold-start initialization, and PTST.
Modality Bridging elegantly resolves the limitation that DeepSeek-R1 cannot process images.
PTST offers a deep insight: the model should first learn to "reason correctly" before learning to "reason complexly," analogous to human learning progressions.
Using only 10K data for RL yields approximately 6% average improvement, demonstrating exceptional data efficiency.
The "Aha moment" is observed in MLLMs for the first time (e.g., self-correction and self-reflection).

Limitations & Future Work¶

RL training relies solely on mathematical data; generalization to general-purpose reasoning tasks remains to be validated.
The number of stages and length settings in PTST are currently determined empirically, lacking theoretical grounding.
Modality Bridging risks information loss during the visual-to-text conversion process.
The 32B and 72B variants use additional data, making direct comparison with the 7B model not fully controlled.
The scale of cold-start data (200K) may be a bottleneck; the benefits of larger-scale data remain to be explored.

Vision-R1 serves as the multimodal counterpart to DeepSeek-R1, pointing toward a viable path for enhancing reasoning in MLLMs.
The PTST methodology is applicable to other RL scenarios requiring control over generation length.
The Modality Bridging approach can be generalized to other settings where text-only LLMs need to process multimodal data.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First successful transfer of the R1-style reasoning paradigm to MLLMs; PTST strategy is original.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple benchmarks (MathVista/MathVerse/MM-Math/DynaMath), multiple scales (7B/32B/72B), and extensive ablations.
Writing Quality: ⭐⭐⭐⭐ Fluent exposition with a well-structured problem-driven narrative; notation is occasionally dense.
Value: ⭐⭐⭐⭐⭐ Achieves O1-level multimodal reasoning at 7B parameters; significant inspiration for the research community.