R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization¶
Conference: ICCV 2025 arXiv: 2503.10615 Code: None Area: Reinforcement Learning / Multimodal Reasoning Keywords: multimodal reasoning, cross-modal formalization, reinforcement learning, vision-language model, reasoning benchmark
TL;DR¶
This paper proposes R1-Onevision, a framework that converts images into formalized textual representations via a cross-modal reasoning pipeline, combined with a two-stage post-training strategy of SFT followed by rule-based reinforcement learning (GRPO), to significantly enhance multimodal reasoning in vision-language models, surpassing GPT-4o on multiple mathematical reasoning benchmarks.
Background & Motivation¶
While large language models have achieved remarkable progress in textual reasoning (e.g., DeepSeek-R1), multimodal reasoning remains a significant challenge. Existing vision-language models exhibit the following limitations when handling complex reasoning tasks:
Perception errors: DeepSeek-R1, for instance, relies on incomplete image descriptions from GPT-4o, leading to flawed reasoning foundations.
Insufficient reasoning depth: Qwen2.5-VL, despite strong multimodal capabilities, lacks deep reasoning and ultimately fails to solve problems.
Limitations of templated reasoning: Methods such as LLaVA-CoT employ predefined reasoning structures, constraining flexibility and creativity.
Generalization issues from direct imitation: Methods such as MAmmoTH-VL directly imitate ground-truth answers, lacking a trial-and-error process.
Furthermore, existing multimodal reasoning benchmarks (e.g., MathVision, MathVista) primarily focus on mathematical problems and lack comprehensive evaluation covering multiple disciplines and difficulty levels.
Method¶
Overall Architecture¶
The R1-Onevision framework consists of three components: (1) a cross-modal reasoning pipeline for dataset construction; (2) a two-stage post-training strategy of SFT + RL; and (3) R1-Onevision-Bench, a comprehensive reasoning benchmark.
Key Designs¶
-
Cross-Modal Reasoning Pipeline: Image content is converted into formalized textual representations, enabling language reasoning models to precisely process visual information. Differentiated strategies are applied according to image type:
- Diagrams/flowcharts: GPT-4o generates structured representations (SPICE circuits, PlantUML flowcharts, HTML layouts, CSV/JSON tables).
- Natural scenes: Grounding DINO extracts bounding boxes + GPT-4o generates descriptive captions.
- Text-dominant images: EasyOCR extracts text and positions + GPT-4o reconstructs the document.
- Mathematical images: GPT-4o provides reasoning strategies.
- Reasoning process generation: A role-playing strategy is adopted to iteratively review images and refine understanding; DeepSeek R1 is used on LlaVA-OneVision to generate reasoning chains.
- Quality assurance: GPT-4o filters inaccurate or inconsistent CoT steps.
-
R1-Onevision Dataset: A total of 155K carefully curated samples spanning science, mathematics, charts, and general scenarios, each annotated with detailed step-by-step reasoning.
-
Two-Stage Post-Training Strategy:
- SFT stage: Qwen2.5-VL is fine-tuned on the R1-Onevision dataset to cultivate coherent reasoning patterns and standardized output format (
<think>...</think>structure). - RL stage: GRPO (Group Relative Policy Optimization) is applied on the CLEVR dataset with two defined rewards:
- Accuracy reward: The final answer is extracted via regular expressions and compared against the ground truth.
- Format reward: Ensures the reasoning process is correctly enclosed within
<think>tags.
- GRPO loss function: \(\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}[\min(\text{ratio}_t \cdot \text{Adv}_t, \text{clipped\_ratio}_t \cdot \text{Adv}_t) - \beta \cdot \text{KL}(\pi_\theta(y|x), \pi_{\text{ref}}(y|x))]\)
- SFT stage: Qwen2.5-VL is fine-tuned on the R1-Onevision dataset to cultivate coherent reasoning patterns and standardized output format (
Loss & Training¶
- SFT: batch size 128, learning rate 1e-5, trained for 1 epoch.
- RL: trained for 1 epoch on a 10K subset of CLEVR.
- Base models: Qwen2.5-VL-7B and Qwen2.5-VL-3B.
Key Experimental Results¶
Main Results¶
Performance on mathematical reasoning benchmarks:
| Model | MathVision | MathVerse(ALL) | MathVerse(Vision Only) | MathVista | WeMath |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B (base) | 25.4 | 43.6 | 38.2 | 63.7 | 61.0 |
| GPT-4o | 30.6 | 41.2 | 34.5 | 60.0 | 69.0 |
| InternVL2.5-8B | 17.1 | 35.6 | 22.8 | 64.5 | 53.8 |
| LLaVA-CoT-11B | - | - | 22.6 | 52.5 | - |
| R1-Onevision-7B | 29.9 | 46.4 | 40.0 | 64.1 | 61.8 |
R1-Onevision-Bench results (partial):
| Model | Avg. | Middle | High | College | Social | Math | Physics | Chemistry | Biology | Deduction |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 49.6 | 51.3 | 56.2 | 45.3 | 26.5 | 41.3 | 52.5 | 71.4 | 63.4 | 26.5 |
| Gemini-2.0-Flash | 59.1 | 56.0 | 65.9 | 61.2 | 39.8 | 52.3 | 64.4 | 74.3 | 67.2 | 39.8 |
| Qwen2.5-VL-7B | 32.1 | 33.8 | 37.1 | 25.3 | 19.4 | 31.5 | 27.3 | 39.0 | 47.0 | 19.4 |
| R1-Onevision-7B | 36.2 | 40.1 | 39.5 | 27.6 | 26.5 | 33.0 | 30.2 | 49.5 | 53.0 | 26.5 |
| Qwen2.5-VL-72B | 52.0 | 54.3 | 56.7 | 54.1 | 23.5 | 48.9 | 55.8 | 63.8 | 63.4 | 23.5 |
Ablation Study¶
Training strategy ablation (based on Qwen2.5-VL-7B):
| Strategy | MathVision | MathVerse | MathVerse (Vision Only) |
|---|---|---|---|
| Base | 25.4 | 43.6 | 38.2 |
| +SFT | 26.3 | 43.4 | 39.7 |
| +SFT+RL | 29.9 | 46.4 | 40.0 |
| RL only (w/o SFT) | 28.0 | - | - |
Model scale ablation (Qwen2.5-VL-3B):
| Model | MathVision | MathVerse | MathVerse (Vision Only) |
|---|---|---|---|
| Qwen2.5-VL-3B | 21.7 | 34.7 | 31.2 |
| R1-Onevision-3B | 23.7 | 38.6 | 35.5 |
Key Findings¶
- R1-Onevision-7B surpasses GPT-4o by 5.2% and 4.1% on MathVerse and MathVista, respectively.
- SFT serves as an essential foundation for RL: SFT+RL outperforms RL-only by 1.9% on MathVision.
- All models perform poorly on deduction-type problems, with no model exceeding 40%.
- The 7B model after post-training substantially narrows the gap with closed-source large models.
- The method is effective at both 3B and 7B scales, validating scalability.
Highlights & Insights¶
- Core Idea of cross-modal formalization: Converting images into structured textual representations (e.g., SPICE, PlantUML) enables language reasoning models to effectively "perceive" visual content, elegantly transforming visual reasoning into textual reasoning.
- Role-playing reasoning strategy: Iteratively reviewing images to simulate the human comprehension process yields more accurate results than single-pass description.
- Complementarity of SFT and RL: SFT establishes reasoning format and foundational capability, while RL further enhances generalization.
- Education-level design of R1-Onevision-Bench: Graded from middle school → high school → college → social examinations, providing an intuitive dimension for capability assessment.
Limitations & Future Work¶
- The reasoning process generation relies on closed-source models such as GPT-4o and DeepSeek R1, resulting in high data construction costs.
- The RL stage is trained only on the CLEVR dataset (10K), limiting its scale.
- All models perform poorly on deduction-type problems, indicating that logical reasoning remains a bottleneck.
- 83.1% of benchmark questions are multiple-choice, providing insufficient evaluation of open-ended reasoning.
- Formalized descriptions depend on the accuracy of OCR and detection models, which may introduce errors.
Related Work & Insights¶
- DeepSeek-R1: Demonstrates the powerful effect of RL on enhancing textual reasoning capability.
- LLaVA-CoT/LlamaV-o1: Pioneers of predefined reasoning structures.
- MAmmoTH-VL: Large-scale multimodal reasoning data construction.
- Insight: The key to visual reasoning may lie not in better visual encoders, but in transforming visual information into forms that language models can reason over efficiently.
Rating¶
- Novelty: ⭐⭐⭐⭐ The cross-modal formalization pipeline and education-level benchmark design are novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark evaluation, training strategy ablation, and model scale ablation.
- Writing Quality: ⭐⭐⭐⭐ Clear framework with rich illustrations.
- Value: ⭐⭐⭐⭐ Unified contribution of dataset, model, and benchmark.