MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction¶
Conference: CVPR 2026 arXiv: 2604.01600 Code: https://zitiantang.github.io/MM-ReCoder Area: Multimodal VLM / Code Generation Keywords: Chart-to-Code, Reinforcement Learning, Self-Correction, Multi-Turn Dialogue, GRPO
TL;DR¶
This paper proposes MM-ReCoder, the first multimodal LLM with genuine self-correction capability for chart-to-code generation. Through a two-stage multi-turn GRPO reinforcement learning framework—first optimizing correction ability via shared-first-turn training, then optimizing coding ability via full-trajectory training—MM-ReCoder achieves 86.5% low-level score on ChartMimic with only 7B parameters, rivaling Qwen3-VL-235B.
Background & Motivation¶
- Background: The Chart2Code task requires generating executable Python plotting code from chart images. Existing approaches primarily rely on SFT (e.g., ChartCoder trained on 160k chart–code pairs), with only a few works (e.g., ChartMaster) beginning to explore RL.
- Limitations of Prior Work: SFT methods do not interact with code execution environments, and thus cannot guarantee the executability or visual fidelity of generated code. Existing RL methods (e.g., ChartMaster) perform only single-pass generation without iterative self-correction.
- Key Challenge: Human programming is inherently iterative (write → execute → inspect → revise), yet existing MLLMs generate code in a single pass. Experiments reveal that even large models such as Qwen3-VL-235B exhibit degraded code quality during self-correction on already-executable code (−0.26%).
- Goal: Enable MLLMs to acquire genuine self-correction ability—not merely fixing execution errors, but also improving the visual quality of already-executable code.
- Key Insight: The authors find that the apparent "improvement" of existing models stems solely from making crashed code executable, rather than from genuine quality enhancement. Two-stage RL training is thus designed to address each aspect separately.
- Core Idea: Employ a two-stage GRPO training pipeline—shared-first-turn optimization followed by full-trajectory optimization—to first acquire correction ability and then improve overall coding ability.
Method¶
Overall Architecture¶
The training pipeline consists of two major stages: (1) Cold Start—SFT on Chart2Code-160k, followed by fine-tuning on 7k curated two-turn correction samples; (2) Two-Stage Multi-Turn GRPO RL—Stage 1 fixes the first-turn output and trains correction ability (shared-first-turn), while Stage 2 jointly optimizes both turns (full-trajectory). At inference time, the model can perform arbitrary rounds of self-correction, receiving rendered outputs or error messages as feedback at each turn.
Key Designs¶
-
Shared-First-Turn Optimization:
- Function: Dedicated training of the model's self-correction ability.
- Mechanism: For each input chart, a shared first-turn output \(o^{(1)}\) is sampled; \(G\) diverse second-turn correction candidates \(o_i^{(2)}\) are then generated conditioned on it. RL optimization is applied only to the second turn, with the first turn serving as shared context. The objective is \(\mathcal{J}^{(shared)} = \mathbb{E}[\frac{1}{G}\sum_i \frac{\pi_\theta(o_i^{(2)}|q,o^{(1)},f^{(1)})}{\text{SG}[\cdot]} A_i]\).
- Design Motivation: Under direct full-trajectory optimization, the model simply repeats the first-turn code in 46.9% of cases (as the expected reward of modifying executable code is negative) or deliberately generates poor first-turn outputs to exploit the reward mechanism. Fixing the first turn forces the model to learn how to improve upon a given code.
-
Reward Design (Rule-based + Model-based):
- Function: Comprehensive evaluation of generated chart quality.
- Mechanism: Rule-based reward hooks Matplotlib functions to extract chart elements (type, text, color, layout, etc.) and computes F1/CIE Lab distances against ground truth (range [0,1]). Model-based reward uses Qwen2.5-VL-72B to score outputs across six dimensions (out of 100, normalized to [0,1]). Final reward \(= (1-\alpha-\beta) \times \text{Format} + \alpha \times \text{Rule-based} + \beta \times \text{Model-based}\).
- Design Motivation: Rule-based rewards are precise but have blind spots (e.g., overlapping text still receives full score), while model-based rewards capture visual quality but are noisy. The two are complementary.
-
Two-Stage Training Strategy:
- Function: Learn correction first, then coding.
- Mechanism: RL Stage 1 uses the shared-first-turn strategy to cultivate correction ability (encouraging diverse correction solutions). Stage 2 switches to full-trajectory optimization, jointly training both turns to improve overall coding competence. Rewards are computed based on the final turn only (\(\gamma=0\) is optimal).
- Design Motivation: Ablations show that full-trajectory-only optimization leads to code repetition and reward exploitation, while shared-first-turn-only training improves correction ability but yields weaker overall coding performance. The two-stage combination is mutually complementary.
Loss & Training¶
- Cold Start: SFT on Chart2Code-160k for 1 epoch + correction data for 2 epochs; batch=128, lr=\(10^{-5}\).
- Correction Data Construction: Two-turn dialogues generated by Qwen3-VL-235B; samples where the second-turn low-level score exceeds the first by more than 0.02 are retained (~7k samples).
- GRPO: Group size \(G=8\); 1 epoch per stage; batch=128; lr=\(10^{-6}\); maximum response length 4096 tokens.
- Format reward encourages outputs in the
<think>...</think>'''python...'''format.
Key Experimental Results¶
Main Results¶
| Model | Params | Turns | ChartMimic Low↑ | ChartMimic High↑ | Plot2Code Text-Match↑ |
|---|---|---|---|---|---|
| ChartCoder | 7B | 1 | 77.4 | 74.0 | 54.5 |
| Qwen2.5-VL-7B | 7B | 1 | 56.2 | 49.6 | 47.8 |
| Qwen3-VL-235B | 235B | 1 | 80.9 | 85.9 | 60.9 |
| GPT-4o | — | 1 | 81.8 | 83.7 | 59.8 |
| MM-ReCoder | 7B | 1 | 83.5 | 81.2 | 63.2 |
| MM-ReCoder | 7B | 4 | 86.5 | 84.9 | 62.7 |
Ablation Study¶
RL strategy comparison (ChartMimic):
| Strategy | Turn-1 Low | Turn-2 Low | Avg Improvement | Improve Rate | Code Repeat Rate |
|---|---|---|---|---|---|
| Full-traj (γ=0,η=0) | 81.8 | 83.9 | +0.21 | 3.4% | 46.9% |
| Full-traj (γ=0,η=0.1) | 66.7 | 84.3 | +10.11 | 87.5% | 0.3% |
| Shared-first + Full-traj | 83.7 | 86.0 | +0.55 | 12.1% | 21.6% |
Stage-wise ablation:
| Stage | Low-level | Avg Improvement | Repeat Rate |
|---|---|---|---|
| Qwen2.5-VL-7B base | 56.2 | −0.36 | 10.9% |
| + Single-turn cold start | 79.1 | −0.10 | 81.5% |
| + Multi-turn cold start | 75.2 | −0.54 | 2.3% |
| + RL (full) | 83.5→84.8 | +0.30 | 2.2% |
Key Findings¶
- Existing large models cannot genuinely self-correct: Qwen3-VL-8B/235B show negative quality improvement on already-executable code (−1.03%/−0.26%); gains come solely from fixing crashed code.
- Using only the correction bonus (\(\eta=0.1\)) causes the model to "cheat"—deliberately generating a poor first turn to obtain a high improvement reward (Turn-1 Low drops to 66.7).
- Multi-turn correction exhibits diminishing returns: Turn 1→2 improves by 1.3%, Turn 2→3 by 0.5%, with no further gain beyond Turn 4→5.
- Code repetition is the core bottleneck: After single-turn SFT, 81.5% of second-turn outputs are simple copies of the first turn.
Highlights & Insights¶
- Diagnosis before treatment: The authors first demonstrate the "pseudo self-correction" problem in existing models (gains arising only from fixing crashed code), then design targeted solutions accordingly. This problem-driven research methodology is exemplary.
- Shared-first-turn strategy: Decoupling correction training from overall coding training avoids reward exploitation in RL. This strategy is transferable to other multi-round code generation scenarios (e.g., webpage generation, SVG generation).
- Hybrid rule-based and model-based reward: The idea of hooking Matplotlib to precisely extract chart elements is elegant, effectively resolving the challenge of automated chart quality assessment.
Limitations & Future Work¶
- RL training is explored only for single-round correction; multi-round RL training may yield further gains.
- Model-based reward relies on a 72B model, incurring high training costs (4×8 H200 GPUs required for the reward model).
- Richer feedback from the code execution environment (e.g., diff information between rendered outputs) remains unexplored.
- Validation is limited to chart generation; the framework is extensible to more visual coding tasks such as webpages, UI, and SVG.
Related Work & Insights¶
- vs. ChartMaster: ChartMaster first introduced GRPO to Chart2Code but only performs single-pass generation. MM-ReCoder extends this with a multi-turn self-correction dimension.
- vs. SCoRe (Kumar et al.): SCoRe pioneered two-stage RL for self-correction in text-only LLMs. MM-ReCoder extends this paradigm to multimodal coding tasks and adapts it for GRPO (SCoRe uses REINFORCE/PPO).
- vs. ChartCoder: ChartCoder relies solely on SFT; applying RL on top of its training data (Chart2Code-160k) raises the low-level score from 77.4 to 86.5 (+9.1%), demonstrating the substantial value of RL.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to achieve reliable self-correction in multimodal coding; the shared-first-turn strategy is elegantly designed.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three benchmarks, extensive strategy ablations, multi-turn correction analysis, and detailed comparisons with large models.
- Writing Quality: ⭐⭐⭐⭐ Problem motivation is clear and experimental analysis is thorough; notation is dense and requires careful tracking.
- Value: ⭐⭐⭐⭐ High practical value—7B model rivaling 235B; the self-correction paradigm is generalizable to broader tasks.