CVPR2025 Image Generation reward model vision-to-code reinforcement learning visual equivalence chart-to-code test-time scaling

Visual-ERM: Reward Modeling for Visual Equivalence¶

Conference: CVPR2025
arXiv: 2603.13224
Code: GitHub
Area: Image Generation
Keywords: reward model, vision-to-code, reinforcement learning, visual equivalence, chart-to-code, test-time scaling

TL;DR¶

Proposes Visual-ERM, a multimodal generative reward model that directly evaluates rendering quality of vision-to-code tasks in the visual space, providing fine-grained, interpretable, and task-agnostic reward signals for RL training and test-time scaling.

Background & Motivation¶

Importance of vision-to-code: Transforming structured visual inputs (charts, tables, SVGs) into executable code or markup languages is a fundamental primitive for downstream applications such as AI-assisted development and scientific paper parsing.

Limitations of SFT: Supervised Fine-Tuning (SFT) requires abundant labeled data and suffers from weak cross-domain generalization.

Mismatch in RL Reward Signals: Existing rewards either rely on text rules (such as edit distance, TEDS), which ignore visual cues, or leverage coarse-grained visual embedding similarities (like DINO), which are insensitive to fine-grained differences.

Reward Hacking: Outputs with a DINO similarity score of 0.99 can still contain numerous parsing errors, and textual metrics fail to capture visual layout and spacing discrepancies.

Lack of Unified Evaluation Benchmark: Current reward benchmarks focus primarily on vision-language alignment, lacking a benchmark for fine-grained image-to-image difference discrimination.

Need for Cross-Modal Evaluation: The ideal reward model must simultaneously perceive visual details, read embedded texts, and reason about structural fidelity.

Method¶

Overall Architecture¶

Visual-ERM operates in the visual space: given the ground-truth image \(I^\star\) and the image rendered from the predicted code \(\hat{I} = \mathcal{R}_m(y)\), the model outputs fine-grained difference descriptions and severity scores as reward signals.

Key Designs¶

1. Reward Data Generation (Controlled Corruption + Sampling) - Edit Mode: Uses a strong LVLM to inject predefined error types into the ground-truth text, generating controlled corrupted samples. - Infer Mode: Uses a weaker LVLM for direct inference and prediction, sampling naturally occurring errors that are closer to the actual distribution. - Renders predictions as images and pairs them with ground truths to form training data.

2. Fine-grained Annotation Distillation - Open-source models (including Qwen3-VL-235B) still suffer from significant gaps in difference localization. - Employs GPT-5-mini for high-quality difference annotation, transferring capabilities to more efficient models via distillation. - Annotations include: error types, locations, descriptions, and severity levels.

3. Visual-ERM Training - Trained on top of Qwen3-VL-8B-Instruct. - Takes the image pair \((I^\star, \hat{I})\) as input and outputs a fine-grained difference analysis sequence \(a\). - Standard NLL objective: \(\mathcal{L}(\theta) = \mathbb{E}[-\sum_t \log f_{\theta_{ERM}}(a_t | x, a_{<t})]\)

4. RL Integration - Sum of difference severity: \(S_{\text{verm}} = \sum_{k=1}^K s_k\), normalized to \([0,1]\). - Final reward: \(r = r_{\text{rsr}} + r_{\text{verm}}\) (rendering success reward + visual equivalence reward). - Optimizes the policy model using the GRPO algorithm.

5. Test-Time Scaling - Explainable feedback generated by Visual-ERM can directly guide iterative self-correction. - The model refines its predictions based on the previous output and feedback: \(y^{(1)} \sim \pi_\theta(\cdot | x, y^{(0)}, f^{(0)})\).

Loss & Training¶

Visual-ERM training: Standard sequence-generation NLL loss.
RL policy optimization: GRPO objective with KL regularization; reward = rendering success + visual equivalence score.

Key Experimental Results¶

Main Results: Chart-to-Code (ChartMimic)¶

Model	Direct Overall	Customized Overall	Avg
Qwen3-VL-8B-Instruct	67.7	71.6	69.6
+ RL (DINO-based)	76.5	75.8	76.1
+ RL (Visual-ERM)	79.5	76.5	78.0
Δ vs. base	+11.8	+4.9	+8.4

Main Results: Table-to-Markdown¶

Model	OmniDocBench TEDS↑	Edit-Dist↓	olmOCR TA↑	Avg↑
Qwen3-VL-8B + RL (DINO)	62.2	37.0	71.7	65.3
Qwen3-VL-8B + RL (TEDS)	79.2	31.6	78.6	74.8
Qwen3-VL-8B + RL (Visual-ERM)	81.4	20.7	78.1	79.5
Δ vs. base	+2.5	+2.5	+2.8	+2.7

DINO-based RL degrades severely on table tasks (Avg drops to 65.3), whereas Visual-ERM achieves consistently robust improvements.

VC-RewardBench Evaluation¶

Visual-ERM (8B) outperforms Qwen3-VL-235B-Instruct in fine-grained image difference discrimination.
Approaches the performance level of leading closed-source models.

Ablation Study & Key Findings¶

DINO rewards suffer from severe reward hacking risks, as semantic bias leads to neglected textual content.
Textual metrics are insensitive to errors in visual layout and spacing.
The explainable feedback of Visual-ERM enables additional performance gains through test-time scaling (reflexion + rectification).
SVG-to-Code task: Visual-ERM yields a +4.1 average improvement.

Highlights & Insights¶

Precise problem diagnosis: Systematically analyzes the failure modes of existing rewards (text rules vs. DINO), demonstrating strong motivation.
Three major attributes: Fine-grained (captures subtle visual differences), interpretable (generates diagnostic feedback), and task-agnostic (a single RM covers charts, tables, and SVGs).
Advantage of generative RMs: Compared to scalar rewards, generative rewards provide structured feedback necessary for Test-Time Scaling (TTS).
Practical benchmark VC-RewardBench: Bridges the gap in benchmarks for fine-grained image difference discrimination.
8B model outperforms 235B: Specialized reward training is more effective than scaling general-purpose models.

Limitations & Future Work¶

Dependency on rendering: Requires environments capable of properly rendering code outputs, which increases pipeline complexity.
Training of Visual-ERM relies on GPT-5-mini annotations, limiting cost and reproducibility.
Simplistic reward aggregation (sum of severities + normalization) may lose structural information.
Only validated on chart, table, and SVG tasks, leaving broader vision-to-code scenarios (e.g., UI, web pages) to be examined.

vs. DINO-based reward: DINO models coarse-grained semantic similarity, ignoring text and precise layout, which is susceptible to reward hacking.
vs. TEDS-based reward: TEDS operates only in the textual space, lacking visual perception.
vs. Bradley-Terry RM: Discriminative scalar RMs fail to provide explainable feedback, thereby failing to support TTS.
Insights: Direct evaluation in the visual space is a necessary and sufficient condition for RL in vision-to-code tasks; generative RMs provide a natural interface for TTS.

Rating¶

Novelty: ⭐⭐⭐⭐ — The concept of a visual equivalence reward model is novel, systematically filling the reward design gap in vision-to-code RL.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensively validated across three tasks, compared against multiple reward baselines, including TTS experiments and VC-RewardBench.
Writing Quality: ⭐⭐⭐⭐ — Deep problem analysis, rigorous experimental design, and clear illustrations.
Value: ⭐⭐⭐⭐ — Provides a practical reward solution for vision-to-code RL, with VC-RewardBench possessing long-term value.