Visual-ERM: Reward Modeling for Visual Equivalence¶
Conference: CVPR2025
arXiv: 2603.13224
Code: GitHub
Area: Image Generation
Keywords: reward model, vision-to-code, reinforcement learning, visual equivalence, chart-to-code, test-time scaling
TL;DR¶
Proposes Visual-ERM, a multimodal generative reward model that directly evaluates rendering quality of vision-to-code tasks in the visual space, providing fine-grained, interpretable, and task-agnostic reward signals for RL training and test-time scaling.
Background & Motivation¶
Importance of vision-to-code: Transforming structured visual inputs (charts, tables, SVGs) into executable code or markup languages is a fundamental primitive for downstream applications such as AI-assisted development and scientific paper parsing.
Limitations of SFT: Supervised Fine-Tuning (SFT) requires abundant labeled data and suffers from weak cross-domain generalization.
Mismatch in RL Reward Signals: Existing rewards either rely on text rules (such as edit distance, TEDS), which ignore visual cues, or leverage coarse-grained visual embedding similarities (like DINO), which are insensitive to fine-grained differences.
Reward Hacking: Outputs with a DINO similarity score of 0.99 can still contain numerous parsing errors, and textual metrics fail to capture visual layout and spacing discrepancies.
Lack of Unified Evaluation Benchmark: Current reward benchmarks focus primarily on vision-language alignment, lacking a benchmark for fine-grained image-to-image difference discrimination.
Need for Cross-Modal Evaluation: The ideal reward model must simultaneously perceive visual details, read embedded texts, and reason about structural fidelity.
Method¶
Overall Architecture¶
Visual-ERM operates in the visual space: given the ground-truth image \(I^\star\) and the image rendered from the predicted code \(\hat{I} = \mathcal{R}_m(y)\), the model outputs fine-grained difference descriptions and severity scores as reward signals.
Key Designs¶
1. Reward Data Generation (Controlled Corruption + Sampling) - Edit Mode: Uses a strong LVLM to inject predefined error types into the ground-truth text, generating controlled corrupted samples. - Infer Mode: Uses a weaker LVLM for direct inference and prediction, sampling naturally occurring errors that are closer to the actual distribution. - Renders predictions as images and pairs them with ground truths to form training data.
2. Fine-grained Annotation Distillation - Open-source models (including Qwen3-VL-235B) still suffer from significant gaps in difference localization. - Employs GPT-5-mini for high-quality difference annotation, transferring capabilities to more efficient models via distillation. - Annotations include: error types, locations, descriptions, and severity levels.
3. Visual-ERM Training - Trained on top of Qwen3-VL-8B-Instruct. - Takes the image pair \((I^\star, \hat{I})\) as input and outputs a fine-grained difference analysis sequence \(a\). - Standard NLL objective: \(\mathcal{L}(\theta) = \mathbb{E}[-\sum_t \log f_{\theta_{ERM}}(a_t | x, a_{<t})]\)
4. RL Integration - Sum of difference severity: \(S_{\text{verm}} = \sum_{k=1}^K s_k\), normalized to \([0,1]\). - Final reward: \(r = r_{\text{rsr}} + r_{\text{verm}}\) (rendering success reward + visual equivalence reward). - Optimizes the policy model using the GRPO algorithm.
5. Test-Time Scaling - Explainable feedback generated by Visual-ERM can directly guide iterative self-correction. - The model refines its predictions based on the previous output and feedback: \(y^{(1)} \sim \pi_\theta(\cdot | x, y^{(0)}, f^{(0)})\).
Loss & Training¶
- Visual-ERM training: Standard sequence-generation NLL loss.
- RL policy optimization: GRPO objective with KL regularization; reward = rendering success + visual equivalence score.
Key Experimental Results¶
Main Results: Chart-to-Code (ChartMimic)¶
| Model | Direct Overall | Customized Overall | Avg |
|---|---|---|---|
| Qwen3-VL-8B-Instruct | 67.7 | 71.6 | 69.6 |
| + RL (DINO-based) | 76.5 | 75.8 | 76.1 |
| + RL (Visual-ERM) | 79.5 | 76.5 | 78.0 |
| Δ vs. base | +11.8 | +4.9 | +8.4 |
Main Results: Table-to-Markdown¶
| Model | OmniDocBench TEDS↑ | Edit-Dist↓ | olmOCR TA↑ | Avg↑ |
|---|---|---|---|---|
| Qwen3-VL-8B + RL (DINO) | 62.2 | 37.0 | 71.7 | 65.3 |
| Qwen3-VL-8B + RL (TEDS) | 79.2 | 31.6 | 78.6 | 74.8 |
| Qwen3-VL-8B + RL (Visual-ERM) | 81.4 | 20.7 | 78.1 | 79.5 |
| Δ vs. base | +2.5 | +2.5 | +2.8 | +2.7 |
DINO-based RL degrades severely on table tasks (Avg drops to 65.3), whereas Visual-ERM achieves consistently robust improvements.
VC-RewardBench Evaluation¶
- Visual-ERM (8B) outperforms Qwen3-VL-235B-Instruct in fine-grained image difference discrimination.
- Approaches the performance level of leading closed-source models.
Ablation Study & Key Findings¶
- DINO rewards suffer from severe reward hacking risks, as semantic bias leads to neglected textual content.
- Textual metrics are insensitive to errors in visual layout and spacing.
- The explainable feedback of Visual-ERM enables additional performance gains through test-time scaling (reflexion + rectification).
- SVG-to-Code task: Visual-ERM yields a +4.1 average improvement.
Highlights & Insights¶
- Precise problem diagnosis: Systematically analyzes the failure modes of existing rewards (text rules vs. DINO), demonstrating strong motivation.
- Three major attributes: Fine-grained (captures subtle visual differences), interpretable (generates diagnostic feedback), and task-agnostic (a single RM covers charts, tables, and SVGs).
- Advantage of generative RMs: Compared to scalar rewards, generative rewards provide structured feedback necessary for Test-Time Scaling (TTS).
- Practical benchmark VC-RewardBench: Bridges the gap in benchmarks for fine-grained image difference discrimination.
- 8B model outperforms 235B: Specialized reward training is more effective than scaling general-purpose models.
Limitations & Future Work¶
- Dependency on rendering: Requires environments capable of properly rendering code outputs, which increases pipeline complexity.
- Training of Visual-ERM relies on GPT-5-mini annotations, limiting cost and reproducibility.
- Simplistic reward aggregation (sum of severities + normalization) may lose structural information.
- Only validated on chart, table, and SVG tasks, leaving broader vision-to-code scenarios (e.g., UI, web pages) to be examined.
Related Work & Insights¶
- vs. DINO-based reward: DINO models coarse-grained semantic similarity, ignoring text and precise layout, which is susceptible to reward hacking.
- vs. TEDS-based reward: TEDS operates only in the textual space, lacking visual perception.
- vs. Bradley-Terry RM: Discriminative scalar RMs fail to provide explainable feedback, thereby failing to support TTS.
- Insights: Direct evaluation in the visual space is a necessary and sufficient condition for RL in vision-to-code tasks; generative RMs provide a natural interface for TTS.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The concept of a visual equivalence reward model is novel, systematically filling the reward design gap in vision-to-code RL.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensively validated across three tasks, compared against multiple reward baselines, including TTS experiments and VC-RewardBench.
- Writing Quality: ⭐⭐⭐⭐ — Deep problem analysis, rigorous experimental design, and clear illustrations.
- Value: ⭐⭐⭐⭐ — Provides a practical reward solution for vision-to-code RL, with VC-RewardBench possessing long-term value.