Skip to content

Visual-ERM: Reward Modeling for Visual Equivalence

Conference: CVPR2025
arXiv: 2603.13224
Code: GitHub
Area: Image Generation
Keywords: reward model, vision-to-code, reinforcement learning, visual equivalence, chart-to-code, test-time scaling

TL;DR

Proposes Visual-ERM, a multimodal generative reward model that directly evaluates rendering quality of vision-to-code tasks in the visual space, providing fine-grained, interpretable, and task-agnostic reward signals for RL training and test-time scaling.

Background & Motivation

Importance of vision-to-code: Transforming structured visual inputs (charts, tables, SVGs) into executable code or markup languages is a fundamental primitive for downstream applications such as AI-assisted development and scientific paper parsing.

Limitations of SFT: Supervised Fine-Tuning (SFT) requires abundant labeled data and suffers from weak cross-domain generalization.

Mismatch in RL Reward Signals: Existing rewards either rely on text rules (such as edit distance, TEDS), which ignore visual cues, or leverage coarse-grained visual embedding similarities (like DINO), which are insensitive to fine-grained differences.

Reward Hacking: Outputs with a DINO similarity score of 0.99 can still contain numerous parsing errors, and textual metrics fail to capture visual layout and spacing discrepancies.

Lack of Unified Evaluation Benchmark: Current reward benchmarks focus primarily on vision-language alignment, lacking a benchmark for fine-grained image-to-image difference discrimination.

Need for Cross-Modal Evaluation: The ideal reward model must simultaneously perceive visual details, read embedded texts, and reason about structural fidelity.

Method

Overall Architecture

Visual-ERM operates in the visual space: given the ground-truth image \(I^\star\) and the image rendered from the predicted code \(\hat{I} = \mathcal{R}_m(y)\), the model outputs fine-grained difference descriptions and severity scores as reward signals.

Key Designs

1. Reward Data Generation (Controlled Corruption + Sampling) - Edit Mode: Uses a strong LVLM to inject predefined error types into the ground-truth text, generating controlled corrupted samples. - Infer Mode: Uses a weaker LVLM for direct inference and prediction, sampling naturally occurring errors that are closer to the actual distribution. - Renders predictions as images and pairs them with ground truths to form training data.

2. Fine-grained Annotation Distillation - Open-source models (including Qwen3-VL-235B) still suffer from significant gaps in difference localization. - Employs GPT-5-mini for high-quality difference annotation, transferring capabilities to more efficient models via distillation. - Annotations include: error types, locations, descriptions, and severity levels.

3. Visual-ERM Training - Trained on top of Qwen3-VL-8B-Instruct. - Takes the image pair \((I^\star, \hat{I})\) as input and outputs a fine-grained difference analysis sequence \(a\). - Standard NLL objective: \(\mathcal{L}(\theta) = \mathbb{E}[-\sum_t \log f_{\theta_{ERM}}(a_t | x, a_{<t})]\)

4. RL Integration - Sum of difference severity: \(S_{\text{verm}} = \sum_{k=1}^K s_k\), normalized to \([0,1]\). - Final reward: \(r = r_{\text{rsr}} + r_{\text{verm}}\) (rendering success reward + visual equivalence reward). - Optimizes the policy model using the GRPO algorithm.

5. Test-Time Scaling - Explainable feedback generated by Visual-ERM can directly guide iterative self-correction. - The model refines its predictions based on the previous output and feedback: \(y^{(1)} \sim \pi_\theta(\cdot | x, y^{(0)}, f^{(0)})\).

Loss & Training

  • Visual-ERM training: Standard sequence-generation NLL loss.
  • RL policy optimization: GRPO objective with KL regularization; reward = rendering success + visual equivalence score.

Key Experimental Results

Main Results: Chart-to-Code (ChartMimic)

Model Direct Overall Customized Overall Avg
Qwen3-VL-8B-Instruct 67.7 71.6 69.6
+ RL (DINO-based) 76.5 75.8 76.1
+ RL (Visual-ERM) 79.5 76.5 78.0
Δ vs. base +11.8 +4.9 +8.4

Main Results: Table-to-Markdown

Model OmniDocBench TEDS↑ Edit-Dist↓ olmOCR TA↑ Avg↑
Qwen3-VL-8B + RL (DINO) 62.2 37.0 71.7 65.3
Qwen3-VL-8B + RL (TEDS) 79.2 31.6 78.6 74.8
Qwen3-VL-8B + RL (Visual-ERM) 81.4 20.7 78.1 79.5
Δ vs. base +2.5 +2.5 +2.8 +2.7

DINO-based RL degrades severely on table tasks (Avg drops to 65.3), whereas Visual-ERM achieves consistently robust improvements.

VC-RewardBench Evaluation

  • Visual-ERM (8B) outperforms Qwen3-VL-235B-Instruct in fine-grained image difference discrimination.
  • Approaches the performance level of leading closed-source models.

Ablation Study & Key Findings

  • DINO rewards suffer from severe reward hacking risks, as semantic bias leads to neglected textual content.
  • Textual metrics are insensitive to errors in visual layout and spacing.
  • The explainable feedback of Visual-ERM enables additional performance gains through test-time scaling (reflexion + rectification).
  • SVG-to-Code task: Visual-ERM yields a +4.1 average improvement.

Highlights & Insights

  1. Precise problem diagnosis: Systematically analyzes the failure modes of existing rewards (text rules vs. DINO), demonstrating strong motivation.
  2. Three major attributes: Fine-grained (captures subtle visual differences), interpretable (generates diagnostic feedback), and task-agnostic (a single RM covers charts, tables, and SVGs).
  3. Advantage of generative RMs: Compared to scalar rewards, generative rewards provide structured feedback necessary for Test-Time Scaling (TTS).
  4. Practical benchmark VC-RewardBench: Bridges the gap in benchmarks for fine-grained image difference discrimination.
  5. 8B model outperforms 235B: Specialized reward training is more effective than scaling general-purpose models.

Limitations & Future Work

  1. Dependency on rendering: Requires environments capable of properly rendering code outputs, which increases pipeline complexity.
  2. Training of Visual-ERM relies on GPT-5-mini annotations, limiting cost and reproducibility.
  3. Simplistic reward aggregation (sum of severities + normalization) may lose structural information.
  4. Only validated on chart, table, and SVG tasks, leaving broader vision-to-code scenarios (e.g., UI, web pages) to be examined.
  • vs. DINO-based reward: DINO models coarse-grained semantic similarity, ignoring text and precise layout, which is susceptible to reward hacking.
  • vs. TEDS-based reward: TEDS operates only in the textual space, lacking visual perception.
  • vs. Bradley-Terry RM: Discriminative scalar RMs fail to provide explainable feedback, thereby failing to support TTS.
  • Insights: Direct evaluation in the visual space is a necessary and sufficient condition for RL in vision-to-code tasks; generative RMs provide a natural interface for TTS.

Rating

  • Novelty: ⭐⭐⭐⭐ — The concept of a visual equivalence reward model is novel, systematically filling the reward design gap in vision-to-code RL.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensively validated across three tasks, compared against multiple reward baselines, including TTS experiments and VC-RewardBench.
  • Writing Quality: ⭐⭐⭐⭐ — Deep problem analysis, rigorous experimental design, and clear illustrations.
  • Value: ⭐⭐⭐⭐ — Provides a practical reward solution for vision-to-code RL, with VC-RewardBench possessing long-term value.