Skip to content

EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Conference: ICLR 2026 arXiv: 2509.26346 Code: GitHub Area: Image Editing / Reward Model Keywords: Image Editing, Reward Model, Human Preference, Data Filtering, VLM

TL;DR

This paper constructs EditReward-Data, a high-quality dataset of 200K expert-annotated preference pairs, and trains the EditReward reward model, which achieves state-of-the-art human alignment across multiple image editing evaluation benchmarks. The model is further validated as a data filter that substantially improves downstream editing model performance.

Background & Motivation

Instruction-guided image editing has advanced considerably in recent years. Closed-source models such as GPT-Image-1 and Seedream deliver strong results, yet open-source models still exhibit a significant gap. The core bottleneck lies in the absence of reliable reward models for filtering and scaling high-quality training data.

Existing evaluation and reward approaches suffer from three fundamental problems:

Perceptual scores (e.g., LPIPS): unable to capture semantic alignment with the editing instruction.

Feature-based scores (e.g., CLIP): unable to understand editing semantics.

VLM-as-judge (e.g., VIEScore): general-purpose VLMs are not optimized for editing tasks.

Prior fine-tuned reward models either rely on noisy crowdsourced annotations (low inter-annotator agreement) or pseudo-labels generated by closed-source models (introducing bias). The root cause is that training a reliable reward model requires large-scale, high-quality human-annotated preference data — a resource that has been lacking.

Key Insight: Construct a large-scale, high-quality, multi-dimensional expert-annotated preference dataset and train a reward model specifically tailored to instruction-guided image editing.

Method

Overall Architecture

EditReward consists of three core components: 1. EditReward-Data: A dataset of 200K expert-annotated preference pairs. 2. EditReward Model: A VLM-based multi-dimensional uncertainty-aware ranking reward model. 3. EditReward-Bench: A multi-way preference ranking evaluation benchmark.

Key Designs

  1. EditReward-Data Construction:

    • 9,557 instruction–image pairs are collected from six editing benchmarks (GEdit-Bench, ImgEdit-Bench, MagicBrush, etc.).
    • Six state-of-the-art editing models (Step1X-Edit, Flux-Kontext, Qwen-Image-Edit, etc.) each generate multiple outputs per input.
    • Key: Trained annotators follow a strict annotation protocol, rating outputs on a 4-level Likert scale across two dimensions:
      • Instruction Following (IF): semantic accuracy, completeness, and absence of unintended modifications.
      • Visual Quality (VQ): plausibility, absence of artifacts, and aesthetics.
    • Krippendorff's \(\alpha\) reaches IF = 0.668 and VQ = 0.597, confirming high annotation quality.
  2. Multi-Dimensional Uncertainty-Aware Ranking:

    • Inspired by HPSv3, scores are modeled as Gaussian distributions \(s_{i,d} \sim \mathcal{N}(\mu_{i,d}, \sigma_{i,d}^2)\), where \(d \in \{1,2\}\) corresponds to the IF and VQ dimensions respectively.
    • Multi-task learning (MTL) is adopted, with independent reward heads predicting Gaussian parameters for each dimension.
    • Three aggregation strategies are explored: pessimistic minimum, balanced mean, and direct summation.
    • The final preference probability is computed by integrating over the two aggregated distributions: \(\mathcal{L}_{\text{rank}} = -\log(P(I_h \succ I_l))\)
  3. Disentangling Ties via Dimensional Preference:

    • Core Idea: pairs rated as overall ties often exhibit complementary strengths across dimensions (e.g., image A excels in IF while image B excels in VQ).
    • Tied pairs \((I_A, I_B)_{\text{tie}}\) are decomposed into two training samples, labeled \(I_A \succ I_B\) and \(I_B \succ I_A\) along their respective preferred dimensions.
    • This encourages the model to learn finer-grained cross-dimensional trade-offs and yields smoother training curves.

Loss & Training

  • Backbone: Qwen2.5-VL-7B or MiMo-VL-7B, with full parameter fine-tuning.
  • 2 epochs, 8×A800 GPUs, learning rate 2e-6, cosine schedule.
  • Images are preprocessed to 448×448 with aspect ratio preserved.
  • The total loss is the ranking loss \(\mathcal{L}_{\text{rank}} = -\log(P(I_h \succ I_l))\).

Key Experimental Results

Main Results

Method GenAI-Bench AURORA-Bench ImagenHub EditReward-Bench
GPT-4o 53.54 50.81 38.21 28.31
GPT-5 59.61 47.27 40.85 37.81
Gemini-2.5-Flash 57.01 47.63 41.62 38.02
Qwen2.5-VL-7B-Inst 40.48 38.62 18.59 29.75
EditReward (Qwen) 63.97 59.50 36.18 36.78
EditReward (MiMo) 65.72 63.62 35.20 38.42

EditReward comprehensively outperforms closed-source models including GPT-5 and Gemini-2.5-Flash.

Data Filtering Application

EditReward is used to filter a high-quality subset from ShareGPT-4o-Image (46K) for fine-tuning Step1X-Edit:

Training Data GEdit-EN \(G_O\) GEdit-CN \(G_O\)
Step1X-Edit (original) 6.444 6.779
+ Full ShareGPT-4o 6.780 6.583
+ Top 10K (EditReward filtered) 6.938 7.000
+ Top 20K (EditReward filtered) 7.086 7.074
+ Top 30K (EditReward filtered) 6.962 6.938
Doubao-Edit 6.983 6.942

Top 20K represents the optimal trade-off point, elevating the open-source Step1X-Edit to near the level of Doubao-Edit.

Ablation Study

Variant Loss Type Head Type Aggregation GenAI-Bench
I Pointwise regression N/A N/A 49.62
II Pairwise ranking Shared Mean 60.17
V (final) Pairwise ranking Multi-independent Mean 63.97
  • Pairwise ranking ≫ pointwise regression (+14.35).
  • Multi-independent heads ≫ shared head (+3.80).
  • Mean aggregation yields the best overall performance.

Key Findings

  • After training, Qwen2.5-VL-7B improves by more than 23 points on GenAI-Bench (40.48→63.97), demonstrating the potency of the proposed framework.
  • EditReward performs comparably to GPT-4o on OOD tasks (Text/Style categories): 46.80 vs. 41.69.
  • Data quality outweighs quantity: Top 20K outperforms the full 46K set.

Highlights & Insights

  • The 200K expert-annotated preference dataset achieves high quality (Krippendorff's \(\alpha > 0.59\)), far surpassing crowdsourced annotations.
  • The two-dimensional (IF + VQ) decoupled design is empirically supported: inter-annotator agreement is indeed higher for IF than VQ, validating the necessity of dimension-wise modeling.
  • Tie decomposition is a simple yet effective technique that fully exploits the information embedded in tied annotation pairs.
  • The data filtering application is direct and quantifiable: scoring 46K samples requires only 2.61 GPU hours.

Limitations & Future Work

  • The annotation framework covers only two dimensions (IF and VQ), potentially missing aspects such as spatial consistency and style preservation.
  • Experiments are conducted primarily on 7B-scale VLMs; the effectiveness at larger or smaller scales remains unknown.
  • Data filtering experiments are validated on a single downstream model (Step1X-Edit); generalizability requires further investigation.
  • Multi-way preference accuracy on EditReward-Bench remains low (~11% at K=4), indicating that the task remains highly challenging.
  • HPSv3: A pioneer in uncertainty-aware ranking, but limited to a single dimension.
  • ImageRewardDB: An early preference dataset, but noisy and single-dimensional.
  • ADIEE: Trained on model-generated labels, introducing bias.
  • Key insight: high-quality human annotation combined with multi-dimensional decoupling constitutes the critical path toward reliable reward models.

Rating

  • Novelty: ⭐⭐⭐⭐ Multi-dimensional uncertainty-aware ranking and tie decomposition are notable contributions, though the overall framework is relatively standard.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation on 4 benchmarks, data filtering application, extensive ablations, and OOD testing.
  • Writing Quality: ⭐⭐⭐⭐ Well-structured and data-rich, though notation is somewhat dense in places.
  • Value: ⭐⭐⭐⭐⭐ Both the dataset and model will be open-sourced, providing significant contributions to the image editing community.