EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing¶
Conference: ICLR 2026 arXiv: 2509.26346 Code: GitHub Area: Image Editing / Reward Model Keywords: Image Editing, Reward Model, Human Preference, Data Filtering, VLM
TL;DR¶
This paper constructs EditReward-Data, a high-quality dataset of 200K expert-annotated preference pairs, and trains the EditReward reward model, which achieves state-of-the-art human alignment across multiple image editing evaluation benchmarks. The model is further validated as a data filter that substantially improves downstream editing model performance.
Background & Motivation¶
Instruction-guided image editing has advanced considerably in recent years. Closed-source models such as GPT-Image-1 and Seedream deliver strong results, yet open-source models still exhibit a significant gap. The core bottleneck lies in the absence of reliable reward models for filtering and scaling high-quality training data.
Existing evaluation and reward approaches suffer from three fundamental problems:
Perceptual scores (e.g., LPIPS): unable to capture semantic alignment with the editing instruction.
Feature-based scores (e.g., CLIP): unable to understand editing semantics.
VLM-as-judge (e.g., VIEScore): general-purpose VLMs are not optimized for editing tasks.
Prior fine-tuned reward models either rely on noisy crowdsourced annotations (low inter-annotator agreement) or pseudo-labels generated by closed-source models (introducing bias). The root cause is that training a reliable reward model requires large-scale, high-quality human-annotated preference data — a resource that has been lacking.
Key Insight: Construct a large-scale, high-quality, multi-dimensional expert-annotated preference dataset and train a reward model specifically tailored to instruction-guided image editing.
Method¶
Overall Architecture¶
EditReward consists of three core components: 1. EditReward-Data: A dataset of 200K expert-annotated preference pairs. 2. EditReward Model: A VLM-based multi-dimensional uncertainty-aware ranking reward model. 3. EditReward-Bench: A multi-way preference ranking evaluation benchmark.
Key Designs¶
-
EditReward-Data Construction:
- 9,557 instruction–image pairs are collected from six editing benchmarks (GEdit-Bench, ImgEdit-Bench, MagicBrush, etc.).
- Six state-of-the-art editing models (Step1X-Edit, Flux-Kontext, Qwen-Image-Edit, etc.) each generate multiple outputs per input.
- Key: Trained annotators follow a strict annotation protocol, rating outputs on a 4-level Likert scale across two dimensions:
- Instruction Following (IF): semantic accuracy, completeness, and absence of unintended modifications.
- Visual Quality (VQ): plausibility, absence of artifacts, and aesthetics.
- Krippendorff's \(\alpha\) reaches IF = 0.668 and VQ = 0.597, confirming high annotation quality.
-
Multi-Dimensional Uncertainty-Aware Ranking:
- Inspired by HPSv3, scores are modeled as Gaussian distributions \(s_{i,d} \sim \mathcal{N}(\mu_{i,d}, \sigma_{i,d}^2)\), where \(d \in \{1,2\}\) corresponds to the IF and VQ dimensions respectively.
- Multi-task learning (MTL) is adopted, with independent reward heads predicting Gaussian parameters for each dimension.
- Three aggregation strategies are explored: pessimistic minimum, balanced mean, and direct summation.
- The final preference probability is computed by integrating over the two aggregated distributions: \(\mathcal{L}_{\text{rank}} = -\log(P(I_h \succ I_l))\)
-
Disentangling Ties via Dimensional Preference:
- Core Idea: pairs rated as overall ties often exhibit complementary strengths across dimensions (e.g., image A excels in IF while image B excels in VQ).
- Tied pairs \((I_A, I_B)_{\text{tie}}\) are decomposed into two training samples, labeled \(I_A \succ I_B\) and \(I_B \succ I_A\) along their respective preferred dimensions.
- This encourages the model to learn finer-grained cross-dimensional trade-offs and yields smoother training curves.
Loss & Training¶
- Backbone: Qwen2.5-VL-7B or MiMo-VL-7B, with full parameter fine-tuning.
- 2 epochs, 8×A800 GPUs, learning rate 2e-6, cosine schedule.
- Images are preprocessed to 448×448 with aspect ratio preserved.
- The total loss is the ranking loss \(\mathcal{L}_{\text{rank}} = -\log(P(I_h \succ I_l))\).
Key Experimental Results¶
Main Results¶
| Method | GenAI-Bench | AURORA-Bench | ImagenHub | EditReward-Bench |
|---|---|---|---|---|
| GPT-4o | 53.54 | 50.81 | 38.21 | 28.31 |
| GPT-5 | 59.61 | 47.27 | 40.85 | 37.81 |
| Gemini-2.5-Flash | 57.01 | 47.63 | 41.62 | 38.02 |
| Qwen2.5-VL-7B-Inst | 40.48 | 38.62 | 18.59 | 29.75 |
| EditReward (Qwen) | 63.97 | 59.50 | 36.18 | 36.78 |
| EditReward (MiMo) | 65.72 | 63.62 | 35.20 | 38.42 |
EditReward comprehensively outperforms closed-source models including GPT-5 and Gemini-2.5-Flash.
Data Filtering Application¶
EditReward is used to filter a high-quality subset from ShareGPT-4o-Image (46K) for fine-tuning Step1X-Edit:
| Training Data | GEdit-EN \(G_O\) | GEdit-CN \(G_O\) |
|---|---|---|
| Step1X-Edit (original) | 6.444 | 6.779 |
| + Full ShareGPT-4o | 6.780 | 6.583 |
| + Top 10K (EditReward filtered) | 6.938 | 7.000 |
| + Top 20K (EditReward filtered) | 7.086 | 7.074 |
| + Top 30K (EditReward filtered) | 6.962 | 6.938 |
| Doubao-Edit | 6.983 | 6.942 |
Top 20K represents the optimal trade-off point, elevating the open-source Step1X-Edit to near the level of Doubao-Edit.
Ablation Study¶
| Variant | Loss Type | Head Type | Aggregation | GenAI-Bench |
|---|---|---|---|---|
| I | Pointwise regression | N/A | N/A | 49.62 |
| II | Pairwise ranking | Shared | Mean | 60.17 |
| V (final) | Pairwise ranking | Multi-independent | Mean | 63.97 |
- Pairwise ranking ≫ pointwise regression (+14.35).
- Multi-independent heads ≫ shared head (+3.80).
- Mean aggregation yields the best overall performance.
Key Findings¶
- After training, Qwen2.5-VL-7B improves by more than 23 points on GenAI-Bench (40.48→63.97), demonstrating the potency of the proposed framework.
- EditReward performs comparably to GPT-4o on OOD tasks (Text/Style categories): 46.80 vs. 41.69.
- Data quality outweighs quantity: Top 20K outperforms the full 46K set.
Highlights & Insights¶
- The 200K expert-annotated preference dataset achieves high quality (Krippendorff's \(\alpha > 0.59\)), far surpassing crowdsourced annotations.
- The two-dimensional (IF + VQ) decoupled design is empirically supported: inter-annotator agreement is indeed higher for IF than VQ, validating the necessity of dimension-wise modeling.
- Tie decomposition is a simple yet effective technique that fully exploits the information embedded in tied annotation pairs.
- The data filtering application is direct and quantifiable: scoring 46K samples requires only 2.61 GPU hours.
Limitations & Future Work¶
- The annotation framework covers only two dimensions (IF and VQ), potentially missing aspects such as spatial consistency and style preservation.
- Experiments are conducted primarily on 7B-scale VLMs; the effectiveness at larger or smaller scales remains unknown.
- Data filtering experiments are validated on a single downstream model (Step1X-Edit); generalizability requires further investigation.
- Multi-way preference accuracy on EditReward-Bench remains low (~11% at K=4), indicating that the task remains highly challenging.
Related Work & Insights¶
- HPSv3: A pioneer in uncertainty-aware ranking, but limited to a single dimension.
- ImageRewardDB: An early preference dataset, but noisy and single-dimensional.
- ADIEE: Trained on model-generated labels, introducing bias.
- Key insight: high-quality human annotation combined with multi-dimensional decoupling constitutes the critical path toward reliable reward models.
Rating¶
- Novelty: ⭐⭐⭐⭐ Multi-dimensional uncertainty-aware ranking and tie decomposition are notable contributions, though the overall framework is relatively standard.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation on 4 benchmarks, data filtering application, extensive ablations, and OOD testing.
- Writing Quality: ⭐⭐⭐⭐ Well-structured and data-rich, though notation is somewhat dense in places.
- Value: ⭐⭐⭐⭐⭐ Both the dataset and model will be open-sourced, providing significant contributions to the image editing community.