EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing¶

Conference: ICLR 2026 arXiv: 2509.26346 Code: GitHub Area: Image Editing / Reward Model Keywords: Image Editing, Reward Model, Human Preference, Data Filtering, VLM

TL;DR¶

This paper constructs EditReward-Data, a high-quality dataset of 200K expert-annotated preference pairs, and trains the EditReward reward model, which achieves state-of-the-art human alignment across multiple image editing evaluation benchmarks. The model is further validated as a data filter that substantially improves downstream editing model performance.

Background & Motivation¶

Instruction-guided image editing has advanced considerably in recent years. Closed-source models such as GPT-Image-1 and Seedream deliver strong results, yet open-source models still exhibit a significant gap. The core bottleneck lies in the absence of reliable reward models for filtering and scaling high-quality training data.

Existing evaluation and reward approaches suffer from three fundamental problems:

Perceptual scores (e.g., LPIPS): unable to capture semantic alignment with the editing instruction.

Feature-based scores (e.g., CLIP): unable to understand editing semantics.

VLM-as-judge (e.g., VIEScore): general-purpose VLMs are not optimized for editing tasks.

Prior fine-tuned reward models either rely on noisy crowdsourced annotations (low inter-annotator agreement) or pseudo-labels generated by closed-source models (introducing bias). The root cause is that training a reliable reward model requires large-scale, high-quality human-annotated preference data — a resource that has been lacking.

Key Insight: Construct a large-scale, high-quality, multi-dimensional expert-annotated preference dataset and train a reward model specifically tailored to instruction-guided image editing.

Method¶

Overall Architecture¶

EditReward consists of three core components: 1. EditReward-Data: A dataset of 200K expert-annotated preference pairs. 2. EditReward Model: A VLM-based multi-dimensional uncertainty-aware ranking reward model. 3. EditReward-Bench: A multi-way preference ranking evaluation benchmark.

Key Designs¶

EditReward-Data Construction:
- 9,557 instruction–image pairs are collected from six editing benchmarks (GEdit-Bench, ImgEdit-Bench, MagicBrush, etc.).
- Six state-of-the-art editing models (Step1X-Edit, Flux-Kontext, Qwen-Image-Edit, etc.) each generate multiple outputs per input.
- Key: Trained annotators follow a strict annotation protocol, rating outputs on a 4-level Likert scale across two dimensions:
  - Instruction Following (IF): semantic accuracy, completeness, and absence of unintended modifications.
  - Visual Quality (VQ): plausibility, absence of artifacts, and aesthetics.
- Krippendorff's \(\alpha\) reaches IF = 0.668 and VQ = 0.597, confirming high annotation quality.
Multi-Dimensional Uncertainty-Aware Ranking:
- Inspired by HPSv3, scores are modeled as Gaussian distributions \(s_{i,d} \sim \mathcal{N}(\mu_{i,d}, \sigma_{i,d}^2)\), where \(d \in \{1,2\}\) corresponds to the IF and VQ dimensions respectively.
- Multi-task learning (MTL) is adopted, with independent reward heads predicting Gaussian parameters for each dimension.
- Three aggregation strategies are explored: pessimistic minimum, balanced mean, and direct summation.
- The final preference probability is computed by integrating over the two aggregated distributions: \(\mathcal{L}_{\text{rank}} = -\log(P(I_h \succ I_l))\)
Disentangling Ties via Dimensional Preference:
- Core Idea: pairs rated as overall ties often exhibit complementary strengths across dimensions (e.g., image A excels in IF while image B excels in VQ).
- Tied pairs \((I_A, I_B)_{\text{tie}}\) are decomposed into two training samples, labeled \(I_A \succ I_B\) and \(I_B \succ I_A\) along their respective preferred dimensions.
- This encourages the model to learn finer-grained cross-dimensional trade-offs and yields smoother training curves.

Loss & Training¶

Backbone: Qwen2.5-VL-7B or MiMo-VL-7B, with full parameter fine-tuning.
2 epochs, 8×A800 GPUs, learning rate 2e-6, cosine schedule.
Images are preprocessed to 448×448 with aspect ratio preserved.
The total loss is the ranking loss \(\mathcal{L}_{\text{rank}} = -\log(P(I_h \succ I_l))\).

Key Experimental Results¶

Main Results¶

Method	GenAI-Bench	AURORA-Bench	ImagenHub	EditReward-Bench
GPT-4o	53.54	50.81	38.21	28.31
GPT-5	59.61	47.27	40.85	37.81
Gemini-2.5-Flash	57.01	47.63	41.62	38.02
Qwen2.5-VL-7B-Inst	40.48	38.62	18.59	29.75
EditReward (Qwen)	63.97	59.50	36.18	36.78
EditReward (MiMo)	65.72	63.62	35.20	38.42

EditReward comprehensively outperforms closed-source models including GPT-5 and Gemini-2.5-Flash.

Data Filtering Application¶

EditReward is used to filter a high-quality subset from ShareGPT-4o-Image (46K) for fine-tuning Step1X-Edit:

Training Data	GEdit-EN \(G_O\)	GEdit-CN \(G_O\)
Step1X-Edit (original)	6.444	6.779
+ Full ShareGPT-4o	6.780	6.583
+ Top 10K (EditReward filtered)	6.938	7.000
+ Top 20K (EditReward filtered)	7.086	7.074
+ Top 30K (EditReward filtered)	6.962	6.938
Doubao-Edit	6.983	6.942

Top 20K represents the optimal trade-off point, elevating the open-source Step1X-Edit to near the level of Doubao-Edit.

Ablation Study¶

Variant	Loss Type	Head Type	Aggregation	GenAI-Bench
I	Pointwise regression	N/A	N/A	49.62
II	Pairwise ranking	Shared	Mean	60.17
V (final)	Pairwise ranking	Multi-independent	Mean	63.97

Pairwise ranking ≫ pointwise regression (+14.35).
Multi-independent heads ≫ shared head (+3.80).
Mean aggregation yields the best overall performance.

Key Findings¶

After training, Qwen2.5-VL-7B improves by more than 23 points on GenAI-Bench (40.48→63.97), demonstrating the potency of the proposed framework.
EditReward performs comparably to GPT-4o on OOD tasks (Text/Style categories): 46.80 vs. 41.69.
Data quality outweighs quantity: Top 20K outperforms the full 46K set.

Highlights & Insights¶

The 200K expert-annotated preference dataset achieves high quality (Krippendorff's \(\alpha > 0.59\)), far surpassing crowdsourced annotations.
The two-dimensional (IF + VQ) decoupled design is empirically supported: inter-annotator agreement is indeed higher for IF than VQ, validating the necessity of dimension-wise modeling.
Tie decomposition is a simple yet effective technique that fully exploits the information embedded in tied annotation pairs.
The data filtering application is direct and quantifiable: scoring 46K samples requires only 2.61 GPU hours.

Limitations & Future Work¶

The annotation framework covers only two dimensions (IF and VQ), potentially missing aspects such as spatial consistency and style preservation.
Experiments are conducted primarily on 7B-scale VLMs; the effectiveness at larger or smaller scales remains unknown.
Data filtering experiments are validated on a single downstream model (Step1X-Edit); generalizability requires further investigation.
Multi-way preference accuracy on EditReward-Bench remains low (~11% at K=4), indicating that the task remains highly challenging.

HPSv3: A pioneer in uncertainty-aware ranking, but limited to a single dimension.
ImageRewardDB: An early preference dataset, but noisy and single-dimensional.
ADIEE: Trained on model-generated labels, introducing bias.
Key insight: high-quality human annotation combined with multi-dimensional decoupling constitutes the critical path toward reliable reward models.

Rating¶

Novelty: ⭐⭐⭐⭐ Multi-dimensional uncertainty-aware ranking and tie decomposition are notable contributions, though the overall framework is relatively standard.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation on 4 benchmarks, data filtering application, extensive ablations, and OOD testing.
Writing Quality: ⭐⭐⭐⭐ Well-structured and data-rich, though notation is somewhat dense in places.
Value: ⭐⭐⭐⭐⭐ Both the dataset and model will be open-sourced, providing significant contributions to the image editing community.