ICCV 2025 Image Generation instruction-based editing supervision rectification contrastive learning diffusion prior VLM triplet loss

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing¶

Conference: ICCV 2025 arXiv: 2505.02370 Area: Diffusion Models · Image Editing Keywords: instruction-based editing, supervision rectification, contrastive learning, diffusion prior, VLM, triplet loss

TL;DR¶

SuperEdit addresses the noisy supervision problem in instruction-based image editing by leveraging diffusion generation priors to guide a VLM in rectifying editing instructions, and by constructing contrastive supervision signals (positive/negative instructions + triplet loss), surpassing SmartEdit by 9.19% with less data and a smaller model.

Background & Motivation¶

Training data for instruction-based image editing is typically generated by automated pipelines (LLM rewrites captions → diffusion model generates edited images), but diffusion models cannot precisely follow text instructions, leading to:

Mismatches between edited images and editing instructions
Unintended modifications to regions that should remain unchanged
Noisy supervision signals

Limitations of existing approaches:

Scaling data (InstructPix2Pix): the noisy supervision problem remains unresolved

Introducing large VLMs (SmartEdit, MGIE): prohibitive computational cost (14.1B parameters)

Pre-training on recognition tasks (InstructDiffusion): indirect mitigation that does not address the root cause

SuperEdit's key insight: the problem lies in the supervision signal itself, not the model architecture. Rectifying instructions is more direct and effective than scaling up models.

Method¶

1. Diffusion Generation Prior¶

Core finding: editing models generate fixed attributes independent of text at different inference stages: - Early stage: global layout - Middle stage: local object attributes - Late stage: image details - Style changes: span all stages

This prior provides a unified foundation for instruction rectification.

2. Rectifying Supervision¶

GPT-4o is used to regenerate accurate editing instructions from original→edited image pairs according to the four generation attributes (global/local/detail/style).

Procedure: 1. Feed the original and edited images to GPT-4o 2. Guide the VLM to describe differences according to the four change categories defined by the diffusion prior 3. Aggregate descriptions into a rectified instruction that precisely matches the image pair 4. Ensure instruction length does not exceed the 77-token limit of the CLIP text encoder

3. Facilitating Supervision (Contrastive Supervision)¶

Even after instruction rectification, the editing model still struggles to distinguish semantically similar instructions (e.g., "add a cat on the left" vs. "add two cats on the right").

Constructing contrastive instructions: GPT-4o substitutes a single attribute (quantity/position/object) in the rectified instruction to generate a negative instruction $c^T_{neg}$.

Triplet Loss:

\[\mathcal{L}_{\text{triplet}} = \max\{d(\epsilon_t, \epsilon_{pos}) - d(\epsilon_t, \epsilon_{neg}) + m, 0\}\]

where: $$\epsilon_{pos} = \epsilon_\theta(\text{concat}(x_t, c^I), t, c^T_{pos})$$ $$\epsilon_{neg} = \epsilon_\theta(\text{concat}(x_t, c^I), t, c^T_{neg})$$

Total Loss:

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{train}} + \lambda \cdot \mathcal{L}_{\text{triplet}}\]

where $\mathcal{L}_{\text{train}} = d(\epsilon_t, \epsilon_{pos})$ is the standard diffusion loss.

Key Experimental Results¶

Comparison on the Real-Edit Benchmark¶

Method	Extra Module	Pre-training	Data Size	Model Size	Overall↑
SmartEdit	✓	✓	1.2M	14.1B	3.59
MGIE	✓	✓	1.0M	8.1B	2.86
InstructPix2Pix	✗	✗	300K	1.1B	3.31
SuperEdit	✗	✗	40K	1.1B	3.92

SuperEdit surpasses SmartEdit by 9.19% using 30× less data and a 13× smaller model.

GPT-4o Automatic Evaluation¶

Method	Following Acc↑	Preserving Acc↑	Quality Acc↑	Overall Acc↑
SmartEdit	64%	66%	45%	58.3%
SuperEdit	75%	72%	55%	67.3%

SuperEdit achieves comprehensive improvements across instruction following, content preservation, and image quality.

Ablation Study¶

Configuration	Following↑	Preserving↑	Quality↑	Overall↑
Original instructions (baseline)	52%	53%	50%	51.7%
+ Rectified instructions	70%	68%	52%	63.3%
+ Rectified instructions + contrastive loss	75%	72%	55%	67.3%

Instruction rectification contributes approximately 11.6% improvement; the contrastive loss provides an additional 4%.

Highlights & Insights¶

Data-centric rather than model-centric: improving supervision signals rather than scaling up models yields greater gains at minimal cost
Generality of the diffusion prior: editing models and T2I models share the same stage-wise generation attributes
Elegant application of contrastive learning: modifying only a single attribute in the instruction ensures small embedding distance but large semantic difference between positive and negative samples
Fully open-sourced (data + model) with strong reproducibility

Limitations & Future Work¶

Relies on GPT-4o for instruction rectification and contrastive instruction generation, incurring API costs
Inherent limitations of the InstructPix2Pix architecture (e.g., resolution constraints of SD 1.5)
The contrastive loss requires an additional UNet forward pass per step
Integration with stronger base models (SDXL, Flux) remains unexplored

Instruction-based editing: InstructPix2Pix, MagicBrush, SmartEdit
Editing data construction: Prompt-to-Prompt, EditBench
Diffusion model alignment: DPO for diffusion, ReFL

Rating¶

Dimension	Score (1–5)
Novelty	4
Technical Depth	4
Experimental Thoroughness	5
Writing Quality	4
Overall	4.2