Skip to content

SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Conference: ICCV 2025 arXiv: 2505.02370 Area: Diffusion Models · Image Editing Keywords: instruction-based editing, supervision rectification, contrastive learning, diffusion prior, VLM, triplet loss

TL;DR

SuperEdit addresses the noisy supervision problem in instruction-based image editing by leveraging diffusion generation priors to guide a VLM in rectifying editing instructions, and by constructing contrastive supervision signals (positive/negative instructions + triplet loss), surpassing SmartEdit by 9.19% with less data and a smaller model.

Background & Motivation

Training data for instruction-based image editing is typically generated by automated pipelines (LLM rewrites captions → diffusion model generates edited images), but diffusion models cannot precisely follow text instructions, leading to:

  • Mismatches between edited images and editing instructions
  • Unintended modifications to regions that should remain unchanged
  • Noisy supervision signals

Limitations of existing approaches:

Scaling data (InstructPix2Pix): the noisy supervision problem remains unresolved

Introducing large VLMs (SmartEdit, MGIE): prohibitive computational cost (14.1B parameters)

Pre-training on recognition tasks (InstructDiffusion): indirect mitigation that does not address the root cause

SuperEdit's key insight: the problem lies in the supervision signal itself, not the model architecture. Rectifying instructions is more direct and effective than scaling up models.

Method

1. Diffusion Generation Prior

Core finding: editing models generate fixed attributes independent of text at different inference stages: - Early stage: global layout - Middle stage: local object attributes - Late stage: image details - Style changes: span all stages

This prior provides a unified foundation for instruction rectification.

2. Rectifying Supervision

GPT-4o is used to regenerate accurate editing instructions from original→edited image pairs according to the four generation attributes (global/local/detail/style).

Procedure: 1. Feed the original and edited images to GPT-4o 2. Guide the VLM to describe differences according to the four change categories defined by the diffusion prior 3. Aggregate descriptions into a rectified instruction that precisely matches the image pair 4. Ensure instruction length does not exceed the 77-token limit of the CLIP text encoder

3. Facilitating Supervision (Contrastive Supervision)

Even after instruction rectification, the editing model still struggles to distinguish semantically similar instructions (e.g., "add a cat on the left" vs. "add two cats on the right").

Constructing contrastive instructions: GPT-4o substitutes a single attribute (quantity/position/object) in the rectified instruction to generate a negative instruction \(c^T_{neg}\).

Triplet Loss:

\[\mathcal{L}_{\text{triplet}} = \max\{d(\epsilon_t, \epsilon_{pos}) - d(\epsilon_t, \epsilon_{neg}) + m, 0\}\]

where: $\(\epsilon_{pos} = \epsilon_\theta(\text{concat}(x_t, c^I), t, c^T_{pos})\)$ $\(\epsilon_{neg} = \epsilon_\theta(\text{concat}(x_t, c^I), t, c^T_{neg})\)$

Total Loss:

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{train}} + \lambda \cdot \mathcal{L}_{\text{triplet}}\]

where \(\mathcal{L}_{\text{train}} = d(\epsilon_t, \epsilon_{pos})\) is the standard diffusion loss.

Key Experimental Results

Comparison on the Real-Edit Benchmark

Method Extra Module Pre-training Data Size Model Size Overall↑
SmartEdit 1.2M 14.1B 3.59
MGIE 1.0M 8.1B 2.86
InstructPix2Pix 300K 1.1B 3.31
SuperEdit 40K 1.1B 3.92

SuperEdit surpasses SmartEdit by 9.19% using 30× less data and a 13× smaller model.

GPT-4o Automatic Evaluation

Method Following Acc↑ Preserving Acc↑ Quality Acc↑ Overall Acc↑
SmartEdit 64% 66% 45% 58.3%
SuperEdit 75% 72% 55% 67.3%

SuperEdit achieves comprehensive improvements across instruction following, content preservation, and image quality.

Ablation Study

Configuration Following↑ Preserving↑ Quality↑ Overall↑
Original instructions (baseline) 52% 53% 50% 51.7%
+ Rectified instructions 70% 68% 52% 63.3%
+ Rectified instructions + contrastive loss 75% 72% 55% 67.3%

Instruction rectification contributes approximately 11.6% improvement; the contrastive loss provides an additional 4%.

Highlights & Insights

  1. Data-centric rather than model-centric: improving supervision signals rather than scaling up models yields greater gains at minimal cost
  2. Generality of the diffusion prior: editing models and T2I models share the same stage-wise generation attributes
  3. Elegant application of contrastive learning: modifying only a single attribute in the instruction ensures small embedding distance but large semantic difference between positive and negative samples
  4. Fully open-sourced (data + model) with strong reproducibility

Limitations & Future Work

  • Relies on GPT-4o for instruction rectification and contrastive instruction generation, incurring API costs
  • Inherent limitations of the InstructPix2Pix architecture (e.g., resolution constraints of SD 1.5)
  • The contrastive loss requires an additional UNet forward pass per step
  • Integration with stronger base models (SDXL, Flux) remains unexplored
  • Instruction-based editing: InstructPix2Pix, MagicBrush, SmartEdit
  • Editing data construction: Prompt-to-Prompt, EditBench
  • Diffusion model alignment: DPO for diffusion, ReFL

Rating

Dimension Score (1–5)
Novelty 4
Technical Depth 4
Experimental Thoroughness 5
Writing Quality 4
Overall 4.2