SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing¶
Conference: ICCV 2025 arXiv: 2505.02370 Area: Diffusion Models · Image Editing Keywords: instruction-based editing, supervision rectification, contrastive learning, diffusion prior, VLM, triplet loss
TL;DR¶
SuperEdit addresses the noisy supervision problem in instruction-based image editing by leveraging diffusion generation priors to guide a VLM in rectifying editing instructions, and by constructing contrastive supervision signals (positive/negative instructions + triplet loss), surpassing SmartEdit by 9.19% with less data and a smaller model.
Background & Motivation¶
Training data for instruction-based image editing is typically generated by automated pipelines (LLM rewrites captions → diffusion model generates edited images), but diffusion models cannot precisely follow text instructions, leading to:
- Mismatches between edited images and editing instructions
- Unintended modifications to regions that should remain unchanged
- Noisy supervision signals
Limitations of existing approaches:
Scaling data (InstructPix2Pix): the noisy supervision problem remains unresolved
Introducing large VLMs (SmartEdit, MGIE): prohibitive computational cost (14.1B parameters)
Pre-training on recognition tasks (InstructDiffusion): indirect mitigation that does not address the root cause
SuperEdit's key insight: the problem lies in the supervision signal itself, not the model architecture. Rectifying instructions is more direct and effective than scaling up models.
Method¶
1. Diffusion Generation Prior¶
Core finding: editing models generate fixed attributes independent of text at different inference stages: - Early stage: global layout - Middle stage: local object attributes - Late stage: image details - Style changes: span all stages
This prior provides a unified foundation for instruction rectification.
2. Rectifying Supervision¶
GPT-4o is used to regenerate accurate editing instructions from original→edited image pairs according to the four generation attributes (global/local/detail/style).
Procedure: 1. Feed the original and edited images to GPT-4o 2. Guide the VLM to describe differences according to the four change categories defined by the diffusion prior 3. Aggregate descriptions into a rectified instruction that precisely matches the image pair 4. Ensure instruction length does not exceed the 77-token limit of the CLIP text encoder
3. Facilitating Supervision (Contrastive Supervision)¶
Even after instruction rectification, the editing model still struggles to distinguish semantically similar instructions (e.g., "add a cat on the left" vs. "add two cats on the right").
Constructing contrastive instructions: GPT-4o substitutes a single attribute (quantity/position/object) in the rectified instruction to generate a negative instruction \(c^T_{neg}\).
Triplet Loss:
where: $\(\epsilon_{pos} = \epsilon_\theta(\text{concat}(x_t, c^I), t, c^T_{pos})\)$ $\(\epsilon_{neg} = \epsilon_\theta(\text{concat}(x_t, c^I), t, c^T_{neg})\)$
Total Loss:
where \(\mathcal{L}_{\text{train}} = d(\epsilon_t, \epsilon_{pos})\) is the standard diffusion loss.
Key Experimental Results¶
Comparison on the Real-Edit Benchmark¶
| Method | Extra Module | Pre-training | Data Size | Model Size | Overall↑ |
|---|---|---|---|---|---|
| SmartEdit | ✓ | ✓ | 1.2M | 14.1B | 3.59 |
| MGIE | ✓ | ✓ | 1.0M | 8.1B | 2.86 |
| InstructPix2Pix | ✗ | ✗ | 300K | 1.1B | 3.31 |
| SuperEdit | ✗ | ✗ | 40K | 1.1B | 3.92 |
SuperEdit surpasses SmartEdit by 9.19% using 30× less data and a 13× smaller model.
GPT-4o Automatic Evaluation¶
| Method | Following Acc↑ | Preserving Acc↑ | Quality Acc↑ | Overall Acc↑ |
|---|---|---|---|---|
| SmartEdit | 64% | 66% | 45% | 58.3% |
| SuperEdit | 75% | 72% | 55% | 67.3% |
SuperEdit achieves comprehensive improvements across instruction following, content preservation, and image quality.
Ablation Study¶
| Configuration | Following↑ | Preserving↑ | Quality↑ | Overall↑ |
|---|---|---|---|---|
| Original instructions (baseline) | 52% | 53% | 50% | 51.7% |
| + Rectified instructions | 70% | 68% | 52% | 63.3% |
| + Rectified instructions + contrastive loss | 75% | 72% | 55% | 67.3% |
Instruction rectification contributes approximately 11.6% improvement; the contrastive loss provides an additional 4%.
Highlights & Insights¶
- Data-centric rather than model-centric: improving supervision signals rather than scaling up models yields greater gains at minimal cost
- Generality of the diffusion prior: editing models and T2I models share the same stage-wise generation attributes
- Elegant application of contrastive learning: modifying only a single attribute in the instruction ensures small embedding distance but large semantic difference between positive and negative samples
- Fully open-sourced (data + model) with strong reproducibility
Limitations & Future Work¶
- Relies on GPT-4o for instruction rectification and contrastive instruction generation, incurring API costs
- Inherent limitations of the InstructPix2Pix architecture (e.g., resolution constraints of SD 1.5)
- The contrastive loss requires an additional UNet forward pass per step
- Integration with stronger base models (SDXL, Flux) remains unexplored
Related Work & Insights¶
- Instruction-based editing: InstructPix2Pix, MagicBrush, SmartEdit
- Editing data construction: Prompt-to-Prompt, EditBench
- Diffusion model alignment: DPO for diffusion, ReFL
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4 |
| Technical Depth | 4 |
| Experimental Thoroughness | 5 |
| Writing Quality | 4 |
| Overall | 4.2 |