X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning¶

Conference: AAAI2026 arXiv: 2508.07607 Code: GitHub Area: Image Generation Keywords: image editing, MoE-LoRA, contrastive learning, dataset construction, FLUX, task-aware

TL;DR¶

A 3.7M high-quality editing dataset covering 14 task categories is constructed, and a lightweight (0.9B parameter) plug-and-play editing module based on Task-Aware MoE-LoRA and Contrastive Learning is proposed, achieving performance comparable to 12B fully fine-tuned models.

Background & Motivation¶

State of the Field¶

Background: Open-source image editing models still lag behind closed-source counterparts (e.g., GPT-4o), with high-quality editing datasets remaining a critical bottleneck.

Limitations of Prior Work¶

Limitations of Prior Work: Existing datasets suffer from three major issues: (1) complex construction pipelines requiring independent design per task category; (2) low editing precision and class imbalance; (3) severe data scarcity for complex tasks (reasoning, camera movement, style transfer).

Root Cause¶

Key Challenge: On the model side, fully fine-tuned models (Step1X-Edit 12B, Kontext 12B) deliver strong performance but at high cost, while lightweight alternatives (ICEdit 0.2B) reduce cost but sacrifice quality.

Paper Goals¶

Goal: How to achieve high-quality arbitrary-instruction image editing covering 14 task categories with only a small number of parameters (8% of the full model)?

Method¶

Overall Architecture¶

Built upon the FLUX.1 DiT architecture, Task-Aware MoE-LoRA modules and contrastive learning regularization are inserted. During training, AlignNet, the task embedding matrix, and MoE-LoRA parameters are updated.

Key Designs¶

1. X2Edit Dataset (3.7M) - Four-stage pipeline: source image sampling → VLM-based editing instruction generation → task-specific workflow for edited image generation → comprehensive scoring and filtering - Qwen2.5-VL-7B generates instructions directly from images (avoiding caption information loss), with self-reflection verification - Step1X-Edit, GPT-4o, BAGEL, and Kontext are leveraged according to task-specific characteristics for data generation - Filtering: multi-dimensional evaluation using aesthetic score + LIQE + CLIPIQA + ImgEdit-Judge + Qwen2.5-VL-72B

2. Task-Aware MoE-LoRA A task embedding matrix $t_{emb} \in \mathbb{R}^{N_t \times c}$ is learned and injected into a gating network to guide expert selection: $$s_i = \text{Softmax}_i(\text{Gate}(\text{Concat}(h^l, t_{emb}^h)))$$ Top-K experts are selected and aggregated with a shared expert: $$x_{moe}^l = \sum_{i=1}^{N_e} g_i \cdot \text{Expert}_x^i(h^l) + \text{SharedExpert}_x(h^l)$$ Configuration: 12 experts, Top-2 activation, LoRA rank=64, with a total parameter count of only 0.9B.

3. Task-Aware Contrastive Learning Task labels are used to construct positive and negative pairs (same task = positive, cross-task = negative), and an InfoNCE loss is applied on intermediate MMDiT representations: $$\mathcal{L}_{task} = -\frac{1}{b}\sum_{i=1}^{N}\log\frac{\sum_j \exp(-D_{ij}/\tau) \cdot M_{ij}}{\sum_k \exp(-D_{ik}/\tau)}$$ Final objective: $\mathcal{L} = \mathcal{L}_{task} + \lambda \mathcal{L}_{diff}$, with $\lambda=0.2$ and $\tau=0.5$.

Key Experimental Results¶

Main Results¶

Method	Params	GEdit-Bench++ (EN) IJ	G_VIE	ImgEdit-Bench IJ
GPT-4o	-	9.003	7.848	8.202
Kontext	12B	8.408	5.712	8.149
Bagel	7B+7B	8.326	5.722	7.925
Step1X-Edit	12B	8.017	5.108	7.653
ICEdit	0.2B	7.203	4.109	7.615
X2Edit	0.9B	8.334	5.550	8.025

DreamBench subject-driven: DINO 0.822 (tied best with Kontext), CLIP-T 0.326
Plug-and-play: seamlessly compatible with various FLUX.1 community variants and LoRAs (Krea-dev, PixelWave, Ghibli, etc.)
User study (4 annotators, 1.3k pairs): overall score of 2.432, placing in the upper-middle tier
Ablation: Task-Aware MoE shows significant improvement over vanilla MoE; applying contrastive loss across all MMDiT layers yields the best results

Highlights & Insights¶

The data construction pipeline is unified and reproducible: VLM-generated instructions + multi-model division of labor + multi-dimensional filtering, covering 14 categories at 3.7M scale
First application of contrastive learning in arbitrary-instruction image editing, promoting inter-task representation disentanglement
Exceptional parameter efficiency: 0.9B parameters match 12B fully fine-tuned models while supporting plug-and-play deployment
"Narrow-yet-numerous" expert strategy (12 experts, rank=64) outperforms configurations with fewer experts and larger ranks

Limitations & Future Work¶

Weak performance on non-English text editing tasks (constrained by the FLUX.1 base model)
User study involves only 4 annotators, yielding insufficient statistical significance
Complex reasoning and camera movement tasks rely on GPT-4o-generated data, limiting open-source reproducibility
Notable performance gap compared to Kontext and Bagel on KontextBench
Sensitivity analysis for contrastive learning temperature $\tau$ and $\lambda$ is absent

vs ICEdit (0.2B): Both adopt a FLUX LoRA approach, but X2Edit introduces task-aware routing and contrastive learning, achieving comprehensive improvements
vs Kontext/Bagel (12–14B): Fully fine-tuned methods with slightly superior performance but tens of times higher training cost; X2Edit achieves comparable performance with 8% of the parameters
vs AnyEdit: X2Edit substantially outperforms in both data quality and model performance (AnyEdit VIE score of only 2.2 vs. X2Edit's 5.5)
vs Step1X-Edit (12B): Full DiT fine-tuning approach; X2Edit matches or exceeds it on most metrics

The task embedding + MoE gating design is generalizable to other multi-task generation scenarios (video editing, 3D generation)
Applying contrastive learning in the diffusion hidden space is a promising new direction worth further exploration
The data construction pipeline of "VLM-generated instructions + multi-model image generation + multi-dimensional filtering" is broadly applicable
The plug-and-play property makes it particularly suitable for community ecosystems, offering significant commercial value

Rating¶

Novelty: ⭐⭐⭐⭐ — First application of task-aware contrastive learning in image editing
Experimental Thoroughness: ⭐⭐⭐⭐ — 4 benchmarks + DreamBench + plug-and-play + ablation, though user study is limited
Writing Quality: ⭐⭐⭐½ — Comprehensive content but slightly verbose structure
Value: ⭐⭐⭐⭐ — Both dataset and model are open-sourced, representing a significant community contribution