Visual Autoregressive Modeling for Instruction-Guided Image Editing¶

Conference: ICLR 2026 arXiv: 2508.15772 Code: https://github.com/HiDream-ai/VAREdit Area: Human Understanding Keywords: Visual Autoregressive, Image Editing, Multi-Scale Prediction, Scale-Aligned Reference, Instruction-Guided

TL;DR¶

VAREdit reformulates instruction-guided image editing as a next-scale prediction problem. It proposes the Scale-Aligned Reference (SAR) module to resolve the scale mismatch between finest-scale conditioning and coarse target features. On EMU-Edit and PIE-Bench, the GPT-Balance score surpasses the strongest diffusion baseline by 64.9% and 45.3%, respectively, with 512×512 editing completed in only 1.2 seconds.

Background & Motivation¶

Background: Instruction-guided image editing is dominated by diffusion-based methods (InstructPix2Pix, UltraEdit, etc.), which train by channel-concatenating source and target images. AnySD and OmniGen extend this further, but all remain constrained by inherent limitations of the diffusion paradigm.

Limitations of Prior Work: (1) The global denoising process of diffusion models entangles edited regions with the full image context, causing spurious modifications; (2) Multi-step denoising is computationally expensive, limiting real-time applications; (3) Early AR-based editing attempts (training-free) lack task-specific knowledge and fall significantly behind diffusion methods.

Key Challenge: The causal compositional mechanism of AR models is naturally suited for editing (preserving unchanged regions while precisely modifying edited regions), yet adapting VAR's multi-scale generation to editing tasks introduces scale mismatch challenges.

Goal: How to effectively apply the VAR paradigm to instruction-guided image editing, and how to resolve the scale mismatch in source image conditioning?

Key Insight: Through systematic analysis of attention patterns in a full-scale model, the authors find that the first self-attention layer requires scale-aligned references, while subsequent layers only need the finest-scale condition.

Core Idea: Scale-aligned source references (generated by downsampling the finest-scale features) are injected only into the first self-attention layer; all remaining layers use the finest-scale condition, thereby balancing global layout and local detail.

Method¶

Overall Architecture¶

Built upon a pretrained VAR model (Infinity), the source image is processed through a shared VQ encoder to obtain multi-scale residuals, and the text instruction is mapped via an encoder. The model autoregressively generates target residuals \(\mathbf{R}_{1:K}^{(tgt)}\):

\[p(\mathbf{R}_{1:K}^{(tgt)} | \mathbf{I}^{(src)}, \mathbf{t}) = \prod_{k=1}^K p(\mathbf{R}_k^{(tgt)} | \mathbf{F}_{1:k-1}^{(tgt)}, \mathbf{F}_K^{(src)}, \mathbf{t})\]

Key Designs¶

Analysis of Source Image Conditioning Strategies:
- Full-scale conditioning: prepend source tokens at all scales → quadratic complexity \(O(n^2)\), potentially introducing redundancy.
- Finest-scale conditioning: use only \(\mathbf{F}_K^{(src)}\) → computationally efficient but suffers from scale mismatch (fine-grained information cannot guide coarse-scale predictions).
Scale Dependency Analysis (Key Findings):
- Self-attention heatmaps are analyzed on a full-scale trained model.
- First layer: Attention is broadly distributed, focusing on the corresponding and all coarser source scales → responsible for global layout.
- Subsequent layers: Attention is highly localized, exhibiting a diagonal structure → responsible for local refinement.
- Conclusion: The first layer requires scale-aligned references; subsequent layers only need the finest scale.
Scale-Aligned Reference (SAR) Module:
- Scale-specific references are generated by downsampling the finest-scale features: \(\mathbf{F}_k^{(ref)} = \text{Down}(\mathbf{F}_K^{(src)}, (h_k, w_k))\)
- Only in the first self-attention layer, the corresponding scale reference is concatenated when predicting scale \(k\):

\[\hat{\mathbf{O}}_k^{(tgt)} = \text{Softmax}\left(\frac{\mathbf{Q}_k^{(tgt)} [\mathbf{K}_k^{(ref)\top}, \mathbf{K}_{1:k}^{(tgt)\top}]}{\sqrt{d}}\right) \cdot [\mathbf{V}_k^{(ref)\top}, \mathbf{V}_{1:k}^{(tgt)\top}]^\top\]

- All remaining layers continue to use finest-scale conditioning combined with causal target history.

Text Conditioning:
- Text is encoded into token embeddings; the pooled representation serves as \(\tilde{\mathbf{F}}_0^{(tgt)}\) (the start token).
- Text token embeddings are used as key/value matrices in cross-attention.
- Source tokens are distinguished from target tokens via a 2D-RoPE positional offset \(\Delta=(64,64)\).

Loss & Training¶

Initialized from Infinity pretrained weights.
VAREdit-2B: two-stage training at 256×256 (8k iterations) → 512×512 (7k iterations).
VAREdit-8B: trained directly at 512×512 for 60k iterations.
Optimizes a bitwise classifier loss.
Inference: CFG \(\eta=4\), logits temperature \(\tau=0.5\).

Key Experimental Results¶

EMU-Edit and PIE-Bench¶

Method	Size	EMU-Edit GPT-Bal.	PIE-Bench GPT-Bal.	Time
InstructPix2Pix	1.1B	2.923	4.034	3.5s
UltraEdit	7.7B	4.541	5.580	2.6s
OmniGen	3.8B	4.666	3.498	16.5s
ICEdit	—	4.786	—	—
VAREdit-2B	2B	7.074	7.609	0.7s
VAREdit-8B	8B	7.892	8.105	1.2s

Comparison with State-of-the-Art Methods¶

Method	Size	EMU-Edit GPT-Bal.	PIE-Bench GPT-Bal.
GPT-4o-Image	—	8.549	8.616
Step1X-Edit	—	7.378	7.488
Qwen-Image-Edit	20B	8.087	8.272
VAREdit-8B	8B	7.892	8.105

Key Findings¶

GPT-Balance surpasses the strongest baseline ICEdit by 64.9% (EMU-Edit) and the strongest UltraEdit by 45.3% (PIE-Bench).
VAREdit-2B completes editing in only 0.7 seconds, 3.7× faster than UltraEdit.
On the high-quality editing subset where GPT-Suc.≥9, VAREdit also achieves the highest GPT-Over., demonstrating genuinely accurate editing with strong preservation.
SAR ablation: removing SAR leads to a significant drop in GPT-Bal., validating the necessity of scale alignment.
Among open-source methods, VAREdit-8B ranks second only to Qwen-Image-Edit (20B), which has a model 2.5× larger.

Highlights & Insights¶

Paradigm Shift: This is the first successful application of VAR's next-scale prediction to image editing, demonstrating that AR has a fundamental advantage over diffusion in editing tasks.
Attention-Analysis-Driven Design: SAR is not designed intuitively but derived from systematic analysis of attention patterns in the full-scale model — the methodology is worth adopting broadly.
Significant Efficiency Advantage: The single-pass generation of AR is inherently faster than diffusion's multi-step denoising, achieving high-quality editing in 1.2 seconds.
Importance of the GPT-Balance Metric: Reveals that methods like OmniGen achieve high GPT-Over. through a "no-edit" strategy; GPT-Bal. provides a more comprehensive evaluation.

Limitations & Future Work¶

Quality is bounded by the VQ tokenizer's reconstruction fidelity; fine-grained textures may suffer degradation.
The current maximum resolution is 512×512; extension to higher resolutions remains to be explored.
The downsampling operation in SAR may discard important spatial information.
Generalizing SAR to spatiotemporal multi-scale scenarios for video editing is a promising future direction.

vs. InstructPix2Pix paradigm: The fundamental issue with channel-concatenation + diffusion is entanglement caused by global denoising; VAREdit's causal mechanism naturally avoids this.
vs. EditAR: Follows vanilla next-token prediction, risking structural degradation; VAREdit uses next-scale prediction instead.
vs. Infinity: VAREdit inherits its multi-scale residual quantizer and bit-level classifier, adapting it for editing tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First application of VAR to editing, with SAR design derived from deep analysis — highly innovative.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four benchmarks, diverse comparisons, ablations, and efficiency analysis are comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Logic flows tightly from motivation to analysis to design; attention heatmap visualizations are excellent.
Value: ⭐⭐⭐⭐⭐ Opens a new direction for VAR-based editing with dual advantages in performance and efficiency.