Visual Autoregressive Modeling for Instruction-Guided Image Editing¶

Conference: ICLR 2026 arXiv: 2508.15772 Code: GitHub Area: Image Generation Keywords: Image Editing, Visual Autoregressive, Multi-Scale Prediction, Instruction-Guided, Scale Alignment

TL;DR¶

This paper proposes VAREdit, which reformulates instruction-guided image editing as a multi-scale prediction problem. By introducing a Scale-Aligned Reference module to address the scale mismatch in finest-scale conditioning, VAREdit significantly outperforms diffusion-based methods in both editing fidelity and inference efficiency.

Background & Motivation¶

Background: Instruction-guided image editing has been dominated by diffusion models (e.g., InstructPix2Pix), which concatenate source and target images channel-wise for joint denoising.
Limitations of Prior Work: The global denoising process of diffusion models inherently couples edited regions with the entire image, leading to: (1) spurious modifications in non-edited regions ("bleeding"); (2) insufficient adherence to editing instructions; and (3) high computational cost due to multi-step iterative denoising.
Key Challenge: The strength of diffusion models—global consistency modeling—is precisely their weakness for editing tasks, which require precise local modification while globally preserving unedited regions. The causal and compositional nature of autoregressive models is naturally suited for editing, yet the VAR paradigm has not been explored in this context.
Goal: To introduce the multi-scale prediction paradigm of Visual Autoregressive (VAR) modeling into instruction-guided image editing.
Key Insight: The core challenge in VAR-based editing lies in the source image conditioning strategy—full-scale conditioning is expensive (\(O(n^2)\)), while finest-scale conditioning is efficient but suffers from scale mismatch. Analysis of attention heatmaps reveals that the mismatch affects only the first self-attention layer.
Core Idea: Inject scale-aligned reference features exclusively into the first self-attention layer, while relying on finest-scale conditioning for all subsequent layers, thereby balancing efficiency and editing quality.

Method¶

Overall Architecture¶

VAREdit is built upon the pretrained Infinity model and formulates image editing as conditional multi-scale prediction: given a source image \(\mathbf{I}^{(src)}\) and a text instruction \(\mathbf{t}\), the model autoregressively generates the K-level residual maps \(\mathbf{R}_{1:K}^{(tgt)}\) of the target image. The finest-scale source feature \(\mathbf{F}_K^{(src)}\) serves as the primary condition, with scale-aligned reference features additionally injected into the first self-attention layer. The model is trained on 3.92M paired samples.

Key Designs¶

1. Scale-Aligned Reference (SAR) Module

Function: Resolves scale mismatch while preserving the efficiency of finest-scale conditioning.
Mechanism: Scale-aligned reference features \(\mathbf{F}_k^{(ref)} = \text{Down}(\mathbf{F}_K^{(src)}, (h_k, w_k))\) are generated by downsampling the finest-scale features to match the spatial dimensions of each target scale. In the first self-attention layer, queries at target scale \(k\) attend to both the scale-aligned reference features and the previously generated target history. SAR is applied only in the first layer; subsequent layers retain the finest-scale condition exclusively.
Design Motivation: Attention heatmap analysis reveals that the first layer is responsible for establishing global layout and long-range dependencies (requiring scale-matched conditioning), while deeper layers handle local refinement (for which finest-scale conditioning suffices).

2. Finest-Scale Conditioning Strategy

Function: Substantially reduces computational overhead.
Mechanism: Only the finest-scale (highest-resolution) features \(\mathbf{F}_K^{(src)}\) are prepended to the target sequence, rather than features from all K scales. This significantly reduces sequence length compared to full-scale methods, given the \(O(n^2)\) complexity of self-attention.
Design Motivation: The finest scale contains the richest high-frequency detail information and is most critical for editing guidance.

3. Autoregressive Multi-Scale Editing Formulation

Function: Leverages the causal compositionality of VAR for precise editing.
Mechanism: Editing is decomposed into residual predictions across K scales: \(p(\mathbf{R}_{1:K}^{(tgt)}|\mathbf{I}^{(src)}, \mathbf{t}) = \prod_{k=1}^K p(\mathbf{R}_k^{(tgt)}|\mathbf{F}_{1:k-1}^{(tgt)}, \mathbf{F}_K^{(src)}, \mathbf{t})\). Text instructions are incorporated via cross-attention. 2D-RoPE is used to distinguish source and target tokens.
Design Motivation: Autoregressive generation naturally supports the decoupling of "preserving unchanged regions" and "precisely modifying edited regions."

Loss & Training¶

A bitwise classifier loss is used to optimize index prediction of target residual tokens, following the Infinity training scheme. VAREdit-2B undergoes two-stage training: 8k steps at \(256^2\) resolution followed by 7k steps at \(512^2\). VAREdit-8B is trained directly at \(512^2\) for 60k steps. Inference uses CFG strength \(\eta=4\) and logits temperature \(\tau=0.5\).

Key Experimental Results¶

Main Results¶

Quantitative comparison on EMU-Edit and PIE-Bench:

Method	Params	GPT-Balance (EMU)↑	GPT-Balance (PIE)↑	Time
InstructPix2Pix	1.1B	2.923	4.034	3.5s
UltraEdit	7.7B	4.541	5.580	2.6s
ICEdit	17B	4.785	4.933	8.4s
VAREdit-2B	2.2B	5.662	6.996	0.7s
VAREdit-8B	8.4B	7.892	8.105	1.2s
Step1X-Edit	21B	7.081	7.351	12.8s

For \(512\times512\) editing, VAREdit-8B requires only 1.2 seconds, which is 2.2× faster than UltraEdit at comparable scale.

Ablation Study¶

Conditioning Strategy	CLIP-Out.↑	GPT-Suc.↑	GPT-Over.↑	GPT-Bal.↑
Full-scale conditioning	0.275	5.781	7.087	5.346
Finest-scale conditioning	0.264	4.926	7.077	4.584
Finest-scale + SAR (layer 1)	0.271	6.210	7.055	5.662
SAR in all layers	0.269	5.884	7.036	5.352
SAR in first 3 layers	0.269	5.894	7.048	5.297

Key Findings¶

VAREdit-8B achieves a GPT-Balance score 64.9% higher than the strongest diffusion baseline (ICEdit) on EMU-Edit and 45.3% higher on PIE-Bench.
Applying SAR exclusively to the first layer yields the best performance—applying it to all layers actually degrades results, validating the insights from attention analysis.
On the subset of most successful edits (GPT-Suc. ≥ 9), VAREdit's region preservation score even surpasses OmniGen, demonstrating genuine preservation rather than conservative inaction.
Among open-source models, VAREdit-8B outperforms Step1X-Edit (21B) and FLUX.1 Kontext (12B) on GPT-Balance despite having fewer parameters.

Highlights & Insights¶

Paradigm Innovation: This is the first work to successfully apply VAR's multi-scale prediction to instruction-guided editing, breaking the diffusion model dominance in this field.
Analysis-Driven Design: The SAR module is motivated entirely by systematic analysis of attention heatmaps rather than intuition.
Efficiency-Quality Trade-off: The 2.2B model completes \(512^2\) editing in 0.7 seconds while surpassing ICEdit (17B) in quality.
Natural Advantage of Autoregressive Modeling for Editing: The causal generation mechanism inherently supports region-selective modification.

Limitations & Future Work¶

Reliance on a discrete visual tokenizer limits editing quality to the tokenizer's reconstruction capacity.
The largest model is currently 8B; the effects of further scaling remain to be investigated.
Interactive and multi-turn conversational editing are not yet supported.
Integration with mask-guided approaches could further improve region control precision.

VAR (Tian et al., 2024) and Infinity establish the foundation for multi-scale autoregressive generation; VAREdit extends this to editing.
InstructPix2Pix establishes the standard paradigm for instruction-guided editing, but its limitations motivate this work's exploration of the autoregressive approach.
Key insight: Rather than patching the shortcomings of diffusion models, transitioning to an autoregressive paradigm may represent a more fundamental solution.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First application of the VAR paradigm to image editing, with data-driven support for the SAR design.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four benchmarks, multiple metrics, comparisons with frontier models.
Writing Quality: ⭐⭐⭐⭐ Clear analysis with a complete logical chain from problem to solution.
Value: ⭐⭐⭐⭐⭐ Substantially surpasses state-of-the-art and establishes a new editing paradigm.