Skip to content

Visual Autoregressive Modeling for Instruction-Guided Image Editing

Conference: ICLR 2026 arXiv: 2508.15772 Code: https://github.com/HiDream-ai/VAREdit Area: Human Understanding Keywords: Visual Autoregressive, Image Editing, Multi-Scale Prediction, Scale-Aligned Reference, Instruction-Guided

TL;DR

VAREdit reformulates instruction-guided image editing as a next-scale prediction problem. It proposes the Scale-Aligned Reference (SAR) module to resolve the scale mismatch between finest-scale conditioning and coarse target features. On EMU-Edit and PIE-Bench, the GPT-Balance score surpasses the strongest diffusion baseline by 64.9% and 45.3%, respectively, with 512×512 editing completed in only 1.2 seconds.

Background & Motivation

Background: Instruction-guided image editing is dominated by diffusion-based methods (InstructPix2Pix, UltraEdit, etc.), which train by channel-concatenating source and target images. AnySD and OmniGen extend this further, but all remain constrained by inherent limitations of the diffusion paradigm.

Limitations of Prior Work: (1) The global denoising process of diffusion models entangles edited regions with the full image context, causing spurious modifications; (2) Multi-step denoising is computationally expensive, limiting real-time applications; (3) Early AR-based editing attempts (training-free) lack task-specific knowledge and fall significantly behind diffusion methods.

Key Challenge: The causal compositional mechanism of AR models is naturally suited for editing (preserving unchanged regions while precisely modifying edited regions), yet adapting VAR's multi-scale generation to editing tasks introduces scale mismatch challenges.

Goal: How to effectively apply the VAR paradigm to instruction-guided image editing, and how to resolve the scale mismatch in source image conditioning?

Key Insight: Through systematic analysis of attention patterns in a full-scale model, the authors find that the first self-attention layer requires scale-aligned references, while subsequent layers only need the finest-scale condition.

Core Idea: Scale-aligned source references (generated by downsampling the finest-scale features) are injected only into the first self-attention layer; all remaining layers use the finest-scale condition, thereby balancing global layout and local detail.

Method

Overall Architecture

Built upon a pretrained VAR model (Infinity), the source image is processed through a shared VQ encoder to obtain multi-scale residuals, and the text instruction is mapped via an encoder. The model autoregressively generates target residuals \(\mathbf{R}_{1:K}^{(tgt)}\):

\[p(\mathbf{R}_{1:K}^{(tgt)} | \mathbf{I}^{(src)}, \mathbf{t}) = \prod_{k=1}^K p(\mathbf{R}_k^{(tgt)} | \mathbf{F}_{1:k-1}^{(tgt)}, \mathbf{F}_K^{(src)}, \mathbf{t})\]

Key Designs

  1. Analysis of Source Image Conditioning Strategies:

    • Full-scale conditioning: prepend source tokens at all scales → quadratic complexity \(O(n^2)\), potentially introducing redundancy.
    • Finest-scale conditioning: use only \(\mathbf{F}_K^{(src)}\) → computationally efficient but suffers from scale mismatch (fine-grained information cannot guide coarse-scale predictions).
  2. Scale Dependency Analysis (Key Findings):

    • Self-attention heatmaps are analyzed on a full-scale trained model.
    • First layer: Attention is broadly distributed, focusing on the corresponding and all coarser source scales → responsible for global layout.
    • Subsequent layers: Attention is highly localized, exhibiting a diagonal structure → responsible for local refinement.
    • Conclusion: The first layer requires scale-aligned references; subsequent layers only need the finest scale.
  3. Scale-Aligned Reference (SAR) Module:

    • Scale-specific references are generated by downsampling the finest-scale features: \(\mathbf{F}_k^{(ref)} = \text{Down}(\mathbf{F}_K^{(src)}, (h_k, w_k))\)
    • Only in the first self-attention layer, the corresponding scale reference is concatenated when predicting scale \(k\):
\[\hat{\mathbf{O}}_k^{(tgt)} = \text{Softmax}\left(\frac{\mathbf{Q}_k^{(tgt)} [\mathbf{K}_k^{(ref)\top}, \mathbf{K}_{1:k}^{(tgt)\top}]}{\sqrt{d}}\right) \cdot [\mathbf{V}_k^{(ref)\top}, \mathbf{V}_{1:k}^{(tgt)\top}]^\top\]
- All remaining layers continue to use finest-scale conditioning combined with causal target history.
  1. Text Conditioning:

    • Text is encoded into token embeddings; the pooled representation serves as \(\tilde{\mathbf{F}}_0^{(tgt)}\) (the start token).
    • Text token embeddings are used as key/value matrices in cross-attention.
    • Source tokens are distinguished from target tokens via a 2D-RoPE positional offset \(\Delta=(64,64)\).

Loss & Training

  • Initialized from Infinity pretrained weights.
  • VAREdit-2B: two-stage training at 256×256 (8k iterations) → 512×512 (7k iterations).
  • VAREdit-8B: trained directly at 512×512 for 60k iterations.
  • Optimizes a bitwise classifier loss.
  • Inference: CFG \(\eta=4\), logits temperature \(\tau=0.5\).

Key Experimental Results

EMU-Edit and PIE-Bench

Method Size EMU-Edit GPT-Bal. PIE-Bench GPT-Bal. Time
InstructPix2Pix 1.1B 2.923 4.034 3.5s
UltraEdit 7.7B 4.541 5.580 2.6s
OmniGen 3.8B 4.666 3.498 16.5s
ICEdit 4.786
VAREdit-2B 2B 7.074 7.609 0.7s
VAREdit-8B 8B 7.892 8.105 1.2s

Comparison with State-of-the-Art Methods

Method Size EMU-Edit GPT-Bal. PIE-Bench GPT-Bal.
GPT-4o-Image 8.549 8.616
Step1X-Edit 7.378 7.488
Qwen-Image-Edit 20B 8.087 8.272
VAREdit-8B 8B 7.892 8.105

Key Findings

  • GPT-Balance surpasses the strongest baseline ICEdit by 64.9% (EMU-Edit) and the strongest UltraEdit by 45.3% (PIE-Bench).
  • VAREdit-2B completes editing in only 0.7 seconds, 3.7× faster than UltraEdit.
  • On the high-quality editing subset where GPT-Suc.≥9, VAREdit also achieves the highest GPT-Over., demonstrating genuinely accurate editing with strong preservation.
  • SAR ablation: removing SAR leads to a significant drop in GPT-Bal., validating the necessity of scale alignment.
  • Among open-source methods, VAREdit-8B ranks second only to Qwen-Image-Edit (20B), which has a model 2.5× larger.

Highlights & Insights

  • Paradigm Shift: This is the first successful application of VAR's next-scale prediction to image editing, demonstrating that AR has a fundamental advantage over diffusion in editing tasks.
  • Attention-Analysis-Driven Design: SAR is not designed intuitively but derived from systematic analysis of attention patterns in the full-scale model — the methodology is worth adopting broadly.
  • Significant Efficiency Advantage: The single-pass generation of AR is inherently faster than diffusion's multi-step denoising, achieving high-quality editing in 1.2 seconds.
  • Importance of the GPT-Balance Metric: Reveals that methods like OmniGen achieve high GPT-Over. through a "no-edit" strategy; GPT-Bal. provides a more comprehensive evaluation.

Limitations & Future Work

  • Quality is bounded by the VQ tokenizer's reconstruction fidelity; fine-grained textures may suffer degradation.
  • The current maximum resolution is 512×512; extension to higher resolutions remains to be explored.
  • The downsampling operation in SAR may discard important spatial information.
  • Generalizing SAR to spatiotemporal multi-scale scenarios for video editing is a promising future direction.
  • vs. InstructPix2Pix paradigm: The fundamental issue with channel-concatenation + diffusion is entanglement caused by global denoising; VAREdit's causal mechanism naturally avoids this.
  • vs. EditAR: Follows vanilla next-token prediction, risking structural degradation; VAREdit uses next-scale prediction instead.
  • vs. Infinity: VAREdit inherits its multi-scale residual quantizer and bit-level classifier, adapting it for editing tasks.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First application of VAR to editing, with SAR design derived from deep analysis — highly innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Four benchmarks, diverse comparisons, ablations, and efficiency analysis are comprehensive.
  • Writing Quality: ⭐⭐⭐⭐⭐ Logic flows tightly from motivation to analysis to design; attention heatmap visualizations are excellent.
  • Value: ⭐⭐⭐⭐⭐ Opens a new direction for VAR-based editing with dual advantages in performance and efficiency.