Autoregressive Image Generation with Randomized Parallel Decoding¶

Conference: ICLR 2026 arXiv: 2503.10568 Code: https://github.com/hp-l33/ARPG Area: Image Generation Keywords: autoregressive image generation, random-order modeling, parallel decoding, KV cache, controllable generation

TL;DR¶

This paper proposes ARPG, a visual autoregressive model built upon a "guided decoding" framework that decouples positional guidance (query) from content representation (key-value), enabling fully randomized-order training and generation with efficient parallel decoding. On ImageNet-1K 256×256, ARPG achieves 1.94 FID in 64 steps with over 20× throughput improvement and over 75% memory reduction.

Background & Motivation¶

Autoregressive (AR) models have achieved remarkable success in large language models, and this paradigm has been extended to visual generation (e.g., VQGAN, LlamaGen). However, applying next-token prediction to image generation faces two core challenges:

Fixed-order constraint: Images have a 2D spatial structure, but AR models require flattening them into a 1D sequence (e.g., raster scan order), making it difficult for the model to handle zero-shot generalization tasks that require non-causal dependencies (e.g., inpainting, outpainting).

Inference inefficiency: Token-by-token generation is highly inefficient at high resolutions, where a 256×256 image requires generating hundreds of tokens.

Existing alternatives each have shortcomings: MaskGIT achieves random-order generation via masked modeling but relies on bidirectional attention and cannot use KV caching; RandAR enables random ordering via positional instruction tokens but doubles the sequence length, incurring substantial computational and memory overhead.

Core Idea: Embed positional information of the prediction target as queries in the attention mechanism, fully decoupling content representation (KV) from positional guidance (Q), thereby supporting random-order modeling and parallel decoding while preserving causality.

Method¶

Overall Architecture¶

ARPG adopts a 2-Pass Decoder architecture. The first pass applies standard causal self-attention over known tokens to obtain contextualized representations (serving as global key-value pairs). The second pass applies cross-attention, using target-aware queries (position-embedded [MASK] tokens) to predict tokens at arbitrary positions. The input consists of a class label and an image token sequence; the output is the predicted token at the corresponding position.

Key Designs¶

Three Core Insights:
- Insight 1: Breaking order-specific constraints in AR models requires explicit positional guidance so that the model knows where the next token to predict is located.
- Insight 2: In masked sequence modeling, queries corresponding to unmasked tokens receive no gradients from the loss function and thus play no role during training — meaning the queries can be entirely data-independent.
- Insight 3: [MASK] tokens encode only positional information and contribute nothing to the contextual representation; moreover, they are harmful to causality — and should therefore be removed from the key-value pairs.
Guided Decoding Framework: Based on the above insights, ARPG redefines the probability distribution for permutation autoregressive modeling. Each query \(q_{\tau_i}\) is obtained by applying 2D RoPE positional encoding to a data-independent [MASK] token, while the key-value pairs are composed entirely of data-dependent known tokens. Through causal cross-attention, each target-aware query independently attends to the contextual key-value pairs, guiding the model to predict the token at a specific position.
Parallel Decoding: Since all tokens to be predicted are mutually independent (their queries do not influence one another), ARPG naturally supports parallel decoding. Multiple queries can be processed simultaneously, sharing a single KV cache. Unlike conventional cross-attention, ARPG swaps the roles of input and condition — known tokens serve as KV, and target positions serve as Q — thereby avoiding attention conflicts among multiple generation targets.
2-Pass Decoder Architecture: The first-pass (self-attention decoder) processes input tokens to obtain global contextual representations; the second-pass (cross-attention decoder) uses guided decoding to predict target tokens. Experiments show that a symmetric configuration (e.g., 12+12 layers) achieves the best balance between efficiency and quality.

Loss & Training¶

Training uses standard teacher-forcing on randomly permuted sequences.
Sequences within each batch are independently shuffled, with the class token placed at the start.
RoPE frequencies are expanded along the batch dimension and shuffled accordingly to maintain alignment.
AdamW optimizer (\(\beta_1=0.99\), \(\beta_2=0.95\)), initial learning rate 1e-4 per 256 batch size.
400 epochs total training, with 100 epochs warmup followed by cosine scheduling to 1e-5.
Classifier-free guidance (CFG) class embedding dropout rate of 0.1.
LlamaGen tokenizer (16× downsampling, codebook size 16384).

Key Experimental Results¶

Main Results¶

Model	Params	Steps	Throughput	Memory	FID↓	IS↑
LlamaGen-XXL	1.4B	576	1.58 it/s	26.22 GB	2.62	244.1
VAR-d24	1.0B	10	48.90 it/s	22.43 GB	2.09	312.9
RandAR-XXL	1.4B	88	10.46 it/s	21.77 GB	2.15	322.0
RAR-XL	955M	256	8.00 it/s	10.55 GB	1.50	306.9
ARPG-L	320M	64	62.12 it/s	2.43 GB	2.44	287.1
ARPG-XL	719M	64	35.89 it/s	4.48 GB	2.10	331.0
ARPG-XXL	1.3B	64	25.39 it/s	7.31 GB	1.94	339.7

Ablation Study¶

Configuration	Steps	Throughput	Memory	FID
ARPG-L (12+12) baseline	64	62.12 it/s	2.43 GB	2.44
Fewer Guided (18+6)	64	50.72 it/s	3.19 GB	3.82
More Guided (6+18)	64	66.11 it/s	1.67 GB	3.51
w/o Guided (24+0)	256	11.70 it/s	4.96 GB	90
Guided Only (0+24)	64	72.26 it/s	0.91 GB	4.57
w/o Shared KV	64	48.02 it/s	3.83 GB	2.37
Random order	64	62.12 it/s	2.43 GB	2.44
Raster order	256	-	-	2.49

Key Findings¶

ARPG-XXL achieves 1.94 FID within 64 steps, with over 20× higher throughput than LlamaGen.
Compared to VAR at similar throughput, ARPG reduces memory consumption by over 75% (7.31 GB vs. 22.43 GB).
Reducing sampling steps (e.g., from 64 to 32) does not significantly degrade quality (ARPG-XXL: FID=2.08 at 32 steps vs. FID=1.94 at 64 steps).
Random-order generation, despite the increased modeling difficulty (\(n!\) possible permutations), outperforms fixed-order generation.
Removing the guided decoder degrades the model to a standard AR model (FID spikes to 90), completely losing random-order generation capability.

Highlights & Insights¶

Theoretical clarity: Starting from a comparison between masked and autoregressive modeling, the method is derived through three rigorous insights, forming a complete logical chain.
Efficiency and quality: The approach substantially improves inference efficiency while maintaining competitive generation quality, which is highly valuable for practical deployment.
Zero-shot generalization: Random-order modeling enables the model to naturally support inpainting, outpainting, and resolution extrapolation without additional training.
Controllable generation: Simply replacing [MASK] queries with condition tokens (e.g., Canny edges, depth maps) enables controllable generation, achieving state-of-the-art results on ControlVAR and ControlAR.
Minimal design: The method does not rely on additional techniques such as QK normalization, AdaLN, or linear attention.

Limitations & Future Work¶

Due to computational constraints, the method has not been extended to text-to-image generation.
The 512×512 resolution experiment involves only 50 epochs of fine-tuning rather than training from scratch, leaving high-resolution performance insufficiently validated.
The 2-pass decoder introduces additional architectural complexity, though the authors partially mitigate the overhead through shared KV caching.
Random-order training may require more training epochs to achieve the same convergence quality.
Compared to diffusion models, the FID scores remain behind the very top tier (e.g., DiT-XL/2 achieves a strong 2.27 FID).

Causal sequence modeling: Raster-order AR models such as VQGAN and LlamaGen, whose efficiency is limited by token-by-token generation.
Masked sequence modeling: The MaskGIT family achieves parallel generation via bidirectional attention but cannot utilize KV caching.
RandAR: Achieves random ordering via positional instruction tokens, but doubling the sequence length introduces significant overhead.
RAR: Specifies the next token position via target-aware positional embeddings, yet still performs best under raster order.
Insight: Redefining the roles of Q, K, and V in the attention mechanism — where Q encodes position and KV encodes content — is an elegant design principle that may inspire other sequence modeling tasks.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐