Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation¶

Conference: ICLR 2026 arXiv: 2507.01957 Code: GitHub Area: Autoregressive Image Generation Keywords: parallel decoding, autoregressive modeling, spatial locality, positional query, efficient inference

TL;DR¶

This paper proposes Locality-aware Parallel Decoding (LPD), which reduces the number of generation steps for 256×256 images from 256 to 20 by flexibly parallelizing autoregressive modeling architectures and employing a locality-aware generation order schedule, achieving at least 3.4× latency reduction.

Background & Motivation¶

Next-patch prediction in autoregressive image generation is a memory-bound operation whose latency scales linearly with the number of steps.
Next-scale prediction (e.g., VAR) requires fewer steps but relies on multi-scale token representations, making it incompatible with flat visual perception models (CLIP, DINO).
Existing parallelization methods (PAR, RandAR) achieve only limited parallelism: PAR fixes the parallel order, while RandAR generates parallel tokens without mutual visibility.
There is a clear need for efficient inference that preserves the generality and compatibility of flat token representations.

Method¶

Overall Architecture¶

LPD consists of two core components: a flexible parallel autoregressive modeling architecture (supporting arbitrary generation orders and parallelism degrees) and a locality-aware generation order schedule (maximizing contextual support while minimizing intra-group dependencies).

Key Designs¶

Flexible Parallel Autoregressive Modeling: The approach decouples context representation from token generation — previously generated tokens supply context (KV cache), while learnable positional query tokens drive the parallel generation of target positions. A dedicated attention mask is employed:
- Context Attention: subsequent tokens causally attend to context tokens.
- Query Attention: positional query tokens within the same step attend to each other, but subsequent tokens are not permitted to attend to query tokens.

During inference, encoding and decoding are fused into a single-step operation, and only the KV cache of generated tokens is stored.

Locality Analysis: Attention patterns are analyzed on LlamaGen-1.4B, revealing strong spatial locality — the attention of decoded tokens concentrates on spatially nearby tokens. Per-Token Attention (PTA) is defined as: $$PTA_s = \frac{1}{N}\sum_{i=1}^N \frac{\sum_j \text{Attention}(T_i,T_j) \cdot \mathbb{I}[d(T_i,T_j)=s]}{\sum_j \mathbb{I}[d(T_i,T_j)=s]}$$ PTA decays sharply with distance, validating two design principles: parallel tokens should be close to already-generated tokens (strong conditioning) and far from tokens within the same group (low dependency).
Locality-aware Generation Order Schedule: At each step $k$:
- The Euclidean distance between unselected tokens and selected tokens is computed as a proximity measure.
- Tokens are ranked by proximity, and a threshold $\tau$ is used to filter a high-proximity candidate set $c_1$.
- Tokens are greedily selected from $c_1$; after each selection, nearby tokens within a repulsion threshold $\rho$ are filtered out.
- If the group is not yet full, farthest-point sampling from the remaining set $c_2$ supplies additional tokens.

Group sizes are typically increased via a cosine schedule, and the generation order can be precomputed.

Loss & Training¶

A grouped autoregressive training objective is adopted: $p(x_1,...,x_N;c) = \prod_{g=1}^G p(X_g|X_{<g};c)$ Cross-entropy loss is used, with a dedicated attention mask enabling teacher-forcing and parallel prediction during training.

Key Experimental Results¶

Main Results (ImageNet 256×256)¶

Type	Model	Params	FID↓	IS↑	#Steps	Latency(s)	Throughput
AR	LlamaGen-XXL	1.4B	2.34	253.9	576	24.40	0.72
AR	RAR-XXL	1.5B	1.48	326.0	256	6.59	6.72
Par.AR	PAR-XXL-4×	1.4B	2.35	263.2	147	6.26	2.33
Par.AR	RandAR-L	343M	2.55	288.8	88	1.97	28.59
Par.AR	LPD-L	343M	2.31	284.9	20	0.40	92.42
Par.AR	LPD-XL	775M	1.97	304.0	20	0.57	60.27

ImageNet 512×512¶

Model	Params	FID↓	#Steps	Latency(s)	Throughput
LlamaGen-XXL	1.4B	2.59	1024	-	-
LPD-XXL	1.4B	2.25	48	2.78	6.56

Key Findings¶

LPD-L generates 256×256 images in only 20 steps, achieving FID=2.31, outperforming LlamaGen-XXL (2.34) which requires 576 steps.
Throughput of 92.42 img/s substantially exceeds RandAR (28.59) and PAR (6.83).
512×512 generation requires only 48 steps (vs. 1024), with FID improving from 2.59 to 2.25.
The locality-aware schedule significantly outperforms raster, random, and Halton orderings.
Zero-shot image editing (class-conditional editing, inpainting, outpainting) is naturally supported.

Highlights & Insights¶

The decoupling design via positional query tokens elegantly resolves the flexibility limitations of standard decoder-only models.
Query Attention ensures that tokens generated within the same step are mutually visible, preventing inconsistencies arising from independent sampling.
The locality analysis provides an empirical foundation for designing parallelization strategies — PTA analysis is transferable to other visual autoregressive models.
Flat token representations are preserved compared to VAR, maintaining compatibility with visual backbones such as CLIP and DINO.

Limitations & Future Work¶

Validation is currently limited to ImageNet class-conditional generation; extension to text-guided generation has not been explored.
The positional query tokens introduce additional parameters and attention computation overhead.
Hyperparameters of the generation order schedule ($\tau$, $\rho$, group size schedule) require tuning.
A gap in FID remains relative to the best MAR/VAR methods, though throughput is substantially superior.

The limitations of parallel autoregressive methods such as PAR, RandAR, and SAR motivate this work.
MaskGIT's masked prediction inspires the design of progressively increasing group sizes.
The spatial locality observation provides insights into understanding attention mechanisms in visual autoregressive models.
This work offers an efficient solution for the image generation component in unified multimodal generation (text + image) systems.

Technical Details¶

Group sizes increase via a cosine schedule: fewer tokens are generated in early steps when context is scarce, with more generated in later steps.
Positional query tokens = shared learnable embeddings + positional encodings of target positions.
During inference, KV representations of query tokens are not stored; only those of generated tokens are cached.
256×256 generation completes in 20 steps; 512×512 in 48 steps.
Zero-shot image editing is supported (class-conditional editing, inpainting, outpainting).
LPD-L with 343M parameters achieves FID=2.31, surpassing LlamaGen-XXL with 1.4B parameters.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of positional query decoupling and locality-aware scheduling is novel and effective.
Experimental Thoroughness: ⭐⭐⭐⭐ Systematic comparisons are thorough, but text-to-image and multimodal experiments are absent.
Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly articulated, with thorough comparative analysis against alternative methods.
Value: ⭐⭐⭐⭐⭐ Substantially reduces latency in autoregressive image generation, with significant implications for unified multimodal systems.