Next Visual Granularity Generation¶
Conference: ICLR 2026 arXiv: 2508.12811 Code: Project Page Area: Image Generation / Visual Autoregression Keywords: Visual Granularity, Autoregressive Generation, Structured Sequence, Coarse-to-Fine Generation, ImageNet
TL;DR¶
This paper proposes the Next Visual Granularity (NVG) generation framework, which decomposes images into structured sequences at different granularity levels and generates from global layout to fine-grained details progressively, achieving consistent FID improvements over the VAR family.
Background & Motivation¶
- Limitations of Prior Work:
- Token serialization methods neglect rich 2D spatial structures and suffer from exposure bias
- In VAR's visual pyramid, tokens at early stages represent large, semantically diverse regions, causing representational ambiguity
- Diffusion models lack explicit structural control and require additional modules
- Mechanism: Images are represented using varying numbers of unique tokens at the same spatial resolution, forming a granularity hierarchy
Method¶
Overall Architecture¶
NVG represents an image as a structured sequence \(\mathcal{T} = \{(\boldsymbol{c}_i, \boldsymbol{s}_i)\}_{i=0}^K\), where: - \(\boldsymbol{c}_i\): content tokens at stage \(i\) (\(|c_i| = n_i\) unique tokens drawn from a shared codebook \(\mathcal{V}\)) - \(\boldsymbol{s}_i\): structure map (\(h \times w\) matrix indicating the token index assigned to each position)
1. Visual Granularity Sequence Construction¶
Structure Construction (bottom-up clustering): - Starts from the finest granularity (one unique token per position) - Greedy strategy: computes pairwise \(\ell_2\) distances and merges the top-\(k\) most similar tokens into a cluster - With \(k=2\), the token count halves at each stage, yielding a sequence \(\{2^i\}_{i=0}^8\) (for a \(16^2\) latent space)
Content Construction (residual manner): Similar to VAR's visual pyramid, but compression is guided by structure maps rather than spatial downscaling
Structure Embedding: A \(K\)-dimensional vector encodes the hierarchical relationships across all stages; one bit (0 or 2) is appended per stage, with 1 used as padding
2. Generation Pipeline¶
Each stage generates structure first, then content: - Structure Generator: A lightweight rectified flow model using v-prediction with Gumbel-top-\(k\) sampling - Input: \(\boldsymbol{z}_s(t) = t \cdot \boldsymbol{\varepsilon} + (1-t) \cdot \boldsymbol{s}_e\); known portions are replaced with ground-truth values - Content Generator: Predicts the final canvas \(f_c(\boldsymbol{x}_i) \rightarrow \boldsymbol{x}\), obtaining current-stage tokens via residuals
Content generator training loss:
3. Structure-Aware RoPE¶
The 64-dimensional attention features are partitioned as: - [8] text/image identifier - [2]×8 structure encoding - [20]×2 spatial position
Tokens within the same cluster share structure positions, while tokens across clusters have distinct ones.
Key Experimental Results¶
ImageNet 256×256 Class-Conditional Generation¶
| Type | Model | FID(↓) | IS(↑) | Prec(↑) | Rec(↑) | Params |
|---|---|---|---|---|---|---|
| X-AR | VAR-d16 | 3.30 | 274.4 | 0.84 | 0.51 | 310M |
| X-AR | VAR-d20 | 2.57 | 302.6 | 0.83 | 0.56 | 600M |
| X-AR | VAR-d24 | 2.09 | 312.9 | 0.82 | 0.59 | 1.0B |
| X-AR | NVG-d16 | 3.03 | 291.6 | - | - | - |
| X-AR | NVG-d20 | 2.44 | 305.0 | - | - | - |
| X-AR | NVG-d24 | 2.06 | 323.0 | - | - | - |
| Mask | MAR-H | 1.55 | 303.7 | 0.81 | 0.62 | 943M |
| Diff | SiT-X | 2.06 | 270.3 | 0.82 | 0.59 | 675M |
Ablation Study: Granularity Decomposition vs. Spatial Decomposition¶
| Method | rFID(↓) | IS(↑) | Note |
|---|---|---|---|
| NVG (granularity decomposition) | better | better | Semantically cleaner per-token representation |
| VAR (spatial decomposition) | baseline | baseline | Early tokens mix diverse semantics |
Key Findings¶
- NVG consistently outperforms VAR across all model scales (FID: 3.30→3.03, 2.57→2.44, 2.09→2.06)
- Clear scaling laws: larger models yield continuous performance gains
- Generated images align closely with structure maps, validating the effectiveness of structural control
- Structure maps from reference images can be reused to enable structure transfer across content
Highlights & Insights¶
- Elegant problem reformulation: Shifts autoregressive generation from "next token" to "next granularity level"
- Resolves VAR's representational ambiguity: Granularity-based decomposition yields semantically cleaner per-token representations
- Explicit structural control: Built directly into the generation process without requiring additional conditioning modules
- Mitigates exposure bias: Residual modeling combined with progressive canvas refinement avoids error accumulation in autoregressive generation
Limitations & Future Work¶
- The greedy clustering strategy may not yield optimal structure construction
- The dual-model design (structure + content) increases overall system complexity
- Validation is currently limited to class-conditional generation; text-to-image generation remains unexplored
- The "cold start" problem of the structure generator requires unified cross-stage training to mitigate
Related Work & Insights¶
- Visual Autoregression: VAR, LlamaGen, Open-MAGVIT2
- Diffusion Models: DiT, SiT, LDM
- Masked Models: MaskGIT, MAR, TiTok
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The concept of visual granularity sequences is distinctive and intuitive
- Technical Depth: ⭐⭐⭐⭐ — Structure embeddings and Structure-Aware RoPE are elegantly designed
- Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive comparisons with clear scaling analysis
- Value: ⭐⭐⭐⭐ — Introduces a novel image generation paradigm with built-in structural control