SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation¶
Conference: ICLR 2026 arXiv: 2602.05534 Code: GitHub Area: Visual Autoregressive Models / Image Generation / Inference-Time Guidance Keywords: VAR, next-scale prediction, information bottleneck, frequency-domain guidance, training-free
TL;DR¶
This paper proposes Scaled Spatial Guidance (SSG), a training-free inference-time guidance method that enhances the coarse-to-fine hierarchical generation quality of visual autoregressive models through frequency-domain prior construction and semantic residual amplification.
Background & Motivation¶
Visual autoregressive (VAR) models generate images via next-scale prediction, naturally achieving coarse-to-fine hierarchical synthesis. However:
Training–inference discrepancy: Limited model capacity and accumulated errors cause the model to deviate from its coarse-to-fine nature at inference time, leading to redundant prediction of low-frequency information.
Limitations of existing improvements: - Auxiliary refinement modules (CoDe, HMAR) require retraining. - Flow matching integration introduces additional overhead. - Self-correction mechanisms require architectural modifications.
Core Problem: How can one guide each generation step to produce novel, scale-specific high-frequency information without modifying model parameters?
Method¶
1. Derivation from an Information-Theoretic Perspective¶
Starting from the Information Bottleneck (IB) principle, the stepwise generation of VAR is reformulated as a variational optimization problem:
- Target information term: Maximizes mutual information with high-frequency details.
- State redundancy term: Minimizes redundancy with already-established coarse structures.
2. SSG Formulation¶
The optimization objective is converted into a MAP-style surrogate function, yielding a closed-form solution:
where: - \(\ell_k\): residual logits at step \(k\) - \(\ell_{\text{prior}}\): coarse-grained prior constructed from the previous step - \(\Delta_k = \ell_k - \ell_{\text{prior}}\): semantic residual (high-frequency details) - \(\beta_k\): step-wise scaling factor
3. Discrete Space Enhancement (DSE)¶
Frequency-domain prior construction procedure: 1. Spatially interpolate the previous-step logits \(\ell_{k-1}\) to obtain \(\ell'_{\text{interp}}\). 2. Apply DCT to both. 3. Fuse the low-frequency coefficients of \(\ell_{k-1}\) with the high-frequency coefficients of \(\ell'_{\text{interp}}\). 4. Apply IDCT to recover the prior \(\ell_{\text{prior}}\).
Advantages over simple interpolation: - Linear interpolation over-smooths and attenuates the prior. - Nearest-neighbor interpolation introduces blocky discontinuities and spurious high frequencies. - DCT frequency-domain fusion preserves energy conservation and achieves precise band separation.
4. Efficient Implementation¶
- No additional forward passes required (cached logits are reused).
- Implemented in only a few lines of code.
- Negligible computational and memory overhead.
Key Experimental Results¶
ImageNet 256×256 Class-Conditional Generation¶
| Model | FID↓ | sFID↓ | IS↑ | Pre↑ | Rec↑ |
|---|---|---|---|---|---|
| VAR-d16 | 3.42 | 8.70 | 275.6 | 0.84 | 0.51 |
| +SSG | 3.27 | 8.39 | 285.3 | 0.85 | 0.50 |
| VAR-d20 | 2.67 | 7.97 | 299.8 | 0.83 | 0.55 |
| +SSG | 2.49 | 7.60 | 305.2 | 0.83 | 0.56 |
| VAR-d24 | 2.39 | 8.18 | 314.7 | 0.82 | 0.58 |
| +SSG | 2.20 | 6.95 | 324.0 | 0.83 | 0.59 |
| VAR-d30 | 2.02 | 8.52 | 302.9 | 0.82 | 0.60 |
| +SSG | 1.68 | 8.50 | 313.2 | 0.81 | 0.62 |
Cross-Model Generalization¶
SSG proves effective across different tokenization schemes: - Standard VAR (Tian et al.) - HART (hybrid tokens) - Infinity (bitwise tokens)
Comparison with Other Generative Models¶
VAR-d30 + SSG (FID 1.68) is competitive with diffusion and masked generative models while retaining VAR's low-latency advantage (10-step inference).
Ablation Study¶
| Component | FID | IS |
|---|---|---|
| w/o SSG (baseline) | 2.02 | 302.9 |
| SSG + linear interpolation prior | limited improvement | — |
| SSG + nearest-neighbor prior | possible degradation | — |
| SSG + DSE (frequency-domain fusion) | 1.68 | 313.2 |
Highlights & Insights¶
- Elegant information-theoretic design: SSG's closed-form solution is rigorously derived from the IB principle.
- Completely training-free: No modification of model weights, no additional data, and no fine-tuning required.
- Theoretically grounded frequency-domain prior (DSE): Exploits DCT orthogonality to achieve lossless energy-preserving band fusion.
- Strong consistency: Effective across VAR models of varying scales and tokenization designs.
- Minimal implementation: Integrable in just a few lines of code.
Limitations & Future Work¶
- The effectiveness of SSG depends on a well-designed \(\beta_k\) schedule, requiring per-model tuning.
- SSG has no effect at the first step (coarsest scale) due to the absence of a prior.
- SSG is fundamentally a posterior correction and cannot compensate for information loss inherent to the tokenizer.
- Applicable only to VAR models that operate on discrete visual tokens.
Related Work & Insights¶
- VAR models: VAR (Tian 2024), HART (Tang 2025), Infinity (Han 2025)
- Visual guidance: CFG, SAG, PAG, STG — none designed specifically for VAR
- Training–inference discrepancy mitigation: CoDe, HMAR — both require retraining
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Elegant bridging from information theory to practice
- Practicality: ⭐⭐⭐⭐⭐ — Zero-cost integration, plug-and-play
- Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across multiple models and settings
- Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical derivations are clear and intuitions are well-explained