SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation¶

Conference: ICLR 2026 arXiv: 2602.05534 Code: GitHub Area: Visual Autoregressive Models / Image Generation / Inference-Time Guidance Keywords: VAR, next-scale prediction, information bottleneck, frequency-domain guidance, training-free

TL;DR¶

This paper proposes Scaled Spatial Guidance (SSG), a training-free inference-time guidance method that enhances the coarse-to-fine hierarchical generation quality of visual autoregressive models through frequency-domain prior construction and semantic residual amplification.

Background & Motivation¶

Visual autoregressive (VAR) models generate images via next-scale prediction, naturally achieving coarse-to-fine hierarchical synthesis. However:

Training–inference discrepancy: Limited model capacity and accumulated errors cause the model to deviate from its coarse-to-fine nature at inference time, leading to redundant prediction of low-frequency information.

Limitations of existing improvements: - Auxiliary refinement modules (CoDe, HMAR) require retraining. - Flow matching integration introduces additional overhead. - Self-correction mechanisms require architectural modifications.

Core Problem: How can one guide each generation step to produce novel, scale-specific high-frequency information without modifying model parameters?

Method¶

1. Derivation from an Information-Theoretic Perspective¶

Starting from the Information Bottleneck (IB) principle, the stepwise generation of VAR is reformulated as a variational optimization problem:

\[\mathcal{L}_{\text{VAR-IB}} = \max_{z_k} \beta I(z_k; H(\hat{f}_K)) - I(z_k; L(\hat{f}_K))\]

Target information term: Maximizes mutual information with high-frequency details.
State redundancy term: Minimizes redundancy with already-established coarse structures.

2. SSG Formulation¶

The optimization objective is converted into a MAP-style surrogate function, yielding a closed-form solution:

\[\ell_k^{\text{SSG}} = \ell_k + \beta_k \Delta_k = \ell_k + \beta_k (\ell_k - \ell_{\text{prior}})\]

where: - \(\ell_k\): residual logits at step \(k\) - \(\ell_{\text{prior}}\): coarse-grained prior constructed from the previous step - \(\Delta_k = \ell_k - \ell_{\text{prior}}\): semantic residual (high-frequency details) - \(\beta_k\): step-wise scaling factor

3. Discrete Space Enhancement (DSE)¶

Frequency-domain prior construction procedure: 1. Spatially interpolate the previous-step logits \(\ell_{k-1}\) to obtain \(\ell'_{\text{interp}}\). 2. Apply DCT to both. 3. Fuse the low-frequency coefficients of \(\ell_{k-1}\) with the high-frequency coefficients of \(\ell'_{\text{interp}}\). 4. Apply IDCT to recover the prior \(\ell_{\text{prior}}\).

Advantages over simple interpolation: - Linear interpolation over-smooths and attenuates the prior. - Nearest-neighbor interpolation introduces blocky discontinuities and spurious high frequencies. - DCT frequency-domain fusion preserves energy conservation and achieves precise band separation.

4. Efficient Implementation¶

No additional forward passes required (cached logits are reused).
Implemented in only a few lines of code.
Negligible computational and memory overhead.

Key Experimental Results¶

ImageNet 256×256 Class-Conditional Generation¶

Model	FID↓	sFID↓	IS↑	Pre↑	Rec↑
VAR-d16	3.42	8.70	275.6	0.84	0.51
+SSG	3.27	8.39	285.3	0.85	0.50
VAR-d20	2.67	7.97	299.8	0.83	0.55
+SSG	2.49	7.60	305.2	0.83	0.56
VAR-d24	2.39	8.18	314.7	0.82	0.58
+SSG	2.20	6.95	324.0	0.83	0.59
VAR-d30	2.02	8.52	302.9	0.82	0.60
+SSG	1.68	8.50	313.2	0.81	0.62

Cross-Model Generalization¶

SSG proves effective across different tokenization schemes: - Standard VAR (Tian et al.) - HART (hybrid tokens) - Infinity (bitwise tokens)

Comparison with Other Generative Models¶

VAR-d30 + SSG (FID 1.68) is competitive with diffusion and masked generative models while retaining VAR's low-latency advantage (10-step inference).

Ablation Study¶

Component	FID	IS
w/o SSG (baseline)	2.02	302.9
SSG + linear interpolation prior	limited improvement	—
SSG + nearest-neighbor prior	possible degradation	—
SSG + DSE (frequency-domain fusion)	1.68	313.2

Highlights & Insights¶

Elegant information-theoretic design: SSG's closed-form solution is rigorously derived from the IB principle.
Completely training-free: No modification of model weights, no additional data, and no fine-tuning required.
Theoretically grounded frequency-domain prior (DSE): Exploits DCT orthogonality to achieve lossless energy-preserving band fusion.
Strong consistency: Effective across VAR models of varying scales and tokenization designs.
Minimal implementation: Integrable in just a few lines of code.

Limitations & Future Work¶

The effectiveness of SSG depends on a well-designed \(\beta_k\) schedule, requiring per-model tuning.
SSG has no effect at the first step (coarsest scale) due to the absence of a prior.
SSG is fundamentally a posterior correction and cannot compensate for information loss inherent to the tokenizer.
Applicable only to VAR models that operate on discrete visual tokens.

VAR models: VAR (Tian 2024), HART (Tang 2025), Infinity (Han 2025)
Visual guidance: CFG, SAG, PAG, STG — none designed specifically for VAR
Training–inference discrepancy mitigation: CoDe, HMAR — both require retraining

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Elegant bridging from information theory to practice
Practicality: ⭐⭐⭐⭐⭐ — Zero-cost integration, plug-and-play
Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across multiple models and settings
Writing Quality: ⭐⭐⭐⭐⭐ — Theoretical derivations are clear and intuitions are well-explained