PTQ4ARVG: Post-Training Quantization for AutoRegressive Visual Generation Models¶
Conference: ICLR 2026 arXiv: 2601.21238 Code: GitHub Area: Model Compression Keywords: Visual Generation, Autoregressive Models, Post-Training Quantization, Activation Quantization, Outlier Suppression
TL;DR¶
PTQ4ARVG is proposed as the first systematic PTQ framework for autoregressive visual generation (ARVG) models. It addresses three ARVG-specific quantization challenges via Gain-Projected Scaling (GPS), Static Token-Wise Quantization (STWQ), and Distribution-Guided Calibration (DGC).
Background & Motivation¶
Limitations of Prior Work¶
Background: Autoregressive visual generation models (VAR, RAR, PAR, MAR) have surpassed diffusion models in image generation quality, yet suffer from large model sizes (2–3B parameters) and slow inference (PAR-3B takes >3 seconds per image). Quantization is an effective acceleration technique, but applying existing methods to ARVG introduces three unique challenges:
Channel-wise severe outliers: Activations modulated by AdaLN exhibit extreme inter-channel range disparities.
Token-wise highly dynamic activations: Positional encodings cause drastic distribution shifts along the token dimension, and conditional tokens form sink tokens.
Sample-level distribution mismatch: Network activations are highly similar across different samples (especially unconditional ones), leading to redundancy in the calibration set.
Method¶
Overall Architecture¶
PTQ4ARVG comprises three targeted components addressing channel-level, token-level, and sample-level quantization challenges respectively, all in a training-free manner.
Key Designs¶
-
Gain-Projected Scaling (GPS):
-
Applies Taylor expansion to the quantization loss, separately quantifying activation and weight losses.
- Defines the scaling gain as: \(g(s_2) = g_{\bm{x}} - g_{\bm{W}_{:,1}}\) (reduction in activation loss minus increase in weight loss).
- Derives a closed-form optimal scaling factor via differentiation: \(s_2 = s_1 \frac{\sqrt{\sum|{\Delta W_{2,i} x_2}|}}{\sqrt{\sum|{W_{2,i} \Delta x_2}|}}\)
-
Represents the first mathematically optimized quantization scaling strategy, outperforming empirically designed alternatives.
-
Static Token-Wise Quantization (STWQ):
-
Exploits two distinctive properties of ARVG: fixed token sequence length and position-invariant cross-sample distributions.
- Assigns static quantization parameters along the token sequence for AdaLN modules.
- Handles sink tokens and regular tokens separately for linear layers.
-
Quantization parameters are set offline with no online calibration overhead, remaining compatible with standard CUDA kernels.
-
Distribution-Guided Calibration (DGC):
-
Measures the distributional entropy of each sample via Mahalanobis distance: \(\rho(x) = \sqrt{(x-u)^T S^{-1} (x-u)}\)
- Selects the top-50% samples with the highest distributional entropy as the calibration set.
- Eliminates redundant samples to ensure that the calibration distribution matches the true data distribution.
Loss & Training¶
- Entirely training-free PTQ.
- GPS derivation is based on Taylor expansion combined with convex optimization.
- STWQ employs percentile-based calibration (rather than min-max) to ensure high precision.
- Evaluation is conducted by generating 50K images on ImageNet, measuring FID, sFID, IS, and Precision.
Key Experimental Results¶
Main Results (VAR-d16 / VAR-d24 — W8A8 Quantization)¶
| Method | VAR-d16 FID ↓ | VAR-d16 IS ↑ | VAR-d24 FID ↓ | VAR-d24 IS ↑ |
|---|---|---|---|---|
| FP | 3.60 | 283.21 | 2.33 | 317.16 |
| SmoothQuant | 4.29 | 229.87 | 4.42 | 246.68 |
| OS+ | 4.11 | 230.41 | 4.14 | 250.61 |
| OmniQuant | 4.19 | 226.92 | - | - |
| PTQ4ARVG | 3.82 | 268.19 | 2.69 | 304.82 |
6-bit Quantization Results (VAR-d24)¶
| Method | FID ↓ | IS ↑ | Precision ↑ |
|---|---|---|---|
| SmoothQuant W6A6 | >10 | <200 | Severe degradation |
| PTQ4ARVG W6A6 | ~4.5 | ~280 | Competitive |
Key Findings¶
- PTQ4ARVG substantially outperforms existing PTQ methods under both 8-bit and 6-bit settings.
- The mathematically optimized scaling of GPS consistently surpasses empirical approaches (SmoothQuant, RepQ-ViT).
- STWQ handles token-level variance without additional inference overhead; dynamic alternatives incur a 0.5× speed penalty.
- DGC significantly improves calibration quality by removing redundant samples.
- The framework is effective across all four ARVG model families: VAR, RAR, PAR, and MAR.
Highlights & Insights¶
- Precise problem formulation: the paper is the first to systematically identify three distinct quantization challenges in ARVG, providing a dedicated solution for each.
- GPS is the first scaling strategy grounded in rigorous mathematical derivation, establishing a theoretical foundation for quantization scaling.
- STWQ cleverly exploits the fixed token length of ARVG — a property unavailable in LLMs due to variable-length sequences.
- Experiments cover four ARVG architectures (VAR/RAR/PAR/MAR), demonstrating strong generalizability.
Limitations & Future Work¶
- 4-bit quantization results are not presented, likely due to severe accuracy degradation of ARVG models at 4-bit.
- Remark 1 in the GPS derivation is based on empirical observation rather than formal proof.
- Comparisons with recent methods such as SVDQuant are absent, though the latter relies on custom CUDA kernels.
- Given the relatively smaller scale of ARVG models compared to LLMs, the practical demand for quantization compression may be less urgent.
Related Work & Insights¶
- vs. SmoothQuant: GPS replaces empirical scaling with mathematically optimized scaling.
- vs. Dynamic quantization in LLMs: STWQ leverages ARVG's fixed token length to achieve overhead-free static quantization.
- vs. Diffusion model PTQ: ARVG lacks timestep conditioning but exhibits token-level dynamics, necessitating a fundamentally different approach.
- Insight: Architecture-specific properties can be fully exploited to design more effective quantization methods.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First ARVG PTQ framework; GPS theoretical derivation is original.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive validation across four models with deployment verification.
- Writing Quality: ⭐⭐⭐⭐ — Problem analysis is thorough, though formula derivations occupy substantial space.
- Value: ⭐⭐⭐⭐ — Establishes a foundation for efficient deployment of ARVG models.