ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization¶

Metadata¶

Conference: ICLR 2026
arXiv: 2602.22948
Code: Coming soon
Area: Others
Keywords: VAR, attention entropy, token pruning, model acceleration, tri-dimensional sparsity optimization

TL;DR¶

ToProVAR is a framework that employs attention entropy to uniformly analyze sparsity across three dimensions — token, layer, and scale — in VAR models, achieving up to 3.4× speedup with negligible image quality degradation, significantly outperforming FastVAR and SkipVAR.

Background & Motivation¶

Visual Autoregressive (VAR) models reformulate image generation from next-token prediction to next-resolution prediction (coarse-to-fine), enabling GPT-style AR models to surpass diffusion models in image quality for the first time. However, the core bottleneck is that token counts grow exponentially with resolution, making later-stage computation extremely inefficient.

Limitations of existing acceleration methods: - FastVAR: Retains a fixed proportion of high-frequency tokens in the token dimension → low-frequency but semantically critical tokens are pruned → semantic loss - SkipVAR: Skips certain scales or replaces unconditional branches in the scale dimension → detail collapse - Both rely on single-dimensional sparsity analysis, failing to capture complex relative relationships among tokens

Core challenges: (1) fine-grained sparsity analysis is required to prevent information loss; (2) multi-dimensional representations are needed to assess token importance; (3) the analysis itself must be efficient and introduce minimal overhead.

Method¶

Overall Architecture¶

ToProVAR employs attention entropy as a unified metric to analyze semantics and sparsity across three dimensions:

\[\mathcal{H}(q_i) = -\sum_{j=1}^{N} \alpha_{i,j} \log \alpha_{i,j}\]

Low entropy = attention concentrated on a few targets → strong semantic selectivity; High entropy = attention distributed uniformly → weak semantic focus.

1. Scale-Level Optimization — Semantic Fineness Analysis¶

Different images require different generation depths: complex subjects (e.g., "cyber fox") need deeper scales to render details, while simple subjects (e.g., the letter "W") stabilize at shallow scales.

Define the low-entropy ratio:

\[\rho_s = \frac{|\{i \mid H_i^s < \bar{H}^s\}|}{N_s}\]

Pruning onset scale: \(D = \min\{s \mid \rho_s \geq \tau\}\)

The threshold \(\tau\) is calibrated via pre-sampling experiments; \(\rho_s\) stabilizes when generation converges.

2. Layer-Level Optimization — Semantic Scope Analysis¶

Attention entropy is extended to the full-layer token distribution. Two layer types are identified: - Global Layer: Uniform grid-like attention distribution with prominent principal components, capturing global spatial relationships - Detail Layer: Semantically driven local attention with non-prominent principal components, refining local textures

Differentiation method: SVD is applied to the entropy map, and the principal component ratio is computed:

\[\varrho^{(l,s)} = \sigma_1^{(l,s)} / \sigma_2^{(l,s)}\]

Layer representation score: \(\mathcal{R}^{(l,s)} = \exp(-\beta(\varrho^{(l,s)}-1))\)

\(\mathcal{R} \to 1\): Detail Layer (prunable)
\(\mathcal{R} \to 0\): Global Layer (not prunable)

Key finding: compressing Global Layers beyond 50% severely degrades quality, whereas Detail Layers maintain high fidelity even at 90% compression.

3. Token-Level Optimization — Fine-Grained Semantic Saliency Analysis¶

After normalizing token entropy, tri-dimensional information is integrated to define a unified pruning tendency:

\[q_i^{(s,l)} = \phi(s) \cdot \mathcal{R}^{(l,s)} \cdot \hat{H}_i^{(s,l)}\]

where \(\phi(s) = s / S_{\max}\) is a monotonic scale factor. The retention probability is:

\[P_{\text{keep}}(i|s,l) = \begin{cases} 1, & s < D \\ 1 - \text{clip}(\alpha_{\min} + (\alpha_{\max}-\alpha_{\min})q_i^{(s,l)}, 0, 1), & \text{otherwise} \end{cases}\]

Flash Attention Entropy¶

Directly computing attention entropy requires constructing an explicit \(N \times N\) attention matrix, which is incompatible with FlashAttention. By leveraging the algebraic identity \(kx\log(kx) = kx\log x + (\log k) \cdot xk\), entropy computation is decomposed into accumulable statistics and computed online within the FlashAttention kernel, introducing only approximately 0.17ms of overhead.

Experiments¶

Main Results (GenEval + DPG)¶

Method	GenEval Overall ↑	DPG Overall ↑	Latency (s) ↓	Speedup
Infinity-2B	0.69	83.41	2.10	1.0×
+FastVAR	0.68	83.39	0.80	2.6×
+SkipVAR	0.67	82.94	1.10	2.0×
+ToProVAR	0.69	83.07	0.61	3.4×
Infinity-8B	0.83	86.68	4.86	1.0×
+FastVAR	0.81	86.50	2.01	2.4×
+SkipVAR	0.82	86.44	2.11	2.3×
+ToProVAR	0.83	86.70	1.78	2.7×

Human Preference Benchmark (HPSv2 + ImageReward)¶

On Infinity-8B, ToProVAR reduces latency by 67%, maintains consistent ImageReward scores (1.04 vs. 1.04), and incurs only a 0.41 drop in HPSv2.

MJHQ30K Perceptual Quality¶

FID on the People category even improves from 58.91 to 58.84 (simultaneous acceleration and quality gain), while FID on Landscape and Food categories remains virtually unchanged.

Ablation Study¶

Configuration	Latency (s)	Speedup	GenEval ↑
Scale Depth only	0.47	4.5×	0.477
+ Layer Repr.	0.57	3.7×	0.679
+ Token Pruning (full)	0.61	3.4×	0.690

Scale depth alone yields the most aggressive speedup but causes significant quality degradation
Progressively incorporating layer-level and token-level optimization gradually recovers quality
Flash Attention Entropy is critical for efficiency: without FAE, latency is 1.10s vs. 0.61s with FAE

Computational Overhead Analysis¶

FAE introduces only 0.17ms at scale=10 (vs. 12.06ms for naive computation, a ~90% reduction)
Layer-level SVD analysis totals 49.84ms, accounting for less than 3% of end-to-end latency

Highlights & Insights¶

Attention entropy serves as a unified metric that elegantly connects sparsity analysis across three dimensions
Flash Attention Entropy is a notable engineering contribution, making online entropy computation practically feasible
Achieves 3.4× speedup with no quality loss on Infinity-2B (GenEval unchanged) and 2.7× speedup with marginal DPG improvement on 8B
Qualitative comparisons clearly demonstrate resolution of semantic loss, structural distortion, and detail collapse issues

Limitations & Future Work¶

Validated only on Infinity-2B/8B (VAR architecture); generalization to other VAR variants remains untested
Threshold \(\tau\) and hyperparameters \(\alpha_{\min}, \alpha_{\max}\) require pre-sampling calibration
Despite its efficiency, the tri-dimensional analysis still introduces approximately 3% additional overhead
Joint optimization of training-time and inference-time strategies is unexplored
The method is limited to image generation and has not been extended to video or multimodal generation

VAR Models: Tian et al. (VAR), Infinity (Han et al.) — next-scale prediction paradigm
VAR Acceleration: FastVAR (frequency pruning), SkipVAR (scale skipping), SparseVAR (token sparsity), CoDe (collaborative decoding)
Diffusion Model Acceleration: Distillation, quantization, pruning, feature caching — not directly applicable to VAR
KV Cache Optimization: HACK, ScaleKV — complementary directions

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The tri-dimensional attention entropy analysis framework is entirely novel
Technical Depth: ⭐⭐⭐⭐⭐ — Theoretical analysis and engineering implementation (FAE) are both rigorous
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple benchmarks and metrics, with thorough ablations
Value: ⭐⭐⭐⭐⭐ — 3.4× speedup with lossless quality, ready for practical deployment