Skip to content

ToProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization

Metadata

  • Conference: ICLR 2026
  • arXiv: 2602.22948
  • Code: Coming soon
  • Area: Others
  • Keywords: VAR, attention entropy, token pruning, model acceleration, tri-dimensional sparsity optimization

TL;DR

ToProVAR is a framework that employs attention entropy to uniformly analyze sparsity across three dimensions — token, layer, and scale — in VAR models, achieving up to 3.4× speedup with negligible image quality degradation, significantly outperforming FastVAR and SkipVAR.

Background & Motivation

Visual Autoregressive (VAR) models reformulate image generation from next-token prediction to next-resolution prediction (coarse-to-fine), enabling GPT-style AR models to surpass diffusion models in image quality for the first time. However, the core bottleneck is that token counts grow exponentially with resolution, making later-stage computation extremely inefficient.

Limitations of existing acceleration methods: - FastVAR: Retains a fixed proportion of high-frequency tokens in the token dimension → low-frequency but semantically critical tokens are pruned → semantic loss - SkipVAR: Skips certain scales or replaces unconditional branches in the scale dimension → detail collapse - Both rely on single-dimensional sparsity analysis, failing to capture complex relative relationships among tokens

Core challenges: (1) fine-grained sparsity analysis is required to prevent information loss; (2) multi-dimensional representations are needed to assess token importance; (3) the analysis itself must be efficient and introduce minimal overhead.

Method

Overall Architecture

ToProVAR employs attention entropy as a unified metric to analyze semantics and sparsity across three dimensions:

\[\mathcal{H}(q_i) = -\sum_{j=1}^{N} \alpha_{i,j} \log \alpha_{i,j}\]

Low entropy = attention concentrated on a few targets → strong semantic selectivity; High entropy = attention distributed uniformly → weak semantic focus.

1. Scale-Level Optimization — Semantic Fineness Analysis

Different images require different generation depths: complex subjects (e.g., "cyber fox") need deeper scales to render details, while simple subjects (e.g., the letter "W") stabilize at shallow scales.

Define the low-entropy ratio:

\[\rho_s = \frac{|\{i \mid H_i^s < \bar{H}^s\}|}{N_s}\]

Pruning onset scale: \(D = \min\{s \mid \rho_s \geq \tau\}\)

The threshold \(\tau\) is calibrated via pre-sampling experiments; \(\rho_s\) stabilizes when generation converges.

2. Layer-Level Optimization — Semantic Scope Analysis

Attention entropy is extended to the full-layer token distribution. Two layer types are identified: - Global Layer: Uniform grid-like attention distribution with prominent principal components, capturing global spatial relationships - Detail Layer: Semantically driven local attention with non-prominent principal components, refining local textures

Differentiation method: SVD is applied to the entropy map, and the principal component ratio is computed:

\[\varrho^{(l,s)} = \sigma_1^{(l,s)} / \sigma_2^{(l,s)}\]

Layer representation score: \(\mathcal{R}^{(l,s)} = \exp(-\beta(\varrho^{(l,s)}-1))\)

  • \(\mathcal{R} \to 1\): Detail Layer (prunable)
  • \(\mathcal{R} \to 0\): Global Layer (not prunable)

Key finding: compressing Global Layers beyond 50% severely degrades quality, whereas Detail Layers maintain high fidelity even at 90% compression.

3. Token-Level Optimization — Fine-Grained Semantic Saliency Analysis

After normalizing token entropy, tri-dimensional information is integrated to define a unified pruning tendency:

\[q_i^{(s,l)} = \phi(s) \cdot \mathcal{R}^{(l,s)} \cdot \hat{H}_i^{(s,l)}\]

where \(\phi(s) = s / S_{\max}\) is a monotonic scale factor. The retention probability is:

\[P_{\text{keep}}(i|s,l) = \begin{cases} 1, & s < D \\ 1 - \text{clip}(\alpha_{\min} + (\alpha_{\max}-\alpha_{\min})q_i^{(s,l)}, 0, 1), & \text{otherwise} \end{cases}\]

Flash Attention Entropy

Directly computing attention entropy requires constructing an explicit \(N \times N\) attention matrix, which is incompatible with FlashAttention. By leveraging the algebraic identity \(kx\log(kx) = kx\log x + (\log k) \cdot xk\), entropy computation is decomposed into accumulable statistics and computed online within the FlashAttention kernel, introducing only approximately 0.17ms of overhead.

Experiments

Main Results (GenEval + DPG)

Method GenEval Overall ↑ DPG Overall ↑ Latency (s) ↓ Speedup
Infinity-2B 0.69 83.41 2.10 1.0×
+FastVAR 0.68 83.39 0.80 2.6×
+SkipVAR 0.67 82.94 1.10 2.0×
+ToProVAR 0.69 83.07 0.61 3.4×
Infinity-8B 0.83 86.68 4.86 1.0×
+FastVAR 0.81 86.50 2.01 2.4×
+SkipVAR 0.82 86.44 2.11 2.3×
+ToProVAR 0.83 86.70 1.78 2.7×

Human Preference Benchmark (HPSv2 + ImageReward)

On Infinity-8B, ToProVAR reduces latency by 67%, maintains consistent ImageReward scores (1.04 vs. 1.04), and incurs only a 0.41 drop in HPSv2.

MJHQ30K Perceptual Quality

FID on the People category even improves from 58.91 to 58.84 (simultaneous acceleration and quality gain), while FID on Landscape and Food categories remains virtually unchanged.

Ablation Study

Configuration Latency (s) Speedup GenEval ↑
Scale Depth only 0.47 4.5× 0.477
+ Layer Repr. 0.57 3.7× 0.679
+ Token Pruning (full) 0.61 3.4× 0.690
  • Scale depth alone yields the most aggressive speedup but causes significant quality degradation
  • Progressively incorporating layer-level and token-level optimization gradually recovers quality
  • Flash Attention Entropy is critical for efficiency: without FAE, latency is 1.10s vs. 0.61s with FAE

Computational Overhead Analysis

  • FAE introduces only 0.17ms at scale=10 (vs. 12.06ms for naive computation, a ~90% reduction)
  • Layer-level SVD analysis totals 49.84ms, accounting for less than 3% of end-to-end latency

Highlights & Insights

  • Attention entropy serves as a unified metric that elegantly connects sparsity analysis across three dimensions
  • Flash Attention Entropy is a notable engineering contribution, making online entropy computation practically feasible
  • Achieves 3.4× speedup with no quality loss on Infinity-2B (GenEval unchanged) and 2.7× speedup with marginal DPG improvement on 8B
  • Qualitative comparisons clearly demonstrate resolution of semantic loss, structural distortion, and detail collapse issues

Limitations & Future Work

  • Validated only on Infinity-2B/8B (VAR architecture); generalization to other VAR variants remains untested
  • Threshold \(\tau\) and hyperparameters \(\alpha_{\min}, \alpha_{\max}\) require pre-sampling calibration
  • Despite its efficiency, the tri-dimensional analysis still introduces approximately 3% additional overhead
  • Joint optimization of training-time and inference-time strategies is unexplored
  • The method is limited to image generation and has not been extended to video or multimodal generation
  • VAR Models: Tian et al. (VAR), Infinity (Han et al.) — next-scale prediction paradigm
  • VAR Acceleration: FastVAR (frequency pruning), SkipVAR (scale skipping), SparseVAR (token sparsity), CoDe (collaborative decoding)
  • Diffusion Model Acceleration: Distillation, quantization, pruning, feature caching — not directly applicable to VAR
  • KV Cache Optimization: HACK, ScaleKV — complementary directions

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The tri-dimensional attention entropy analysis framework is entirely novel
  • Technical Depth: ⭐⭐⭐⭐⭐ — Theoretical analysis and engineering implementation (FAE) are both rigorous
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple benchmarks and metrics, with thorough ablations
  • Value: ⭐⭐⭐⭐⭐ — 3.4× speedup with lossless quality, ready for practical deployment