Progressive Supernet Training for Efficient Visual Autoregressive Modeling¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: TBD
Area: Model Compression / Efficient Inference / Visual Autoregressive Generation
Keywords: Visual Autoregressive, supernet, elastic depth, KV cache, progressive training

TL;DR¶

VARiant identifies a "scale-depth asymmetric dependence" in Visual Autoregressive (VAR) models: early low-resolution scales are highly dependent on network depth, while later high-resolution scales are robust to depth reductions. Based on this, a 30-layer VAR is trained as a weight-sharing elastic depth supernet (early scales use the full network; late scales use 2–16 layer subnets). Using a three-stage dynamic ratio progressive training to break the fixed-ratio Pareto frontier, d16/d8 subnets achieve near-lossless performance on ImageNet (FID 2.05/2.15 vs. 1.95) while saving 40–65% GPU memory.

Background & Motivation¶

Background: VAR transforms image generation from "next-token" to "next-scale", predicting multi-scale token maps $R=(r_1,\dots,r_K)$ in parallel from coarse to fine. This reduces generation to ~10 steps, an order of magnitude faster than diffusion (50 steps) or traditional AR (100–384 steps), with superior quality.

Limitations of Prior Work: The next-scale paradigm suffers from a critical memory issue—generating finer scales requires retaining tokens from all previous scales. The KV cache grows quadratically with resolution, becoming a deployment bottleneck. Existing mitigations have trade-offs: Distilled Decoding reduces steps to 1–2 but degrades quality; token/cache compression (FastVAR, HACK) can save 50–70% but requires fine-grained operations and complex implementation; multi-model collaboration (CoDe) assigns different scales to small and large models but requires deploying two independent models simultaneously, increasing system complexity and memory footprints.

Key Challenge: Reducing memory requires cutting computation (depth or tokens), but VAR scales do not have equal computational requirements. Uniformly reducing depth severely degrades quality at certain scales. Current solutions for scale-differentiated depth allocation rely on multi-model deployment, trading flexibility for system complexity.

Goal: Achieve scale-level elastic depth adjustment within a single model—allowing differentiated resource allocation to save memory without multi-model complexity, while ensuring both the full network and subnets reach their respective optima.

Key Insight: Empirical evidence (Sec 3.2.1) measuring how network depth affects generation quality across scales reveals a strong scale-depth asymmetric dependence. Applying a 50% depth subnet to low-res scales $r_1$–$r_3$ causes FID to skyrocket from 1.95 to 12.91 (+10.95), losing global semantics. Conversely, using it only for high-res scales $r_7$–$r_{10}$ yields an FID of 5.42 (+3.47), despite these scales accounting for 87% of inference latency.

Core Idea: Low-resolution scales handle global layout/semantics and require deep networks; high-resolution scales refine local textures and are robust to depth reduction. Thus, VAR is trained as a weight-sharing supernet where early scales use the full network and late scales use shallow subnets, enabling zero-cost runtime depth switching.

Method¶

Overall Architecture¶

VARiant trains a $D=30$ layer VAR as a supernet supporting multiple depths. During inference, the $K$ scales are split into two zones based on asymmetric dependence: the Bridge Zone ($r_1$–$r_N$) always utilizes the full $D$ layers to preserve global semantics, while the Flexible Zone ($r_{N+1}$–$r_K$) selects from a set of discrete subnet depths $I_d$ (e.g., 16/8/4/2 layers). Subnets share weights with the full network, making depth a real-time adjustable hyperparameter. Training employs a three-stage dynamic ratio progressive strategy to ensure all configurations converge optimally.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Multi-scale Generation<br/>r1 … rK"] --> B["Scale-Depth Asymmetric Dependence<br/>Early sensitive / Late robust"]
    B --> C["Weight-sharing Supernet<br/>Equidistant sampling + Bridge/Flexible Zones"]
    D["Dynamic Ratio Progressive Training<br/>2:8 → Linear → 10:0 Stages"]
    C --> D
    D --> E["Single Model, Multi-Depth<br/>Zero-cost runtime switching"]

Key Designs¶

1. Scale-Depth Asymmetric Dependence: Locating depth reduction targets This empirical foundation identifies which VAR scales tolerate depth reduction. Using a 50% depth subnet on ImageNet-256 at different intervals shows: $r_1$–$r_3$ (early) leads to FID 12.91 (semantic collapse), $r_4$–$r_6$ (mid) leads to 8.5, and $r_7$–$r_{10}$ (late) leads to 5.42 despite covering 87% of latency and reducing FLOPs by 46.7%. Conclusion: Early stages require the representation capacity of deep networks for global structure; late stages are inherently robust to depth reduction.

2. Weight-sharing Supernet: Equidistant sampling + Cross-scale depth allocation To avoid multi-model deployment, depth is made an intra-model hyperparameter. Equidistant Layer Sampling: For total depth $D$ and target $d$, layers are selected via $I_d=\{\lfloor i\cdot(D-1)/(d-1)\rfloor\mid i=0,\dots,d-1\}$, ensuring subnets are nested ($I_{0.25D}\subset I_{0.5D}\subset\{0,\dots,D-1\}$). Cross-scale Depth Allocation: The active layers at step $k$ are $I_k=\{0,\dots,D-1\}$ if $k\le N$ (Bridge Zone) and $I_d$ if $k>N$ (Flexible Zone). This facilitates implicit knowledge transfer between subnets and the full network and cross-scale gradient propagation for skipped layers via the Bridge Zone.

3. Dynamic Ratio Progressive Training: Breaking the fixed-ratio Pareto frontier Weight sharing causes optimization conflicts: training only subnets degrades the full network, while training only the full network leaves subnets under-optimized. A fixed sampling ratio $p$ results in a trade-off: $p=0.1$ gives best full-net FID (1.96) but poor subnet FID (2.68); $p=1.0$ yields subnet FID 2.15 but degrades the full network to 2.32. To resolve this, a three-stage dynamic ratio $\rho = \text{Subnet}:\text{Full Network}$ is used: $$\text{Phase 1 (Joint, }\rho=2{:}8\text{)}\;\to\;\text{Phase 2 (Prog. Transition, }p(ep)=0.2+0.8\cdot\tfrac{ep-E_1}{E_2-E_1}\text{)}\;\to\;\text{Phase 3 (Subnet Tuning, }\rho=10{:}0\text{)}$$ Phase 1 solidifies the foundation for all layers. Phase 2 smoothly transitions gradient contributions. Phase 3 specializes the subnets while the Bridge Zone maintains the full-network quality.

Loss & Training¶

The objective is scale-wise cross-entropy: $L=\sum_{k=1}^{K}\mathrm{CE}(p_\theta(r_k\mid r_{<k},I_k),r^*_k)$. Based on a pre-trained VAR-d30, the supernet provides 2/4/8/16/30 layer configurations. Stage 1 (5 epochs), Stage 2 (15 epochs), Stage 3 (5–15 epochs). Optimizer: AdamW, LR $1\times10^{-6}$, Batch Size 1024, 8×H100.

Key Experimental Results¶

Main Results¶

ImageNet 256×256 class-conditional generation using a single 2.0B model. Efficiency measured on a single L20 (batch 64):

Method	Steps	Speedup↑	Latency↓	Memory↓	KV cache↓	Params	FID↓	IS↑
DiT-XL/2	50	–	19.20s	–	–	675M	2.26	239
LlamaGen-XXL	384	–	74.27s	–	–	1.4B	2.34	254
VAR-d30 (Base)	10	1.0×	3.62s	39265MB	28677MB	2.0B	1.95	301
VAR-CoDe (Dual)	6+4	2.9×	1.27s	19943MB	8156MB	2.0+0.3B	2.27	297
VARiant-d16	6+4	1.7×	2.12s	28644MB	16092MB	2.0B	2.05	314

Note: Further results for d8 (2.6× speedup, 65% memory saving, FID 2.15) and d2 (3.5× speedup, 80% memory saving, FID 2.67) are provided in the text.

Ablation Study (Fixed vs. Progressive)¶

Training Ratio (Sub:Full)	Full-Net FID	Subnet FID	Note
1:9 ($p=0.1$)	1.96	2.68	Subnet gradient starvation
10:0 ($p=1.0$)	2.32	2.15	Full-net stagnation
Progressive (Ours)	≈1.95	≈2.05	Breaks the Pareto frontier

Key Findings¶

Late scales are the primary pruning targets: Depth reduction in high-res scales yields massive gains (87% latency coverage) with minimal quality loss.
Fixed ratios are suboptimal: Any constant $p$ forces a compromise; dynamic ratios allow both configurations to reach peak performance.
Gradient Bridge is essential: The Bridge Zone provides consistent gradients to all layers, ensuring full-network stability during Phase 3.

Highlights & Insights¶

Scale-depth asymmetry is a clean, actionable observation: It transforms the "where to prune" question into a quantifiable scale-level rule.
NAS principles applied to the scale axis: Unlike traditional elastic depth (per sample/layer), this is per generation scale, allowing a single model to serve multiple efficiency-quality targets with zero runtime switching cost.
Gradient re-distribution over time: The three-stage schedule recognizes that training needs shift from broad foundation-building to specialized sub-path optimization.

Limitations & Future Work¶

Primarily validated on ImageNet class-conditional generation with VAR-d30; generalization to T2I, higher resolutions, or video variants (Infinity) is unproven.
Hyperparameters like the Bridge/Flexible boundary $N$ and Stage 3 duration are empirically set; optimal values might vary across datasets.
Quality trade-offs in d2: The extreme efficiency mode shows noticeable quality degradation (FID 2.67).
Reduces memory and latency but not total parameter count (still a 2.0B model file).

vs. CoDe: VARiant achieves comparable or better quality using a single model (d4/d8) versus CoDe’s dual-model (2.0+0.3B) setup, reducing system complexity.
vs. Distilled Decoding: Distillation sacrifices quality for steps; VARiant sacrifices late-stage depth, retaining better visual fidelity.
vs. Token Compression (FastVAR, HACK): These methods are orthogonal; VARiant’s depth-based approach is simpler to deploy and can likely be combined with token-level pruning.

Rating¶

Novelty: ⭐⭐⭐⭐ Asymmetric dependence observation + scale-axis elastic supernet.
Experimental Thoroughness: ⭐⭐⭐⭐ Strong ablations on schedules and scale zones, though limited to ImageNet.
Writing Quality: ⭐⭐⭐⭐ Clear progression from observation to architecture.
Value: ⭐⭐⭐⭐⭐ Highly practical for VAR deployment, offering 40–80% memory savings.

Training Ratio (Sub:Full)	Full-Net FID	Subnet FID	Note
1:9 (\(p=0.1\))	1.96	2.68	Subnet gradient starvation
10:0 (\(p=1.0\))	2.32	2.15	Full-net stagnation
Progressive (Ours)	≈1.95	≈2.05	Breaks the Pareto frontier