CVPR2026 Model Compression dataset distillation visual autoregressive model hierarchical semantic amplification coarse-to-fine generation codebook token diversity

HierAmp: Coarse-to-Fine Autoregressive Amplification for Generative Dataset Distillation¶

Conference: CVPR2026 arXiv: 2603.06932 Code: Oshikaka/HIERAMP Area: Model Compression / Dataset Distillation Keywords: dataset distillation, visual autoregressive model, hierarchical semantic amplification, coarse-to-fine generation, codebook token diversity

TL;DR¶

This paper proposes HierAmp, which injects learnable class tokens at each scale of the coarse-to-fine generation process of a Visual AutoRegressive (VAR) model to identify semantically salient regions, and amplifies attention to these regions via positive logit biasing. This enables distilled data to acquire richer and more diverse layouts at coarse scales while focusing on class-relevant details at fine scales, achieving state-of-the-art performance across multiple dataset distillation benchmarks.

Background & Motivation¶

Limitations of dataset distillation: Existing methods primarily optimize global distribution proximity (gradient matching, trajectory matching, distribution matching) without directly targeting the discriminative semantic information required for downstream classification.
Neglect of hierarchical semantics: Object semantics naturally possess a hierarchical structure—global layout constrains local structure, which in turn constrains textural details—yet existing distillation methods model this in a single latent space without accounting for such hierarchy.
Poor visual quality of traditional methods: Images generated by optimization-based distillation methods lack visual fidelity and resemble feature abstractions rather than natural images.
Limited diversity in GAN-based methods: Early GAN-based distillation improves visual quality but suffers from limited generation diversity.
High cost of diffusion models: Diffusion models yield high quality but incur large computational overhead due to long denoising chains.
Natural alignment of VAR: The coarse-to-fine generation paradigm of visual autoregressive models aligns naturally with the hierarchical structure of object semantics—early scales generate global structure while subsequent scales refine details—providing an ideal framework for hierarchical semantic amplification.

Method¶

Overall Architecture¶

HierAmp is built upon a pretrained VAR model comprising 10 hierarchical scales (scale 0–9). The core idea is to inject a learnable class token at each scale, train it to capture scale-specific semantic information via a classification objective, and then leverage the class token's attention map to identify salient regions and amplify their attention.

Scale-Restricted Class Token Attention¶

A learnable class token \([c]_n\) is concatenated at each scale \(n\).
A scale-restricted attention mask constrains \([c]_n\) to attend only to image tokens at the current scale, blocking cross-scale connections.
After aggregating multi-head attention, a semantic saliency map \(\mathbf{M}_n \in \mathbb{R}^{h_n \times w_n}\) is obtained.
Class tokens are trained with a classification loss: \(\mathcal{L}_{cls} = \frac{1}{N}\sum_{n=1}^{N}(-\log p_n(\mathbf{c}_n^e))\)

Coarse-to-Fine Autoregressive Amplification¶

The top-\(\rho_n\%\) positions from the attention map \(\mathbf{m}_n\) are selected to form the salient set \(\mathcal{S}_n\).
A binary indicator vector \(\mathbf{a}_n\) is constructed, and a positive logit bias \(\beta_n\) is added to salient positions:

\[\tilde{\mathbf{L}}_n^{(h)} = \mathbf{L}_n^{(h)} + \beta_n \cdot \mathbf{1} \cdot \mathbf{a}_n^\top\]

The modified attention \(\tilde{\boldsymbol{\alpha}}_n^{(h)} = \text{softmax}(\tilde{\mathbf{L}}_n^{(h)})\) increases probability mass over semantically relevant regions.
A three-stage schedule is adopted: Coarse (scales 1–3), Mid (scales 4–6), and Fine (scales 7–9), each with independent \(\rho\) parameters.

Loss & Training¶

The original cross-scale cross-entropy loss of VAR (teacher forcing).
Classification loss \(\mathcal{L}_{cls}\) for the class tokens.
Class token training requires only 5 epochs of fine-tuning, with negligible additional inference overhead.

Key Experimental Results¶

Main Results: Comparison with SOTA Methods (Table 1)¶

Dataset	IPC	Best ResNet-18	Compared Methods
CIFAR-10	10	44.3%	D3HR 41.3%, RDED 37.1%
CIFAR-100	10	52.0%	D3HR 49.4%, RDED 42.6%
ImageNet-Woof	10	45.8%	CaO2 45.6%, RDED 38.5%
ImageNet-100	50	68.1%	CaO2 68.0%, RDED 61.6%
ImageNet-1K	10	47.6%	CaO2 46.1%, D3HR 44.3%
ImageNet-1K	50	60.8%	CaO2 60.0%, D3HR 59.4%
ImageNet-1K	100	62.7%	D3HR 62.5%

HierAmp achieves the highest accuracy in nearly all dataset and IPC settings, with a notably large margin of 1.5% over the second-best method CaO2 on ImageNet-1K at IPC=10.

Cross-Architecture Generalization (Table 2, ImageNet-1K IPC=10)¶

Teacher → Student	HierAmp	RDED	D3HR
MobileNet-V2 → ResNet-18	46.2%	34.4%	43.4%
ResNet-18 → EfficientNet-B0	38.7%	36.6%	38.3%
EfficientNet-B0 → EfficientNet-B0	28.7%	23.5%	28.1%

Cross-architecture generalization consistently surpasses both RDED and D3HR.

Ablation Study (Table 3, ImageNet-1K IPC=10)¶

No amplification baseline: 45.6%
Coarse-only amplification (\(\beta\)=5, \(\rho\)=50%): 47.6% (largest gain)
Mid-only amplification: 46.9%
Fine-only amplification: 46.5%
Full-scale amplification: 47.6%

Key Findings: Coarse-scale amplification contributes the most, as it establishes global structure and influences the semantic richness of subsequent scales.

Token Distribution Analysis¶

Coarse-scale amplification → increased token entropy and coverage (more diverse layout combinations).
Fine-scale amplification → concentrated token usage (focus on class-relevant textural details).
This symmetric effect explains why hierarchical amplification outperforms single-scale amplification.

Highlights & Insights¶

Novel perspective: The first work to analyze dataset distillation through the lens of hierarchical semantic amplification, revealing a symmetric effect between coarse-scale diversity and fine-scale focus.
Elegant design: Only lightweight class tokens and positive logit biasing are required, with no external segmentation tools and minimal inference overhead.
Strong interpretability: Token entropy/coverage analysis and attention visualization provide clear mechanistic explanations.
Consistent SOTA: Comprehensively surpasses prior methods on CIFAR-10/100 and ImageNet-Woof/100/1K.
Cross-architecture generalization: Distilled data performs stably across diverse teacher–student architecture combinations.

Limitations & Future Work¶

The method depends on a pretrained VAR model and cannot be directly transferred to other generative frameworks (diffusion models, GANs, etc.).
The stage-wise scheduling of \(\rho\) and \(\beta\) requires manual configuration, lacking an adaptive mechanism.
Validation is limited to classification tasks; distillation for downstream tasks such as detection and segmentation remains unexplored.
Class token training requires additional classification labels, making the approach incompatible with unsupervised distillation scenarios.
Some ResNet-101 results in Table 1 (e.g., ImageNet-1K IPC=10) do not surpass D3HR.

Method	Base Model	Core Strategy	Limitation
RDED	No generative model	Crop informative regions from real images	Constrained by original data quality
D3HR	DDIM	Inversion + distribution matching	High computation cost at high resolution
CaO2	Diffusion	Probabilistic sampling + latent code optimization	Long inference chain
Minimax	Diffusion	Minimax optimization	Limited scalability
HierAmp	VAR	Hierarchical semantic amplification	Relies on VAR pretraining

Rating¶

Novelty: ⭐⭐⭐⭐ — The hierarchical semantic amplification perspective is novel; the class token + logit bias design is clean and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across multiple datasets, IPC settings, architectures, ablations, and analyses.
Writing Quality: ⭐⭐⭐⭐ — Clear structure; the analysis section (token entropy/coverage) provides good interpretability.
Value: ⭐⭐⭐⭐ — Offers a new hierarchical semantic understanding of dataset distillation with strong practical utility.