HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers¶

Conference: CVPR 2026 arXiv: 2603.12222 Code: N/A Area: Model Compression / Vision Transformer Pruning / Neural Architecture Search Keywords: Vision Transformer pruning, multi-granular structured pruning, Gumbel-Sigmoid gating, end-to-end subnet search, edge deployment

TL;DR¶

This paper proposes HiAP, a hierarchical Gumbel-Sigmoid gating framework that unifies macro-level (entire attention heads / FFN blocks) and micro-level (intra-head dimensions / FFN neurons) pruning decisions. Through a single end-to-end training pass, HiAP automatically discovers efficient ViT subnetworks satisfying a given compute budget, eliminating the need for manual importance ranking or multi-stage pipelines.

Background & Motivation¶

Root Cause¶

Background: ViTs incur substantial computational and memory overhead, making structured pruning the dominant compression approach. Limitations of Prior Work: Existing methods suffer from two fundamental shortcomings: (1) They typically operate at a single granularity — purely micro-level pruning (e.g., ViT-Slim pruning intra-head dimensions) reduces FLOPs but still requires loading all weight matrices, leaving the memory bandwidth bottleneck unresolved; purely macro-level pruning (e.g., UPDP dropping entire blocks) leads to significant loss of representational capacity. (2) They rely on complex multi-stage pipelines: pruning masks are first determined via hand-crafted heuristics (e.g., Taylor importance scores, graph ranking), followed by separate fine-tuning to recover accuracy — a cumbersome process requiring expert knowledge for hyperparameter tuning.

Starting Point¶

Goal: Can a network learn, within a single training run, what to prune and what to retain across multiple granularities — without requiring manually specified per-layer pruning ratios or importance metrics?

Method¶

Overall Architecture¶

HiAP introduces two-level learnable Gumbel-Sigmoid stochastic gates within each Transformer block of a ViT. During training, gate logits and network weights are jointly optimized; temperature annealing gradually drives the gates from soft continuous values toward hard binary decisions. Upon training completion, the physical subnetwork is directly extracted without any secondary fine-tuning.

Key Designs¶

Hierarchical Gating Mechanism: Macro-level gates $g_{l,h}$ and $b_l$ control the retention or removal of entire attention heads and FFN blocks; micro-level gates $d_{l,h,j}$ and $c_{l,k}$ selectively prune intra-head dimensions and FFN neurons within surviving macro-level structures. Micro-level gates are conditioned on macro-level gates — when a macro gate is closed, all corresponding micro gates are automatically deactivated.
Differentiable Compute Modeling: The MAC cost of the network is linearly decomposed as $\mathbb{E}[C(\mathcal{G})] = \sum_l \sum_h (C_1 \cdot \mathbb{E}[g_{l,h}] + C_2 \sum_j \mathbb{E}[g_{l,h} \cdot d_{l,h,j}]) + \sum_l \sum_k C_3 \cdot \mathbb{E}[b_l \cdot c_{l,k}]$, where $C_1$ denotes macro-level attention cost and $C_2$, $C_3$ denote per-dimension and per-neuron micro-level costs, enabling the optimizer to precisely attribute hardware cost to each structural component.
Decoupled Cost Penalties: Macro-level ($\mathcal{L}_{macro}$) and micro-level ($\mathcal{L}_{micro}$) compute penalties are separated and controlled by independent hyperparameters, allowing explicit management of the coarse-to-fine sparsity trade-off.
Structural Feasibility Constraints: $\mathcal{L}_{feasibility}$ is introduced to prevent layer collapse — each layer is required to retain a minimum number of attention heads and a minimum proportion of FFN neurons, enforced via ReLU-based threshold penalties.

Loss & Training¶

$$\mathcal{L}_{total} = \mathcal{L}_{task} + \lambda_{macro}\mathcal{L}_{macro} + \lambda_{micro}\mathcal{L}_{micro} + \mathcal{L}_{feasibility}$$ - $\mathcal{L}_{task}$: cross-entropy loss combined with knowledge distillation (using pretrained DeiT-Small as teacher, $\alpha_{KD}=0.7$, $T=4.0$) - Gumbel-Sigmoid temperature is exponentially annealed from $\tau_0=2.0$ to $\tau_{min}=0.5$ over 200 training epochs using the AdamW optimizer with lr=5e-5

Key Experimental Results¶

Method	Training Epochs	MACs (G)	Top-1 Acc (%)	Δ Acc
DeiT-Small (dense)	-	4.6	79.85	-
ViT-Slim	15.6	3.1	79.90	+0.05
GOHSP	14.4	3.0	79.98	+0.13
S2ViT	15.3	3.1	79.22	−0.63
WDPruning	15.0	3.1	78.55	−1.30
HiAP	15.0	3.1	79.10	−0.75
HiAP (aggressive)	12.3	2.5	77.95	−1.90

HiAP achieves 79.1% accuracy at 3.1G MACs (~33% compression) via single-stage training without any multi-stage pipeline.
On CIFAR-10: 87.56% under 33% compression, outperforming Uniform-Ratio (86.63%) and $\ell_1$-Structured (87.15%).
Measured inference latency is reduced from 5.57 ms to 3.86 ms (1.44× speedup), confirming that the extracted physical subnetwork is immediately deployable.

Ablation Study¶

Ablation over macro-to-micro penalty ratios (2:1, 5:1, 1.5:1, macro-only, micro-only): 2:1 yields the best Pareto balance for DeiT-Small.
A macro-only penalty causes a large number of heads to be removed while FFN neurons remain largely untouched, resulting in imbalanced sparsity.
A micro-only penalty inadvertently leads to the elimination of entire MLP blocks — an unintended depth-pruning effect.
The network autonomously identifies the last FFN block as entirely redundant ($b_{12}=0$) without any manual specification.

Highlights & Insights¶

The framework design is notably clean: hierarchical gating + differentiable budget constraint + feasibility regularization together constitute a principled solution to automated pruning.
Single-stage training simplicity is the central contribution — compared to methods such as GOHSP and NViT that require graph ranking and multi-round evaluation, engineering complexity is substantially reduced.
The decoupled cost penalty design renders pruning behavior controllable and interpretable; visualizations of structural evolution under different penalty ratios are highly informative.
Proposition 1 proves that the linear decomposition of the differentiable budget constraint does not require a gate independence assumption, providing theoretical rigor.

Limitations & Future Work¶

The objective optimizes only MACs and does not model actual latency or energy consumption; the paper itself acknowledges the MACs-to-latency gap as the primary limitation.
Accuracy on ImageNet remains below GOHSP and ViT-Slim (79.1% vs. 79.98%), with the trade-off being a simplified pipeline.
Validation is limited to classification tasks; applicability to dense prediction tasks such as detection and segmentation remains unexplored.
The temperature annealing schedule ($\tau_0$, $\tau_{min}$) requires tuning, and sensitivity to different model/task configurations has not been thoroughly analyzed.

ViT-Slim: Micro-level pruning only (intra-head dimensions + FFN), requiring sparsity ranking and threshold determination. HiAP generates decisions end-to-end within a unified multi-granular framework.
GOHSP: Uses graph ranking to determine head importance and optimizes pruning combinations through a complex pipeline requiring expert tuning. HiAP allows the network to learn these decisions autonomously via Gumbel gating.
UPDP: Operates solely at the macro level (FFN block granularity) using a genetic algorithm for search, in contrast to HiAP's differentiable search paradigm.

The Gumbel-Sigmoid gating + temperature annealing paradigm is broadly reusable in any setting that requires learning discrete structural selections.
The idea of decoupling macro- and micro-level costs can be generalized to independent control of different resource dimensions in NAS.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐