Skip to content

HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers

Conference: CVPR 2026 arXiv: 2603.12222 Code: N/A Area: Model Compression / Vision Transformer Pruning / Neural Architecture Search Keywords: Vision Transformer pruning, multi-granular structured pruning, Gumbel-Sigmoid gating, end-to-end subnet search, edge deployment

TL;DR

This paper proposes HiAP, a hierarchical Gumbel-Sigmoid gating framework that unifies macro-level (entire attention heads / FFN blocks) and micro-level (intra-head dimensions / FFN neurons) pruning decisions. Through a single end-to-end training pass, HiAP automatically discovers efficient ViT subnetworks satisfying a given compute budget, eliminating the need for manual importance ranking or multi-stage pipelines.

Background & Motivation

Root Cause

Background: ViTs incur substantial computational and memory overhead, making structured pruning the dominant compression approach. Limitations of Prior Work: Existing methods suffer from two fundamental shortcomings: (1) They typically operate at a single granularity — purely micro-level pruning (e.g., ViT-Slim pruning intra-head dimensions) reduces FLOPs but still requires loading all weight matrices, leaving the memory bandwidth bottleneck unresolved; purely macro-level pruning (e.g., UPDP dropping entire blocks) leads to significant loss of representational capacity. (2) They rely on complex multi-stage pipelines: pruning masks are first determined via hand-crafted heuristics (e.g., Taylor importance scores, graph ranking), followed by separate fine-tuning to recover accuracy — a cumbersome process requiring expert knowledge for hyperparameter tuning.

Starting Point

Goal: Can a network learn, within a single training run, what to prune and what to retain across multiple granularities — without requiring manually specified per-layer pruning ratios or importance metrics?

Method

Overall Architecture

HiAP introduces two-level learnable Gumbel-Sigmoid stochastic gates within each Transformer block of a ViT. During training, gate logits and network weights are jointly optimized; temperature annealing gradually drives the gates from soft continuous values toward hard binary decisions. Upon training completion, the physical subnetwork is directly extracted without any secondary fine-tuning.

Key Designs

  1. Hierarchical Gating Mechanism: Macro-level gates \(g_{l,h}\) and \(b_l\) control the retention or removal of entire attention heads and FFN blocks; micro-level gates \(d_{l,h,j}\) and \(c_{l,k}\) selectively prune intra-head dimensions and FFN neurons within surviving macro-level structures. Micro-level gates are conditioned on macro-level gates — when a macro gate is closed, all corresponding micro gates are automatically deactivated.
  2. Differentiable Compute Modeling: The MAC cost of the network is linearly decomposed as \(\mathbb{E}[C(\mathcal{G})] = \sum_l \sum_h (C_1 \cdot \mathbb{E}[g_{l,h}] + C_2 \sum_j \mathbb{E}[g_{l,h} \cdot d_{l,h,j}]) + \sum_l \sum_k C_3 \cdot \mathbb{E}[b_l \cdot c_{l,k}]\), where \(C_1\) denotes macro-level attention cost and \(C_2\), \(C_3\) denote per-dimension and per-neuron micro-level costs, enabling the optimizer to precisely attribute hardware cost to each structural component.
  3. Decoupled Cost Penalties: Macro-level (\(\mathcal{L}_{macro}\)) and micro-level (\(\mathcal{L}_{micro}\)) compute penalties are separated and controlled by independent hyperparameters, allowing explicit management of the coarse-to-fine sparsity trade-off.
  4. Structural Feasibility Constraints: \(\mathcal{L}_{feasibility}\) is introduced to prevent layer collapse — each layer is required to retain a minimum number of attention heads and a minimum proportion of FFN neurons, enforced via ReLU-based threshold penalties.

Loss & Training

$\(\mathcal{L}_{total} = \mathcal{L}_{task} + \lambda_{macro}\mathcal{L}_{macro} + \lambda_{micro}\mathcal{L}_{micro} + \mathcal{L}_{feasibility}\)$ - \(\mathcal{L}_{task}\): cross-entropy loss combined with knowledge distillation (using pretrained DeiT-Small as teacher, \(\alpha_{KD}=0.7\), \(T=4.0\)) - Gumbel-Sigmoid temperature is exponentially annealed from \(\tau_0=2.0\) to \(\tau_{min}=0.5\) over 200 training epochs using the AdamW optimizer with lr=5e-5

Key Experimental Results

Method Training Epochs MACs (G) Top-1 Acc (%) Δ Acc
DeiT-Small (dense) - 4.6 79.85 -
ViT-Slim 15.6 3.1 79.90 +0.05
GOHSP 14.4 3.0 79.98 +0.13
S2ViT 15.3 3.1 79.22 −0.63
WDPruning 15.0 3.1 78.55 −1.30
HiAP 15.0 3.1 79.10 −0.75
HiAP (aggressive) 12.3 2.5 77.95 −1.90
  • HiAP achieves 79.1% accuracy at 3.1G MACs (~33% compression) via single-stage training without any multi-stage pipeline.
  • On CIFAR-10: 87.56% under 33% compression, outperforming Uniform-Ratio (86.63%) and \(\ell_1\)-Structured (87.15%).
  • Measured inference latency is reduced from 5.57 ms to 3.86 ms (1.44× speedup), confirming that the extracted physical subnetwork is immediately deployable.

Ablation Study

  • Ablation over macro-to-micro penalty ratios (2:1, 5:1, 1.5:1, macro-only, micro-only): 2:1 yields the best Pareto balance for DeiT-Small.
  • A macro-only penalty causes a large number of heads to be removed while FFN neurons remain largely untouched, resulting in imbalanced sparsity.
  • A micro-only penalty inadvertently leads to the elimination of entire MLP blocks — an unintended depth-pruning effect.
  • The network autonomously identifies the last FFN block as entirely redundant (\(b_{12}=0\)) without any manual specification.

Highlights & Insights

  • The framework design is notably clean: hierarchical gating + differentiable budget constraint + feasibility regularization together constitute a principled solution to automated pruning.
  • Single-stage training simplicity is the central contribution — compared to methods such as GOHSP and NViT that require graph ranking and multi-round evaluation, engineering complexity is substantially reduced.
  • The decoupled cost penalty design renders pruning behavior controllable and interpretable; visualizations of structural evolution under different penalty ratios are highly informative.
  • Proposition 1 proves that the linear decomposition of the differentiable budget constraint does not require a gate independence assumption, providing theoretical rigor.

Limitations & Future Work

  • The objective optimizes only MACs and does not model actual latency or energy consumption; the paper itself acknowledges the MACs-to-latency gap as the primary limitation.
  • Accuracy on ImageNet remains below GOHSP and ViT-Slim (79.1% vs. 79.98%), with the trade-off being a simplified pipeline.
  • Validation is limited to classification tasks; applicability to dense prediction tasks such as detection and segmentation remains unexplored.
  • The temperature annealing schedule (\(\tau_0\), \(\tau_{min}\)) requires tuning, and sensitivity to different model/task configurations has not been thoroughly analyzed.
  • ViT-Slim: Micro-level pruning only (intra-head dimensions + FFN), requiring sparsity ranking and threshold determination. HiAP generates decisions end-to-end within a unified multi-granular framework.
  • GOHSP: Uses graph ranking to determine head importance and optimizes pruning combinations through a complex pipeline requiring expert tuning. HiAP allows the network to learn these decisions autonomously via Gumbel gating.
  • UPDP: Operates solely at the macro level (FFN block granularity) using a genetic algorithm for search, in contrast to HiAP's differentiable search paradigm.
  • The Gumbel-Sigmoid gating + temperature annealing paradigm is broadly reusable in any setting that requires learning discrete structural selections.
  • The idea of decoupling macro- and micro-level costs can be generalized to independent control of different resource dimensions in NAS.

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐
  • Value: ⭐⭐⭐