HiAP: A Multi-Granular Stochastic Auto-Pruning Framework for Vision Transformers¶
Conference: CVPR 2026 arXiv: 2603.12222 Code: N/A Area: Model Compression / Vision Transformer Pruning / Neural Architecture Search Keywords: Vision Transformer pruning, multi-granular structured pruning, Gumbel-Sigmoid gating, end-to-end subnet search, edge deployment
TL;DR¶
This paper proposes HiAP, a hierarchical Gumbel-Sigmoid gating framework that unifies macro-level (entire attention heads / FFN blocks) and micro-level (intra-head dimensions / FFN neurons) pruning decisions. Through a single end-to-end training pass, HiAP automatically discovers efficient ViT subnetworks satisfying a given compute budget, eliminating the need for manual importance ranking or multi-stage pipelines.
Background & Motivation¶
Root Cause¶
Background: ViTs incur substantial computational and memory overhead, making structured pruning the dominant compression approach. Limitations of Prior Work: Existing methods suffer from two fundamental shortcomings: (1) They typically operate at a single granularity — purely micro-level pruning (e.g., ViT-Slim pruning intra-head dimensions) reduces FLOPs but still requires loading all weight matrices, leaving the memory bandwidth bottleneck unresolved; purely macro-level pruning (e.g., UPDP dropping entire blocks) leads to significant loss of representational capacity. (2) They rely on complex multi-stage pipelines: pruning masks are first determined via hand-crafted heuristics (e.g., Taylor importance scores, graph ranking), followed by separate fine-tuning to recover accuracy — a cumbersome process requiring expert knowledge for hyperparameter tuning.
Starting Point¶
Goal: Can a network learn, within a single training run, what to prune and what to retain across multiple granularities — without requiring manually specified per-layer pruning ratios or importance metrics?
Method¶
Overall Architecture¶
HiAP introduces two-level learnable Gumbel-Sigmoid stochastic gates within each Transformer block of a ViT. During training, gate logits and network weights are jointly optimized; temperature annealing gradually drives the gates from soft continuous values toward hard binary decisions. Upon training completion, the physical subnetwork is directly extracted without any secondary fine-tuning.
Key Designs¶
- Hierarchical Gating Mechanism: Macro-level gates \(g_{l,h}\) and \(b_l\) control the retention or removal of entire attention heads and FFN blocks; micro-level gates \(d_{l,h,j}\) and \(c_{l,k}\) selectively prune intra-head dimensions and FFN neurons within surviving macro-level structures. Micro-level gates are conditioned on macro-level gates — when a macro gate is closed, all corresponding micro gates are automatically deactivated.
- Differentiable Compute Modeling: The MAC cost of the network is linearly decomposed as \(\mathbb{E}[C(\mathcal{G})] = \sum_l \sum_h (C_1 \cdot \mathbb{E}[g_{l,h}] + C_2 \sum_j \mathbb{E}[g_{l,h} \cdot d_{l,h,j}]) + \sum_l \sum_k C_3 \cdot \mathbb{E}[b_l \cdot c_{l,k}]\), where \(C_1\) denotes macro-level attention cost and \(C_2\), \(C_3\) denote per-dimension and per-neuron micro-level costs, enabling the optimizer to precisely attribute hardware cost to each structural component.
- Decoupled Cost Penalties: Macro-level (\(\mathcal{L}_{macro}\)) and micro-level (\(\mathcal{L}_{micro}\)) compute penalties are separated and controlled by independent hyperparameters, allowing explicit management of the coarse-to-fine sparsity trade-off.
- Structural Feasibility Constraints: \(\mathcal{L}_{feasibility}\) is introduced to prevent layer collapse — each layer is required to retain a minimum number of attention heads and a minimum proportion of FFN neurons, enforced via ReLU-based threshold penalties.
Loss & Training¶
$\(\mathcal{L}_{total} = \mathcal{L}_{task} + \lambda_{macro}\mathcal{L}_{macro} + \lambda_{micro}\mathcal{L}_{micro} + \mathcal{L}_{feasibility}\)$ - \(\mathcal{L}_{task}\): cross-entropy loss combined with knowledge distillation (using pretrained DeiT-Small as teacher, \(\alpha_{KD}=0.7\), \(T=4.0\)) - Gumbel-Sigmoid temperature is exponentially annealed from \(\tau_0=2.0\) to \(\tau_{min}=0.5\) over 200 training epochs using the AdamW optimizer with lr=5e-5
Key Experimental Results¶
| Method | Training Epochs | MACs (G) | Top-1 Acc (%) | Δ Acc |
|---|---|---|---|---|
| DeiT-Small (dense) | - | 4.6 | 79.85 | - |
| ViT-Slim | 15.6 | 3.1 | 79.90 | +0.05 |
| GOHSP | 14.4 | 3.0 | 79.98 | +0.13 |
| S2ViT | 15.3 | 3.1 | 79.22 | −0.63 |
| WDPruning | 15.0 | 3.1 | 78.55 | −1.30 |
| HiAP | 15.0 | 3.1 | 79.10 | −0.75 |
| HiAP (aggressive) | 12.3 | 2.5 | 77.95 | −1.90 |
- HiAP achieves 79.1% accuracy at 3.1G MACs (~33% compression) via single-stage training without any multi-stage pipeline.
- On CIFAR-10: 87.56% under 33% compression, outperforming Uniform-Ratio (86.63%) and \(\ell_1\)-Structured (87.15%).
- Measured inference latency is reduced from 5.57 ms to 3.86 ms (1.44× speedup), confirming that the extracted physical subnetwork is immediately deployable.
Ablation Study¶
- Ablation over macro-to-micro penalty ratios (2:1, 5:1, 1.5:1, macro-only, micro-only): 2:1 yields the best Pareto balance for DeiT-Small.
- A macro-only penalty causes a large number of heads to be removed while FFN neurons remain largely untouched, resulting in imbalanced sparsity.
- A micro-only penalty inadvertently leads to the elimination of entire MLP blocks — an unintended depth-pruning effect.
- The network autonomously identifies the last FFN block as entirely redundant (\(b_{12}=0\)) without any manual specification.
Highlights & Insights¶
- The framework design is notably clean: hierarchical gating + differentiable budget constraint + feasibility regularization together constitute a principled solution to automated pruning.
- Single-stage training simplicity is the central contribution — compared to methods such as GOHSP and NViT that require graph ranking and multi-round evaluation, engineering complexity is substantially reduced.
- The decoupled cost penalty design renders pruning behavior controllable and interpretable; visualizations of structural evolution under different penalty ratios are highly informative.
- Proposition 1 proves that the linear decomposition of the differentiable budget constraint does not require a gate independence assumption, providing theoretical rigor.
Limitations & Future Work¶
- The objective optimizes only MACs and does not model actual latency or energy consumption; the paper itself acknowledges the MACs-to-latency gap as the primary limitation.
- Accuracy on ImageNet remains below GOHSP and ViT-Slim (79.1% vs. 79.98%), with the trade-off being a simplified pipeline.
- Validation is limited to classification tasks; applicability to dense prediction tasks such as detection and segmentation remains unexplored.
- The temperature annealing schedule (\(\tau_0\), \(\tau_{min}\)) requires tuning, and sensitivity to different model/task configurations has not been thoroughly analyzed.
Related Work & Insights¶
- ViT-Slim: Micro-level pruning only (intra-head dimensions + FFN), requiring sparsity ranking and threshold determination. HiAP generates decisions end-to-end within a unified multi-granular framework.
- GOHSP: Uses graph ranking to determine head importance and optimizes pruning combinations through a complex pipeline requiring expert tuning. HiAP allows the network to learn these decisions autonomously via Gumbel gating.
- UPDP: Operates solely at the macro level (FFN block granularity) using a genetic algorithm for search, in contrast to HiAP's differentiable search paradigm.
Related Work & Insights¶
- The Gumbel-Sigmoid gating + temperature annealing paradigm is broadly reusable in any setting that requires learning discrete structural selections.
- The idea of decoupling macro- and micro-level costs can be generalized to independent control of different resource dimensions in NAS.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐