GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models¶
Conference: ACL 2026 arXiv: 2604.19398 Code: GitHub Area: Robotics & Embodied AI Keywords: Structured Pruning, Global Budget, Gating Learning, KV Head Pruning, Projected STE
TL;DR¶
GRASPrune proposes a globally budget-constrained structured pruning framework that enforces hard mask budget constraints at every training step via Projected Straight-Through Estimator (Projected STE), jointly pruning FFN channels and KV head groups, achieving 12.18 PPL at 50% parameter retention on LLaMA-2-7B with only 6 minutes of single A100 training.
Method¶
Key Designs¶
-
Global Budget Joint Pruning: FFN channels and KV head groups compete under a single budget with heterogeneous unit costs. FFN channel \(c_i=1\), KV head group \(c_i=\alpha\) where \(\alpha = \frac{(2G+2)d_h}{3}\).
-
Projected STE: Projects continuous gate probabilities \(\mathbf{p}\) into budget-feasible hard masks every step via greedy ranking by \(p_i\) (not \(p_i/c_i\)). Forward uses hard mask \(m_i\), backward uses soft probability \(p_i\) via STE.
-
Budget-Preserving Scale Calibration: Post-pruning scalar multipliers \(\gamma_i\) for retained units, folded into sliced weights for zero inference overhead.
Key Experimental Results¶
| Retention | Method | Wiki PPL↓ |
|---|---|---|
| 50% | LLM-Pruner | ~18 |
| 50% | GRASPrune | 12.18 |
Highlights & Insights¶
- "Learning under constraints" vs "learning then constraining" — a deep insight addressing a widely overlooked issue in structured pruning
- Extremely low training cost (6 minutes single GPU) makes the method highly practical
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐