GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models¶

Conference: ACL 2026 arXiv: 2604.19398 Code: GitHub Area: Robotics & Embodied AI Keywords: Structured Pruning, Global Budget, Gating Learning, KV Head Pruning, Projected STE

TL;DR¶

GRASPrune proposes a globally budget-constrained structured pruning framework that enforces hard mask budget constraints at every training step via Projected Straight-Through Estimator (Projected STE), jointly pruning FFN channels and KV head groups, achieving 12.18 PPL at 50% parameter retention on LLaMA-2-7B with only 6 minutes of single A100 training.

Method¶

Key Designs¶

Global Budget Joint Pruning: FFN channels and KV head groups compete under a single budget with heterogeneous unit costs. FFN channel \(c_i=1\), KV head group \(c_i=\alpha\) where \(\alpha = \frac{(2G+2)d_h}{3}\).
Projected STE: Projects continuous gate probabilities \(\mathbf{p}\) into budget-feasible hard masks every step via greedy ranking by \(p_i\) (not \(p_i/c_i\)). Forward uses hard mask \(m_i\), backward uses soft probability \(p_i\) via STE.
Budget-Preserving Scale Calibration: Post-pruning scalar multipliers \(\gamma_i\) for retained units, folded into sliced weights for zero inference overhead.

Key Experimental Results¶

Retention	Method	Wiki PPL↓
50%	LLM-Pruner	~18
50%	GRASPrune	12.18

Highlights & Insights¶

"Learning under constraints" vs "learning then constraining" — a deep insight addressing a widely overlooked issue in structured pruning
Extremely low training cost (6 minutes single GPU) makes the method highly practical

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐