Skip to content

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

Conference: ACL 2026 arXiv: 2604.19398 Code: GitHub Area: Robotics & Embodied AI Keywords: Structured Pruning, Global Budget, Gating Learning, KV Head Pruning, Projected STE

TL;DR

GRASPrune proposes a globally budget-constrained structured pruning framework that enforces hard mask budget constraints at every training step via Projected Straight-Through Estimator (Projected STE), jointly pruning FFN channels and KV head groups, achieving 12.18 PPL at 50% parameter retention on LLaMA-2-7B with only 6 minutes of single A100 training.

Method

Key Designs

  1. Global Budget Joint Pruning: FFN channels and KV head groups compete under a single budget with heterogeneous unit costs. FFN channel \(c_i=1\), KV head group \(c_i=\alpha\) where \(\alpha = \frac{(2G+2)d_h}{3}\).

  2. Projected STE: Projects continuous gate probabilities \(\mathbf{p}\) into budget-feasible hard masks every step via greedy ranking by \(p_i\) (not \(p_i/c_i\)). Forward uses hard mask \(m_i\), backward uses soft probability \(p_i\) via STE.

  3. Budget-Preserving Scale Calibration: Post-pruning scalar multipliers \(\gamma_i\) for retained units, folded into sliced weights for zero inference overhead.

Key Experimental Results

Retention Method Wiki PPL↓
50% LLM-Pruner ~18
50% GRASPrune 12.18

Highlights & Insights

  • "Learning under constraints" vs "learning then constraining" — a deep insight addressing a widely overlooked issue in structured pruning
  • Extremely low training cost (6 minutes single GPU) makes the method highly practical

Rating

  • Novelty: ⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐