Wanda++: Pruning Large Language Models via Regional Gradients¶
Conference: ACL 2025
arXiv: 2503.04992
Code: None
Area: Model Compression
TL;DR¶
Wanda++ is proposed: a lightweight LLM pruning framework based on decoder block-level regional gradients. It improves the pruning criterion with Regional Gradient Score (RGS) and minimizes the output discrepancies between dense and sparse blocks via Regional Optimization (RO). Under 2:4 sparsity, it reduces WikiText perplexity by up to 32% compared to Wanda, while pruning a 7B model within 10 minutes on a single H100 GPU.
Background & Motivation¶
- LLM Inference Bottleneck: Large language models have huge parameters (e.g., LLaMA-2 70B requires 140GB of VRAM), leading to high inference latency and an urgent need for model compression.
- High Accuracy Loss of Existing Pruning Methods: Although post-training pruning methods (SparseGPT, Wanda) are efficient, they suffer from severe performance degradation under 2:4 semi-structured sparsity, lagging far behind quantization methods (such as AWQ which achieves nearly lossless 4x compression).
- Gradient Information is Valuable but Expensive to Obtain: GBLM and Pruner-Zero have demonstrated that gradient information can significantly improve pruning results. However, they rely on full-model backpropagation, which demands unacceptable GPU time and memory (GBLM takes 5800 seconds to prune a 7B model, compared to 55 seconds for Wanda).
- Flaws in the Linear Assumption of Layer-by-Layer Pruning: Wanda evaluates weight importance independently at each layer, ignoring the cumulative error propagation effects across layers.
- Core Problem: Can gradient information be utilized effectively while remaining lightweight?
Method¶
Overall Architecture¶
Wanda++ performs two-stage processing sequentially for each decoder block: Regional Gradient Score (RGS) Pruning + Regional Optimization (RO) Weight Restoration, iteratively for K rounds.
1. Regional Gradient Score (RGS)¶
Core Idea: Replace full-model gradients with decoder block-level "regional gradients" to substantially reduce computational overhead.
Regional Loss Definition (without labels): $\(\mathcal{L}_{RGS}^l(\mathbf{X}_n^l) = \|f^l(\mathbf{X}_n^l)\|_2\)$
i.e., the L2 norm of the output of the \(l\)-th decoder block. A single backpropagation pass on this loss yields the gradients of all weights within that block.
RGS Pruning Criterion: $\(S_{ij} = (\alpha \cdot G_{ij} + \|\mathbf{X}_j\|_2) \cdot |W_{ij}|\)$
- \(G_{ij}\): Regional gradient magnitude (RMS over N samples)
- \(\|\mathbf{X}_j\|_2\): Original input activation norm from Wanda
- \(\alpha = 100\): Scaling factor balancing gradient and activation
- The regional gradient is computed only once per block, and is fused with the layer-wise updated activation norm, balancing efficiency and accuracy.
2. Regional Optimization (RO)¶
After each round of RGS pruning, weights within the block are fine-tuned to repair the errors introduced by pruning:
- MSE loss between dense block output and pruned block output
- 32 samples are randomly selected from 128 calibration samples for RO
- Uses RMSprop optimizer, learning rate of 3e-7
- Iterated for K=5 rounds per block (RGS pruning → RO optimization → re-pruning to restore sparsity)
3. Algorithm Pipeline¶
The L decoder blocks are processed sequentially: 1. Compute regional gradient G (single backpropagation) 2. K iterations of: RGS pruning → RO weight update 3. Final RGS pruning to restore sparsity constraints 4. Update the input hidden states of the next block
Key Experimental Results¶
Table 1: WikiText Perplexity (↓ lower is better)¶
| Method | LLaMA-1 7B (2:4) | LLaMA-1 13B (2:4) | OpenLLaMA 3B (2:4) | LLaMA-3.1 8B (2:4) |
|---|---|---|---|---|
| Dense Baseline | 5.68 | 5.09 | 7.27 | 6.39 |
| Wanda | 11.53 | 9.58 | 28.04 | 24.83 |
| GBLM | 11.33 | 9.16 | 24.75 | 24.34 |
| SparseGPT | 11.00 | 9.11 | 15.91 | - |
| Wanda++ | 9.43 (-19%) | 7.75 (-20%) | 19.03 (-32%) | 18.32 (-26%) |
The improvement is most significant on small models (3B/7B); 2:4 sparsity benefits more than unstructured and 4:8 sparsity.
Table 2: Zero-Shot Downstream Task Accuracy (LLaMA-1 7B, 2:4 Sparsity)¶
| Method | MRPC | HellaSwag | ARC-e | RTE | MMLU |
|---|---|---|---|---|---|
| Dense | 69.12 | 56.96 | 75.29 | 66.43 | 35.10 |
| Wanda | 46.81 | 41.66 | 59.34 | 49.82 | 25.85 |
| Wanda++ | 68.38 (+46%) | 45.31 (+8%) | 63.72 (+7%) | 62.09 (+24%) | 27.52 (+6%) |
Performance on MRPC and RTE tasks is almost restored to the dense baseline level.
Pruning Efficiency (LLaMA-1 7B)¶
| Method | Time | VRAM |
|---|---|---|
| GBLM | 5801s | 26 GB |
| SparseGPT | 322s | 23 GB |
| Wanda | 55s | 22 GB |
| Wanda++ | 290s | 25 GB |
VRAM usage is comparable to Wanda, while the execution time is only 1/20 of GBLM.
Highlights & Insights¶
- Ingenious Concept of Regional Gradients: Avoids full-model backpropagation (BP) through block-level BP, reducing gradient computation complexity from O(full model) to O(single block), with VRAM footprint independent of the total number of layers.
- Complementary Two-Stage Design (RGS + RO): RGS improves pruning decisions while RO repairs pruning errors. The combined effect exceeds either component applied individually.
- Orthogonality to LoRA Fine-Tuning: Running LoRA fine-tuning after Wanda++ pruning yields further improvements, showing that the two techniques are complementary.
- Scalability to Large Models: Theoretical analysis suggests that single-block optimization on a 530B model requires only ~40GB VRAM, making it feasible on a single GPU.
Limitations & Future Work¶
- Notable Accuracy Loss in 2:4 Sparsity: Even with Wanda++, the perplexity of LLaMA-1 7B under 2:4 sparsity increases from 5.68 to 9.43, which is far from the "near-lossless" level achieved by quantization.
- Sensitivity to Calibration Data Volume: The effect of RO is unstable when using small calibration sets (<64 samples).
- Increased Time due to RO Iterations: Takes 290s compared to Wanda's 55s (approx. 5x slower), although still 20x faster than GBLM.
- Evaluated Only on the LLaMA Family: Generality has not been validated on other architectures such as Qwen or Mistral.
Related Work & Insights¶
| Dimension | Wanda++ | Wanda | GBLM | SparseGPT |
|---|---|---|---|---|
| Gradient Information | Regional gradients (block-level) | None | Full-model gradients | Second-order Hessian approximation |
| Weight Update | RO (block-level MSE) | None | None | Column-wise OBS update |
| 7B Pruning Time | ~5 min | ~1 min | ~97 min | ~5 min |
| VRAM Requirement | Comparable to Wanda | Lowest | Full model loading | Medium |
| 2:4 Perplexity Improvement | Best (-19 to -32%) | Baseline | Minor | Moderate |
Rating¶
- ⭐⭐⭐⭐ Novelty: Replacing full-model gradients with regional gradients is simple yet effective, and block-level MSE optimization reduces error propagation.
- ⭐⭐⭐⭐ Practicality: 10 mins for 7B on a single GPU, orthogonal to LoRA, scalable to ultra-large models, and engineering-friendly deployment.
- ⭐⭐⭐⭐ Experimental Thoroughness: Comprehensive coverage of 4 model families, 3 sparsity patterns, perplexity, downstream tasks, efficiency, latency, and ablation studies.
- ⭐⭐⭐ Writing Quality: The method description is clear, but some experimental figures and tables are not intuitive, and the performance drop in certain zero-shot tasks is not fully explained.