Wanda++: Pruning Large Language Models via Regional Gradients¶

Conference: ACL 2025
arXiv: 2503.04992
Code: None
Area: Model Compression

TL;DR¶

Wanda++ is proposed: a lightweight LLM pruning framework based on decoder block-level regional gradients. It improves the pruning criterion with Regional Gradient Score (RGS) and minimizes the output discrepancies between dense and sparse blocks via Regional Optimization (RO). Under 2:4 sparsity, it reduces WikiText perplexity by up to 32% compared to Wanda, while pruning a 7B model within 10 minutes on a single H100 GPU.

Background & Motivation¶

LLM Inference Bottleneck: Large language models have huge parameters (e.g., LLaMA-2 70B requires 140GB of VRAM), leading to high inference latency and an urgent need for model compression.
High Accuracy Loss of Existing Pruning Methods: Although post-training pruning methods (SparseGPT, Wanda) are efficient, they suffer from severe performance degradation under 2:4 semi-structured sparsity, lagging far behind quantization methods (such as AWQ which achieves nearly lossless 4x compression).
Gradient Information is Valuable but Expensive to Obtain: GBLM and Pruner-Zero have demonstrated that gradient information can significantly improve pruning results. However, they rely on full-model backpropagation, which demands unacceptable GPU time and memory (GBLM takes 5800 seconds to prune a 7B model, compared to 55 seconds for Wanda).
Flaws in the Linear Assumption of Layer-by-Layer Pruning: Wanda evaluates weight importance independently at each layer, ignoring the cumulative error propagation effects across layers.
Core Problem: Can gradient information be utilized effectively while remaining lightweight?

Method¶

Overall Architecture¶

Wanda++ performs two-stage processing sequentially for each decoder block: Regional Gradient Score (RGS) Pruning + Regional Optimization (RO) Weight Restoration, iteratively for K rounds.

1. Regional Gradient Score (RGS)¶

Core Idea: Replace full-model gradients with decoder block-level "regional gradients" to substantially reduce computational overhead.

Regional Loss Definition (without labels): $$\mathcal{L}_{RGS}^l(\mathbf{X}_n^l) = \|f^l(\mathbf{X}_n^l)\|_2$$

i.e., the L2 norm of the output of the $l$-th decoder block. A single backpropagation pass on this loss yields the gradients of all weights within that block.

RGS Pruning Criterion: $$S_{ij} = (\alpha \cdot G_{ij} + \|\mathbf{X}_j\|_2) \cdot |W_{ij}|$$

$G_{ij}$: Regional gradient magnitude (RMS over N samples)
$\|\mathbf{X}_j\|_2$: Original input activation norm from Wanda
$\alpha = 100$: Scaling factor balancing gradient and activation
The regional gradient is computed only once per block, and is fused with the layer-wise updated activation norm, balancing efficiency and accuracy.

2. Regional Optimization (RO)¶

After each round of RGS pruning, weights within the block are fine-tuned to repair the errors introduced by pruning:

\[\mathcal{L}_{ro}^{l,k}(\hat{\mathbf{X}}_m^l) = (f^l(\hat{\mathbf{X}}_m^l) - \hat{f}_k^l(\hat{\mathbf{X}}_m^l))^2\]

MSE loss between dense block output and pruned block output
32 samples are randomly selected from 128 calibration samples for RO
Uses RMSprop optimizer, learning rate of 3e-7
Iterated for K=5 rounds per block (RGS pruning → RO optimization → re-pruning to restore sparsity)

3. Algorithm Pipeline¶

The L decoder blocks are processed sequentially: 1. Compute regional gradient G (single backpropagation) 2. K iterations of: RGS pruning → RO weight update 3. Final RGS pruning to restore sparsity constraints 4. Update the input hidden states of the next block

Key Experimental Results¶

Table 1: WikiText Perplexity (↓ lower is better)¶

Method	LLaMA-1 7B (2:4)	LLaMA-1 13B (2:4)	OpenLLaMA 3B (2:4)	LLaMA-3.1 8B (2:4)
Dense Baseline	5.68	5.09	7.27	6.39
Wanda	11.53	9.58	28.04	24.83
GBLM	11.33	9.16	24.75	24.34
SparseGPT	11.00	9.11	15.91	-
Wanda++	9.43 (-19%)	7.75 (-20%)	19.03 (-32%)	18.32 (-26%)

The improvement is most significant on small models (3B/7B); 2:4 sparsity benefits more than unstructured and 4:8 sparsity.

Table 2: Zero-Shot Downstream Task Accuracy (LLaMA-1 7B, 2:4 Sparsity)¶

Method	MRPC	HellaSwag	ARC-e	RTE	MMLU
Dense	69.12	56.96	75.29	66.43	35.10
Wanda	46.81	41.66	59.34	49.82	25.85
Wanda++	68.38 (+46%)	45.31 (+8%)	63.72 (+7%)	62.09 (+24%)	27.52 (+6%)

Performance on MRPC and RTE tasks is almost restored to the dense baseline level.

Pruning Efficiency (LLaMA-1 7B)¶

Method	Time	VRAM
GBLM	5801s	26 GB
SparseGPT	322s	23 GB
Wanda	55s	22 GB
Wanda++	290s	25 GB

VRAM usage is comparable to Wanda, while the execution time is only 1/20 of GBLM.

Highlights & Insights¶

Ingenious Concept of Regional Gradients: Avoids full-model backpropagation (BP) through block-level BP, reducing gradient computation complexity from O(full model) to O(single block), with VRAM footprint independent of the total number of layers.
Complementary Two-Stage Design (RGS + RO): RGS improves pruning decisions while RO repairs pruning errors. The combined effect exceeds either component applied individually.
Orthogonality to LoRA Fine-Tuning: Running LoRA fine-tuning after Wanda++ pruning yields further improvements, showing that the two techniques are complementary.
Scalability to Large Models: Theoretical analysis suggests that single-block optimization on a 530B model requires only ~40GB VRAM, making it feasible on a single GPU.

Limitations & Future Work¶

Notable Accuracy Loss in 2:4 Sparsity: Even with Wanda++, the perplexity of LLaMA-1 7B under 2:4 sparsity increases from 5.68 to 9.43, which is far from the "near-lossless" level achieved by quantization.
Sensitivity to Calibration Data Volume: The effect of RO is unstable when using small calibration sets (<64 samples).
Increased Time due to RO Iterations: Takes 290s compared to Wanda's 55s (approx. 5x slower), although still 20x faster than GBLM.
Evaluated Only on the LLaMA Family: Generality has not been validated on other architectures such as Qwen or Mistral.

Dimension	Wanda++	Wanda	GBLM	SparseGPT
Gradient Information	Regional gradients (block-level)	None	Full-model gradients	Second-order Hessian approximation
Weight Update	RO (block-level MSE)	None	None	Column-wise OBS update
7B Pruning Time	~5 min	~1 min	~97 min	~5 min
VRAM Requirement	Comparable to Wanda	Lowest	Full model loading	Medium
2:4 Perplexity Improvement	Best (-19 to -32%)	Baseline	Minor	Moderate

Rating¶

⭐⭐⭐⭐ Novelty: Replacing full-model gradients with regional gradients is simple yet effective, and block-level MSE optimization reduces error propagation.
⭐⭐⭐⭐ Practicality: 10 mins for 7B on a single GPU, orthogonal to LoRA, scalable to ultra-large models, and engineering-friendly deployment.
⭐⭐⭐⭐ Experimental Thoroughness: Comprehensive coverage of 4 model families, 3 sparsity patterns, perplexity, downstream tasks, efficiency, latency, and ablation studies.
⭐⭐⭐ Writing Quality: The method description is clear, but some experimental figures and tables are not intuitive, and the performance drop in certain zero-shot tasks is not fully explained.