ICML2025 Model Compression Structured Pruning LLM Compression Orthogonal Decomposition Linear Calibration Retraining-free PCA SVD

Olica: Efficient Structured Pruning of Large Language Models without Retraining¶

Conference: ICML2025
arXiv: 2506.08436
Code: BetterTMrR/LLM-Olica
Area: Model Compression
Keywords: Structured Pruning, LLM Compression, Orthogonal Decomposition, Linear Calibration, Retraining-free, PCA, SVD

TL;DR¶

This paper proposes Olica, a retraining-free framework for structured LLM pruning. By performing orthogonal decomposition (PCA/SVD) on the matrix products of MHA layers and linear calibration (ridge regression closed-form solution + low-rank approximation) on FFN layers, Olica prunes LLaMA-7B in just 7 minutes using 256 samples and 3GB VRAM, outperforming existing methods that require retraining.

Background & Motivation¶

Existing structured pruning methods for LLMs (e.g., LLM-Pruner, SlimGPT, LoRAP) heavily rely on extensive data and computational resources for LoRA retraining to recover disrupted inter-layer relationships:

DISP-LLM requires 4× A100 80GB GPUs to prune LLaMA-13B.
LLM-Pruner / SlimGPT require 50K annotated samples (Alpaca) for retraining.
Pruning in domain-specific scenarios incurs an extremely high cost for annotating large instruction datasets.

Key Observation: The computation of the MHA layer fundamentally relies on two categories of matrix products: \(W_q W_k^\top\) and \(W_v W_o^\top\). These products can be treated as unified entities, allowing direct PCA execution to extract the most critical information, thereby bypassing the need for retraining.

Method¶

Olica consists of three core components: Orthogonal Neuron Decomposition (OND), Fast OND, and Linear Calibration (LC).

1. Orthogonal Neuron Decomposition (OND) — MHA Layer Compression¶

The MHA layer relies on matrix products \(W_{qk} = W_q W_k^\top\) and \(W_{vo} = W_v W_o^\top\). Treating these products as unified entities, SVD is applied to \(W_{vo}\):

\[W_{vo} = U \Sigma V^\top\]

Let \(\hat{W}_v \leftarrow U\Sigma\) and \(\hat{W}_o \leftarrow V\), which yields \(\hat{W}_v \hat{W}_o^\top = W_v W_o^\top\). Because the columns of \(U\) and \(V\) are orthogonal, the decomposed neurons carry non-redundant information, ensuring maximum information retention for a given dimension.

Pruning Strategy: Rather than performing simple pruning based on eigenvalue magnitudes, an importance score based on weight-activation magnitude is utilized:

\[\mathcal{I}(\text{neuron}_j) = \sum_i [\mathcal{I}(\hat{W}_{v_{ij}}) + \mathcal{I}(\hat{W}_{o_{ij}})]\]

where \(\mathcal{I}(W_{ij}) = \|x^{(i)}\|_2 \cdot |W_{ij}|\) (Wanda score), pruning the least critical neurons.

2. Fast OND — Lowering SVD Complexity¶

Performing SVD on \(W_{vo} \in \mathbb{R}^{d \times d}\) has a complexity of \(O(d^3)\), totaling \(O(h d^3)\) across \(h\) attention heads, which takes approximately 1 hour for a 7B model.

Key Discovery: The singular value distributions of \(W_v\) and \(W_o\) are highly similar (as are those of \(W_q\) and \(W_k\)). Consequently, performing SVD on only one of these matrices suffices to approximate the redundancy of the joint entity. For example, performing SVD on \(W_v \in \mathbb{R}^{d \times d/h}\) reduces complexity to \(O(d^3/h^2)\), totaling \(O(d^3/h)\) across \(h\) heads—effectively reducing the computation by an \(h^2\) factor.

3. Linear Calibration (LC) — FFN Error Compensation¶

Pruning changes the outputs of FFN layers, which accumulates and amplifies errors across layers. Olica reconstructs the pruning residual utilizing a closed-form ridge regression:

\[\hat{W} = \arg\min_{W} \|E - XW\|_2^2 + \lambda \|W\|_F^2\]

where \(E = f(X) - \hat{f}(X)\) is the output residual before and after pruning. The closed-form solution is \(\hat{W} = (X^\top X + \lambda I)^{-1} X E\), which requires no iterative training.

After calibration, the forward pass becomes: \(X_{l+1} = \hat{f}_l(X_l) + X_l \hat{W}_l\)

Low-Rank Approximation: Applying SVD to \(\hat{W} \in \mathbb{R}^{d \times d}\) keeps the top \(r\) principal components (\(r/d = 0.03\)), reducing parameter size from \(d^2\) to \(2dr\), which constitutes only about 1% of the FFN layer parameters.

Layer Selection Criteria: FFN layers amenable to linear recovery are selected for calibration based on the Multiple Correlation Coefficient \(R_{XE}\):

\[R_{XE} = \frac{1}{d} \sum_{i=1}^{d} R_i\]

where \(R_i\) is the Pearson correlation coefficient between the predicted and true residuals. Experiments indicate that FFN residuals in shallower blocks are more suited for linear calibration.

4. RoPE Compatibility¶

Since LLaMA employs RoPE positional embeddings, there is no direct matrix multiplication between \(W_q\) and \(W_k\). The authors resolve this by performing Weighted SVD on \(W_q\) and \(W_k\) separately with the objective:

\[\arg\min_{W_1, W_2} \|(W - W_1 W_2) D\|_2^2\]

where \(D = \text{diag}(\|x^{(1)}\|_2, \ldots, \|x^{(d)}\|_2)\), multiplying weights with input feature magnitudes.

Key Experimental Results¶

Resource Consumption Comparison (LLaMA-7B)¶

Method	Data	Time	VRAM	PPL↓ (25%)	Acc↑ (25%)	PPL↓ (33%)	Acc↑ (33%)
LLM-Pruner	50K	3h	30GB	20.57	58.67	24.50	55.39
SlimGPT	50K	1h	20GB	18.45	62.45	22.43	61.41
Olica	256	7min	3GB	16.69	63.53	19.83	61.21

Main Results for LLaMA-1 Series¶

Model	Sparsity	Method	Retraining	PPL↓	Avg. Acc (7 tasks)↑
LLaMA-7B	20%	Olica	✗	15.35	64.54
LLaMA-7B	25%	Olica	✗	16.69	63.54
LLaMA-7B	25%	SlimGPT	✗	19.11	62.47
LLaMA-7B	25%	LoRAP	✗	17.40	62.57
LLaMA-7B	33%	Olica	✗	19.83	61.21
LLaMA-7B	33%	SlimGPT	✗	24.55	60.37
LLaMA-13B	20%	Olica	✗	13.68	67.67
LLaMA-13B	20%	LoRAP	✗	13.84	66.84

LLaMA-2 / Vicuna Results¶

Model	Sparsity	Method	PPL↓	Avg. Acc↑
LLaMA-2-7B	30%	Olica	18.54	61.14
LLaMA-2-7B	30%	LoRAP	19.42	58.89
Vicuna-7B	20%	Olica	20.23	64.88
Vicuna-7B	20%	LoRAP	20.74	64.39

Inference Efficiency (LLaMA-7B, RTX 4090)¶

Sparsity	Params	MACs	VRAM	Inference Latency
0%	6.74B	424.02G	12884 MiB	46.95s
20%	5.39B	373.23G	10464 MiB	40.62s
33%	4.52B	339.53G	8718 MiB	35.78s

Ablation Study (LLaMA-7B, 33% Sparsity)¶

MHA Method	Linear Calibration	PPL↓	Acc↑
SVD	✗	71.01	47.62
Wanda	✗	20.94	59.82
Fast-OND	✗	20.34	60.68
Fast-OND	✓	19.83	61.21

Runtime Comparison (LLaMA-7B, 20%)¶

Method	Runtime	PPL	Acc
OND (Standard SVD)	2910s	15.17	64.32
Fast-OND	413s	15.35	64.54

Fast-OND delivers an approximate 7× speedup with almost no loss in performance.

Highlights & Insights¶

Treating matrix products as unified entities is the core insight—operating directly in the parameter space avoids the need for data-driven retraining.
The discovery of singular value distribution symmetry (\(W_q \sim W_k\) and \(W_v \sim W_o\)) enables Fast-OND, reducing complexity by \(h^2\) times.
The closed-form solution for linear calibration elegantly leverages ridge regression and SVD low-rank approximation, adding only ~1% extra parameters.
Requiring only 256 samples, the framework is robust to both sample size and sequence length (with PPL fluctuating by only 2.4 as sequence length varies from 8 to 2048).
The extremely low resource requirement (3GB VRAM / 7 minutes) makes it viable to execute LLM pruning even on edge devices.

Limitations & Future Work¶

Severe performance drop at high sparsity: At 50% sparsity, all methods experience a substantial degradation (Olica on LLaMA-7B yields only 50.68% accuracy), demonstrating that retraining-free methods have a lower performance ceiling in this regime.
Evaluation limited to the LLaMA family: The method has not been tested on non-LLaMA architectures like Mistral, Qwen, and Phi; thus, generalizability remains to be verified.
RoPE handling is a workaround: Because RoPE disrupts the direct product structure of \(W_q W_k^\top\), \(W_q\) and \(W_k\) can only undergo separate weighted SVDs rather than a joint decomposition, which compromises the theoretical elegance of MHA compression.
Limitations of the linear calibration assumption: Reconstructing residuals solely with a linear model provides limited capacity to recover from non-linear error patterns.
Lack of joint experiments with quantization: Pruning and quantization joint compression is a mainstream choice in industry but is not addressed in this paper.
Insufficient theoretical support for the layer selection criterion: The MC² threshold must be manually chosen (from {6, 12, 16}) and lacks an adaptive selection mechanism.

LLM-Pruner (NeurIPS'23): Pioneering work introducing a two-stage process (pruning + LoRA retraining).
SliceGPT (ICLR'24): Reduces hidden representations using PCA, but operates in the activation space rather than the parameter space.
LoRAP (ICML'24): Performs low-rank approximation on QKV, but still requires retraining.
SlimGPT (NeurIPS'24): Extends OBS to structured pruning but requires extensive data.
Wanda (ICLR'24): Proposes weight-by-activation-magnitude importance scores, which Olica directly adopts.

The core inspiration of Olica: Orthogonal decomposition in the parameter space can bypass the need for data-dependent retraining. This approach could potentially be extended to other model compression scenarios (e.g., ViT, diffusion models).

Rating¶

Novelty: ⭐⭐⭐⭐ — The perspective of integrating matrix products as unified entities and applying PCA is highly novel, and the observation of singular value symmetry in Fast-OND is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablation studies and evaluations across multiple models and sparsities are provided, though evaluations on non-LLaMA architectures and joint quantization experiments are missing.
Writing Quality: ⭐⭐⭐⭐ — The paper is clearly presented, mathematically rigorous, and features highly informative plots and tables.
Value: ⭐⭐⭐⭐⭐ — Extremely practical. The remarkably low overhead of 3GB VRAM and 7 minutes runtime makes it a prime implementation for resource-constrained scenarios.