SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression¶
Conference: ICML 2025
arXiv: 2410.09615
Code: GitHub
Area: Interpretability
Keywords: one-shot compression, quantization, sparsity, low-rank adapter, SLiM-Quant, SLiM-LoRA
TL;DR¶
Proposes SLiM, a one-shot compression framework that seamlessly integrates hardware-friendly uniform quantization, semi-structured sparsity, and saliency-based low-rank adapters, achieving up to 5.66% accuracy improvement under 4-bit + 2:4 sparsity conditions.
Background & Motivation¶
- While pruning or quantization alone can effectively reduce inference costs, their joint application leads to error accumulation, resulting in severe performance degradation.
- Existing one-shot compression methods struggle to recover the accuracy of dense models under 4-bit + structured sparsity scenarios.
- Low-rank adapters can mitigate compression losses, but typically require expensive retraining.
- Existing low-rank methods (such as L2QER) are designed solely for quantization and perform poorly when combined with sparsity.
Method¶
Three-Step Pipeline¶
Step 1: SLiM-Quant (Probabilistic Uniform Quantization)¶
Reformulates the quantization problem from non-convex optimization into a probabilistic framework:
Decomposed into quantization error + clipping error:
Since actual weight distributions do not match any standard distribution (Gaussian, Laplace, etc., are all ruled out), a numerical integration + multi-grid strategy is adopted to solve for the optimal \(\alpha\): first performing a coarse search over 10 uniformly sampled points, and then refining within the optimal region.
Activation-Aware Extension (SLiM-Quant^O): Defines channel saliency as \(|diag(\mathbf{x}) \times \mathcal{W}|\), amplifying weights and scaling down activations for approximately the top 1% most salient channels.
Step 2: Sparsification¶
Applies Wanda on the quantized weights to perform semi-structured (2:4) or unstructured sparsity.
Step 3: SLiM-LoRA (Saliency-Based Low-Rank Adapter)¶
Core Innovation: Designs a saliency function \(F(\mathcal{W}) = diag(\mathbf{x})\mathcal{W}\) that satisfies invertibility and additivity:
Leveraging these two properties, the adapter values are solved directly via SVD without iterative optimization:
where \(E_Q\) and \(E_S\) represent the quantization and sparsity errors, respectively.
Adapter Quantization¶
Also applies 4-bit quantization to the low-rank adapters (AbsMax group quantization with a group size of 128), reducing overhead by 4×.
Optional Post-Compression Fine-Tuning¶
Freezes the sparse-quantized weights and only fine-tunes the low-rank adapters, utilizing STE (Straight-Through Estimator) to achieve quantization-aware fine-tuning.
Key Experimental Results¶
Main Results: Zero-Shot Accuracy (4-bit + 2:4 Sparsity)¶
| Method | LLaMA-2-7B | LLaMA-2-13B | OPT-13B |
|---|---|---|---|
| Dense | 56.6 | 60.8 | 48.7 |
| Wanda + Group AbsMax | 40.6 | 49.6 | 37.7 |
| SparseGPT + OPTQ | 42.6 | 53.3 | 43.0 |
| JSQ | 44.3 | 53.7 | 42.0 |
| SLiM | 46.3 | 57.2 | 43.6 |
| SLiM + PEFT | 47.0 | 58.9 | - |
- Average improvement of 5.66% on LLaMA-2-7B
- Average improvement of 3.89% on LLaMA-2-13B
- SLiM even outperforms the dense model by 0.6% in certain configurations
Hardware Acceleration¶
| GPU | Layer-wise Speedup |
|---|---|
| RTX 3060 | 4.3× |
| A100 | 3.8× |
| Memory Reduction | 0.23× |
Ablation Study: Contribution of Each Component¶
| Configuration | PPL (WikiText2) |
|---|---|
| SLiM-Quant only | 6.89 |
| + Wanda | 8.12 |
| + SLiM-LoRA | 7.45 |
| + Adapter Quantization | 7.51 |
| + PEFT | 7.32 |
Highlights & Insights¶
- The invertible and additive design of the saliency function is highly elegant, enabling closed-form solutions for the low-rank adapters.
- Uniform quantization + probabilistic optimization eliminates the extra overhead of group quantization.
- Seamless integration of the three compression technologies, with clear mathematical motivation for each step.
- End-to-end one-shot scheme, requiring no large-scale retraining.
- Maintains high accuracy even under extreme compression conditions (4-bit + 2:4 sparsity).
Limitations & Future Work¶
- The numerical integration of SLiM-Quant relies on the weight histogram and may not be robust to outlier distributions.
- SLiM-LoRA has limited options for the saliency function (only \(diag(\mathbf{x})\mathcal{W}\)), which may not be optimal.
- Evaluated only on LLaMA-2 and OPT, lacking experiments on newer models such as LLaMA-3 and Mistral.
- The rank selection for the adapters lacks an adaptive mechanism.
- Although the optional PEFT step increases accuracy, it also adds to the complexity of the pipeline.
Rating¶
⭐⭐⭐⭐ — Rigorous theoretical derivation and complete engineering implementation. The joint optimization of the three compression technologies achieves a new SOTA in one-shot compression.