Skip to content

SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Conference: ICML 2025
arXiv: 2410.09615
Code: GitHub
Area: Interpretability
Keywords: one-shot compression, quantization, sparsity, low-rank adapter, SLiM-Quant, SLiM-LoRA

TL;DR

Proposes SLiM, a one-shot compression framework that seamlessly integrates hardware-friendly uniform quantization, semi-structured sparsity, and saliency-based low-rank adapters, achieving up to 5.66% accuracy improvement under 4-bit + 2:4 sparsity conditions.

Background & Motivation

  • While pruning or quantization alone can effectively reduce inference costs, their joint application leads to error accumulation, resulting in severe performance degradation.
  • Existing one-shot compression methods struggle to recover the accuracy of dense models under 4-bit + structured sparsity scenarios.
  • Low-rank adapters can mitigate compression losses, but typically require expensive retraining.
  • Existing low-rank methods (such as L2QER) are designed solely for quantization and perform poorly when combined with sparsity.

Method

Three-Step Pipeline

Step 1: SLiM-Quant (Probabilistic Uniform Quantization)

Reformulates the quantization problem from non-convex optimization into a probabilistic framework:

\[\alpha^* = \arg\min_\alpha \intLock_{-\infty}^{\infty} f(x)|Q^{-1}(Q(x)) - x|^2 dx\]

Decomposed into quantization error + clipping error:

\[E_Q(\alpha) = E_{quant}(\alpha) + E_{clip}(\alpha)\]

Since actual weight distributions do not match any standard distribution (Gaussian, Laplace, etc., are all ruled out), a numerical integration + multi-grid strategy is adopted to solve for the optimal \(\alpha\): first performing a coarse search over 10 uniformly sampled points, and then refining within the optimal region.

Activation-Aware Extension (SLiM-Quant^O): Defines channel saliency as \(|diag(\mathbf{x}) \times \mathcal{W}|\), amplifying weights and scaling down activations for approximately the top 1% most salient channels.

Step 2: Sparsification

Applies Wanda on the quantized weights to perform semi-structured (2:4) or unstructured sparsity.

Step 3: SLiM-LoRA (Saliency-Based Low-Rank Adapter)

Core Innovation: Designs a saliency function \(F(\mathcal{W}) = diag(\mathbf{x})\mathcal{W}\) that satisfies invertibility and additivity:

\[F(A+B) = F(A) + F(B) \quad \text{(additivity)}\]

Leveraging these two properties, the adapter values are solved directly via SVD without iterative optimization:

\[diag(\mathbf{x})\mathcal{L}, \mathcal{R} = -SVD(diag(\mathbf{x})(E_Q + E_S))\]

where \(E_Q\) and \(E_S\) represent the quantization and sparsity errors, respectively.

Adapter Quantization

Also applies 4-bit quantization to the low-rank adapters (AbsMax group quantization with a group size of 128), reducing overhead by 4×.

Optional Post-Compression Fine-Tuning

Freezes the sparse-quantized weights and only fine-tunes the low-rank adapters, utilizing STE (Straight-Through Estimator) to achieve quantization-aware fine-tuning.

Key Experimental Results

Main Results: Zero-Shot Accuracy (4-bit + 2:4 Sparsity)

Method LLaMA-2-7B LLaMA-2-13B OPT-13B
Dense 56.6 60.8 48.7
Wanda + Group AbsMax 40.6 49.6 37.7
SparseGPT + OPTQ 42.6 53.3 43.0
JSQ 44.3 53.7 42.0
SLiM 46.3 57.2 43.6
SLiM + PEFT 47.0 58.9 -
  • Average improvement of 5.66% on LLaMA-2-7B
  • Average improvement of 3.89% on LLaMA-2-13B
  • SLiM even outperforms the dense model by 0.6% in certain configurations

Hardware Acceleration

GPU Layer-wise Speedup
RTX 3060 4.3×
A100 3.8×
Memory Reduction 0.23×

Ablation Study: Contribution of Each Component

Configuration PPL (WikiText2)
SLiM-Quant only 6.89
+ Wanda 8.12
+ SLiM-LoRA 7.45
+ Adapter Quantization 7.51
+ PEFT 7.32

Highlights & Insights

  • The invertible and additive design of the saliency function is highly elegant, enabling closed-form solutions for the low-rank adapters.
  • Uniform quantization + probabilistic optimization eliminates the extra overhead of group quantization.
  • Seamless integration of the three compression technologies, with clear mathematical motivation for each step.
  • End-to-end one-shot scheme, requiring no large-scale retraining.
  • Maintains high accuracy even under extreme compression conditions (4-bit + 2:4 sparsity).

Limitations & Future Work

  • The numerical integration of SLiM-Quant relies on the weight histogram and may not be robust to outlier distributions.
  • SLiM-LoRA has limited options for the saliency function (only \(diag(\mathbf{x})\mathcal{W}\)), which may not be optimal.
  • Evaluated only on LLaMA-2 and OPT, lacking experiments on newer models such as LLaMA-3 and Mistral.
  • The rank selection for the adapters lacks an adaptive mechanism.
  • Although the optional PEFT step increases accuracy, it also adds to the complexity of the pipeline.

Rating

⭐⭐⭐⭐ — Rigorous theoretical derivation and complete engineering implementation. The joint optimization of the three compression technologies achieves a new SOTA in one-shot compression.