GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance¶

Conference: ICML 2025
arXiv: 2505.07004
Code: snu-mllab/GuidedQuant
Area: Model Compression
Keywords: Post-training quantization, Fisher Information Matrix, Non-uniform scalar quantization, End-to-end loss guidance, LLM compression

TL;DR¶

GuidedQuant is proposed, which improves existing SOTA PTQ methods in scalar, vector, and weight-activation quantization as a plug-and-play module by incorporating end-to-end loss gradient information into layer-wise quantization objectives (preserving weight interactions within output channels). Meanwhile, the LNQ algorithm for non-uniform scalar quantization is proposed, reducing the 2-bit Llama-2-7B perplexity from 39.58 to 8.83.

Background & Motivation¶

Existing LLM post-training quantization (PTQ) methods mainly rely on layer-wise output reconstruction error as a proxy objective, i.e., minimizing the MSE of each layer's output before and after quantization: $\|\mathbf{X}\mathbf{W} - \mathbf{X}\hat{\mathbf{W}}\|_F^2$. This objective has a key limitation: it treats all hidden features as equally important, overlooking their varying impacts on the final loss.

Another line of work, such as SqueezeLLM, leverages gradient information to compute a saliency score for each weight, measuring the impact of weight error on the end loss via the diagonal approximation of the Fisher Information Matrix. However, the diagonal approximation is too coarse—it ignores the interaction between weights, whereas the Fisher matrix actually possesses a prominent block-diagonal structure (which the authors verified via visualization on Llama-2-7B).

Key Challenge: Either ignoring the difference in feature importance (layer-wise reconstruction) or ignoring weight interactions (diagonal Fisher approximation), both of which are insufficiently accurate.

Method¶

Overall Architecture¶

The core idea of GuidedQuant is to weight the output error using the gradients of the end-to-end loss with respect to the layer outputs, while preserving the interaction and dependency among weights within output channels.

Specifically, for the $l$-th layer, the following quantization objective is proposed:

\[\left\|\frac{\partial \ell}{\partial \mathbf{Z}^{(l)}} \odot (\mathbf{X}^{(l)}\mathbf{W}^{(l)} - \mathbf{X}^{(l)}\hat{\mathbf{W}}^{(l)})\right\|_F^2\]

where $\frac{\partial \ell}{\partial \mathbf{Z}^{(l)}}$ represents the gradient of the loss with respect to the output of that layer, and $\odot$ denotes the element-wise product. Intuitively, output dimensions with larger gradients have a greater impact on the final loss, and thus their quantization errors should be assigned higher weights.

This objective is equivalent to a finer second-order Taylor approximation, where the Hessian adopts a block-diagonal approximation of the Fisher information matrix—preserving the $d_{in} \times $d_{in}$ blocks $\mathbf{F}_j^{(l)}$ within each output channel $j$ while discarding cross-channel and cross-layer interactions:

\[n \sum_{l=1}^{L} \sum_{j=1}^{d_{out}^{(l)}} (\mathbf{w}_j^{(l)} - \hat{\mathbf{w}}_j^{(l)})^\top \mathbf{F}_j^{(l)} (\mathbf{w}_j^{(l)} - \hat{\mathbf{w}}_j^{(l)})\]

Key Designs¶

Averaging Approximation

Directly computing the Hessian $\mathbf{H}_j^{(l)} = n\mathbf{F}_j^{(l)}$ for each output channel $j$ requires $O(d_{in}^2 \cdot d_{out})$ storage—which is prohibitive for modern LLMs ($d_{in}, d_{out} > 10^3$).

Core Approximation: Partitioning the $d_{out}$ output channels into $g$ groups ($g \ll d_{out}$) and averaging the Hessians within each group:

\[\bar{\mathbf{H}}_k^{(l)} = \frac{1}{|J_k|} \sum_{j \in J_k} \mathbf{H}_j^{(l)}\]

By the chain rule, this is equivalent to averaging the squared gradients:

\[\bar{\mathbf{H}}_k^{(l)} = \mathbf{X}^{(l)\top} \text{Diag}\left(\frac{1}{|J_k|}\sum_{j \in J_k} \left(\frac{\partial \ell}{\partial \mathbf{z}_j^{(l)}}\right)^2 \right) \mathbf{X}^{(l)}\]

Consequently, only $g$ matrices of size $d_{in} \times d_{in}$ need to be stored per layer, reducing storage to $O(d_{in}^2 \cdot g)$. The grouping strategy simply groups every $d_{out}/g$ consecutive channels. In practice, gradients are multiplied by a large constant of $10^3$ to prevent underflow.

Plug-and-play Capability: GuidedQuant can enhance any layer-wise output-oriented PTQ method by simply replacing the original Hessian $\mathbf{H}^{(l)}$ with the group-averaged Hessian $\bar{\mathbf{H}}_k^{(l)}$.

LNQ: Layer-wise Non-uniform Quantization

The authors also propose the LNQ algorithm to improve the optimization process in non-uniform scalar quantization. Existing SOTA (GPTVQ 1D) uses gradient descent to optimize the codebook and GPTQ to optimize assignments, both of which are sub-optimal.

LNQ employs an alternating minimization strategy:

Codebook Optimization (fixing assignment $\mathbf{P}$): A standard least-squares problem, which has a closed-form solution: $\mathbf{c}^{(j)*} = (\mathbf{P}^{(j)} \mathbf{H} \mathbf{P}^{(j)\top})^{-1} \mathbf{P}^{(j)} \mathbf{H} \mathbf{w}_j$
Assignment Optimization (fixing codebook $\mathbf{c}$): Utilizing Cyclic Coordinate Descent (Cyclic CD) instead of GPTQ. CD minimizes the objective coordinate-by-coordinate, where each step has a closed-form solution: quantizing to the nearest codebook value and compensating for the errors of other coordinates.

Theoretical Guarantee: The objective value of LNQ is monotonically non-increasing and convergent (Proposition 4.1).

Implementation Optimization: CD is accelerated by $4\times$ on GPUs via three techniques: closed-form coordinate updates, precomputation, and lazy batch-updates.

Loss & Training¶

Quantization Objective: Group-weighted layer-wise output reconstruction error (Eq. 7), with gradient weights computed from a single backpropagation pass.
Only a Single Backpropagation is required to compute gradients on the calibration set, with a storage overhead of $O(ngL)$.
Total memory is $O(Lg(d_{in}^2 + n))$, and different groups and layers can be fully parallelized.
Simple hyperparameters: $g=4$ (for 7B/13B), $g=2$ (for 70B), $g=1$ (for weight-activation quantization).
LNQ hyperparameters: alternating iterations $T=2$, CD cycles $K=4$ (for 7B/13B); $T=1, K=4$ (for 70B).

Key Experimental Results¶

Main Results¶

Non-uniform Scalar Quantization (Llama-2 family, WikiText2 PPL, context=4096):

Model	Method	Bits	Wiki2 ↓	C4 ↓
Llama-2-7B	SqueezeLLM	2.01	39.58	44.05
Llama-2-7B	LNQ	2.01	23.31	26.71
Llama-2-7B	LNQ + GQuant	2.01	8.83	11.15
Llama-2-13B	SqueezeLLM	2.01	16.24	19.20
Llama-2-13B	LNQ + GQuant	2.01	7.26	9.17
Llama-2-70B	SqueezeLLM	2.01	9.17	13.03
Llama-2-70B	LNQ + GQuant	2.01	5.04	7.04

Vector Quantization (QTIP + GuidedQuant):

Model	Method	Bits	Wiki2 ↓	C4 ↓
Llama-2-7B	QTIP	2.00	6.82	8.96
Llama-2-7B	QTIP + GQuant	2.00	6.11	7.99
Llama-2-70B	QTIP	2.00	3.87	5.69
Llama-2-70B	QTIP + GQuant	2.00	3.80	5.61

Weight-Activation Quantization (SpinQuant + GuidedQuant, W4A4KV4, Wiki2-2K):

Model	Method	Wiki2 ↓
Llama-2-7B	SpinQuant	5.95
Llama-2-7B	SpinQuant + GQuant	5.89
Llama-2-13B	SpinQuant	5.24
Llama-2-13B	SpinQuant + GQuant	5.19

Ablation Study¶

Configuration	Wiki2 (2-bit)	Description
LNQ (g=0, w/o gradient guidance)	23.31	Layer-wise reconstruction objective only
LNQ + GQuant (g=1)	9.00	A single-group average already yields a significant improvement
LNQ + GQuant (g=2)	8.82	Marginal improvement
LNQ + GQuant (g=4)	8.83	Best performance; g is insensitive to higher bit-widths
LNQ + GQuant (assignment via GPTQ)	9.65	CD outperforms GPTQ for assignment
LNQ + GQuant (assignment via CD)	8.83	Validates the correct choice of CD

Key Findings¶

GuidedQuant yields the largest gains in ultra-low bit-width (2-bit) scenarios—reducing the 7B model PPL from 39.58 to 8.83, a 78% reduction.
The Fisher matrix indeed exhibits a strong block-diagonal structure—visualization shows that weight interactions within the same output channel are significantly stronger than those across channels.
LNQ's closed-form codebook + CD assignment outperforms GPTVQ's gradient descent + GPTQ—using LNQ alone exceeds GPTVQ 1D.
Inference throughput is unaffected—GuidedQuant only optimizes the codebook and assignment values, reusing existing CUDA kernels.
Equally effective on Llama-3—Llama-3-8B 2-bit PPL drops from SqueezeLLM's 163k+ to 30.80 with LNQ+GQuant.
Reasonable quantization cost—the entire workflow for Llama-2-7B (including Hessian caching) takes about 1-2 hours, which is reusable across different configurations.

Highlights & Insights¶

Theoretical Elegance: Naturally deriving that gradient-weighted output error equals a block-diagonal Fisher approximation from the second-order Taylor expansion, which is more accurate than both diagonal approximations (SqueezeLLM) and ignoring gradients (GPTQ).
Engineering Ingenuity: The averaging approximation scales down the intractable $O(d_{in}^2 d_{out})$ complexity to $O(d_{in}^2 g)$, establishing an optimal trade-off between accuracy and efficiency.
High Versatility: Acts as a plug-and-play module applicable to three quantization formats: scalar, vector, and weight-activation.
Superior LNQ Design: The combination of CD with a closed-form codebook consistently outperforms existing alternating minimization methods, backed by convergence guarantees.

Limitations & Future Work¶

Cross-layer and cross-channel interactions are still ignored—the current block-diagonal approximation discards this information, which may lead to accuracy degradation as quantization errors accumulate across layers.
Simple grouping strategy: Channels are grouped solely by spatial continuity. More intelligent clustering (e.g., based on gradient similarity) could yield further improvements.
A full backpropagation pass is required: For a 70B model, an A100-class GPU is still needed for gradient caching.
Inference latency of non-uniform scalar quantization: While comparable to uniform quantization, lookup table decoding might still incur overhead on certain hardware architectures.
Evaluation primarily restricted to the Llama family: Lacks verification on other architectures such as Mistral or Qwen.

SqueezeLLM (Kim et al., ICML 2024): Diagonal Fisher approximation + weighted k-means; the direct baseline improved by GuidedQuant.
GPTQ (Frantar et al., ICLR 2023): Layer-wise output reconstruction + OBQ-style quantization while ignoring gradient information.
QTIP (Tseng et al., NeurIPS 2024): SOTA vector quantization method, which can be directly integrated with GuidedQuant.
SpinQuant (Liu et al., 2024): Uniform quantization after rotation matrices reduce activation outliers; GuidedQuant enhances its weight quantization.
WoodFisher (Singh & Alistarh, NeurIPS 2020): Block-diagonal Fisher used for CNN pruning, but failed to scale to LLM size.
Insights: Gradient information serves as a cheap yet highly effective saliency signal. The key is to manage computational and storage overheads while preserving sufficient interaction details.

Rating¶

Novelty: ⭐⭐⭐⭐ — The integration of block-diagonal Fisher and averaging approximation is novel, though individual components have precedents in prior work.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers three quantization formats, three model scales, and multiple baselines with extensive ablations, alongside inference throughput and downstream task evaluations.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear theoretical derivations, with a coherent logical flow from motivation and methods to experiments.
Value: ⭐⭐⭐⭐⭐ — Enhances several SOTA approaches in a plug-and-play fashion, bringing immense improvements to 2-bit quantization with high practical utility.