YAQA: End-to-End KL Minimization LLM Adaptive Weight Quantization¶
Conference: ICML 2026
arXiv: 2505.22988
Code: Not yet released
Area: Model Compression / LLM Quantization
Keywords: Quantization, Adaptive Rounding, End-to-end KL, Hessian Sketching, Kronecker Decomposition
TL;DR¶
YAQA replaces the proxy objective of LLM weight quantization from "layer-wise activation error" to "end-to-end model output KL divergence." By using a Kronecker-decomposed Hessian sketch, it provides the first end-to-end error bound. It reduces KL divergence by approximately 30% relative to GPTQ/LDLQ, achieves higher accuracy than Quantization-Aware Training (QAT), and maintains unchanged inference speed.
Background & Motivation¶
Background: LLM quantization follows two paths—QAT learns low-precision representations by modifying the training process, offering high quality but at immense cost; PTQ maps full-precision weights to a discrete codebook post-hoc (e.g., GPTQ/LDLQ), which is popular due to its low cost. GPTQ uses the Hessian of "current layer activation error" \(H_1 = \mathbb{E}[x^\top x]\) as a proxy for end-to-end error.
Limitations of Prior Work: \(H_1\) only considers the input distribution of the current layer, completely ignoring how subsequent layers amplify or cancel out rounding errors. Consequently, "layer-wise optimal" does not equal "model-level optimal," often leading to unnecessarily high KL divergence. GuidedQuant/SqueezeLLM use block-diagonal approximations of the Empirical Fisher, but these originate from cross-entropy task loss rather than the true KL Hessian. Their block structures are heuristic without theoretical guarantees—empirically, increasing the number of blocks leads to inconsistent results.
Key Challenge: Direct adaptive rounding against \(\nabla^2 L(W^*) \in \mathbb{R}^{mn \times mn}\) (the true KL Hessian with respect to a layer's weights) suffers from scale explosion. To maintain a tractable structure, the approximation quality must be provable. Existing structural approximations either lack bounds or approximate poorly.
Goal: Find a structured Hessian sketch that allows LDLQ-style iterative rounding within \(O(m+n)\) steps while strictly controlling end-to-end KL through "cosine similarity with the true Hessian."
Key Insight: The authors introduce "Structural Nilpotence Degree" (SND), a combinatorial quantity, to characterize the convergence steps of LDLQ. They prove that for a Kronecker product \(L_O \otimes L_I\), \(\mathrm{snd}(L_O \otimes L_I) = \mathrm{snd}(L_O) + \mathrm{snd}(L_I) \le m+n-1\), which maps "tractable computation" directly onto Kronecker decomposition.
Core Idea: Use Kronecker decomposition \(\tilde{H} = H_O \otimes H_I\) as an approximation of \(\nabla^2 L(W^*)\). Obtain "near-optimal" \(H_O, H_I\) via power iteration on the true Fisher. The rounding algorithm adds a symmetric output-side feedback component to LDLQ, significantly lowering KL while taking \(\approx 2\times\) LDLQ time.
Method¶
Overall Architecture¶
YAQA consists of two components: (1) Rounding Algorithm—generalizing LDLQ to a Kronecker-decomposed Hessian sketch, resulting in \(W = Q(W^* + L_O'^{\top} \Delta L_I' + L_O'^{\top} \Delta + \Delta L_I')\), where \(L_O', L_I'\) are the LDL triangular factors of \(H_O, H_I\) minus the identity matrix, and \(\Delta = W^* - W\); (2) Hessian Sketching—constructing \(H_O, H_I\) via power iteration to maximize the alignment between \(\tilde{H}\) and the true Hessian under the Frobenius inner product. This process is performed independently for each linear layer without changing the inference structure; thus, the inference speed is determined by the codebook (e.g., E8P) and is unrelated to YAQA.
Key Designs¶
-
SND + Kronecker Hessian for End-to-End LDLQ:
- Function: Characterizes the Hessian structures that allow "tractable LDLQ" and selects an optimal one.
- Mechanism: Defines \(\mathrm{snd}(L)\) as the nilpotence degree of a binary nilpotent matrix sharing the same support as \(L - I\), proving LDLQ converges in \(\le \mathrm{snd}(L)\) steps. For the Kronecker product \(L_O \otimes L_I\), \(\mathrm{snd}\) is the sum of the degrees of both sides. Thus, \(\tilde{H} = H_O \otimes H_I\) allows symmetric "input + output" feedback while requiring only \(O(m+n)\) small matrix multiplications. Under this framework, GuidedQuant is equivalent to running LDLQ on a block-diagonal approximation without output feedback, explaining its saturation beyond 4 blocks.
- Design Motivation: Previous PTQ algorithms forced a choice between "layer-wise \(H_1\) (no output feedback)" or "QAT (expensive but global)." Kronecker is the first structure to achieve "end-to-end" and "tractable" simultaneously.
-
End-to-End KL Error Bound + Cosine Similarity Objective:
- Function: Formulates the model KL upper bound based on the relationship between the Hessian sketch and the true Hessian, guiding the selection of \(H_O, H_I\).
- Mechanism: Theorem 3.4 proves \(\mathrm{vec}(\Delta) H \, \mathrm{vec}(\Delta)^\top \le \|H\|_F (\|\Delta\|_F^2 \sqrt{2 - 2c} + \text{(incoherence/trace term)})\), where \(c = \langle H, H_O \otimes H_I \rangle / (\|H\|_F \|H_O\|_F \|H_I\|_F)\) is the cosine similarity. This implies that the closer the sketch is "in direction" to the true Hessian, the tighter the end-to-end KL bound. Simultaneously, \(H_O, H_I\) must maintain low incoherence and low rank.
- Design Motivation: This is the first time any quantization algorithm has obtained an end-to-end error bound, upgrading "Hessian sketch selection" from empirical heuristics to an optimizable objective. Cosine similarity suggests using power iteration for approximation.
-
Two Scalable Power Iteration Hessian Sketches:
- Function: Computes \(H_O, H_I\) at LLM scale without directly manipulating the \(mn \times mn\) true Hessian.
- Mechanism: Sketch A assumes token independence within sequences, approximating \(H \approx \mathbb{E}[x^\top x \otimes (\nabla_y \ell)^\top (\nabla_y \ell)]\). Starting from \((H_I)_0 = H_1, (H_O)_0 = I\), it converges in \(\approx 3\) power iteration steps (approx. 20 GPU-hours for a 10B model). Sketch B runs one round of power iteration on the true Fisher (starting from \(I, I\)) using sequence-level gradients, requiring approx. 30 GPU-hours for a 10B model. Both use a modified backward pass for distributed power iteration, similar to Shampoo's preconditioning but using true Fisher.
- Design Motivation: Direct Monte-Carlo estimation of the true Hessian has high variance. Sketch A trades variance for bias, suitable for low-data regimes. Sketch B tolerates higher variance via a single power iteration round, yielding better quality with more data.
Loss & Training¶
YAQA is a PTQ method with no explicit loss. The optimization objective is implicitly \(\mathrm{tr}(\Delta^\top H_O \Delta H_I)\), the proxy loss under the Kronecker sketch. The rounding algorithm uses fixed-point iteration (Equation 5/6), compatible with scalar or vector quantizers. It is complementary to QuIP#'s randomized Hadamard transform—the latter makes \(W\) near-Gaussian and reduces incoherence, while the former handles precise Hessian computation.
Key Experimental Results¶
Main Results: LLM Quantization Quality¶
| Model / Setting | Method | KL ↓ (vs FP) | Downstream Benchmarks (acc%) ↑ |
|---|---|---|---|
| Llama 3.1 8B Inst, W2 | LDLQ (GPTQ) | Baseline | Baseline |
| Llama 3.1 8B Inst, W2 | GuidedQuant | Slightly better | Slightly better |
| Llama 3.1 8B Inst, W2 | YAQA Sketch A | \(\approx -30\%\) vs LDLQ | Significantly ahead |
| Llama 3.1 8B Inst, W2 | YAQA Sketch B | Lowest | Highest |
| Llama 3.1 8B Inst, W2 | QAT | Higher than YAQA | Lower than YAQA |
(Data summarized from abstract and tables; Sketch B establishes a new PTQ SOTA on multiple chat/reasoning tasks.)
Ablation Study¶
| Setting | KL ↓ | Description |
|---|---|---|
| LDLQ (\(H_O = I, H_I = H_1\)) | Baseline | YAQA degenerate case |
| Sketch A, 1 step | Medium | Initialized with \(H_1\) |
| Sketch A, 3 steps | Excellent | Empirical convergence |
| Sketch B, 2K sequences | Excellent | SOTA in 1 GPU-hour |
| Sketch B, 64K sequences | Best | 30 GPU-hours |
| GuidedQuant, >4 blocks | No improvement | Lack of output feedback |
Key Findings¶
- Empirically, \(H_O\) is near low-rank, matching the theoretical condition where the YAQA bound is strictly tighter than the LDLQ bound, explaining the success of the Kronecker sketch.
- Sketch B with one power iteration round outperforms Sketch A, showing that true Fisher variance is manageable with sequence-level estimates—strict convergence of power iteration is not required.
- YAQA achieves SOTA with minimal data (2K sequences, 1 GPU-hour), a major selling point for PTQ practicality.
- The result that KL is lower than QAT's is counter-intuitive but theoretically sound: QAT uses first-order descent and may get stuck in local optima; YAQA performs "one-shot optimal rounding within the Hessian ellipsoid," bypassing QAT's optimization difficulties.
Highlights & Insights¶
- First End-to-End KL Bound: Converts "Hessian sketch selection" into a clear mathematical problem of maximizing cosine similarity and controlling incoherence/rank, avoiding past empirical structures.
- SND Framework Unification: GPTQ, LDLQ, and GuidedQuant are all unified under the SND/Kronecker perspective, clarifying the presence or absence of output feedback.
- Tractability and Optimality: Low SND ensures speed, cosine similarity ensures quality, and power iteration is the classic tool for optimal Frobenius Kronecker approximation—a natural fit.
- True Fisher vs. Empirical Fisher: YAQA highlights that for the KL objective, true Fisher (Monte-Carlo sampling of logits) must be used instead of Empirical Fisher (task loss) to avoid directional bias.
Limitations & Future Work¶
- Focuses only on weight-only PTQ; transferability to activation and KV-cache quantization is not detailed.
- 30 GPU-hours for Sketch B remains heavy for 70B+ models; exploring aggressive sparsification or low-rank approximations to reduce cost is necessary.
- The cosine similarity bound still contains incoherence/trace terms, and rank is not fully controlled; tightening the bound by controlling the effective rank of \(H_O, H_I\) is a future direction.
- Global Hessian behavior for non-linear layers (e.g., Attention Softmax) and cross-layer coupling has not been deeply analyzed.
Related Work & Insights¶
- vs GPTQ / LDLQ: Equivalent to the degenerate case of YAQA where \(H_O = I, H_I = H_1\); YAQA's bound is theoretically tighter when \(H_O\) is low-rank.
- vs GuidedQuant / SqueezeLLM: Both attempt to go beyond \(H_1\), but use Empirical Fisher + block-diagonal approximations, lacking output feedback and end-to-end bounds.
- vs QAT / DiscQuant / PV-Tuning: QAT requires long training; YAQA proves that a single power iteration can be more accurate than QAT, boosting confidence in the PTQ route.
- vs Shampoo / KFAC: Shares the Kronecker + Power Iteration sketching philosophy, but applies it to the rounding direction of PTQ rather than optimizer preconditioning.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to map end-to-end KL bounds to quantization algorithms and provide a provable algorithm-theory loop via SND/Kronecker.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Tested across Llama/Gemma scales and multiple bit configurations, outperfoming LDLQ, GuidedQuant, and QAT, including data requirement ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear theoretical derivations, but the SND/Kronecker arguments are dense and have a high entry barrier; the appendix is extensive.
- Value: ⭐⭐⭐⭐⭐ Directly useful for LLM deployment: pushes quality to or beyond QAT with near-zero inference cost, representing a significant advancement for PTQ.