Skip to content

GlowQ: Group-Shared Low-Rank Approximation for Quantized LLMs

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=kVojSLUcvS
Code: https://github.com/ahnselim/GlowQ
Area: Model Compression / Quantization Error Compensation
Keywords: Low-rank compensation, Post-training quantization, Shared right factor, Covariance alignment, Selective recovery

TL;DR

GlowQ transforms the paradigm of "independent low-rank error correction for every layer" into "sharing a right factor \(B\) per input group, caching the projection \(BX\) once for reuse across modules," while selectively recovering specific groups/layers based on benefit. This significantly reduces first-token latency and increases throughput for quantized LLMs with negligible accuracy loss.

Background & Motivation

Background: 4-bit post-training quantization (GPTQ/AWQ/BitsAndBytes) is the standard for LLM deployment, but low-bitwidths suffer from accuracy degradation. "Low-rank error compensation" methods (LQER, QERA, ASER, etc.) address this by approximating the quantization error \(W-W_q\approx AB\) with high-precision low-rank terms, restoring quality by adding \(A(BX)\) during inference.

Limitations of Prior Work: Existing methods almost exclusively attach independent \((A_\ell,B_\ell)\) modules to every layer and every projection, leading to repeated computation of high-precision projections \(B_\ell X\) across the network. This results in triple waste: (i) modules sharing the same input tensor (e.g., Q/K/V sharing the attention input) repeat the same expensive projection; (ii) materializing multiple copies of \(BX\) increases memory bandwidth traffic; (iii) subspace selection objectives often ignore the strong anisotropy of real activations, allocating limited rank to directions that are rarely used. Consequently, the accuracy-efficiency tradeoff under strict latency budgets is worse than theoretically achievable.

Key Challenge: While low-rank compensation recovers accuracy, the "layer-wise independent + layer-wise recomputation" deployment model imposes unnecessary costs in latency and memory—the algorithmic gains of the compensation modules are offset by engineering overhead.

Goal: Retain the expressiveness of layer-wise correction while eliminating redundant computation and memory usage, making low-rank compensation truly cost-effective under real deployment budgets.

Core Idea: [Group-Sharing + Single Projection Caching] Modules sharing the same input are treated as a group, learning a single shared right factor \(B_{shared}\) while maintaining individual left factors \(A_i\). During inference, \(R=B_{shared}X\) is computed and cached once per group, followed by lightweight \(A_i R\) operations. This is augmented by [Covariance Alignment] to align the limited rank with high-frequency data directions and [Selective Recovery] to enable correction only for the most beneficial groups/layers.

Method

Overall Architecture

GlowQ consists of three components: first, error matrices of modules with the same input are vertically stacked for joint low-rank fitting (theoretically equivalent to truncated SVD of the stacked matrix); second, a covariance alignment objective directs the right subspace toward high-frequency data directions, solved efficiently via QR-reduction and randomized SVD without materializing large matrices; finally, during inference, \(R=B_{shared}X\) is cached and reused once per group, with only the top-k units selectively recovered based on importance scores.

flowchart LR
    A[Error E_i per module] --> B[Vertical stack E_cat per input group]
    B --> C[Covariance Alignment<br/>Whiten E_cat·Σx^1/2]
    C --> D[QR reduction to d×d kernel<br/>+ Randomized SVD]
    D --> E[Shared Right Factor B_shared<br/>Module Left Factors A_i]
    E --> F[Inference: Compute & cache<br/>R=B_shared·X once per group]
    F --> G[Selective Recovery<br/>Enable top-k groups/layers]
    G --> H[y_i = W_q·X + A_i·R]

Key Designs

1. Group-Shared Right Factor: One B for the whole group, proven optimal For \(m\) modules sharing the same input dimension \(d\), their error matrices are vertically concatenated into \(E_{cat}=[E_1^T\cdots E_m^T]^T\) to solve \(\min_{A,B}\|E_{cat}-AB\|_F^2\). The paper presents Proposition 1: joint fitting with a single right factor \(B\) is equivalent to a low-rank fit on the stacked matrix \(E_{cat}\). By the Eckart-Young-Mirsky theorem, the optimal \(B\) spans the top \(r\) right singular subspaces of \(E_{cat}\). Allowing each module an independent \(B_i\) does not increase expressiveness (differences can be absorbed into an invertible reparameterization of \(A_i\)). This mathematically proves that "layer-wise independent \(B_\ell\)" is redundant—a shared \(B\) is both sufficient and optimal, legally collapsing multiple large \(BX\) matrix multiplications into one projection plus several inexpensive \(A_iR\) products.

2. Covariance Alignment: Aligning limited rank with data directions Pure stacked SVD only considers the geometry of the error matrix, whereas real activations are highly anisotropic. Empirical measurements show input covariance spectra follow a power-law decay \(\lambda_r\propto r^{-\alpha}\) (MLP \(\alpha\approx0.77\), QKV \(\alpha\approx1.19\)), meaning representation space usage is highly concentrated. Without weighting, the selected subspace may mismatch data preferences, wasting rank. Ours uses a "usage-weighted" objective: expected loss \(\mathbb{E}\|Mx\|_2^2=\|M\Sigma_x^{1/2}\|_F^2\), equivalent to low-rank fitting after right-multiplying a whitening matrix: \(\min_{A,B}\|(E_{cat}-AB)\Sigma_x^{1/2}\|_F^2\). Propositions 2/3 prove that minimizing usage-weighted risk is exactly equal to minimizing right-weighted reconstruction error, with the global optimum given by the rank-\(r\) SVD of the whitened error \(\tilde E=E_{cat}\Sigma_x^{1/2}\). In practice, the whitened version captures cumulative energy significantly faster under the same rank.

3. QR-Reduction Randomized SVD: Scalable solver without high-dimensional materialization Directly performing SVD on the large whitened matrix is computationally expensive and numerically unstable. Ours uses a three-step approach: first, a thin QR decomposition on \(E_{cat}\) reduces the "tall \(m\)" problem into a \(d\times d\) kernel \(M=R_e\Sigma_x^{1/2}\) (leveraging left-orthogonal invariance of the Frobenius norm); second, randomized SVD (with \(p\) oversampling columns and \(q\) power iterations) is performed on the kernel to extract the dominant right subspace, reducing complexity from \(O(d^3)\) to \(O(d^2(r+p)+qd^2(r+p))\); third, balanced recovery \(\hat A^\star=U_r\Sigma_r^{1/2}\), \(\hat B^\star=\Sigma_r^{1/2}V_r^T\) improves stability before lifting factors back to original variables. This solving process occurs once offline.

4. Caching + Selective Recovery: Converting algorithmic savings into latency/throughput gains The shared structure allows each group to materialize \(R_\ell=XB_{\ell,shared}^T\) once. Module \(i\) simply performs a small correction \(y_i=W_i^{(q)}X+A_{\ell,i}R_\ell\). An anchor strategy ensures \(R_\ell\) is computed exactly once and consumed a fixed number of times (e.g., Q as the anchor for K/V in attention; gate as the anchor for up in MLP). Solo modules (o_proj, down_proj) are computed on-the-fly. Under a latency budget, components are activated based on a top-k ranking using two metrics: the GSVD energy capture score \(g_{ec}(u)=\sum_{j=1}^r\sigma_j(M_u)^2/\|M_u\|_F^2\) and the normalized error ratio \(g_{ner}(u)=\|E_u\|_F^2/\|W_u\|_F^2\). This yields two versions: full GlowQ and lightweight GlowQ-S.

Key Experimental Results

Evaluations cover 11+ variants including LLaMA 2/3, Qwen 2.5/3, OPT, Mistral, and Qwen1.5-MoE. Setup: W4A16 (int4 weights g128, fp16 activations), rank fixed at 64, 64 calibration sequences of length 2048, no fine-tuning. Baselines include PTQ (BnB/AWQ/GPTQ) and error correction methods (L2QER/ZeroQuant-V2/QERA).

Main Results (WikiText-2 PPL, Lower is Better, Selected)

Method LLaMA3.2-3B LLaMA3.1-8B Qwen2.5-7B Qwen3-8B Mistral-7B
FP16 7.81 6.24 6.86 9.73 5.32
AWQ 8.24 6.64 7.11 10.19 5.51
QERA 8.22 6.64 8.09 10.07 5.48
L2QER 8.30 6.75 8.14 10.07 5.46
GlowQ 8.16 6.59 7.07 9.90 5.42
GlowQ-S 8.22 6.62 7.09 9.97 5.45

GlowQ achieved the best or tied-best results in 9 out of 11 variants.

Downstream Accuracy + C4 (Table 2, Selected)

Method LLaMA3.2-3B Acc↑ LLaMA3.1-8B Acc↑ Qwen3-8B Acc↑ Qwen3-14B Acc↑
FP16 67.14 73.29 71.48 74.10
QERA 65.48 72.86 69.86 73.14
L2QER 66.19 72.43 69.52 73.24
GlowQ 66.90 73.33 70.71 73.84

Efficiency & Selective Recovery

  • Compared to strong baselines, GlowQ reduces TTFB by 5.6% and increases throughput by 9.6% on average, while decreasing WikiText-2 PPL by 0.17% and increasing downstream accuracy by 0.42 percentage points.
  • The selective version GlowQ-S reduces TTFB by 23.4% and increases throughput by 37.4%, with an average accuracy loss of less than 0.2 percentage points (typically recovering ~50% of units).

Key Findings

  • Layer-wise independent B is redundant: Shared right factors do not compromise expressiveness and save massive amounts of repeated projections and compensation parameters.
  • Covariance alignment significantly accelerates energy capture: Under heavy-tailed anisotropic activations, whitening allows the recovery of more error energy for the same rank.
  • Selective recovery sweet spot is around 50%: Recovering approximately 50% of groups/layers captures most of the accuracy while yielding the largest latency/throughput gains.

Highlights & Insights

  • Attributes the "slowness of low-rank compensation" to the engineering perspective of repeated projections caused by shared inputs, then eliminates it via a clean algebraic proposition (optimality of shared B). Theoretical and system gains are perfectly aligned.
  • Covariance alignment is not an arbitrary weighting but derived from the equivalence of "usage-weighted risk = right-weighted reconstruction error," supported by power-law spectra.
  • The anchor/consumer cache scheduling maps "single projection reuse" onto concrete attention/MLP structures, making it directly reproducible in engineering.

Limitations & Future Work

  • Primarily targets W4A16 weight quantization error, with gains narrowing at W4A4; activation quantization remains a more difficult challenge.
  • Importance scoring rules for GlowQ-S vary by model family (requiring family-specific selection of GSVD/NER/Layer order), lacking a unified cross-family adaptive strategy.
  • Absolute precision gains (PPL -0.17%, Acc +0.42pt) are modest; the primary value lies in latency/throughput at equivalent accuracy rather than accuracy alone.
  • Covariance/SVD steps require offline computation on high-end GPUs, and end-to-end costs for resource-constrained scenarios were not fully discussed.
  • Low-Rank Error Compensation: LQER/ZeroQuant-V2(LoRC)/QERA/ASER established the \(W\approx W_q+AB\) compensation paradigm. GlowQ builds upon this to solve deployment inefficiencies of "layer-wise independence + recomputation."
  • Joint/Collective Matrix Factorization: The idea of shared right subspaces originates from collective matrix factorization (Singh & Gordon 2008) and shared singular subspace recovery (Ma & Ma 2024), which Ours systematically migrates to LLM input-sharing modules.
  • Pruning-style Saliency Selection: Selective recovery borrows saliency/importance scoring from pruning (Molchanov, Nagel, Banner, etc.), framing "whether to compensate a layer" as a top-k selection under budget constraints.
  • Inspiration: When an enhancement module is "attached to every layer," check if inputs are shared and if computation can be collapsed. This "group-sharing + single calculation caching + selective activation" approach could extend to LoRA, adapters, and KV cache correction.

Rating

  • Novelty: ⭐⭐⭐⭐ — Reframes low-rank compensation from "layer-wise independent" to "group-shared right factors + cached projection," with algebraic proofs of optimality.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 11+ model variants, multiple baselines, PPL/accuracy/latency/throughput, and ablation of selective recovery.
  • Writing Quality: ⭐⭐⭐⭐ — Coherent chain of motivation-theory-implementation-deployment.
  • Value: ⭐⭐⭐⭐ — Real-world latency reduction and throughput increases at equivalent accuracy, though small absolute accuracy gains limit the ceiling.