Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment¶
Conference: NeurIPS 2025
arXiv: 2509.20214
Code: GitHub
Area: Model Compression
Keywords: Model Quantization, LLM Inference, Fractional-Bit Quantization, Rate-Distortion, CUDA Optimization
TL;DR¶
This paper derives optimal bit allocation for Gaussianized weights from an information-theoretic perspective, proposes the Q-Palette collection of fractional-bit quantizers and a mixed-scheme quantization framework, and achieves near-optimal quantization performance with inference acceleration in LLM deployment.
Background & Motivation¶
Weight-only post-training quantization (PTQ) is a key technique for reducing LLM inference latency and memory footprint, particularly on memory-constrained edge devices. However, heavy-tailed outliers in LLM weights complicate quantization.
Recent progress and remaining challenges:
Rotation methods: Hadamard rotation transforms weights into near-Gaussian distributions, reducing the impact of outliers.
Integer-bit constraint: Existing quantizers support only integer bit-widths (2/3/4-bit), preventing fine-grained bit allocation.
Optimality gap: A substantial gap remains between practical quantizers and the information-theoretic lower bound (Gaussian rate-distortion bound).
Mixed-precision limitations: Existing mixed-precision methods select from a limited set of options without considering the mixing of quantizer types.
Method¶
Overall Architecture¶
Q-Palette consists of two main components: (1) a collection of fractional-bit quantizers spanning different bit-widths and distortion levels; and (2) a mixed-scheme framework that jointly optimizes quantizer selection and layer-fusion decisions.
Key Designs¶
1. Information-Theoretically Optimal Bit Allocation
- The Rate-Distortion function is derived for Gaussianized (post-rotation) weights.
- Optimal allocation requires quantizers of arbitrary precision, not merely integer bit-widths.
- The optimal bit-width per layer is derived as: \(b_l^* = \frac{1}{2}\log_2\frac{\sigma_l^2}{\lambda}\), where \(\lambda\) is determined by the total bit budget.
2. Fractional-Bit Quantizer Collection
Q-Palette provides a range of quantizers spanning near-optimal distortion to fast inference: - Trellis-coded quantizers (TCQ): Implemented via finite-state machines, achieving distortion close to the Gaussian rate-distortion bound. - Vector quantizers (VQ): Exploit multi-dimensional coding to achieve fractional bit-widths (e.g., 2.5-bit). - Scalar quantizers: Simple and efficient, suited for latency-sensitive scenarios. - All quantizers are equipped with optimized CUDA kernels.
3. Mixed-Scheme Quantization Framework
- Jointly optimizes two decisions: (a) which quantizer to assign to each layer, and (b) whether to fuse adjacent layers.
- Dynamic programming is used to find the optimal scheme under given resource constraints.
- Objective: \(\min \sum_l D_l(q_l) \quad \text{s.t.} \quad \sum_l R_l(q_l) \leq B\)
Loss & Training¶
- Training-free: purely post-training quantization (PTQ).
- Calibration: requires few or zero calibration samples.
- Allocation optimization: solved iteratively via Lagrangian multiplier methods.
Key Experimental Results¶
Main Results¶
Perplexity (PPL, WikiText-2) of LLaMA-2-7B at various average bit-widths:
| Method | 2-bit | 2.5-bit | 3-bit | 4-bit |
|---|---|---|---|---|
| GPTQ | diverge | 12.85 | 8.32 | 6.09 |
| QuIP# | 9.15 | 7.85 | 6.83 | 5.98 |
| AQLM | 8.78 | 7.52 | 6.71 | 5.95 |
| Q-Palette (Ours) | 8.21 | 7.18 | 6.52 | 5.92 |
LLaMA-2-13B (PPL, WikiText-2):
| Method | 2-bit | 2.5-bit | 3-bit | 4-bit |
|---|---|---|---|---|
| GPTQ | diverge | 9.15 | 6.85 | 5.42 |
| QuIP# | 7.52 | 6.58 | 5.92 | 5.35 |
| AQLM | 7.28 | 6.35 | 5.85 | 5.32 |
| Q-Palette (Ours) | 6.85 | 6.12 | 5.73 | 5.30 |
Ablation Study¶
Distortion–speed trade-off across quantizer types (LLaMA-2-7B, 3-bit):
| Quantizer | PPL | Decoding Latency (ms/token) | Gap to DRB |
|---|---|---|---|
| Scalar (Uniform) | 7.15 | 2.1 | +1.23 dB |
| Vector (Group-VQ) | 6.78 | 3.5 | +0.52 dB |
| TCQ | 6.55 | 5.2 | +0.08 dB |
| Gaussian Rate-Distortion Bound | — | — | 0 dB |
Key Findings¶
- Fractional-bit quantization yields the greatest advantage in low-bit regimes (< 3-bit), narrowing the gap to the theoretical optimum.
- The TCQ quantizer achieves distortion only 0.08 dB above the Gaussian rate-distortion bound, approaching the information-theoretic limit.
- The mixed-scheme framework further reduces PPL by 0.3–0.5 compared to uniformly applying a single quantizer type.
- Optimized CUDA kernels keep TCQ inference overhead within a practical range.
Highlights & Insights¶
- Information-theoretic perspective: Grounding the work in Rate-Distortion theory provides a well-defined theoretical optimality target for quantization.
- Engineering completeness: The paper delivers not only theory but also full CUDA kernel implementations ready for deployment.
- Flexible composition: Q-Palette allows different layers to adopt different quantization strategies, maximizing overall efficiency.
Limitations & Future Work¶
- TCQ encoding and decoding complexity is relatively high, limiting practical inference speedup.
- The current work supports weight-only quantization and does not extend to KV cache quantization.
- The search space for the mixed-scheme framework grows with model scale.
- Performance degradation remains significant in ultra-low-bit regimes (< 2-bit).
Related Work & Insights¶
- QuIP# (Tseng et al.): Quantization based on randomized rotation.
- AQLM (Egiazarian et al.): Adaptive grouped quantization.
- Rate-Distortion Theory: Shannon information theory provides the lower bound for quantization.
Rating¶
- ⭐ Novelty: 8/10 — Applying information-theoretic tools to LLM quantization offers a fresh perspective.
- ⭐ Practicality: 9/10 — Open-source code with CUDA kernels enables direct deployment.
- ⭐ Writing Quality: 8/10 — The transition from theory to engineering is presented smoothly.