Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment¶

Conference: NeurIPS 2025

arXiv: 2509.20214

Code: GitHub

Area: Model Compression

Keywords: Model Quantization, LLM Inference, Fractional-Bit Quantization, Rate-Distortion, CUDA Optimization

TL;DR¶

This paper derives optimal bit allocation for Gaussianized weights from an information-theoretic perspective, proposes the Q-Palette collection of fractional-bit quantizers and a mixed-scheme quantization framework, and achieves near-optimal quantization performance with inference acceleration in LLM deployment.

Background & Motivation¶

Weight-only post-training quantization (PTQ) is a key technique for reducing LLM inference latency and memory footprint, particularly on memory-constrained edge devices. However, heavy-tailed outliers in LLM weights complicate quantization.

Recent progress and remaining challenges:

Rotation methods: Hadamard rotation transforms weights into near-Gaussian distributions, reducing the impact of outliers.

Integer-bit constraint: Existing quantizers support only integer bit-widths (2/3/4-bit), preventing fine-grained bit allocation.

Optimality gap: A substantial gap remains between practical quantizers and the information-theoretic lower bound (Gaussian rate-distortion bound).

Mixed-precision limitations: Existing mixed-precision methods select from a limited set of options without considering the mixing of quantizer types.

Method¶

Overall Architecture¶

Q-Palette consists of two main components: (1) a collection of fractional-bit quantizers spanning different bit-widths and distortion levels; and (2) a mixed-scheme framework that jointly optimizes quantizer selection and layer-fusion decisions.

Key Designs¶

1. Information-Theoretically Optimal Bit Allocation

The Rate-Distortion function is derived for Gaussianized (post-rotation) weights.
Optimal allocation requires quantizers of arbitrary precision, not merely integer bit-widths.
The optimal bit-width per layer is derived as: \(b_l^* = \frac{1}{2}\log_2\frac{\sigma_l^2}{\lambda}\), where \(\lambda\) is determined by the total bit budget.

2. Fractional-Bit Quantizer Collection

Q-Palette provides a range of quantizers spanning near-optimal distortion to fast inference: - Trellis-coded quantizers (TCQ): Implemented via finite-state machines, achieving distortion close to the Gaussian rate-distortion bound. - Vector quantizers (VQ): Exploit multi-dimensional coding to achieve fractional bit-widths (e.g., 2.5-bit). - Scalar quantizers: Simple and efficient, suited for latency-sensitive scenarios. - All quantizers are equipped with optimized CUDA kernels.

3. Mixed-Scheme Quantization Framework

Jointly optimizes two decisions: (a) which quantizer to assign to each layer, and (b) whether to fuse adjacent layers.
Dynamic programming is used to find the optimal scheme under given resource constraints.
Objective: \(\min \sum_l D_l(q_l) \quad \text{s.t.} \quad \sum_l R_l(q_l) \leq B\)

Loss & Training¶

Training-free: purely post-training quantization (PTQ).
Calibration: requires few or zero calibration samples.
Allocation optimization: solved iteratively via Lagrangian multiplier methods.

Key Experimental Results¶

Main Results¶

Perplexity (PPL, WikiText-2) of LLaMA-2-7B at various average bit-widths:

Method	2-bit	2.5-bit	3-bit	4-bit
GPTQ	diverge	12.85	8.32	6.09
QuIP#	9.15	7.85	6.83	5.98
AQLM	8.78	7.52	6.71	5.95
Q-Palette (Ours)	8.21	7.18	6.52	5.92

LLaMA-2-13B (PPL, WikiText-2):

Method	2-bit	2.5-bit	3-bit	4-bit
GPTQ	diverge	9.15	6.85	5.42
QuIP#	7.52	6.58	5.92	5.35
AQLM	7.28	6.35	5.85	5.32
Q-Palette (Ours)	6.85	6.12	5.73	5.30

Ablation Study¶

Distortion–speed trade-off across quantizer types (LLaMA-2-7B, 3-bit):

Quantizer	PPL	Decoding Latency (ms/token)	Gap to DRB
Scalar (Uniform)	7.15	2.1	+1.23 dB
Vector (Group-VQ)	6.78	3.5	+0.52 dB
TCQ	6.55	5.2	+0.08 dB
Gaussian Rate-Distortion Bound	—	—	0 dB

Key Findings¶

Fractional-bit quantization yields the greatest advantage in low-bit regimes (< 3-bit), narrowing the gap to the theoretical optimum.
The TCQ quantizer achieves distortion only 0.08 dB above the Gaussian rate-distortion bound, approaching the information-theoretic limit.
The mixed-scheme framework further reduces PPL by 0.3–0.5 compared to uniformly applying a single quantizer type.
Optimized CUDA kernels keep TCQ inference overhead within a practical range.

Highlights & Insights¶

Information-theoretic perspective: Grounding the work in Rate-Distortion theory provides a well-defined theoretical optimality target for quantization.
Engineering completeness: The paper delivers not only theory but also full CUDA kernel implementations ready for deployment.
Flexible composition: Q-Palette allows different layers to adopt different quantization strategies, maximizing overall efficiency.

Limitations & Future Work¶

TCQ encoding and decoding complexity is relatively high, limiting practical inference speedup.
The current work supports weight-only quantization and does not extend to KV cache quantization.
The search space for the mixed-scheme framework grows with model scale.
Performance degradation remains significant in ultra-low-bit regimes (< 2-bit).

QuIP# (Tseng et al.): Quantization based on randomized rotation.
AQLM (Egiazarian et al.): Adaptive grouped quantization.
Rate-Distortion Theory: Shannon information theory provides the lower bound for quantization.

Rating¶

⭐ Novelty: 8/10 — Applying information-theoretic tools to LLM quantization offers a fresh perspective.
⭐ Practicality: 9/10 — Open-source code with CUDA kernels enables direct deployment.
⭐ Writing Quality: 8/10 — The transition from theory to engineering is presented smoothly.