ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization¶

Conference: NeurIPS 2025 arXiv: 2502.02631 Code: Not available Area: Model Compression Keywords: Extremely low-bit quantization, scaling laws, quantization-aware training, 2-bit quantization, Pareto optimality

TL;DR¶

This paper proposes ParetoQ — the first unified framework supporting 1/1.58/2/3/4-bit quantization — which systematically studies training strategies (full-precision pretraining vs. QAT budget allocation) and quantization function design (introducing the SEQ quantizer). The work demonstrates that 2-bit and 1.58-bit quantization outperform conventional 4-bit in the accuracy–model-size trade-off, and achieves state-of-the-art results across all bit-widths.

Background & Motivation¶

A central debate in LLM quantization concerns: what is the optimal bit-width?

One camp (Dettmers & Zettlemoyer 2023) argues that 4-bit or 6-bit is Pareto optimal.
Another camp (Ma et al. 2024; Kaushal et al. 2024) claims that 1.58-bit suffices to match full-precision performance.

Why do conclusions conflict? The absence of a unified framework means different works use different training recipes, quantization functions, and baselines, rendering their conclusions incomparable.

Key observation: Prior scaling-law studies reduce the search space to \(\mathcal{L}(\mathcal{N}, \mathcal{D}, \mathcal{P})\) (model size, data volume, precision), neglecting two critical factors: training strategy \(\mathcal{S}_{\text{train}}\) and bit-specific quantization function \(\mathcal{F}\). The correct search space is \(\mathcal{L}(\mathcal{N}, \mathcal{D}, \mathcal{P}, \mathcal{S}_{\text{train}}, \mathcal{F})\) — a five-dimensional space.

Core finding: A significant behavioral shift exists between 2-bit and 3-bit quantization — at 3-bit and above, QAT acts as compensation (weight changes of 10–20%), whereas at 2-bit and below, it performs reconstruction (weight changes of ~40%).

Method¶

Overall Architecture¶

ParetoQ's methodology proceeds in three steps:

Fix the quantization function, find the optimal training strategy: \(\mathcal{L}(\mathcal{N}, \mathcal{D}, \mathcal{S}_{\text{train}} | \mathcal{P}, \mathcal{F})\)
Fix the optimal training strategy, find the optimal quantization function: \(\mathcal{L}(\mathcal{N}, \mathcal{F} | \mathcal{P}, \mathcal{D}^*, \mathcal{S}_{\text{train}}^*)\)
Fix optimal training and quantization, compare across bit-widths: \(\mathcal{L}(\mathcal{N}, \mathcal{P} | \mathcal{F}^*, \mathcal{D}^*, \mathcal{S}_{\text{train}}^*)\)

Key Designs¶

Training budget allocation strategy: Given a fixed total training budget \(\mathcal{B}_{\text{train}} = \mathcal{B}_{\text{FP}} + \mathcal{B}_{\text{QAT}}\), the paper investigates the optimal allocation between full-precision pretraining and QAT fine-tuning. Experiments on MobileLLM-125M reveal:
- ~90% for full-precision pretraining + ~10% for QAT is the optimal allocation, holding consistently across nearly all bit-widths.
- Training from scratch with QAT (allocating the entire budget to quantized training) consistently underperforms pretrain-then-finetune.
- QAT for 3-bit/4-bit saturates at ~10B tokens; 1-bit/1.58-bit/2-bit saturates at ~30B tokens.

Design Motivation: Low-bit quantization (≤2-bit) requires more training tokens because weight reconstruction demands a larger search space than compensation.

Stretched Elastic Quant (SEQ) quantizer: A key innovation for 1.58-bit and 2-bit quantization. The problem: 2-bit quantization has 4 quantization levels — including zero (e.g., \(\{-2,-1,0,1\}\)) yields only one positive level (imbalanced); excluding zero (e.g., \(\{-1.5,-0.5,0.5,1.5\}\)) yields balance but cannot represent zero. SEQ resolves this via:

\(\mathbf{W}_Q^i = \alpha \left(\lfloor \text{Clip}\left(\frac{\mathbf{W}_R^i}{\alpha}, -1, 1\right) \times \frac{k}{2} - 0.5 \rceil + 0.5 \right) / k \times 2\)

This simultaneously achieves balanced quantization levels and uniform coverage of the full-precision weight range. For 3-bit/4-bit, LSQ (which benefits from including zero) is retained.

Unified ParetoQ quantization formula:

\(\mathbf{W}_Q^i = \begin{cases} \alpha \cdot \text{Sign}(\mathbf{W}_R^i), & N_{\text{bit}} = 1 \\ \alpha(\lfloor \text{Clip}(\frac{\mathbf{W}_R^i}{\alpha}, -1, 1) \times k/2 - 0.5 \rceil + 0.5)/k \times 2, & N_{\text{bit}} = 1.58, 2 \\ \alpha \lfloor \text{Clip}(\frac{\mathbf{W}_R^i}{\alpha}, n, p) \rceil, & N_{\text{bit}} = 3, 4 \end{cases}\)

Backpropagation uses the Straight-Through Estimator (STE), with separate gradient definitions for weights and scaling factor \(\alpha\). Initialization of \(\alpha\): \(\ell_1\) mean for 1-bit, maximum absolute value for all other bit-widths.

Loss & Training¶

AdamW optimizer, zero weight decay, 16 GPUs, batch size 8 per GPU.
1/1.58/2-bit: 120K steps, learning rate \(2 \times 10^{-5}\), cosine decay.
3/4-bit: 40K steps, learning rate \(1 \times 10^{-5}\), cosine decay.
All weights quantized except embeddings and the output layer.

Key Experimental Results¶

Main Results: LLaMA-3 8B Across Bit-Widths¶

Method	Bits	ARC-e	ARC-c	PIQA	HellaS	WinoG	Avg.	Wiki2
Full Precision	16	81.0	57.7	81.0	79.5	73.9	74.6	6.15
EfficientQAT	2	69.3	46.8	76.4	69.0	66.3	65.5	9.6
ParetoQ	2	78.5	54.5	79.2	73.8	70.0	71.2	8.0
1-bit Era	1.58	72.8	45.4	81.0	70.6	58.0	65.6	11.7
ParetoQ	1.58	76.3	51.4	77.7	71.9	67.7	69.0	8.6
BiLLM	1	33.2	25.6	54.6	32.7	50.5	39.3	38.5
ParetoQ	1	75.5	51.9	76.6	69.4	65.6	67.8	9.5

Ablation Study: Impact of Quantization Function Choice¶

Quantizer	1.58-bit Acc.	2-bit Acc.	3-bit Acc.	4-bit Acc.
Min-Max (stats)	Poor	Collapse	Usable	Good
Range clipping (stats)	Good	Good	Poor	Poor
LSQ (learnable)	Medium	Medium	Best	Best
SEQ (learnable)	Best	Best	Slightly lower	Slightly lower

Key Findings¶

Pareto curves challenge conventional wisdom: 1.58-bit, 2-bit, and 3-bit all outperform 4-bit in the accuracy–model-size trade-off.
ParetoQ 1.58-bit 8B closes the gap to full precision by 37.8% relative to 1-bit Era, using only 30% of the training tokens.
A ParetoQ 600M ternary model surpasses the previous SOTA 3B ternary model — achieving equivalent performance with 1/5 the parameters.
A behavioral transition exists between 2-bit and 3-bit: ≥3-bit is compensation (small weight adjustments); ≤2-bit is reconstruction (large weight changes).
2-bit benefits from CPU kernel acceleration, yielding a better accuracy–speed trade-off than 4-bit.
1.58-bit and 3-bit are less hardware-friendly than 2-bit (1.58-bit has complex storage requirements; 3-bit suffers from alignment difficulties).

Highlights & Insights¶

The paper's greatest methodological contribution is formalizing the fragmented low-bit quantization landscape as a five-dimensional search problem, enabling rigorous apples-to-apples comparisons for the first time.
The SEQ quantizer embodies the insight that, in low-bit settings, balanced quantization levels are more important than including zero.
The compensation vs. reconstruction dichotomy provides an intuitive explanation for QAT behavior across different bit-widths.
2-bit quantization holds practical deployment value as a potential successor to 4-bit — native INT2 hardware support represents an important direction for the community to pursue.

Limitations & Future Work¶

Experiments cover only the MobileLLM and LLaMA-3 families (up to 8B); larger models (70B+) remain unvalidated.
The 2-bit kernel implementation is CPU-only; GPU-side native INT2 support is lacking.
Only weight quantization is addressed; activation quantization is not considered.
Training costs remain substantial (120K steps × 16 GPUs), posing challenges for resource-constrained researchers.
Comparisons with mixed-precision quantization methods are absent.

The comparison with 1-bit Era (Ma et al. 2024) is the most compelling: under the unified framework, ParetoQ substantially outperforms it with fewer training tokens and simpler optimization.
Relative to the 4-bit conclusion of Dettmers & Zettlemoyer (2023), ParetoQ demonstrates that better quantization function design can shift the Pareto frontier toward lower bit-widths.
Takeaway: quantization exhibits a "no free lunch" property — each bit-width requires a tailored quantization function, and the value of a unified framework lies in providing a fair benchmark.

Rating¶

Novelty: ⭐⭐⭐⭐ The unified framework and SEQ quantizer are substantive contributions, though the work leans more toward systematic engineering research.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Eight models × five bit-widths, covering PTQ/QAT/VQ baselines — exceptionally comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear logic and rich figures; the unfolding of the five-dimensional search space is occasionally verbose.
Value: ⭐⭐⭐⭐⭐ Establishes an authoritative benchmark for the low-bit quantization field; the discovery of 2-bit's potential carries significant practical implications.