EfficientQAT: Efficient Quantization-Aware Training for Large Language Models¶

Conference: ACL 2025
arXiv: 2407.11062
Code: https://github.com/OpenGVLab/EfficientQAT
Area: Model Compression / LLM Efficiency
Keywords: quantization-aware training, LLM compression, block-wise training, low-bit quantization, step size

TL;DR¶

EfficientQAT proposes a two-stage QAT framework: first performing Block-wise All-Parameter training (Block-AP) to provide a good initialization, and then executing End-to-End Quantization-Parameter fine-tuning (E2E-QP) to capture cross-block interactions. It achieves 2-bit quantization of Llama-2-70B in 41 hours on a single A100 GPU, with only a 3-point accuracy degradation.

Background & Motivation¶

Background: LLM quantization methods fall into three categories: (1) PTQ (GPTQ/AWQ/OmniQuant) performs fast quantization via block-wise reconstruction, but suffers from significant accuracy loss at low-bit settings; (2) QAT (BitNet b1.58) trains all parameters end-to-end, achieving the best accuracy but requiring extremely high resources (training from scratch); (3) Q-PEFT (QLoRA/PEQA) freezes quantized weights and trains only a small number of parameters, failing to recover quantization loss at low-bit settings.

Limitations of Prior Work: (1) PTQ restricts the optimization space by training only rounding/clipping parameters and ignoring cross-block interactions; (2) Vanilla QAT requires complete training data and multi-GPU setups, which is infeasible for 70B models; (3) Q-PEFT contains too few trainable parameters (step size accounts for only ~1.6%), failing to recover quantization errors in ultra-low bit scenarios.

Key Challenge: The trade-off between all-parameter training (good accuracy but high overhead) versus limited-parameter training (high efficiency but poor accuracy); and block-wise training (memory-friendly but ignoring cross-block interactions) versus end-to-end training (captures interactions but causes memory explosion).

Goal: To achieve quantization accuracy close to vanilla QAT on a single GPU, especially in ultra-low-bit (2-bit/3-bit) scenarios.

Key Insight: Decomposing QAT into two complementary stages: block-wise all-parameter training to provide a good initialization, and end-to-end training of only step sizes to capture cross-block interactions.

Core Idea: By dividing QAT into a two-stage approach—block-wise all-parameter training (Block-AP) and end-to-end quantization-parameter fine-tuning (E2E-QP)—the method achieves both sufficient optimization space and high memory efficiency.

Method¶

Overall Architecture¶

Stage 1 (Block-AP): Following the order of transformer blocks, all parameters (weights \(W\), step sizes \(s\), and zero points \(z\)) are trained within each block using a reconstruction loss \(\rightarrow\) outputting a quantized model with \(W_q, s, z\). \(\rightarrow\) Stage 2 (E2E-QP): The quantized weights \(W_q\) are frozen, and only step sizes \(s\) are trained end-to-end, requiring gradients for only ~1.6% of the parameters.

Key Designs¶

Block-AP (Block-wise All-Parameter Training):
- Function: Simultaneously trains weights, step sizes, and zero points within each transformer block.
- Mechanism: Standard uniform quantization is defined as \(W_{int} = \text{clamp}(\lfloor W/s \rceil + z, 0, 2^N-1)\), and dequantization is \(\hat{W} = (W_{int} - z) \cdot s\). By embedding quantization/dequantization into the computation graph, STE (Straight-Through Estimator) is used to optimize all parameters via gradient descent.
- Design Motivation: Prior block-wise methods (OmniQuant/BRECQ/AutoRound) only train a subset of parameters (clipping/rounding/LoRA), which limits the optimization space. Block-AP directly trains all parameters without requiring complex designs, showing a particularly distinct advantage in 2-bit scenarios.
- Difference from Prior Works: It is the first method to train all parameters within a block-wise reconstruction paradigm; prior methods restricted the update range to \((-1, +1)\) to prevent overfitting, whereas Block-AP imposes no such restriction.
E2E-QP (End-to-End Quantization Parameter Training):
- Function: Freezes the quantized weights \(W_q\) output by Block-AP and trains only the step sizes \(s\) end-to-end.
- Mechanism: Only dequantization (Eq.2) is performed without active quantization (Eq.1). The gradient \(\partial\hat{w}/\partial s = w_q - z\) is simple to compute; trainable parameters account for only ~1.6% (when group size = 64), leading to extremely low memory requirements.
- Design Motivation: While Block-AP ignores cross-block interactions, E2E-QP allows step sizes of all blocks to co-optimize through an end-to-end objective function. Meanwhile, memory usage is extremely low—2-bit E2E-QP for a 70B model requires only 34.2GB.
- Flexibility: Can be directly trained on target datasets (for continual pre-training or instruction tuning).
Complementarity of the Two Stages:
- Block-AP provides high-quality initialization (large optimization space \(\rightarrow\) low quantization error) but does not consider cross-block interactions.
- E2E-QP performs lightweight fine-tuning based on this initialization (small parameter volume \(\rightarrow\) no overfitting) to capture cross-block interactions.
- Combining the two stages achieves both the accuracy of QAT and the efficiency of PTQ.

Key Experimental Results¶

Main Results: Llama-2 Zero-Shot Inference (Average Accuracy over 5 Tasks)¶

Method	Bit	Llama-2-7B	Llama-2-13B	Llama-2-70B
FP16	16	64.86	67.81	72.41
GPTQ	3	62.48	66.18	71.47
AWQ	3	62.82	66.14	71.41
OmniQuant	3	62.42	66.18	71.07
EfficientQAT	3	64.02	67.28	71.76
OmniQuant	2	46.98	53.56	54.87
AutoRound	2	54.50	60.72	67.70
EfficientQAT	2	59.50	63.88	68.93
AQLM (VQ)	2	57.61	62.22	69.85

Ablation Study¶

Configuration	Performance	Description
Block-AP + E2E-QP	Best	Full EfficientQAT
Block-AP only	Good	Lacks cross-block interactions
E2E-QP only (RTN Initialization)	Poor	RTN initialization is too poor to recover
Block-AP training rounding only	Suboptimal	Restricts optimization space
Block-AP training clipping only	Suboptimal	Same as above
E2E training s + z	≈ E2E training s	Converting z to full-precision incurs extra overhead

Key Findings¶

Outstanding Advantage in 2-bit Scenarios: Under 2-bit settings, EfficientQAT outperforms OmniQuant by around 12-14 points and AutoRound by around 5 points.
Almost Lossless at 3-bit: Llama-2-70B at 3-bit obtains 71.76 vs. FP16 at 72.41, with only a 0.65-point drop.
Extremely High Training Efficiency: 2-bit quantization for a 70B model requires only a single A100, taking 41 hours and 34.2GB memory.
All-Parameter Training > Sub-Parameter Training in Block-AP: Simply and directly training all parameters is more effective than carefully designing rounding/clipping parameters, challenging the conventional belief that "restricting the optimization space is necessary to prevent overfitting."
Cross-Modal Universality: It is equally effective on instruction-tuned LLMs and multimodal LLMs (LLaVA).

Highlights & Insights¶

"Simplicity is Key" Design Philosophy: The greatest innovation of Block-AP is actually "not designing anything" — directly training all parameters instead of carefully designing rounding/clipping parameters like previous works. This indicates that optimization space is more crucial than regularization in LLM quantization.
Elegance of the Two-Stage Decomposition: Decomposing the "all-parameter + end-to-end" nature of QAT into "all-parameter + block-wise" and "few-parameters + end-to-end", keeping each stage highly efficient. The combination approaches the performance of vanilla QAT.
Conciseness of E2E-QP Training Only the Step Size: Although step size accounts for only ~1.6% of the parameters, it effectively captures cross-block interactions. This shows that the end-to-end adjustment of quantization parameters yields higher leverage than weight fine-tuning.
Transferability to Q-PEFT Scenarios: The E2E-QP stage of EfficientQAT can be directly trained on instruction fine-tuning data, unifying compression and fine-tuning.

Limitations & Future Work¶

Only uniform quantization was explored; it has not been combined with vector quantization (e.g., AQLM/QuIP#)—AQLM remains competitive at 2-bit.
Block-AP still requires block-wise full-precision forward passes, meaning memory might still be insufficient for ultra-large models (700B+).
Activation quantization (WAQ) has not been explored; the work only conducts weight quantization.
Sensitivity analysis on the choice of calibration data and training hyperparameters is insufficient.

vs OmniQuant (Shao et al., 2023): OmniQuant trains clipping parameters block-wise, which restricts its optimization space. EfficientQAT's Block-AP trains all parameters, outperforming it by 12+ points at 2-bit.
vs BitNet b1.58 (Ma et al., 2024): BitNet belongs to vanilla QAT trained from scratch, whereas EfficientQAT is an efficient QAT on pre-existing models, making it more widely applicable.
vs PEQA (Kim et al., 2023): PEQA only trains step sizes end-to-end (lacks Block-AP initialization), making recovery difficult when starting from RTN. EfficientQAT's Block-AP provides a critical, high-quality initialization.

Rating¶

Novelty: ⭐⭐⭐⭐ The two-stage decomposition logic is clear. Although Block-AP's "all-parameter training" is simple, it is proposed and proven effective for the first time.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across scales from 7B to 70B, other low-bit scenarios (2/3/4-bit), covering base, instruction-following, and multimodal scenarios, with a comprehensive ablation study.
Writing Quality: ⭐⭐⭐⭐ Clear and precise, with detailed method descriptions.
Value: ⭐⭐⭐⭐⭐ Highly practical, achieving 2-bit QAT for a 70B model on a single GPU provides a genuinely deployable solution.