Skip to content

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

Conference: ACL 2025
arXiv: 2411.02355
Code: None
Area: Model Compression / LLM Efficiency
Keywords: LLM quantization, FP8, INT8, INT4, inference benchmark, vLLM

TL;DR

This is the most comprehensive empirical study of LLM quantization to date, conducting over 500k evaluations of FP8/INT8/INT4 on the entire Llama-3.1 family (8B/70B/405B). It finds that FP8 is nearly lossless, INT8 incurs only a 1-3% drop, and INT4 is surprisingly competitive, while providing recommendations for selecting quantization formats in different deployment scenarios.

Background & Motivation

Background: LLM quantization has become the most dominant inference acceleration technique. Main quantization formats include W8A8-FP (Hopper GPU), W8A8-INT (Ampere GPU), and W4A16-INT (4-bit weight). However, there is a lack of systematic benchmarking regarding the accuracy-performance trade-offs across different formats.

Limitations of Prior Work: (1) Prior research (e.g., Lee et al., 2024b) claimed that W8A8-INT performs significantly worse than FP8, leading to community misunderstandings (such as skepticism toward quantized 405B models); (2) most evaluations only use academic benchmarks, which do not reflect real-world deployment scenarios; (3) suboptimal hyperparameters in studies lead to misleading conclusions (e.g., that AWQ outperforms GPTQ); (4) there is a lack of comprehensive analysis that combines inference performance (latency/throughput).

Key Challenge: The community exhibits a "BF16 or nothing" bias regarding quantization—remaining uncertain whether quantization is good enough, which leads to wasteful deployments.

Goal: To provide a data-driven guide for selecting quantization formats: how much accuracy is actually lost under each format, and which format is optimal across different hardware/scenarios?

Key Insight: Building a comprehensive evaluation framework covering academic, real-world, and textual similarity benchmarks, and testing inference performance on 3 types of GPUs (A6000/A100/H100).

Core Idea: Answering "how much does quantization actually lose" through over 500k evaluations—the answer is: far less than generally expected.

Method

Overall Architecture

Rather than proposing a new methodology, this paper is a systematic empirical study. The framework covers: three quantization formats (W8A8-FP/W8A8-INT/W4A16-INT) \(\times\) the entire Llama-3.1 family (8B/70B/405B) \(\times\) multi-dimensional evaluations (academic benchmarks/real-world tasks/textual similarity/inference performance) \(\times\) three GPU architectures.

Key Designs

  1. Evaluation Framework:

    • Academic Benchmarks: Open LLM Leaderboard V1 (GSM/MMLU/ARC/Winogrande/HellaSwag/TruthfulQA) + V2 (MMLU-Pro/GPQA/BBH/MuSR/MATH/IFEval)
    • Real-world Benchmarks: Arena-Hard-Auto (500 complex prompts), HumanEval/HumanEval+ (code generation), RULER (long context 4k-128k)
    • Textual Similarity: ROUGE-1/ROUGE-L/BERTScore/STS to analyze the semantic and structural consistency between the outputs of the quantized model and the original model under the same prompts.
  2. Quantization Algorithm Optimization:

    • W8A8-FP: Dynamic per-token activation quantization + symmetric weight quantization, requiring no calibration data.
    • W8A8-INT: GPTQ symmetric weight quantization + dynamic per-token activation quantization + SmoothQuant (necessary for 70B).
    • W4A16-INT: GPTQ + MSE optimal clipping + group size of 128 + high-quality calibration data (OpenPlatypus).
    • Correcting prior AWQ vs GPTQ misunderstandings: GPTQ, when paired with MSE clipping and high-quality calibration data, outperforms AWQ on real-world tasks.
  3. Inference Performance Evaluation:

    • 7 deployment scenarios: synchronous/asynchronous, varying concurrency levels.
    • vLLM framework tested across 3 GPU types: A6000/A100/H100.

Key Experimental Results

Main Results: Accuracy on Academic and Real-world Benchmarks

Quantization Format Llama-3.1-8B Leaderboard V1/V2 Arena-Hard HumanEval
BF16 74.06 / 27.62 25.8 67.3
W8A8-FP ≈BF16 (within error margin) ≈BF16 ≈BF16
W8A8-INT 73.x / 27.x (1-3%↓) 24.x 65.x
W4A16-INT(GPTQ) 73.11 / 26.53 24.0 67.1
W4A16-INT(AWQ) 72.69 / 27.40 22.3 63.0

Ablation Study: GPTQ vs AWQ

Dimension GPTQ AWQ Notes
Academic Benchmark Average ≈AWQ ≈GPTQ Roughly on par
Arena-Hard +1.7 Baseline GPTQ is better on real-world tasks
HumanEval +4.1 Baseline GPTQ is significantly better for code generation
MBPP +3.0 Baseline Re-verified on code generation

Key Findings

  • FP8 is nearly lossless: Across all model scales and benchmarks, FP8 results are equivalent to BF16 within the evaluation error margin. It only requires RTN quantization with no calibration data needed.
  • INT8 is far better than previously reported: The previously claimed drop of 10+% is actually only 1-3%; the key lies in the correct usage of SmoothQuant and high-quality calibration data.
  • INT4 is surprisingly good: W4A16-INT is comparable to or even better than W8A8-INT in accuracy, while requiring only half the memory for weights.
  • GPTQ > AWQ on real-world tasks: MSE optimal clipping + high-quality calibration are crucial, overturning the community assumption that AWQ is superior.
  • Textual Similarity: The textual output of large models post-quantization remains almost identical to the original (BERTScore >0.95); smaller models exhibit moderate structural variation but preserve semantics.
  • Deployment Recommendations: Use W4A16 for synchronous scenarios (lowest latency), W8A8 for asynchronous high-throughput scenarios (maximum throughput), and determine based on specific ratios for hybrid scenarios.

Highlights & Insights

  • The lesson of "hyperparameters matter": Many previous studies' conclusions regarding quantization formats were actually due to poorly tuned hyperparameters (e.g., using default absmax instead of MSE clipping in GPTQ, or using C4 instead of high-quality data). This serves as a reminder to ensure strong baselines during comparative experiments.
  • Empirically-driven deployment guide: Derived from over 500k actual evaluations rather than theoretical analysis, directly guiding industrial deployments.
  • Dismantling the "BF16 or nothing" myth: Presenting a wealth of evidence showing that quantization is not intimidating—FP8 is lossless and INT8 is nearly lossless—lowering the psychological barrier for adopting quantization in the community.
  • Textual similarity analysis: Going beyond traditional accuracy metrics, analyzing the structural and semantic consistency of generated text offers a new dimension for evaluating the impact of quantization.

Limitations & Future Work

  • Only the Llama-3.1 family was evaluated; findings might differ for other architectures (e.g., Qwen, Mistral).
  • The FP8 conclusions are only applicable to Hopper-architecture GPUs; older GPUs cannot utilize this format.
  • More extreme low-bit (2-bit/3-bit) quantization was not evaluated.
  • Inference performance tests were based on a specific version of vLLM (0.6.4); framework optimizations might impact the conclusions.
  • Fine-tuning scenarios for quantized models were not considered.
  • vs Lee et al. (2024b): The closest previous study, which claimed that W8A8-INT was significantly inferior to FP8. This paper narrows the gap from 10+ points to 0.7 points through correct hyperparameter tuning.
  • vs AWQ (Lin et al., 2024): AWQ is widely recommended in academic papers, but this study proves that GPTQ + MSE clipping + high-quality data is superior on real-world tasks.
  • vs SmoothQuant (Xiao et al., 2022): SmoothQuant addresses the problem of outliers in activation quantization, and this paper confirms that it is necessary for large models (e.g., 70B) in W8A8-INT.

Rating

  • Novelty: ⭐⭐⭐ Not a methodological innovation, but rather an empirical contribution with unprecedented comprehensiveness.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 500k evaluations, spanning academic, real-world, textual similarity, and inference performance dimensions across three GPU architectures.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, explicit findings, and practical deployment recommendations.
  • Value: ⭐⭐⭐⭐⭐ Direct guiding value for industrial quantization deployment decisions.