Task-Stratified Knowledge Scaling Laws for Post-Training Quantized LLMs¶
Conference: ACL 2026 arXiv: 2508.18609 Code: None Area: Model Compression / Quantization Keywords: Post-training quantization, scaling laws, knowledge stratification, memorization-application-reasoning, fine-grained quantization factors
TL;DR¶
This paper establishes the first task-stratified knowledge scaling laws for post-training quantization (PTQ), decomposing LLM capabilities into three levels—memorization, application, and reasoning—and jointly modeling four factors: model size, bit-width, group size, and calibration set size. The laws are validated across 293 PTQ configurations, revealing differentiated patterns: reasoning is sensitive to precision, application improves with scale, and memorization is sensitive to calibration data.
Background & Motivation¶
Background: PTQ has become the dominant strategy for LLM compression (~70% of quantization research focuses on PTQ). Existing scaling laws (e.g., Chinchilla) primarily characterize full-precision models, and the few quantization-aware scaling laws consider only model size and bit-width.
Limitations of Prior Work: (1) The systematic effects of fine-grained PTQ parameters such as group size and calibration set size are ignored; (2) aggregating performance across all tasks obscures the differentiated impact of quantization on memorization, application, and reasoning capabilities.
Key Challenge: Existing scaling laws cannot guide practical decisions such as "how to configure group size and calibration set size under low-bit quantization to preserve specific capabilities."
Goal: Establish a unified four-factor power-law framework and fit separate scaling laws for each of the three knowledge capability levels.
Key Insight: LLM capabilities are stratified following Bloom's Taxonomy into memorization (exact factual recall), application (flexible knowledge use), and reasoning (multi-step logic), covered by 14 benchmarks across three levels.
Core Idea: \(-\ln(\text{Acc}_{\text{adj}}) = A \cdot N^{\alpha} \cdot (\log_2 B)^{\beta} \cdot (\log_2 C_b)^{\gamma} \cdot G^{\delta}\), where the exponents \(\alpha, \beta, \gamma, \delta\) are task-level-specific, quantifying the sensitivity of each capability to each factor.
Method¶
Overall Architecture¶
A systematic PTQ configuration sweep is conducted on the Qwen3 series (0.6B–14B) and Llama-3 series, covering bit-widths of 3/4/8, group sizes of 32/64/128/1024, and calibration set sizes of 8/32/128/1024, yielding 293 configurations in total. GPTQ is adopted as the unified quantization method. After evaluation on 14 benchmarks, OLS regression is applied to fit the log-transformed power law.
Key Designs¶
-
Four-Factor Unified Power-Law Framework:
- Function: Jointly models the combined effect of model size \(N\), bit-width \(B\), calibration set \(C_b\), and group size \(G\) on quantized performance.
- Mechanism: Logarithms are applied to \(B\) and \(C_b\) to model diminishing marginal returns; \(-\ln(\text{Acc}_{\text{adj}})\) maps performance to an unbounded "loss" space; OLS is applied to the log-transformed equation. The baseline-adjusted accuracy \(\text{Acc}_{\text{adj}} = \frac{\text{Acc} - \text{Acc}_{\text{random}}}{1 - \text{Acc}_{\text{random}}}\) removes the random-chance baseline differences across tasks.
- Design Motivation: The exponents are interpretable as elasticity coefficients measuring the sensitivity of performance to relative changes in each factor. The log transformation restores the monotone convexity required for bounded accuracy metrics.
-
Task-Stratified Knowledge Taxonomy:
- Function: Disentangles the differentiated impact of quantization on cognitive capabilities at distinct levels.
- Mechanism: L1 Memorization (exact factual recall: TriviaQA/NQ/LAMA etc.), L2 Application (flexible knowledge use: MMLU/HellaSwag etc.), and L3 Reasoning (multi-step logic: GSM8K/ARC-C etc.) each receive independently fitted scaling laws.
- Design Motivation: Fitting only aggregate performance conceals critical differences—reasoning may have already collapsed while application remains intact.
-
Critical Role of Fine-Grained Factors in Low-Bit Regimes:
- Function: Demonstrates that at 2–3-bit quantization, group size and calibration set size are no longer optional parameters but necessary conditions for preventing capability collapse.
- Mechanism: Ablations show \(f(N,B)\) achieves \(R^2 = 0.91\); adding \(G\) raises it to 0.95, indicating that group size accounts for approximately 4% of additional variance—concentrated precisely in the low-bit regime.
- Design Motivation: Practitioners often use default group sizes and calibration sets in low-bit quantization, which may lead to unnecessary performance collapse.
Loss & Training¶
No training is involved. GPTQ minimizes layer-wise quantization reconstruction error using the Hessian matrix.
Key Experimental Results¶
Main Results¶
Scaling Exponents Across Knowledge Capability Levels
| Capability Level | α(N) | β(B) | γ(Cb) | δ(G) | Adj R² |
|---|---|---|---|---|---|
| General | -0.359 | -1.067 | -0.032 | 0.073 | 0.9475 |
| L1 Memorization | -0.315 | -0.964 | -0.040 | 0.064 | 0.9350 |
| L2 Application | -0.400 | -1.100 | -0.030 | 0.075 | 0.9500 |
| L3 Reasoning | -0.320 | -1.200 | -0.025 | 0.085 | 0.9300 |
Key Findings¶
- Reasoning precision bottleneck: \(\beta_{\text{KR}} = -1.200\) (largest absolute value), indicating that reasoning is the most sensitive capability to bit-width.
- Application scale responsiveness: \(\alpha_{\text{KA}} = -0.400\) (largest absolute value), indicating that application capability benefits most from increased model scale.
- Memorization calibration sensitivity: \(\gamma_{\text{KM}} = -0.040\) (largest absolute value), indicating that exact factual recall is the most sensitive capability to calibration data volume.
- The four-factor model improves Adj R² by 3.5% over the two-factor baseline \((N, B)\), and extrapolation to Qwen3-32B is validated successfully.
Highlights & Insights¶
- The task-stratified scaling law framework is highly practical—it guides practitioners toward informed quantization configuration decisions under resource constraints.
- The critical role of fine-grained factors in low-bit regimes is an important finding: default configurations at 3-bit may cause capability collapse.
- Consistent behavior across architectures (Qwen3 → Llama-3) confirms the generalizability of the derived laws.
Limitations & Future Work¶
- Only GPTQ is employed as the quantization method.
- 2-bit data are excluded from fitting due to performance collapse.
- QAT and mixed-precision scenarios are not considered.
- Evaluation of generative tasks is not covered.
Related Work & Insights¶
- vs. Chinchilla Laws: Chinchilla targets full-precision models; this work extends the framework to quantized settings and incorporates group size and calibration set as additional factors.
- vs. QiD Laws: QiD models only aggregate performance degradation; this work stratifies the analysis across three distinct knowledge capability levels.
Rating¶
- Novelty: ⭐⭐⭐⭐ First task-stratified PTQ scaling laws; unified four-factor framework.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 293 configurations + 14 benchmarks + cross-architecture validation + extrapolation testing.
- Writing Quality: ⭐⭐⭐⭐ Clear mathematical derivations; convincing figures and tables.
- Value: ⭐⭐⭐⭐⭐ Directly actionable for LLM quantization practice.