TurboBoA: Faster and Exact Attention-aware Quantization without Backpropagation¶
Conference: ICLR 2026 arXiv: 2602.04929 Code: GitHub Area: Model Compression / Quantization / LLM Keywords: post-training quantization, attention-aware, backpropagation-free, low-bit quantization, LLM compression
TL;DR¶
TurboBoA proposes a backpropagation-free post-training quantization method for LLMs that achieves over 3× speedup over BoA while retaining its accuracy advantages, through three innovations: joint multi-output-channel quantization, preceding-layer error compensation, and adaptive grid selection.
Background & Motivation¶
The rapid growth of LLM scale has made post-training quantization (PTQ) a critical technique for reducing memory and computational costs. Backpropagation-free methods guided by Hessian-based error compensation (e.g., GPTQ) have attracted wide attention due to their efficiency.
A fundamental trade-off exists between two families of methods: - GPTQ: Assumes inter-layer independence, leading to significant accuracy degradation at low bit-widths (e.g., INT2). - BoA: Improves the Hessian approximation by exploiting cross-layer dependencies within attention modules, substantially boosting accuracy, but requires sequential per-output-channel quantization, making it far less efficient than GPTQ.
Core Problem: Can BoA's accuracy be preserved while dramatically improving its efficiency?
Method¶
Overall Architecture¶
TurboBoA introduces three key innovations: 1. Joint multi-output-channel quantization (Feature 1) — eliminates the sequential bottleneck. 2. Preceding-layer quantization error compensation (Feature 2) — mitigates error accumulation. 3. Adaptive grid selection + coordinate descent refinement (Feature 3) — maintains alignment.
Key Design 1: Joint Multi-Output-Channel Quantization¶
BoA quantizes output channels one by one (e.g., 128 sequential operations). TurboBoA simultaneously quantizes \(N\) output channels.
The error compensation problem is formulated as:
Proposition 3.1 (Closed-form solution):
where \(B = \{0, \ldots, N-1\}\) and \(\mathbf{U}_{out} = \text{Chol}(\mathbf{H}_{out}^{-1})^T\).
At \(N=16\), this yields over 3× speedup over BoA (128→8 sequential operations) with negligible accuracy loss.
Key Design 2: Preceding-Layer Quantization Error Compensation¶
Quantization errors in preceding layers propagate to subsequent layers, a factor BoA does not account for. TurboBoA explicitly models the input deviation \(\Delta\mathbf{X} = \mathbf{X} - \tilde{\mathbf{X}}\):
Proposition 3.2 (Closed-form solution with preceding-layer error):
where \(\mathbf{R} = \Delta\mathbf{X}\mathbf{X}^T\). In contrast to GPTAQ's vector-level optimization, TurboBoA handles the general dense \(\mathbf{H}_{out}\).
Key Design 3: Adaptive Grid + Coordinate Descent Refinement¶
- Adaptive grid: The quantization grid is computed on-the-fly before each quantization step to ensure alignment with updated weights.
- Coordinate descent refinement: The integer weights \(\mathbf{W}_{int}\) are frozen, and only the scales are optimized:
Proposition 3.3 (CD update rule):
Loss & Training¶
The standard attention reconstruction error is employed, based on the Kronecker-structured Hessian \(\mathbf{H} = \mathbf{H}_{in} \otimes \mathbf{H}_{out}\).
Key Experimental Results¶
Main Results: INT2 Quantization Speed¶
| Method | N | Llama3-8B Time | Wiki2 PPL |
|---|---|---|---|
| BoA | 1 | 94.75 min | 15.20 |
| TurboBoA | 4 | 39.46 min | 15.27 |
| TurboBoA | 8 | 30.55 min | 15.30 |
| TurboBoA | 16 | 25.30 min | 15.41 |
| TurboBoA | 32 | 22.95 min | 15.22 |
For 70B models: BoA requires 17 hours, while TurboBoA (\(N=16\)) requires only 5.6 hours, saving approximately 11 hours.
Ablation Study: Three Features¶
| Method | F2 | F3 | Llama3-8B Wiki2↓ | C4↓ |
|---|---|---|---|---|
| BoA | - | - | 15.20 | 36.95 |
| TurboBoA (F1 only) | ✗ | ✗ | 15.41 | — |
| TurboBoA (F1+F2) | ✓ | ✗ | Improved | — |
| TurboBoA (All) | ✓ | ✓ | Best | Best |
SOTA Results¶
When combined with outlier suppression techniques such as QuaRot: - Weight-only quantization: Comprehensively outperforms GPTQ, BoA, and other methods at INT2. - Weight-activation quantization: Also achieves state-of-the-art performance.
Key Findings¶
- Accuracy degradation remains negligible as \(N\) increases to 64, indicating that the remaining output channels provide sufficient error compensation capacity.
- Speedup gains diminish beyond \(N > 16\), making \(N=16\) the optimal trade-off point.
- Preceding-layer error compensation and grid refinement each contribute independently and complementarily.
Highlights & Insights¶
- All three Propositions provide closed-form solutions, resulting in theoretically elegant formulations.
- Over 3× speedup is achieved with accuracy on par with or better than BoA.
- The method is not tied to a specific Hessian formulation and can directly accommodate more advanced Hessian estimates.
- Over 11 hours of quantization time are saved for 70B models, demonstrating substantial practical value.
Limitations & Future Work¶
- Validation is limited to the Llama model family; other architectures (e.g., Mixtral, Qwen) are not evaluated.
- Although the choice of \(N\) is empirically robust, a theoretical error bound analysis is lacking.
- The stabilization coefficient \(\alpha\) requires manual tuning from the set \(\{0.05, 0.125, 0.25\}\).
- The method focuses exclusively on attention layer quantization; FFN layers rely on standard GPTQ.
Related Work & Insights¶
- Backpropagation-free quantization: GPTQ (Frantar et al., 2023), BoA (Kim et al., 2025), GPTAQ (Li et al., 2025)
- Transformation-based methods: SmoothQuant (Xiao et al., 2023), QuaRot (Ashkboos et al., 2024)
- Early PTQ works: AdaRound (Nagel et al., 2020), BRECQ (Li et al., 2021)
Rating¶
- Novelty: ⭐⭐⭐⭐ — The closed-form solution for joint quantization is the core contribution.
- Theoretical Depth: ⭐⭐⭐⭐⭐ — Three Propositions are complete and rigorous.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-scale model evaluation with comprehensive ablations.
- Value: ⭐⭐⭐⭐⭐ — Directly addresses the efficiency bottleneck of BoA.
- Writing Quality: ⭐⭐⭐⭐ — Notation is clear and mathematical derivations are detailed.