BoA: Attention-aware Post-training Quantization without Backpropagation¶
Conference: ICML 2025
arXiv: 2406.13474
Code: https://github.com/SamsungLabs/BoA
Area: Model Compression
Keywords: Post-training quantization, attention-aware, Hessian optimization, LLM quantization, cross-layer dependency
TL;DR¶
This paper proposes BoA—the first backpropagation-free algorithm for post-training quantization that accounts for cross-layer dependencies. By constructing an attention-aware Hessian matrix, it captures inter-layer interactions within the attention module, significantly outperforming existing PTQ methods at ultra-low bit-widths (INT2).
Background & Motivation¶
Background: PTQ is a key technology for LLM deployment. Methods like GPTQ use Hessian information for layer-wise optimization of quantized weights but assume independence between layers.
Limitations of Prior Work: Layer-wise independent optimization ignores the interactions between Q/K/V/O projection layers in the attention module—quantization error in one layer propagates through attention and affects the optimal quantization of other layers.
Key Challenge: Accounting for cross-layer dependencies requires a much larger Hessian matrix, resulting in massive computational and memory overhead.
Goal: How to efficiently incorporate attention-layer dependencies within a backpropagation-free framework?
Key Insight: Extend the Hessian from layer-wise reconstruction error to attention-module-level reconstruction error.
Core Idea: Attention-aware Hessian + Hessian relaxation + head-wise joint quantization, balancing accuracy and efficiency.
Method¶
Overall Architecture¶
- Construct an attention-module-level reconstruction error objective (rather than layer-wise).
- Derive the attention-aware Hessian matrix.
- Reduce overhead through Hessian relaxation and efficient matrix inversion.
- Coordinate joint quantization of Q/K/V projections head-by-head.
Key Designs¶
-
Attention-aware Hessian:
- Function: Extends the quantization objective from \(\|\Delta W \cdot X\|_F^2\) to the overall reconstruction error of the attention module output.
- Mechanism: The Hessian contains cross-layer information among Q/K/V layers, capturing how "quantization error in the Q-layer propagates to the output through attention weights."
- Design Motivation: Softmax in attention highly couples Q/K/V, making layer-wise independent quantization sub-optimal.
-
Hessian Relaxation and Efficient Computation:
- Function: Reduces computation through block-diagonal approximation and Cholesky decomposition.
- Mechanism: Preserves cross-layer interactions within the same attention head while ignoring interactions between different heads.
- Design Motivation: Q/K/V interactions are strongest within the same head, while inter-head interactions are relatively weak.
-
Head-wise Joint Quantization:
- Function: Quantizes the Q/K/V projections of an attention head simultaneously.
- Mechanism: Leverages the block structure of the attention-aware Hessian for parallel optimization.
- Design Motivation: Better utilizes inter-layer dependency information compared to serial, layer-wise quantization.
Loss & Training¶
- Backpropagation-free, based on the Hessian-based OBS framework.
- Compatible with activation outlier suppression methods like SmoothQuant/QuaRot.
- Computational overhead is comparable to GPTQ.
Key Experimental Results¶
Main Results¶
Llama-2-7B W2A16 quantization perplexity:
| Method | WikiText PPL ↓ | C4 PPL ↓ |
|---|---|---|
| GPTQ | 107.8 | 89.2 |
| QuIP# | 12.7 | 14.8 |
| BoA | 10.2 | 12.1 |
Ablation Study¶
| Configuration | PPL | Description |
|---|---|---|
| Layer-wise Hessian (GPTQ) | 107.8 | Ignores cross-layer dependencies |
| Attention-aware Hessian (Full) | 10.0 | High memory overhead |
| Attention-aware Hessian (Relaxed) | 10.2 | Negligible accuracy loss, manageable memory |
| BoA + QuaRot | 8.1 | Optimal combination with rotation-based methods |
Key Findings¶
- The advantage is most pronounced at ultra-low bit-widths (W2)—GPTQ PPL 107.8 vs BoA 10.2.
- Showcases strong synergy with QuaRot/SmoothQuant (reaching 8.1 PPL on W2A16 when combined).
- W4A4 weight-activation quantization also achieves SOTA performance.
Highlights & Insights¶
- Modeling attention cross-layer dependency is a key breakthrough—improvements are moderate at normal bit-widths (W4) but show a massive difference under extreme compression (W2).
- The block-diagonal approximation for Hessian relaxation retains the most crucial intra-head interactions, representing an elegant engineering decision.
- Orthogonal to pre-processing methods and can be applied synergistically.
Limitations & Future Work¶
- Only considers cross-layer dependencies within the attention module, while the FFN part remains layer-wise independent.
- Hessian computation still requires a forward pass on calibration data.
- The performance improvement is limited under W4 precision.
Related Work & Insights¶
- vs GPTQ: Layer-wise Hessian, ignoring cross-layer dependencies.
- vs QuIP#: Uses lazy codebooks but still processes layers independently.
- vs any4: any4 improves codebook design, whereas BoA improves quantization optimization strategies—the two are orthogonal.
Rating¶
- Novelty: ⭐⭐⭐⭐ Attention-aware Hessian is a novel technical direction.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across multiple models, bit-widths, and combined with various methods.
- Writing Quality: ⭐⭐⭐⭐ Technically rigorous with clear derivations.
- Value: ⭐⭐⭐⭐ Significant breakthrough in ultra-low bit-width quantization.