CVPR 2026 Model Compression attention quantization binary quantization vision transformer diffusion transformer 1-bit attention FlashAttention

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers¶

Conference: CVPR 2026 arXiv: 2603.09582 Code: EdwardChasel/BinaryAttention Area: Model Compression Keywords: attention quantization, binary quantization, vision transformer, diffusion transformer, 1-bit attention, FlashAttention

TL;DR¶

This paper proposes BinaryAttention, which quantizes Query and Key in Transformer attention to 1-bit binary representations and replaces floating-point dot products with XNOR + popcount bitwise operations, achieving over 2× speedup over FlashAttention2 on A100 GPUs while matching or surpassing full-precision attention across vision classification, detection, segmentation, and diffusion generation tasks.

Background & Motivation¶

Attention computation as a bottleneck: Standard Transformer attention scales quadratically with sequence length, making it the primary inference efficiency bottleneck in high-resolution vision tasks.

Existing quantization limited to 8-bit/4-bit: The SageAttention series quantizes QK to INT8/INT4/FP4, but pushing further to sub-4-bit — especially binary (1-bit) — causes severe information loss, training instability, and sharp performance degradation.

Cost of architectural alternatives: Linear Attention, Sparse Attention, and SSMs (e.g., Mamba) reduce complexity but often sacrifice the expressive power of standard attention across diverse tasks.

Hardware native support for binary operations: NVIDIA A100 Tensor Cores deliver up to 4992 TOPs/s for binary operations — 16× that of FP16 — providing a hardware foundation for ultra-low-bit attention.

Theoretical feasibility: The authors demonstrate from two perspectives — distance metrics (Hamming vs. Euclidean distance) and directional similarity (cosine similarity preservation) — that the core similarity relationships in attention can be preserved after binarization.

Practical acceleration demand: Orthogonal to architecture-changing approaches, attention quantization offers a plug-and-play acceleration method that preserves the original architecture, enabling broader generality and practicality.

Method¶

Overall Architecture¶

BinaryAttention consists of three core components: (1) Scaled Binary Representations — quantizing Q and K to 1-bit while retaining scaling factors; (2) Bias Enhancement — introducing learnable biases to compensate for information loss from binarization; (3) Hybrid Quantization — applying 8-bit quantization to attention scores and V for end-to-end acceleration. Training employs QAT combined with a self-distillation strategy. The overall scheme is implemented atop the tiled attention framework of FlashAttention2 for hardware acceleration.

Key Design 1: Scaled Binary Representations¶

Function: Query \(\mathbf{q}_i\) and Key \(\mathbf{k}_j\) are quantized via the sign function to \(\{-1, +1\}^d\), yielding \(\mathbf{s}_i = \mu_q \cdot \text{sign}(\mathbf{q}_i)\) and \(\mathbf{t}_j = \mu_k \cdot \text{sign}(\mathbf{k}_j)\).
Mechanism: The dot-product similarity \(\mu_q \mu_k \mathbf{s}_i^T \mathbf{t}_j\) can be computed efficiently via XNOR + popcount bitwise operations, theoretically achieving 16× speedup for the \(\mathbf{QK}^T\) portion.
Design Motivation: Theorem 1 proves that the outer product of binary Q/K is a consistent estimator of the original covariance matrix, providing statistical guarantees for the expressiveness of binary attention. The scaling factors \(\mu_q, \mu_k\) preserve the magnitude information of original tokens, reducing quantization error.

Key Design 2: Bias Enhancement¶

Function: A bias term is added to the binary dot product: \(S_{ij} = \mu_q \mu_k \mathbf{s}_i^T \mathbf{t}_j / \sqrt{d} + b_{ij}\).
Mechanism: The bias can be a dense learnable matrix, a relative positional bias, or a context-aware bias, increasing the rank of the attention score matrix and preventing the softmax distribution from collapsing to a uniform distribution.
Design Motivation: 1-bit quantization discards magnitude information, causing attention scores to tend toward uniformity (the "flattened effect"), losing the ability to distinguish salient features. The bias term re-injects contextual and spatial structural information, restoring the discriminative capacity of attention. Ablation studies show the bias is especially effective for small models (DeiT-T: +0.44%).

Key Design 3: Hybrid Quantization¶

Function: Post-softmax attention scores \(P_{ij}\) are quantized using unsigned 8-bit static quantization (scale = 1/255); Values \(\mathbf{v}_j\) are quantized using channel-wise 8-bit quantization.
Mechanism: The \(\mathbf{PV}\) multiplication uses INT8 Tensor Core instructions mma.s32.u8.s8.s32, achieving 2× speedup for this stage.
Design Motivation: Quantizing only QK is insufficient for end-to-end acceleration, as PV multiplication is also a computational bottleneck. 8-bit precision is adequate for attention scores (naturally in [0,1]) and Values.

Key Design 4: QAT + Self-Distillation Training Strategy¶

Function: Quantization-Aware Training (QAT) simulates quantization effects during training/fine-tuning; a full-precision model serves as teacher for self-distillation.
Mechanism: The Straight-Through Estimator (STE) enables backpropagation through the sign function; the distillation loss guides binary representations to align in similarity with their full-precision counterparts.
Design Motivation: 1-bit quantization induces distribution shift and approximation errors that PTQ alone cannot compensate. Ablation results show self-distillation yields +0.66% for the larger DeiT-B, demonstrating its effectiveness in countering quantization-induced distribution shift.

Loss & Training¶

QAT Training: Sign quantization is applied to Q/K during the forward pass; gradients are approximated via STE during backpropagation.
Self-Distillation: A full-precision pretrained model acts as teacher; the distillation loss encourages sign-aligned similarity between binary and full-precision attention.
Hardware Implementation: Built on the FlashAttention2 framework; QK multiplication uses the mma.s32.b1.b1.s32 PTX instruction and PV multiplication uses mma.s32.u8.s8.s32.

Key Experimental Results¶

Table 1: ImageNet-1K Image Classification (Top-1 Accuracy)¶

Method	Size	Resolution	OPs	Top-1 (%)
DeiT-T (FlashAttention2)	6M	224²	1.2G	72.2
SageAttention-T	6M	224²	1.2G	72.11
BinaryAttention-T	6M	224²	1.1G	72.88
DeiT-S	22M	224²	4.6G	79.8
SageAttention-S	22M	224²	4.5G	79.82
BinaryAttention-S	22M	224²	4.3G	80.24
DeiT-B	87M	384²	55.4G	83.1
SageAttention-B	87M	384²	53.2G	82.89
BinaryAttention-B	87M	384²	50.2G	83.64

Table 2: ADE20K Semantic Segmentation (mIoU)¶

Backbone	OPs	mIoU (SS)	mIoU (MS)
DeiT-B	2654G	46.86	47.74
SageAttention-B	2539G	46.86	47.74
BinaryAttention-B	2384G	47.76	48.37

Table 3: DiT-XL/2 Image Generation (ImageNet 256×256, cfg=1.50)¶

Method	OPs	Training Steps	FID↓	IS↑
FlashAttention2	118.6G	7000K	2.27	278.24
SageAttention	117.1G	7000K	2.27	278.03
BinaryAttention	115.0G	4000K	2.19	278.03

Table 4: Ablation Study (ImageNet-1K Top-1)¶

Scale	Bias	Distill	DeiT-T	DeiT-S	DeiT-B
✗	✗	✗	71.95	79.59	81.10
✓	✗	✗	72.42	79.81	81.33
✓	✗	✓	72.44	79.97	81.99
✓	✓	✓	72.88	80.24	82.04

Highlights & Insights¶

Theory meets practice: Theorem 1 provides theoretical guarantees under a Gaussian assumption that binary attention preserves the covariance structure, in contrast to the purely empirical approach of most quantization work.
Surpassing full precision: BinaryAttention outperforms full-precision FlashAttention2 across multiple tasks and model scales, suggesting that QAT + distillation renders binarization a form of regularization.
Significant practical speedup: Achieves 2× kernel-level speedup over FlashAttention2 and 1.5× end-to-end speedup at 1024² input resolution, with seamless composability with existing linear layer quantization methods (e.g., PTQ4ViT).
Effective for generative tasks: Achieves comparable or superior FID on DiT/SiT diffusion models with fewer training steps, demonstrating the viability of binary attention in generative models.
Elegant bias design: Simple relative positional biases effectively counteract the distribution collapse from binarization, with more pronounced benefits for smaller models — a clear and well-motivated insight.

Limitations & Future Work¶

Requires QAT fine-tuning: This is not a PTQ solution; fine-tuning from a full-precision model is required, increasing deployment cost.
Hardware dependency: Binary Tensor Core instructions (mma.b1) are currently supported only on NVIDIA GPUs; portability to other hardware platforms is unexplored.
Theoretical assumption limitations: Theorem 1 relies on a zero-mean Gaussian assumption; actual Q/K distributions may deviate, limiting the strictness of the theoretical guarantees.
Insufficient validation on large models: Experiments only reach the DeiT-B / DiT-XL scale (~87M parameters); applicability to ViT-L/H or multimodal large models (e.g., LLaVA) remains unknown.
Value not quantized to ultra-low bit: V is retained at 8-bit, limiting speedup gains for PV (only 2×); further compression of V could yield greater benefits.

SageAttention series [Zhang et al.]: A progressive attention quantization roadmap from INT8 → INT4 → FP4; BinaryAttention pushes this to the 1-bit extreme.
FlashAttention [Dao et al.]: The IO-aware tiled attention hardware optimization framework upon which BinaryAttention is directly built — the two are complementary.
Binary Neural Networks (e.g., BiT [Liu et al.], BiBERT [Qin et al.]): Prior binarization work primarily targets linear layer weights/activations; this paper is the first to successfully apply binarization to attention QK computation.
DiT / SiT: Representative diffusion Transformer architectures; this paper validates binary attention for generative models, opening a new direction for efficient diffusion models.
Insights: The binarization + bias compensation paradigm is transferable to other scenarios requiring efficient attention, such as video understanding (long sequences) and point cloud processing (large-scale point sets); combining with KV cache compression could further reduce LLM inference latency.

Rating¶

Novelty: ⭐⭐⭐⭐ — First successful quantization of attention QK to 1-bit without performance degradation, with theoretically grounded analysis.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers four tasks (classification, detection, segmentation, generation) with detailed ablations and evaluation of both kernel-level and end-to-end efficiency.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear, experiments are well-organized, and the motivation for the bias term is intuitively explained.
Value: ⭐⭐⭐⭐ — Significant practical speedup with plug-and-play applicability, orthogonal and complementary to existing quantization and acceleration methods.