Skip to content

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

Conference: CVPR 2026 arXiv: 2603.09582 Code: EdwardChasel/BinaryAttention Area: Model Compression Keywords: attention quantization, binary quantization, vision transformer, diffusion transformer, 1-bit attention, FlashAttention

TL;DR

This paper proposes BinaryAttention, which quantizes Query and Key in Transformer attention to 1-bit binary representations and replaces floating-point dot products with XNOR + popcount bitwise operations, achieving over 2× speedup over FlashAttention2 on A100 GPUs while matching or surpassing full-precision attention across vision classification, detection, segmentation, and diffusion generation tasks.

Background & Motivation

Attention computation as a bottleneck: Standard Transformer attention scales quadratically with sequence length, making it the primary inference efficiency bottleneck in high-resolution vision tasks.

Existing quantization limited to 8-bit/4-bit: The SageAttention series quantizes QK to INT8/INT4/FP4, but pushing further to sub-4-bit — especially binary (1-bit) — causes severe information loss, training instability, and sharp performance degradation.

Cost of architectural alternatives: Linear Attention, Sparse Attention, and SSMs (e.g., Mamba) reduce complexity but often sacrifice the expressive power of standard attention across diverse tasks.

Hardware native support for binary operations: NVIDIA A100 Tensor Cores deliver up to 4992 TOPs/s for binary operations — 16× that of FP16 — providing a hardware foundation for ultra-low-bit attention.

Theoretical feasibility: The authors demonstrate from two perspectives — distance metrics (Hamming vs. Euclidean distance) and directional similarity (cosine similarity preservation) — that the core similarity relationships in attention can be preserved after binarization.

Practical acceleration demand: Orthogonal to architecture-changing approaches, attention quantization offers a plug-and-play acceleration method that preserves the original architecture, enabling broader generality and practicality.

Method

Overall Architecture

BinaryAttention consists of three core components: (1) Scaled Binary Representations — quantizing Q and K to 1-bit while retaining scaling factors; (2) Bias Enhancement — introducing learnable biases to compensate for information loss from binarization; (3) Hybrid Quantization — applying 8-bit quantization to attention scores and V for end-to-end acceleration. Training employs QAT combined with a self-distillation strategy. The overall scheme is implemented atop the tiled attention framework of FlashAttention2 for hardware acceleration.

Key Design 1: Scaled Binary Representations

  • Function: Query \(\mathbf{q}_i\) and Key \(\mathbf{k}_j\) are quantized via the sign function to \(\{-1, +1\}^d\), yielding \(\mathbf{s}_i = \mu_q \cdot \text{sign}(\mathbf{q}_i)\) and \(\mathbf{t}_j = \mu_k \cdot \text{sign}(\mathbf{k}_j)\).
  • Mechanism: The dot-product similarity \(\mu_q \mu_k \mathbf{s}_i^T \mathbf{t}_j\) can be computed efficiently via XNOR + popcount bitwise operations, theoretically achieving 16× speedup for the \(\mathbf{QK}^T\) portion.
  • Design Motivation: Theorem 1 proves that the outer product of binary Q/K is a consistent estimator of the original covariance matrix, providing statistical guarantees for the expressiveness of binary attention. The scaling factors \(\mu_q, \mu_k\) preserve the magnitude information of original tokens, reducing quantization error.

Key Design 2: Bias Enhancement

  • Function: A bias term is added to the binary dot product: \(S_{ij} = \mu_q \mu_k \mathbf{s}_i^T \mathbf{t}_j / \sqrt{d} + b_{ij}\).
  • Mechanism: The bias can be a dense learnable matrix, a relative positional bias, or a context-aware bias, increasing the rank of the attention score matrix and preventing the softmax distribution from collapsing to a uniform distribution.
  • Design Motivation: 1-bit quantization discards magnitude information, causing attention scores to tend toward uniformity (the "flattened effect"), losing the ability to distinguish salient features. The bias term re-injects contextual and spatial structural information, restoring the discriminative capacity of attention. Ablation studies show the bias is especially effective for small models (DeiT-T: +0.44%).

Key Design 3: Hybrid Quantization

  • Function: Post-softmax attention scores \(P_{ij}\) are quantized using unsigned 8-bit static quantization (scale = 1/255); Values \(\mathbf{v}_j\) are quantized using channel-wise 8-bit quantization.
  • Mechanism: The \(\mathbf{PV}\) multiplication uses INT8 Tensor Core instructions mma.s32.u8.s8.s32, achieving 2× speedup for this stage.
  • Design Motivation: Quantizing only QK is insufficient for end-to-end acceleration, as PV multiplication is also a computational bottleneck. 8-bit precision is adequate for attention scores (naturally in [0,1]) and Values.

Key Design 4: QAT + Self-Distillation Training Strategy

  • Function: Quantization-Aware Training (QAT) simulates quantization effects during training/fine-tuning; a full-precision model serves as teacher for self-distillation.
  • Mechanism: The Straight-Through Estimator (STE) enables backpropagation through the sign function; the distillation loss guides binary representations to align in similarity with their full-precision counterparts.
  • Design Motivation: 1-bit quantization induces distribution shift and approximation errors that PTQ alone cannot compensate. Ablation results show self-distillation yields +0.66% for the larger DeiT-B, demonstrating its effectiveness in countering quantization-induced distribution shift.

Loss & Training

  • QAT Training: Sign quantization is applied to Q/K during the forward pass; gradients are approximated via STE during backpropagation.
  • Self-Distillation: A full-precision pretrained model acts as teacher; the distillation loss encourages sign-aligned similarity between binary and full-precision attention.
  • Hardware Implementation: Built on the FlashAttention2 framework; QK multiplication uses the mma.s32.b1.b1.s32 PTX instruction and PV multiplication uses mma.s32.u8.s8.s32.

Key Experimental Results

Table 1: ImageNet-1K Image Classification (Top-1 Accuracy)

Method Size Resolution OPs Top-1 (%)
DeiT-T (FlashAttention2) 6M 224² 1.2G 72.2
SageAttention-T 6M 224² 1.2G 72.11
BinaryAttention-T 6M 224² 1.1G 72.88
DeiT-S 22M 224² 4.6G 79.8
SageAttention-S 22M 224² 4.5G 79.82
BinaryAttention-S 22M 224² 4.3G 80.24
DeiT-B 87M 384² 55.4G 83.1
SageAttention-B 87M 384² 53.2G 82.89
BinaryAttention-B 87M 384² 50.2G 83.64

Table 2: ADE20K Semantic Segmentation (mIoU)

Backbone OPs mIoU (SS) mIoU (MS)
DeiT-B 2654G 46.86 47.74
SageAttention-B 2539G 46.86 47.74
BinaryAttention-B 2384G 47.76 48.37

Table 3: DiT-XL/2 Image Generation (ImageNet 256×256, cfg=1.50)

Method OPs Training Steps FID↓ IS↑
FlashAttention2 118.6G 7000K 2.27 278.24
SageAttention 117.1G 7000K 2.27 278.03
BinaryAttention 115.0G 4000K 2.19 278.03

Table 4: Ablation Study (ImageNet-1K Top-1)

Scale Bias Distill DeiT-T DeiT-S DeiT-B
71.95 79.59 81.10
72.42 79.81 81.33
72.44 79.97 81.99
72.88 80.24 82.04

Highlights & Insights

  1. Theory meets practice: Theorem 1 provides theoretical guarantees under a Gaussian assumption that binary attention preserves the covariance structure, in contrast to the purely empirical approach of most quantization work.
  2. Surpassing full precision: BinaryAttention outperforms full-precision FlashAttention2 across multiple tasks and model scales, suggesting that QAT + distillation renders binarization a form of regularization.
  3. Significant practical speedup: Achieves 2× kernel-level speedup over FlashAttention2 and 1.5× end-to-end speedup at 1024² input resolution, with seamless composability with existing linear layer quantization methods (e.g., PTQ4ViT).
  4. Effective for generative tasks: Achieves comparable or superior FID on DiT/SiT diffusion models with fewer training steps, demonstrating the viability of binary attention in generative models.
  5. Elegant bias design: Simple relative positional biases effectively counteract the distribution collapse from binarization, with more pronounced benefits for smaller models — a clear and well-motivated insight.

Limitations & Future Work

  1. Requires QAT fine-tuning: This is not a PTQ solution; fine-tuning from a full-precision model is required, increasing deployment cost.
  2. Hardware dependency: Binary Tensor Core instructions (mma.b1) are currently supported only on NVIDIA GPUs; portability to other hardware platforms is unexplored.
  3. Theoretical assumption limitations: Theorem 1 relies on a zero-mean Gaussian assumption; actual Q/K distributions may deviate, limiting the strictness of the theoretical guarantees.
  4. Insufficient validation on large models: Experiments only reach the DeiT-B / DiT-XL scale (~87M parameters); applicability to ViT-L/H or multimodal large models (e.g., LLaVA) remains unknown.
  5. Value not quantized to ultra-low bit: V is retained at 8-bit, limiting speedup gains for PV (only 2×); further compression of V could yield greater benefits.
  • SageAttention series [Zhang et al.]: A progressive attention quantization roadmap from INT8 → INT4 → FP4; BinaryAttention pushes this to the 1-bit extreme.
  • FlashAttention [Dao et al.]: The IO-aware tiled attention hardware optimization framework upon which BinaryAttention is directly built — the two are complementary.
  • Binary Neural Networks (e.g., BiT [Liu et al.], BiBERT [Qin et al.]): Prior binarization work primarily targets linear layer weights/activations; this paper is the first to successfully apply binarization to attention QK computation.
  • DiT / SiT: Representative diffusion Transformer architectures; this paper validates binary attention for generative models, opening a new direction for efficient diffusion models.
  • Insights: The binarization + bias compensation paradigm is transferable to other scenarios requiring efficient attention, such as video understanding (long sequences) and point cloud processing (large-scale point sets); combining with KV cache compression could further reduce LLM inference latency.

Rating

  • Novelty: ⭐⭐⭐⭐ — First successful quantization of attention QK to 1-bit without performance degradation, with theoretically grounded analysis.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers four tasks (classification, detection, segmentation, generation) with detailed ablations and evaluation of both kernel-level and end-to-end efficiency.
  • Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear, experiments are well-organized, and the motivation for the bias term is intuitively explained.
  • Value: ⭐⭐⭐⭐ — Significant practical speedup with plug-and-play applicability, orthogonal and complementary to existing quantization and acceleration methods.