RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training¶
Conference: ICLR 2026 arXiv: 2510.04008 Code: https://github.com/sahiljoshi515/RACE_Attention Area: LLM Efficiency / Attention Mechanisms Keywords: Linear Attention, LSH, Angular Kernel, Long-Sequence Training, Attention Approximation
TL;DR¶
This paper proposes RACE Attention — replacing softmax with a power angular kernel and approximating attention outputs via differentiable LSH sketches — achieving strictly linear time complexity, supporting up to 12M tokens on a single GPU and 75M tokens on a single CPU, while matching or surpassing softmax accuracy across diverse tasks.
Background & Motivation¶
Background: The \(O(N^2 d)\) complexity of softmax attention is a fundamental bottleneck for long-context training. Even with FlashAttention-2/3 optimizations, a single layer on GH200 cannot process more than ~4M tokens.
Limitations of Prior Work: Linear attention (Linear Attention, Performer) suffers from accuracy degradation; low-rank approximations (Linformer) do not support autoregressive inference; YOSO employs hard LSH but lacks theoretical guarantees and does not support causal language modeling.
Key Challenge: Existing approximation methods lack a rigorous mathematical framework for characterizing the efficiency–accuracy trade-off, leading to ad hoc design choices that are unstable across tasks.
Goal: Design a strictly linear-time attention mechanism with theoretical guarantees, supporting both causal and non-causal settings and scaling to tens of millions of tokens.
Key Insight: The LSH collision probability of angular kernels equals the angular similarity exactly, and RACE sketches enable unbiased estimation of kernel density sums in linear time.
Core Idea: Replace softmax with a power angular kernel and approximate attention in \(O(N)\) via differentiable RACE sketches.
Method¶
Overall Architecture¶
Softmax is replaced by the angular kernel \((1 - \frac{\arccos(\hat{q}_i \cdot \hat{k}_j)}{\pi})^\gamma\). Rather than constructing the attention matrix, keys and values are hashed into \(S = L \times R\) buckets, and queries aggregate per-bucket statistics at inference time.
Key Designs¶
-
Power Angular Kernel: Angular similarity raised to the \(\gamma\)-th power replaces softmax; larger \(\gamma\) yields sharper attention. Since the LSH collision probability equals angular similarity, RACE sketch theory applies directly.
-
Differentiable RACE Sketch: Hard SimHash is replaced by sigmoid-based soft assignment, preserving approximation quality while enabling gradient-based training. Averaging over \(L\) independent hash tables reduces variance.
-
Causal RACE: Causal bucket counters are maintained via prefix-sum streaming, supporting autoregressive language modeling.
Loss & Training¶
RACE Attention serves as a drop-in replacement for softmax attention, trained with standard cross-entropy loss. \(L\) (number of hash tables) and \(R\) (number of buckets) govern the variance–accuracy trade-off.
Key Experimental Results¶
Main Results¶
| Method | Complexity | 64K Support | Accuracy |
|---|---|---|---|
| Softmax (FA2) | \(O(N^2)\) | OOM | Baseline |
| Linear Attn | \(O(N)\) | ✓ | Poor |
| Performer | \(O(Nd^2)\) | Partial | Poor |
| RACE | \(O(N)\) | ✓ | ≈ Baseline |
Scalability¶
| Hardware | Softmax Max | RACE Max |
|---|---|---|
| GH200 (96GB) | ~4M | 12M |
| CPU | N/A | 75M |
Key Findings¶
- RACE achieves lower wall-clock time than FlashAttention-2 at 64K with comparable accuracy.
- RACE outperforms Linformer in accuracy while using 13% fewer parameters.
- The \(\gamma\) parameter controls sharpness; excessively large values increase variance, requiring more hash tables to compensate.
- CPU training support opens new possibilities for long-context training without GPUs.
Highlights & Insights¶
- Elegant theoretical chain: angular kernel → LSH collision probability → RACE sketch → linear-time attention, with theoretical guarantees at every step.
- Truly linear time with small constants — \(S = L \times R\) buckets rather than \(N\) keys, yielding substantial practical speedups.
- Training on 75M tokens on CPU is a distinctive contribution, decoupling long-context research from GPU availability.
Limitations & Future Work¶
- Validated only on ~120M-parameter models; effectiveness at larger scales remains unknown.
- \(\gamma\), \(L\), and \(R\) require manual tuning; no automatic selection strategy is currently available.
- Integration with sparse attention is a promising future direction.
Related Work & Insights¶
- vs. FlashAttention: FA optimizes constant factors but retains \(O(N^2)\) complexity; RACE achieves genuine \(O(N)\).
- vs. YOSO: Both employ angular kernels with LSH, but RACE uses soft LSH with theoretical guarantees and supports causal modeling.
- vs. Performer: Performer is quadratic in embedding dimension and exhibits inferior accuracy.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The combination of angular kernel and RACE sketch constitutes a wholly novel approach.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-task evaluation, scalability analysis, and theoretical support; large-model validation is lacking.
- Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear and well-organized.
- Value: ⭐⭐⭐⭐⭐ — Significant practical value for ultra-long-context training.