MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention¶
Conference: NeurIPS 2025 arXiv: 2505.18698 Code: GitHub Area: LLM/NLP Keywords: efficient attention, Monarch matrices, sub-quadratic attention, structured matrices, hardware-aware, zero-shot conversion
TL;DR¶
This paper proposes MonarchAttention, which leverages the structured properties of Monarch matrices and employs alternating optimization over a variational form of softmax to approximate attention at \(\Theta(N\sqrt{N}d)\) complexity. The method enables zero-shot replacement of attention layers in pretrained Transformers without any additional training, while achieving 1.4×–8.2× speedups over FlashAttention-2 on GPU.
Background & Motivation¶
The core attention mechanism in Transformers has \(\Theta(N^2 d)\) quadratic time complexity, which is a critical bottleneck for long-sequence training and inference. Existing sub-quadratic attention methods fall into several categories:
Low-rank methods (linear attention, Performer, etc.): hardware-friendly but unsuitable as direct drop-in replacements for pretrained models, since attention matrices often exhibit strong diagonal structure.
Sparse methods (LSH, fixed sparse masks): data-dependent sparsity patterns are difficult to implement efficiently on GPU.
Low-rank + sparse: combined approaches improve accuracy but incur significant overhead.
These methods either require training from scratch or fine-tuning (limiting transferability), or fail to achieve practical GPU speedups (large gap between theoretical complexity and actual performance). The core motivation of MonarchAttention is to simultaneously achieve transferability (zero-shot replacement) and hardware efficiency (exploiting GPU tensor cores).
Method¶
Overall Architecture¶
MonarchAttention aims to find a Monarch matrix \(\mathbf{M} \in \mathbb{R}^{N \times N}\) such that \(\mathbf{M} \approx \text{softmax}(\mathbf{Q}\mathbf{K}^\top)\), and then efficiently compute \(\mathbf{O} = \mathbf{M}\mathbf{V}\).
Monarch matrix definition: Given \(N = m \times b\) (typically \(m = b = \sqrt{N}\)), a Monarch matrix is defined as \(\mathbf{M} = \mathbf{P}^\top \mathbf{B}\), where \(\mathbf{P}\) is a transpose permutation matrix and \(\mathbf{B}\) is a block rank-one matrix. Each block \(\mathbf{B}_{jk} = \mathbf{L}_{jk}\mathbf{R}_{kj}^\top\) requires only \(\Theta(N\sqrt{N})\) storage, and matrix multiplication requires only \(\Theta(N\sqrt{N}d)\) operations.
Key Designs¶
Variational form of softmax: Exploiting the variational characterization of softmax:
where \(H(\mathbf{a}) = -\sum_i \mathbf{a}_i \log \mathbf{a}_i\) is the Shannon entropy. The computation of the attention matrix is reformulated as an optimization problem:
Exploiting low-dimensional structure: When \(\mathbf{A}\) has Monarch structure, the objective decomposes into multiple independent small-scale subproblems:
Due to the block rank-one structure, the entropy term is also separable, and each subproblem requires only \(\Theta((m+b)d)\) operations.
Alternating maximization: The objective is concave in \(\mathbf{R}\) given fixed \(\mathbf{L}\) (and vice versa), yielding closed-form updates via KKT conditions:
With initialization \(\mathbf{L}_{jkl}^{(0)} = \delta_{kl}\) (identity matrix), \(T\) steps of alternating optimization yield \(\mathbf{M}^{(T)} \approx \sigma(\mathbf{Q}\mathbf{K}^\top)\).
Loss & Training¶
MonarchAttention is an inference-time optimization method that involves no training loss. Its core procedure relies on alternating maximization of a variational objective to approximate the attention matrix.
IO-optimized implementation: \(\mathbf{L}\) and \(\mathbf{R}\) are not materialized in HBM; only state variables \(\alpha_R, \alpha_L, c_R, c_L\) are maintained (additional \(\Theta(Nd)\) memory). All intermediate values are materialized only in on-chip SRAM, achieving IO savings analogous to FlashAttention but with effective sequence length \(\sqrt{N}\), eliminating the need for tiling along the sequence dimension. The optimal IO complexity is \(\Theta(Nd)\), improving upon FlashAttention's worst-case \(O(N^2 d^2 / S)\).
Key Experimental Results¶
Main Results¶
| Task | Model | Method | FLOPs Reduction | Performance Drop |
|---|---|---|---|---|
| Image Classification (ImageNet) | ViT-B (87M) | MonarchAttention | 80% | Top-5 accuracy drop of only 5% |
| Image Classification (ImageNet) | ViT-B | MonarchAttention | 50% | On par with baseline |
| QA (SQuAD) | RoBERTa-B (125M) | MonarchAttention | 60% | F1 drop of only 10 points |
| QA (SQuAD) | RoBERTa-B | MonarchAttention | 35% | On par with baseline |
| Summarization (BookSum) | BART-B (139M) | MonarchAttention (N=8192) | Similar FLOPs as softmax N=2048 | ROUGE-1 +0.75, ROUGE-L +0.5 |
Image Generation (DiT-XL 675M, ImageNet):
| Replaced Layers | Method | FLOPs (×10⁹) | FID ↓ | sFID ↓ |
|---|---|---|---|---|
| All | Nyströmformer | 3.30 | 5.97 | 13.47 |
| All | MonarchAttention | 3.44 | 2.82 | 5.09 |
| First half | Nyströmformer | 5.88 | 8.17 | 19.01 |
| First half | MonarchAttention | 5.95 | 0.39 | 0.66 |
| Second half | Nyströmformer | 5.88 | 6.76 | 13.58 |
| Second half | MonarchAttention | 5.95 | 1.98 | 3.36 |
Ablation Study¶
Speed benchmarks (NVIDIA A40 GPU):
| Sequence Length N | MonarchAttention vs. FlashAttention-2 Speedup |
|---|---|
| 256 | 1.4× |
| 4096 | 4.5× |
| 16384 | 8.2× |
Effect of iteration count \(T\) on ViT accuracy: larger \(T \in \{1, 2, 3\}\) yields higher accuracy at greater FLOPs cost; \(T=1\) already achieves a favorable accuracy–efficiency trade-off.
Key Findings¶
- MonarchAttention substantially outperforms low-rank baselines (Performer, Nyströmformer) across all tasks, with particularly large margins on image generation.
- On the summarization task, MonarchAttention achieves a better ROUGE vs. FLOPs trade-off than softmax attention by efficiently processing longer sequences.
- Replacing only the first half of DiT layers yields FID of 0.39, suggesting that attention in earlier layers is more amenable to Monarch matrix approximation.
- Even at short sequences (N=256), a 1.4× speedup is achieved thanks to a fully fused Triton kernel implementation.
Highlights & Insights¶
- Elegant use of the variational form: Reformulating softmax approximation as constrained optimization avoids direct computation of the \(N \times N\) attention matrix.
- Entropy separability of block rank-one structure: The block rank-one property of Monarch matrices enables separable computation of the entropy term, reducing complexity from \(\Theta(mb)\) to \(\Theta(m+b)\)—a key theoretical contribution.
- Hardware-aware design: Monarch matrices exploit GPU tensor cores via batched dense matrix multiplications, avoiding the memory-unfriendly access patterns of FFT-based algorithms.
- IO complexity strictly superior to FlashAttention: \(\Theta(Nd)\) in the worst case vs. \(O(N^2 d^2 / S)\).
Limitations & Future Work¶
- Not applicable to autoregressive generation: Cannot be used for token-by-token decoding in decoders, as no complete attention matrix exists to approximate.
- Difficulty integrating causal masks: How to efficiently incorporate causal masking into MonarchAttention remains unclear.
- Uniform block size allocation: Monarch matrices assign equal capacity to each block, yet different regions of the attention matrix may vary in complexity.
- Scalability: For extremely long sequences where \(\sqrt{N}d > S\) (SRAM size), tiling along the sequence dimension is still required.
Related Work & Insights¶
- FlashAttention: MonarchAttention can be viewed as a block-sparse generalization of FlashAttention, where each optimization step is expressible as a FlashAttention-style computation.
- Monarch Mixer: Although it also uses Monarch matrices for token mixing, it is not data-dependent; MonarchAttention preserves the data-dependent nature of attention.
- \(\alpha\)-entmax: Produces sparser attention matrices than softmax and may be better suited for block rank-one approximation, representing a promising future direction.
- The method is applicable to diffusion language models (non-autoregressive) and for accelerating the prefill stage.
Rating¶
- ⭐⭐⭐⭐ Novelty: The combination of the variational form of softmax with Monarch structured matrices for sub-quadratic attention approximation is highly elegant.
- ⭐⭐⭐⭐ Experimental Thoroughness: Covers diverse architectures and tasks across vision (ViT, DiT) and language (RoBERTa, BART), providing strong empirical support.
- ⭐⭐⭐⭐⭐ Value: Zero-shot replacement + practical GPU speedups + open-source code yield extremely high engineering value.
- ⭐⭐⭐ Limitations & Future Work: Inapplicability to autoregressive LLM inference—the most dominant current use case—limits broader impact.
Overall: ⭐⭐⭐⭐ (4/5) — A high-quality work with elegant theory, solid experiments, and strong engineering practicality. The central limitation is incompatibility with autoregressive decoding, though the method remains highly valuable for prefill, encoder-only models, and diffusion models.