Native Hybrid Attention for Efficient Sequence Modeling¶
Conference: ACL 2026 arXiv: 2510.07019 Code: GitHub Area: LLM Efficiency / Attention Mechanism Keywords: Hybrid Attention, Linear Attention, Sliding Window, Long-Short Memory Fusion, Efficient Sequence Modeling
TL;DR¶
Native Hybrid Attention (NHA) concatenates linear RNN long-term memory slots with sliding window short-term precise tokens and processes them through a single softmax attention, achieving native intra-layer and inter-layer hybridization — dynamically allocating long-short attention weights without extra fusion parameters, outperforming Transformer and other hybrid baselines on recall-intensive and commonsense reasoning tasks.
Method¶
Key Designs¶
-
Intra-Layer Hybrid — Unified Softmax Fusion: Long-term memory via gated linear RNN concatenated with sliding window KV cache, processed by single softmax. Weights are query-key similarity dependent — achieving per-token, per-head context-aware weighting with zero extra parameters.
-
Inter-Layer Hybrid — Window Size Tuning: All NHA layers share the same architecture; only window size \(w\) controls behavior (\(w=0\) = pure linear RNN, \(w=N\) = full attention). Supports zero-cost inference-time architecture search.
-
Chunkwise Parallel Computation: Efficient GPU implementation via Triton kernels maintaining near-linear complexity.
Key Experimental Results¶
| Model | Commonsense Avg↑ | Recall-Dense Avg↑ |
|---|---|---|
| Trans++ | 50.71 | 37.31 |
| GSA-H | 50.76 | 44.99 |
| NHA | 52.89 | 46.43 |
Highlights & Insights¶
- Unified softmax fusion is the core innovation — demoting fusion from explicit parameter learning to implicit softmax allocation
- "Architecture duality" is highly practical — same model can zero-cost switch between different efficiency-accuracy configurations at inference time
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐