Native Hybrid Attention for Efficient Sequence Modeling¶

Conference: ACL 2026 arXiv: 2510.07019 Code: GitHub Area: LLM Efficiency / Attention Mechanism Keywords: Hybrid Attention, Linear Attention, Sliding Window, Long-Short Memory Fusion, Efficient Sequence Modeling

TL;DR¶

Native Hybrid Attention (NHA) concatenates linear RNN long-term memory slots with sliding window short-term precise tokens and processes them through a single softmax attention, achieving native intra-layer and inter-layer hybridization — dynamically allocating long-short attention weights without extra fusion parameters, outperforming Transformer and other hybrid baselines on recall-intensive and commonsense reasoning tasks.

Method¶

Key Designs¶

Intra-Layer Hybrid — Unified Softmax Fusion: Long-term memory via gated linear RNN concatenated with sliding window KV cache, processed by single softmax. Weights are query-key similarity dependent — achieving per-token, per-head context-aware weighting with zero extra parameters.
Inter-Layer Hybrid — Window Size Tuning: All NHA layers share the same architecture; only window size \(w\) controls behavior (\(w=0\) = pure linear RNN, \(w=N\) = full attention). Supports zero-cost inference-time architecture search.
Chunkwise Parallel Computation: Efficient GPU implementation via Triton kernels maintaining near-linear complexity.

Key Experimental Results¶

Model	Commonsense Avg↑	Recall-Dense Avg↑
Trans++	50.71	37.31
GSA-H	50.76	44.99
NHA	52.89	46.43

Highlights & Insights¶

Unified softmax fusion is the core innovation — demoting fusion from explicit parameter learning to implicit softmax allocation
"Architecture duality" is highly practical — same model can zero-cost switch between different efficiency-accuracy configurations at inference time

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐