Skip to content

Native Hybrid Attention for Efficient Sequence Modeling

Conference: ACL 2026 arXiv: 2510.07019 Code: GitHub Area: LLM Efficiency / Attention Mechanism Keywords: Hybrid Attention, Linear Attention, Sliding Window, Long-Short Memory Fusion, Efficient Sequence Modeling

TL;DR

Native Hybrid Attention (NHA) concatenates linear RNN long-term memory slots with sliding window short-term precise tokens and processes them through a single softmax attention, achieving native intra-layer and inter-layer hybridization — dynamically allocating long-short attention weights without extra fusion parameters, outperforming Transformer and other hybrid baselines on recall-intensive and commonsense reasoning tasks.

Method

Key Designs

  1. Intra-Layer Hybrid — Unified Softmax Fusion: Long-term memory via gated linear RNN concatenated with sliding window KV cache, processed by single softmax. Weights are query-key similarity dependent — achieving per-token, per-head context-aware weighting with zero extra parameters.

  2. Inter-Layer Hybrid — Window Size Tuning: All NHA layers share the same architecture; only window size \(w\) controls behavior (\(w=0\) = pure linear RNN, \(w=N\) = full attention). Supports zero-cost inference-time architecture search.

  3. Chunkwise Parallel Computation: Efficient GPU implementation via Triton kernels maintaining near-linear complexity.

Key Experimental Results

Model Commonsense Avg↑ Recall-Dense Avg↑
Trans++ 50.71 37.31
GSA-H 50.76 44.99
NHA 52.89 46.43

Highlights & Insights

  • Unified softmax fusion is the core innovation — demoting fusion from explicit parameter learning to implicit softmax allocation
  • "Architecture duality" is highly practical — same model can zero-cost switch between different efficiency-accuracy configurations at inference time

Rating

  • Novelty: ⭐⭐⭐⭐⭐
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐
  • Writing Quality: ⭐⭐⭐⭐⭐
  • Value: ⭐⭐⭐⭐⭐