Star Attention: Efficient LLM Inference over Long Sequences¶
Conference: ICML2025
arXiv: 2411.17116
Code: GitHub - Star-Attention
Area: LLM/NLP
Keywords: Long-sequence inference, sparse attention, distributed inference, Anchor Block, KV cache
TL;DR¶
Proposes Star Attention, a two-phase block-sparse attention mechanism: Phase 1 encodes partitioned context blocks via local attention across multiple hosts, and Phase 2 generates queries by aggregating global attention. It is compatible with existing LLMs without fine-tuning, achieving up to an 11x speedup in inference while retaining 97-100% accuracy.
Background & Motivation¶
Key Challenge¶
Key Challenge: Paragraphs or contexts spanning millions of tokens are increasingly common (e.g., code analysis, multi-document summarization), but the quadratic complexity of self-attention makes inference extremely expensive.
Goal¶
Goal: - FlashAttention: Accelerates computation but does not reduce complexity - Ring Attention: Distributed but requires ring communication - Chunk-based encoding methods: Require fine-tuning or extra components
Core Insights of Star Attention¶
Inference typically consists of two phases: (1) long-context encoding, and (2) short-query generation. Context tokens only require local attention, while query tokens require global attention.
Method¶
Phase 1: Local Context Encoding + Anchor Block¶
- The context is split into contiguous blocks and distributed across different hosts.
- Each block is prefixed with an "anchor block" (a copy of the first block).
- Each host performs self-attention independently without communication.
- Only the KV values of the non-anchor parts are cached.
Phase 2: Global Query Encoding¶
- Queries are replicated to all hosts.
- Each host computes local attention using its local KV cache.
- The "query host" aggregates softmax statistics to obtain global attention.
- Minimal communication overhead: each host transmits only 1 vector + 1 scalar per token.
Characteristics¶
- The context length scales horizontally (linearly) with the number of hosts.
- Requires no model fine-tuning.
- Fully compatible and stackable with FlashAttention.
Key Experimental Results¶
Inference Acceleration¶
Main Results¶
| Context Length | Number of Hosts | Speedup | Accuracy Retention |
|---|---|---|---|
| 128K | 4 | 4x | 99% |
| 256K | 8 | 7x | 98% |
| 512K | 16 | 11x | 97% |
Accuracy Comparison (Llama3.1-8B/70B)¶
Ablation Study¶
| Benchmark | Standard Attention | Star Attention |
|---|---|---|
| RULER | 100% | 97-100% |
| LongBench | Baseline | Close to Baseline |
| Needle-in-Haystack | 100% | 98%+ |
Key Findings¶
- The anchor block is critical; accuracy drops significantly without it.
- Local attention in Phase 1 is sufficient for most context understanding tasks.
- Global aggregation in Phase 2 requires minimal communication.
- Demonstrates effectiveness on both Llama3.1-8B and 70B.
Highlights & Insights¶
- The two-phase design of "local context + global query" is highly intuitive and natural.
- The introduction of the anchor block elegantly maintains global consistency.
- Zero fine-tuning required—plug-and-play with any global-attention LLM.
- Extremely low communication overhead (only 1 vector + 1 scalar per token from each host).
- Can be combined and stacked with FlashAttention.
Limitations & Future Work¶
- Accuracy may suffer a 2–3% drop in tasks with extremely strong position dependence.
- The choice of anchor block size affects performance and requires tuning.
- May not be optimal for tasks requiring global interaction across the entire context (e.g., full-text summarization).
- The single query host in Phase 2 could potentially become a bottleneck.
- The integration with KV cache compression methods remains unexplored.
Related Work & Insights¶
- Difference from Ring Attention: Ring Attention requires circular ring-communication, whereas Star Attention does not.
- Difference from Longformer: Longformer requires fine-tuning, whereas Star Attention is applicable zero-shot.
- Insight: The two-phase design can be generalized to other long-sequence tasks.
Rating¶
- Novelty: 4.5/5 — Two-phase + anchor block design
- Experimental Thoroughness: 4.5/5 — Multi-model and multi-benchmark
- Writing Quality: 4.5/5
- Value: 5.0/5 — 11x speedup + plug-and-play
Supplementary Technical Details¶
Role of the Anchor Block¶
A copy of the first block is pre-pended to the block of each host, ensuring that every host can "see" the beginning of the global context. This design is inspired by the "attention sink" phenomenon.
Communication Overhead Analysis in Phase 2¶
Each context host only needs to transmit softmax normalization statistics (one vector + one scalar per token), making the communication volume independent of context length.
Typical Application Scenarios¶
Best suited for the "long context + short query + short answer" paradigm, such as RAG, document QA, and code analysis.