Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis¶
Conference: ICLR 2026 arXiv: 2601.21709 Code: GitHub Area: Model Compression / Attention Mechanism Analysis / LLM Inference Acceleration Keywords: attention patterns, temporal analysis, RoPE, query self-similarity, KV cache compression, LLM pruning
TL;DR¶
This paper proposes the TAPPA framework, which explains the formation mechanisms of various attention patterns in LLMs (attention sink, diagonal, periodic, etc.) from a temporal continuity perspective in a unified manner, and leverages query self-similarity (q-similarity) as a metric to guide KV cache compression and model pruning tasks.
Background & Motivation¶
Attention heads in LLMs exhibit diverse structured patterns: - Attention sinks: The first token receives anomalously high attention. - Diagonal patterns: Attention is focused on neighboring tokens. - Retrieval heads: Global scanning of the context. - Periodic patterns: Attention recurs at regular intervals.
Prior work typically analyzes individual patterns in isolation, lacking a unified explanation. The core question is: given the same attention formulation, what factors determine which attention pattern a given head adopts?
Method¶
Overall Architecture: TAPPA¶
TAPPA (Temporal Attention Pattern Predictability Analysis) categorizes attention patterns into two classes: - Predictable patterns: Exhibit temporal continuity, with attention metrics evolving smoothly across decoding steps. - Unpredictable patterns: Exhibit irregular jumps and lack temporal consistency (e.g., retrieval heads).
The key discriminating factor is query self-similarity (q-similarity).
Key Design 1: Predictable vs. Unpredictable Patterns¶
Proposition 4.1: If the difference between consecutive queries \(\|q_{t+1} - q_t\|\) is large and not orthogonal to the rotated keys, the difference in attention logits must also be large:
That is, low q-similarity leads to random patterns, while high q-similarity is a necessary condition for predictable patterns.
Key Design 2: Re-access Patterns (Attention Sink)¶
Theorem 5.1 (Vertical Stability of Attention): When queries are highly self-similar and a dominant low-frequency RoPE channel exists, the attention logits are vertically stable over time:
When the angle \(\phi_{t,i}^{(m)}\) between \(q\) and \(k_i\) is small, the cosine term approaches 1, which explains the attention sink phenomenon.
Key Design 3: Sequential Patterns (Diagonal)¶
Theorem 5.2: When both queries and keys exhibit high self-similarity (\(\|q_{t+1} - q_t\| \leq \varepsilon\), \(\|k_{i+1} - k_i\| \leq \varepsilon\)):
RoPE's relative positional encoding preserves query–key interactions under synchronized position shifts, giving rise to diagonal patterns.
Key Design 4: Periodic Sequential Patterns¶
Theorem 5.3: When a dominant RoPE channel \(m^\star\) exists, the diagonal spacing is:
This is verified experimentally: relocating the dominant channel to a low index (high-frequency) position induces the theoretically predicted periodic diagonals, and adjusting the RoPE base \(c\) controls the spacing accordingly.
Key Design 5: Seasonal Patterns¶
Theorem 5.4: When queries and keys are approximately periodic with period \(L\) and resonate with the dominant RoPE frequency:
This produces seasonal attention patterns with period \(L\).
Downstream Applications¶
Q-similarity is used as a simple metric to guide: - KV cache compression: Heads with high q-similarity can be safely compressed. - LLM pruning: Identifying redundant heads suitable for pruning.
Key Experimental Results¶
KV Cache Compression (LongBench)¶
| Method | Budget=512 | Avg. Score |
|---|---|---|
| StreamingLLM | — | 41.75 |
| H2O | — | 44.39 |
| SnapKV | — | 46.92 |
| CAKE | — | 47.19 |
| TAPPA | — | 47.55 |
| Full cache | — | 49.06 |
TAPPA's simple q-similarity-based metric consistently outperforms all baseline methods.
LLM Pruning¶
On Llama-3.1-8B and Qwen-2.5-7B: - Q-similarity-guided pruning outperforms uniform pruning without guidance. - Pruning heads with high q-similarity yields smaller performance degradation.
Theoretical Validation Experiments¶
- Dominant channel relocation: Relocating the dominant channel at index 124 in Qwen2.5 to indices 2/3/5 successfully induces the theoretically predicted periodic diagonal patterns.
- RoPE base adjustment: Reducing \(c\) from \(1{,}000{,}000\) to \(100{,}000\) shortens the diagonal spacing, consistent with the theoretical prediction \(T = 2\pi / \theta_{m^\star}\).
- Q-similarity distribution: Analysis across layers, heads, models, and datasets confirms the ubiquitous presence of both high- and low-continuity heads.
Key Findings¶
- Q-similarity is the key factor distinguishing predictable from unpredictable attention patterns.
- Re-access patterns require high q-similarity combined with a dominant low-frequency RoPE channel.
- Sequential patterns require high q-similarity combined with high k-similarity.
- The spacing of periodic diagonals is determined by the frequency of the dominant RoPE channel.
- Q-similarity serves as a simple yet consistently effective metric for downstream tasks.
Highlights & Insights¶
- First work to provide a unified explanation of diverse attention patterns from a temporal continuity perspective.
- Four theorems offer rigorous mathematical analysis.
- The q-similarity metric is extremely simple yet consistently effective.
- Controlled experiments (channel relocation and RoPE base adjustment) precisely validate the theoretical claims.
Limitations & Future Work¶
- The theoretical analysis assumes that query/key self-similarity is measurable, whereas in practice these quantities vary with context.
- Analysis of unpredictable patterns (e.g., retrieval heads) remains relatively limited.
- Seasonal patterns require RoPE resonance conditions, which may have limited applicability in practice.
- Downstream task improvements, while consistent, are modest in magnitude (~0.5–1 point).
Related Work & Insights¶
- Attention Patterns: Attention sinks by Xiao et al. (2023); retrieval heads by Wu et al. (2024).
- RoPE Analysis: Barbero et al. (2025) attribute diagonal patterns to high-frequency RoPE components.
- KV Cache Compression: H2O, SnapKV, PyramidKV, MInference.
- Input Dynamics: AttentionPredictor (Yang et al., 2025); Lee et al. (2024).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The unified theoretical framework is a significant contribution.
- Theoretical Depth: ⭐⭐⭐⭐⭐ — Four theorems with rigorous derivations.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Theoretical validation is impressive; downstream tasks are somewhat limited.
- Value: ⭐⭐⭐⭐ — Q-similarity is simple and practical.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear and elegant, with outstanding visualizations.