Attention (as Discrete-Time Markov) Chains¶
Conference: NeurIPS 2025 arXiv: 2507.17657 Code: https://yoterel.github.io/attention_chains/ Area: Attention Analysis / Visualization Keywords: Attention Markov Chains, TokenRank, PageRank, Multi-Bounce Attention, Image Segmentation
TL;DR¶
This work reinterprets the softmax-normalized attention matrix as the transition probability matrix of a Discrete-Time Markov Chain (DTMC), and proposes Multi-Bounce Attention and TokenRank (stationary distribution, analogous to PageRank) to capture indirect attention paths and global token importance. The approach achieves 94.29% mAP on ImageNet segmentation and enhances image generation quality in Self-Attention Guidance.
Background & Motivation¶
Background: Attention analysis relies on direct operations — row selection (which tokens a token attends to), column selection (which tokens attend to a given token), and summation (global aggregation). These capture only first-order/direct attention effects.
Limitations of Prior Work: Analogous to PageRank's insight into web hyperlinks — direct link counting is inferior to propagation-based importance estimation. Attention also exhibits indirect influence paths: token A attends to B, and B attends to C, yet direct operations cannot reveal the indirect relationship A→C.
Key Challenge: Row and column operations on the attention matrix are first-order and neglect higher-order indirect paths. A mathematical framework is needed to unify direct and indirect attention effects.
Goal: To provide a Markov chain-based framework for attention analysis that captures multi-order indirect dependencies and global token importance.
Key Insight: The softmax-normalized attention matrix naturally satisfies the definition of a Markov chain transition probability matrix (non-negative entries, rows summing to one), enabling direct reuse of Markov chain and PageRank theoretical tools.
Core Idea: Attention matrix = DTMC transition matrix → Multi-hop propagation = \(k\)-th matrix power → Stationary vector = TokenRank (analogous to PageRank) → \(\lambda_2\)-weighted mixture of multiple heads.
Method¶
Overall Architecture¶
Attention matrix \(A\) (softmax output) → Multi-Bounce: \(\mathbf{v}_{i,n+1}^T = \mathbf{v}_{i,n}^T A\) (\(n=1\) reduces to standard row selection) → TokenRank: compute the stationary vector \(\pi\) of \(A\) satisfying \(\pi^T A = \pi^T\) → \(\lambda_2\) weighting: weight the contribution of each attention head by the magnitude of the second-largest eigenvalue → Apply to segmentation / generation / token masking.
Core insight: The softmax-normalized attention matrix naturally satisfies the Markov chain transition probability definition (non-negative, rows sum to one), allowing all theoretical tools from PageRank to be directly transferred.
Key Designs¶
-
Multi-Bounce Attention:
- Function: Captures indirect attention paths via matrix power propagation.
- Mechanism: For token \(i\), initialize a one-hot vector \(\mathbf{v}_{i,0} = \mathbf{e}_i\) and iterate \(\mathbf{v}_{i,n+1}^T = \mathbf{v}_{i,n}^T A\). \(n=1\) corresponds to standard row selection (direct attention); \(n=2\) incorporates second-order indirect paths; \(n \to \infty\) converges to the stationary distribution.
- Design Motivation: In image segmentation, a pixel may not directly attend to the center of its object, but can be indirectly associated through intermediate pixels — Multi-Bounce captures such indirect relationships.
-
TokenRank (Stationary Distribution):
- Function: Computes the global importance of each token (analogous to PageRank).
- Mechanism: The stationary vector \(\pi\) is obtained via power iteration or the PageRank correction (\(P' = \alpha P + (1-\alpha)\frac{1}{n}\mathbf{e}\mathbf{e}^T\), ensuring ergodicity and aperiodicity). A higher \(\pi(i)\) indicates that token \(i\) is more "important" in attention propagation.
- Design Motivation: To assess global token influence rather than local relationships — "Which tokens are most central in the overall attention graph?"
-
\(\lambda_2\)-Weighted Multi-Head Aggregation:
- Function: Weights the contribution of each attention head by the second-largest eigenvalue \(\lambda_2\).
- Mechanism: A larger \(\lambda_2\) indicates slower convergence of the Markov chain (more metastable states), corresponding to attention heads with richer multi-scale structure. These heads are assigned greater weight via \(\lambda_2\).
- Design Motivation: Not all attention heads are equally important for downstream tasks — \(\lambda_2\) provides an unsupervised estimate of head importance.
Loss & Training¶
- A purely analytical method; no training is required.
- TokenRank is computed via 10–20 power iterations.
Key Experimental Results¶
Main Results¶
| Task | Method | Accuracy | mIoU | mAP |
|---|---|---|---|---|
| ImageNet Segmentation | Concept Attention | 83.07% | 71.04% | — |
| Ours (FLUX DiT) | 84.12% | 70.20% | 94.29% | |
| SAG Image Generation | SD1.5 base | IS 16.32 | — | — |
| TokenRank | IS 18.37 | — | — | |
| DiffSeg | Uniform sampling | 72.50 mACC | 43.60 mIoU | — |
| TokenRank grid | 84.97 mACC | 44.87 mIoU | — |
Ablation Study¶
| Configuration | Result |
|---|---|
| \(\lambda_2\) weighting vs. uniform | Statistically significant improvement (test E.1) |
| \(n=1\) (standard) vs. \(n=2\) (two-hop) | \(n=2\) is optimal for segmentation |
| Structured features (DINOv2) vs. unstructured (ViT) | DINOv2 + TokenRank yields greater gains |
| Token Masking | Removing TokenRank tokens → AUC 0.26–0.64 (accuracy drops faster than baseline 0.27–0.79), confirming that TokenRank identifies genuinely important tokens |
Key Findings¶
- Multi-Bounce attention (\(n=2\)) outperforms direct attention (\(n=1\)) on segmentation tasks — indirect paths do carry useful information.
- TokenRank significantly improves generation quality over random seed sampling in SAG — using "important tokens" as guidance is more effective.
- Structured attention (DINOv2 + registers) benefits more from TokenRank — suggesting that Markov chain analysis requires a certain degree of structural regularity in the attention.
- \(\lambda_2\) weighting yields modest but statistically significant improvements — automated head selection is valuable.
Highlights & Insights¶
- The PageRank-to-TokenRank analogy is highly natural: The row-stochastic property of the softmax-normalized attention matrix exactly satisfies the Markov chain definition, allowing all theoretical tools from PageRank to be directly transferred.
- Multi-Bounce attention reveals "attention of attention": A token's true influence encompasses not only what it directly attends to, but also what it reaches indirectly through other tokens — this carries deep implications for understanding information flow in Transformers.
- A unified framework spanning multiple applications: Segmentation (Multi-Bounce), generation guidance (TokenRank + SAG), and token importance analysis (masking) — one theoretical framework, multiple applications.
Limitations & Future Work¶
- Applicable only to square matrices (self-attention / mixed attention); cross-attention introduces unreachable states, requiring theoretical extension.
- Computing \(\lambda_2\) is expensive for large matrices, though Multi-Bounce and TokenRank are efficient (10–20 iterations).
- Limited gains on unstructured attention (e.g., standard ViT without registers), suggesting that Markov chain analysis requires some structural prior in the attention.
- Automatic selection of the optimal number of hops for Multi-Bounce is unexplored; \(n=2\) is currently set manually.
- Cross-layer Markov chain properties of attention are not analyzed; the current work operates within a single layer.
- The stationary vector may not be unique (periodic chains or disconnected attention graphs) — PageRank correction is required to guarantee uniqueness.
Related Work & Insights¶
- vs. Attention Rollout: Rollout performs cross-layer attention propagation by accumulating along layer depth but does not conduct Markov chain analysis.
- vs. GradCAM: GradCAM requires gradients, whereas TokenRank is entirely gradient-free.
- vs. Concept Attention: Concept Attention performs attention analysis for specific concepts; TokenRank provides a general global importance assessment.
- Transferability: TokenRank can be applied to any Transformer — not limited to vision; text and audio Transformers are equally applicable.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The Markov chain perspective is entirely novel; the PageRank→TokenRank transfer is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three applications (segmentation + generation + masking) with ablation studies.
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear and analogies are intuitive.
- Value: ⭐⭐⭐⭐ Provides a new mathematical framework for attention analysis with deep implications for understanding information flow in Transformers.