Normalization in Attention Dynamics¶
Conference: NeurIPS 2025 arXiv: 2510.22026 Code: None Area: Deep Learning Theory, Transformer Architecture Keywords: Layer Normalization, Attention Dynamics, Representation Collapse, Interacting Particle Systems, Velocity Modulation
TL;DR¶
This paper unifies various normalization schemes (Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, sqrt-scaling) under a single framework of velocity modulation in an interacting particle system on the sphere. It theoretically characterizes how each scheme affects token clustering dynamics and representation collapse, identifying Peri-LN as the theoretically optimal choice.
Background & Motivation¶
Layer normalization (LayerNorm) in Transformers is a critical component that governs training stability and representation quality in deep networks. Several normalization schemes have been proposed:
- Post-LN (original Transformer): normalization applied after the residual connection
- Pre-LN (default in GPT, LLaMA): normalization applied before the attention layer, yielding more stable training
- Mix-LN: Post-LN for early layers, Pre-LN for later layers
- Peri-LN (used in Gemma-3): normalization applied both before and after attention
- nGPT: introduces a learnable parameter \(\alpha_t\) to control update magnitude
- sqrt-scaling: scales residuals by the square root of depth
In practice, the "curse of depth" causes deep layers to degenerate into near-identity transformations that can be pruned without performance loss. Simultaneously, "representation collapse" limits the scalability of model depth. This paper provides a unified theoretical analysis of how different normalization schemes contribute to these phenomena from a dynamical systems perspective.
Method¶
Overall Architecture¶
The core idea is to model token representations across Transformer layers as an interacting particle system on the sphere \(\mathcal{S}^{d-1}\). Each token is decomposed as \(x_k = r_k \cdot \theta_k\), where \(\theta_k\) is the directional unit vector and \(r_k\) is the magnitude. Since a normalization step typically precedes the final decoding layer, the paper focuses on the evolution of directions \(\theta_k\).
The unified dynamical equation is formulated as Normalized Attention (NA) dynamics:
where \(s_j(t)\) is a velocity modulation factor determined by the normalization scheme, and \(\mathbf{P}_\theta\) denotes projection onto the tangent space of the sphere.
Key Designs¶
-
Unified Velocity Modulation Perspective: All normalization schemes can be characterized by different choices of \(s_j(t)\) and \(\dot{r}_j(t)\). Post-LN sets \(s_j = 1\) (constant speed); Pre-LN sets \(s_j = r_j(t)\) (deceleration as magnitude grows); Peri-LN sets \(s_j = r_j(t)\|A_j^t\|\) (double deceleration); nGPT sets \(s_j = \alpha_t^{-1}\|A_j^t\|\) (learnable control). This unified perspective constitutes the central contribution of the paper.
-
Asymptotic Clustering Theorem (Theorem 3.1): Under the simplified setting \(Q=K=V=I_d\), the paper proves that token directions under Post-LN, nGPT, and sqrt-scaling almost surely converge to a synchronized cluster, while Pre-LN, Mix-LN, and Peri-LN also cluster when magnitude growth ceases. Convergence is established via a generalization of the Łojasiewicz inequality.
-
Initial and Terminal Velocity Analysis (Theorems 4.1–4.3):
- Initial velocity: Peri-LN and nGPT produce \(O(1)\) angular displacement in early layers, whereas Post-LN and Pre-LN yield only \(O(\log n / d)\), a gap of \(\Omega(\min(d/\log n, \sqrt{n/\log n}))\).
- Terminal velocity: Post-LN exhibits exponential clustering decay \(Ce^{-2t}\); Pre-LN, Peri-LN, and Mix-LN exhibit polynomial decay \(C/t^3\); nGPT depends on the choice of \(\alpha_t\).
- Polynomial decay implies that tokens continue to evolve meaningfully in deep layers, enabling better utilization of intermediate representations and resistance to collapse.
Theoretical Tools¶
- Riemannian gradient flow theory on the sphere
- Symmetric orthogonal initialization for simplified ODE analysis
- Local cone initialization for terminal behavior analysis
- Tracking decay rate of within-cluster variance \(\text{Var}(t)\)
Key Experimental Results¶
Cosine Similarity Evolution under Symmetric Initialization (Theoretical ODE, \(\beta=5, n=256\))¶
| Normalization | Initial Velocity \(\dot{\gamma}(0)\) | Terminal Velocity \(\dot{\gamma}(\infty)\) | Clustering Rate |
|---|---|---|---|
| Post-LN | \(\frac{2}{e^\beta + n - 1}\) | \(Ce^{-2t}\) (exponential decay) | Fastest clustering → deep collapse risk |
| Pre-LN | \(\frac{2}{r_0(e^\beta + n-1)}\) | \(C/t^3\) (polynomial decay) | Slow clustering → collapse-resistant |
| Peri-LN | \(\frac{2}{r_0\sqrt{e^{2\beta}+n-1}}\) | \(C/t^3\) (polynomial decay) | Fast early + slow terminal |
| nGPT | \(\frac{2\alpha_0}{\sqrt{e^{2\beta}+n-1}}\) | Depends on \(\alpha_t\) | Controllable |
| sqrt-scaling | \(\frac{2}{e^\beta+n-1}\) | \(Ce^{-4\sqrt{t}}/\sqrt{t}\) | Between exponential and polynomial |
Validation with Random Initialization (\(d=512, n_{\text{heads}}=1, \beta=\sqrt{d}\))¶
| Normalization | Token Movement in Early Layers | Collapse Rate in Deep Layers | Overall Assessment |
|---|---|---|---|
| Post-LN | Moderate | Fast (exponential) | Deep layers nearly redundant |
| Pre-LN | Slow | Slow (polynomial) | Collapse-resistant but underutilizes early layers |
| Peri-LN | Fast | Slow (polynomial) | Best: strong at both ends |
| nGPT (\(\alpha_t \equiv 1\)) | Fast | Fast (exponential) | Requires careful \(\alpha_t\) tuning |
| Mix-LN | Moderate | Slow (polynomial) | Transitional solution |
Key Findings¶
- Peri-LN is theoretically optimal: it produces large angular displacements in early layers (effective utilization of shallow layers) while maintaining polynomial clustering decay in deep layers (resistance to representation collapse), achieving the best of both regimes.
- The curse of depth in Post-LN has a theoretical root: exponential clustering causes tokens to nearly cease moving in deep layers, consistent with empirical observations that deep layers are prunable.
- Pre-LN's advantage stems from magnitude growth: the linear growth \(r_j(t) \sim t\) slows clustering from exponential to polynomial, but early layers remain underutilized.
- nGPT offers fine-grained control: the \(\alpha_t\) parameter allows per-layer behavioral tuning, but requires careful adjustment.
- Temperature \(\beta\) exponentially suppresses initial velocity: smaller QK magnitudes are recommended in early layers.
Highlights & Insights¶
- Six normalization schemes are unified within a single interacting-particle ODE framework, distinguished elegantly by a velocity modulation factor.
- The paper provides both rigorous asymptotic convergence proofs and quantitative characterizations of initial and terminal velocities, combining theoretical depth with practical insight.
- Experiments using randomly initialized Kaiming weights validate the qualitative consistency of theoretical predictions.
- The theoretical superiority of Peri-LN resonates with its practical adoption in Gemma-3.
Limitations & Future Work¶
- The theoretical analysis relies on strong assumptions such as \(Q=K=V=I_d\), which do not reflect the parameter diversity encountered in actual training.
- The FFN layers are omitted; only pure attention dynamics are analyzed.
- End-to-end validation through training realistic models is absent (no concrete architecture is trained and compared).
- The theoretically predicted linear magnitude growth under Pre-LN is inconsistent with the empirically observed \(\sqrt{t}\) growth, due to differences between weight-tied and randomly initialized settings.
- Gradient propagation analysis is deferred to future work.
Related Work & Insights¶
- This work extends the interacting particle system modeling tradition for Transformer attention dynamics established by Geshkovski et al.
- It provides a theoretical explanation for the empirical findings on the "curse of depth" (Sun et al.) and "deep layer pruning" (Gromov et al.).
- Peri-LN (Kim et al.) and nGPT (Loshchilov et al.) represent the most recent schemes analyzed within this framework.
- The paper lays groundwork for future "gradient flow analysis" and "complete analysis including MLP layers."
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The unified velocity modulation perspective is elegant, with solid theoretical contributions.
- Experimental Thoroughness: ⭐⭐⭐ — Primarily theoretical, supplemented by simplified experimental validation; large-scale training comparisons are absent.
- Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are rigorous; tables and figures are clear; argumentation proceeds in well-structured layers.
- Value: ⭐⭐⭐⭐ — Provides systematic theoretical guidance for normalization scheme selection in Transformers.