Normalization in Attention Dynamics¶

Conference: NeurIPS 2025 arXiv: 2510.22026 Code: None Area: Deep Learning Theory, Transformer Architecture Keywords: Layer Normalization, Attention Dynamics, Representation Collapse, Interacting Particle Systems, Velocity Modulation

TL;DR¶

This paper unifies various normalization schemes (Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT, sqrt-scaling) under a single framework of velocity modulation in an interacting particle system on the sphere. It theoretically characterizes how each scheme affects token clustering dynamics and representation collapse, identifying Peri-LN as the theoretically optimal choice.

Background & Motivation¶

Layer normalization (LayerNorm) in Transformers is a critical component that governs training stability and representation quality in deep networks. Several normalization schemes have been proposed:

Post-LN (original Transformer): normalization applied after the residual connection
Pre-LN (default in GPT, LLaMA): normalization applied before the attention layer, yielding more stable training
Mix-LN: Post-LN for early layers, Pre-LN for later layers
Peri-LN (used in Gemma-3): normalization applied both before and after attention
nGPT: introduces a learnable parameter \(\alpha_t\) to control update magnitude
sqrt-scaling: scales residuals by the square root of depth

In practice, the "curse of depth" causes deep layers to degenerate into near-identity transformations that can be pruned without performance loss. Simultaneously, "representation collapse" limits the scalability of model depth. This paper provides a unified theoretical analysis of how different normalization schemes contribute to these phenomena from a dynamical systems perspective.

Method¶

Overall Architecture¶

The core idea is to model token representations across Transformer layers as an interacting particle system on the sphere \(\mathcal{S}^{d-1}\). Each token is decomposed as \(x_k = r_k \cdot \theta_k\), where \(\theta_k\) is the directional unit vector and \(r_k\) is the magnitude. Since a normalization step typically precedes the final decoding layer, the paper focuses on the evolution of directions \(\theta_k\).

The unified dynamical equation is formulated as Normalized Attention (NA) dynamics:

\[\dot{\theta}_j(t) = \frac{1}{s_j(t)} \mathbf{P}_{\theta_j(t)} A_j^t(\Theta(t))\]

where \(s_j(t)\) is a velocity modulation factor determined by the normalization scheme, and \(\mathbf{P}_\theta\) denotes projection onto the tangent space of the sphere.

Key Designs¶

Unified Velocity Modulation Perspective: All normalization schemes can be characterized by different choices of \(s_j(t)\) and \(\dot{r}_j(t)\). Post-LN sets \(s_j = 1\) (constant speed); Pre-LN sets \(s_j = r_j(t)\) (deceleration as magnitude grows); Peri-LN sets \(s_j = r_j(t)\|A_j^t\|\) (double deceleration); nGPT sets \(s_j = \alpha_t^{-1}\|A_j^t\|\) (learnable control). This unified perspective constitutes the central contribution of the paper.
Asymptotic Clustering Theorem (Theorem 3.1): Under the simplified setting \(Q=K=V=I_d\), the paper proves that token directions under Post-LN, nGPT, and sqrt-scaling almost surely converge to a synchronized cluster, while Pre-LN, Mix-LN, and Peri-LN also cluster when magnitude growth ceases. Convergence is established via a generalization of the Łojasiewicz inequality.
Initial and Terminal Velocity Analysis (Theorems 4.1–4.3):
- Initial velocity: Peri-LN and nGPT produce \(O(1)\) angular displacement in early layers, whereas Post-LN and Pre-LN yield only \(O(\log n / d)\), a gap of \(\Omega(\min(d/\log n, \sqrt{n/\log n}))\).
- Terminal velocity: Post-LN exhibits exponential clustering decay \(Ce^{-2t}\); Pre-LN, Peri-LN, and Mix-LN exhibit polynomial decay \(C/t^3\); nGPT depends on the choice of \(\alpha_t\).
- Polynomial decay implies that tokens continue to evolve meaningfully in deep layers, enabling better utilization of intermediate representations and resistance to collapse.

Theoretical Tools¶

Riemannian gradient flow theory on the sphere
Symmetric orthogonal initialization for simplified ODE analysis
Local cone initialization for terminal behavior analysis
Tracking decay rate of within-cluster variance \(\text{Var}(t)\)

Key Experimental Results¶

Cosine Similarity Evolution under Symmetric Initialization (Theoretical ODE, \(\beta=5, n=256\))¶

Normalization	Initial Velocity \(\dot{\gamma}(0)\)	Terminal Velocity \(\dot{\gamma}(\infty)\)	Clustering Rate
Post-LN	\(\frac{2}{e^\beta + n - 1}\)	\(Ce^{-2t}\) (exponential decay)	Fastest clustering → deep collapse risk
Pre-LN	\(\frac{2}{r_0(e^\beta + n-1)}\)	\(C/t^3\) (polynomial decay)	Slow clustering → collapse-resistant
Peri-LN	\(\frac{2}{r_0\sqrt{e^{2\beta}+n-1}}\)	\(C/t^3\) (polynomial decay)	Fast early + slow terminal
nGPT	\(\frac{2\alpha_0}{\sqrt{e^{2\beta}+n-1}}\)	Depends on \(\alpha_t\)	Controllable
sqrt-scaling	\(\frac{2}{e^\beta+n-1}\)	\(Ce^{-4\sqrt{t}}/\sqrt{t}\)	Between exponential and polynomial

Validation with Random Initialization (\(d=512, n_{\text{heads}}=1, \beta=\sqrt{d}\))¶

Normalization	Token Movement in Early Layers	Collapse Rate in Deep Layers	Overall Assessment
Post-LN	Moderate	Fast (exponential)	Deep layers nearly redundant
Pre-LN	Slow	Slow (polynomial)	Collapse-resistant but underutilizes early layers
Peri-LN	Fast	Slow (polynomial)	Best: strong at both ends
nGPT (\(\alpha_t \equiv 1\))	Fast	Fast (exponential)	Requires careful \(\alpha_t\) tuning
Mix-LN	Moderate	Slow (polynomial)	Transitional solution

Key Findings¶

Peri-LN is theoretically optimal: it produces large angular displacements in early layers (effective utilization of shallow layers) while maintaining polynomial clustering decay in deep layers (resistance to representation collapse), achieving the best of both regimes.
The curse of depth in Post-LN has a theoretical root: exponential clustering causes tokens to nearly cease moving in deep layers, consistent with empirical observations that deep layers are prunable.
Pre-LN's advantage stems from magnitude growth: the linear growth \(r_j(t) \sim t\) slows clustering from exponential to polynomial, but early layers remain underutilized.
nGPT offers fine-grained control: the \(\alpha_t\) parameter allows per-layer behavioral tuning, but requires careful adjustment.
Temperature \(\beta\) exponentially suppresses initial velocity: smaller QK magnitudes are recommended in early layers.

Highlights & Insights¶

Six normalization schemes are unified within a single interacting-particle ODE framework, distinguished elegantly by a velocity modulation factor.
The paper provides both rigorous asymptotic convergence proofs and quantitative characterizations of initial and terminal velocities, combining theoretical depth with practical insight.
Experiments using randomly initialized Kaiming weights validate the qualitative consistency of theoretical predictions.
The theoretical superiority of Peri-LN resonates with its practical adoption in Gemma-3.

Limitations & Future Work¶

The theoretical analysis relies on strong assumptions such as \(Q=K=V=I_d\), which do not reflect the parameter diversity encountered in actual training.
The FFN layers are omitted; only pure attention dynamics are analyzed.
End-to-end validation through training realistic models is absent (no concrete architecture is trained and compared).
The theoretically predicted linear magnitude growth under Pre-LN is inconsistent with the empirically observed \(\sqrt{t}\) growth, due to differences between weight-tied and randomly initialized settings.
Gradient propagation analysis is deferred to future work.

This work extends the interacting particle system modeling tradition for Transformer attention dynamics established by Geshkovski et al.
It provides a theoretical explanation for the empirical findings on the "curse of depth" (Sun et al.) and "deep layer pruning" (Gromov et al.).
Peri-LN (Kim et al.) and nGPT (Loshchilov et al.) represent the most recent schemes analyzed within this framework.
The paper lays groundwork for future "gradient flow analysis" and "complete analysis including MLP layers."

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The unified velocity modulation perspective is elegant, with solid theoretical contributions.
Experimental Thoroughness: ⭐⭐⭐ — Primarily theoretical, supplemented by simplified experimental validation; large-scale training comparisons are absent.
Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are rigorous; tables and figures are clear; argumentation proceeds in well-structured layers.
Value: ⭐⭐⭐⭐ — Provides systematic theoretical guidance for normalization scheme selection in Transformers.