Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena¶

Conference: ICML2025
arXiv: 2502.21009
Code: None (theoretical/position paper)
Area: Theory (deep learning theory, optimization dynamics)
Keywords: layerwise linear models, dynamical feedback principle, neural collapse, emergence, lazy/rich regime, grokking

TL;DR¶

Proposes the Dynamical Feedback Principle and demonstrates that layerwise linear models are sufficient to provide a unified explanation for four major deep learning dynamical phenomena: neural collapse, emergence, lazy/rich regimes, and grokking. The authors advocate prioritizing the study of layerwise structures over non-linear activations.

Background & Motivation¶

Deep Neural Networks (DNNs) are complex non-linear dynamical systems that are extremely difficult to analyze directly. In physics, complex systems are often simplified into solvable minimal models (e.g., modeling a cow as a sphere, linearizing pendulum motion). Analogously, layerwise linear models (such as linear neural networks), despite discarding non-linear activations, possess gradient flow dynamics that are inherently non-linear.

In recent years, several hard-to-explain dynamical phenomena have emerged in DNNs:

Neural Collapse: The last-layer features of classification networks collapse into a low-rank simplex structure.
Emergence: Large Language Models (LLMs) abruptly acquire new capabilities as scale increases.
Lazy/Rich Regime: Networks transition between linear and non-linear dynamics.
Grokking: Generalization is severely delayed despite training accuracy already reaching perfection.

While seemingly unrelated, this work demonstrates that they all stem from the dynamical feedback generated by the layerwise parameter multiplication structure.

Method¶

Core: Dynamical Feedback Principle¶

Taking a diagonal linear network \(f(x) = \sum_i x_i a_i b_i\) as an example, the gradient flow equations are:

\[\frac{da_i}{dt} = -b_i \mathbf{E}[x_i^2](a_ib_i - S_i), \quad \frac{db_i}{dt} = -a_i \mathbf{E}[x_i^2](a_ib_i - S_i)\]

Key observation: The magnitude of \(a_i\) controls the rate of change of \(b_i\), and vice versa, establishing a dynamical feedback. In contrast, in a shallow linear model without hidden layers, \(\frac{d\theta_i}{dt} = -\mathbf{E}[x_i^2](\theta_i - S_i)\), the parameters evolve independently weight-by-weight without feedback effects.

Conserved Quantities¶

From the symmetry of the gradient equations, we derive the conserved quantity: \(a_i^2 - b_i^2 = \mathcal{C}_i\), which remains constant during training. Generalizing to the matrix form: \(W_2 W_2^T - W_1^T W_1\) is conserved.

Phenomenon 1: Emergence ← Amplification Dynamics + Sigmoid Saturation¶

Under small initialization (\(a_i(0) = b_i(0) \ll 1\)), each mode follows a sigmoid saturation curve:

\[a_i(t)b_i(t)/S_i = \frac{1}{1 + \left(\frac{S_i}{a_i(0)b_i(0)} - 1\right) e^{-2S_i \mathbf{E}[x_i^2] t}}\]

Unlike the exponential saturation of linear models \(\theta_i(t)/S_i = 1 - e^{-\mathbf{E}[x_i^2]t}\), sigmoid dynamics lead to staged training, where different modes are learned sequentially according to their relevance, resulting in the abrupt emergence of capabilities.

Phenomenon 2: Neural Collapse ← Greedy Low-Rank Dynamics¶

The dynamics of a linear neural network \(f = x^T W_1 W_2\) decouple into \(c\) independent modes (where \(c\) is the output dimension), with each mode similarly following sigmoid saturation. The network prioritizes learning the features most relevant to the target (the largest singular value), yielding a min-rank preference. The rank of the final-layer feature matrix \(XW_1\) collapses to \(c\), forming a simplex ETF structure.

Phenomenon 3: Lazy/Rich Regime ← Inter-layer Imbalance¶

Introducing the \(\lambda\)-balanced condition \(W_2 W_2^T - W_1^T W_1 = \lambda I\):

\(|\lambda| \approx 0\) (balanced layers) \(\to\) non-linear greedy dynamics \(\to\) Rich regime
\(|\lambda| \gg 0\) (unbalanced layers) \(\to\) only the "light" layer gets trained, leading to linear dynamics \(\to\) Lazy regime

Phenomenon 4: Grokking ← Weight-to-Target Ratio¶

Defining the weight-to-target ratio \(\Sigma_0 / S\), where \(\Sigma_0 = \sum_i \frac{a_i(0)^2 + b_i(0)^2}{2Z}\) represents the initial weight scale. A key quantity is defined as:

\[\gamma_+ = \frac{S + \sqrt{\Sigma_0^2 - \mathcal{S}_0^2 + S^2}}{\Sigma_0 + \mathcal{S}_0}\]

\(\gamma_+ \gg 1\) \(\to\) large feature differences \(\to\) Rich regime \(\to\) fast generalization
\(\gamma_+ \approx 1\) \(\to\) almost unchanged features \(\to\) Lazy regime \(\to\) Grokking (delayed generalization)

Methods that decrease \(\Sigma_0/S\) (weight downscaling, target amplification, input downscaling, output downscaling) can all eliminate grokking.

Key Experimental Results¶

Phenomenon	Simplified Model	Core Mechanism	Practical Validation
Emergence	Diagonal linear network + pre-built skill functions	Sigmoid saturation + staged training	2-layer ReLU network on multi-task sparse parity problems
Neural Collapse	Linear neural network (UFM)	Greedy low-rank dynamics	ResNet18 on CIFAR10; feature collapse to a 9-simplex ETF
Lazy/Rich	\(\lambda\)-balanced linear network	Inter-layer imbalance control	CNN upstream initialization improving feature learning and interpretability
Grokking	Scalar input-output linear network	Weight-to-target ratio \(\Sigma_0/S\)	4-layer tanh MLP on MNIST, Transformer on modular arithmetic

Experimental results for eliminating grokking (4-layer tanh MLP, 1000 MNIST samples):

Method	Modification	Effect
Default setup	Large weight initialization	Pronounced grokking occurs
Weight downscaling	Lowering \(\Sigma_0\)	Eradicates generalization delay
Target amplification	Increasing \(S\)	Eradicates generalization delay
Input downscaling	Equivalently increasing \(S/\Sigma_0\)	Eradicates generalization delay
Output downscaling	Increasing \(Z\) to decrease \(\Sigma_0\)	Eradicates generalization delay

Highlights & Insights¶

High Unification: Integrates four seemingly unrelated phenomena using a single 'dynamical feedback principle', presenting elegant physical intuition.
Solvability: Under appropriate assumptions, layerwise linear models yield exact analytical solutions, preventing prospective misdirection from approximations.
Practical Guidance: The theory directly delivers operational solutions for eliminating grokking (e.g., downscaling weights, amplifying targets).
Scaling Laws Prediction: Successfully predicts the scaling laws of 2-layer networks via sigmoid dynamics and power-law skill distributions.
Depth Effects: Deeper networks drive the sigmoid curve closer to a step function, intensifying greedy dynamics and offering an explanation for the Lottery Ticket Hypothesis.

Limitations & Future Work¶

Expressivity Gap: Layerwise linear models cannot fit real-world data, and the transferability of their dynamical conclusions to non-linear DNNs lacks rigorous guarantees.
Strong Assumptions: Most conclusions rely on idealized conditions such as small initialization, whitened inputs, and specific conserved structures.
Position Paper Nature: Direct validation via large-scale experiments (e.g., LLM training) is lacking, leaning more toward analogical arguments.
Underestimated Non-linear Effects: Certain phenomena (such as double descent, or feature selection in ReLU networks) might inherently require non-linearities.
Difficulty in Multi-layer Generalization: Most exact solutions are restricted to 2 layers; the analytical treatment of deeper networks remains open.

Saxe et al. (2014): Foundational work on the exact dynamical solutions of linear networks.
Papyan et al. (2020): Empirical discovery of Neural Collapse.
Nam et al. (2024): Explaining emergence and scaling laws using diagonal linear networks with pre-built skill functions.
Kunin et al. (2024): Controlling grokking via inter-layer imbalance; deeper networks biasing toward \(L_1\) norm solutions.
Dominé et al. (2025): Analysis of lazy/rich regimes in \(\lambda\)-balanced models.
Mixon et al. (2020) & Fang et al. (2021): Analyzing neural collapse under the Unconstrained Feature Model (UFM).

Rating¶

Novelty: ⭐⭐⭐⭐ — The dynamical feedback principle as a unifying framework is a novel contribution, although many building blocks come from prior work.
Experimental Thoroughness: ⭐⭐⭐ — Due to its nature as a position paper, experiments are primarily illustrative and lack large-scale validation.
Writing Quality: ⭐⭐⭐⭐⭐ — Clear physical intuition; the narrative logic spanning simple models to complex phenomena is beautifully structured.
Value: ⭐⭐⭐⭐ — Offers a unified perspective on understanding DNN dynamics, providing high guidance value for theoretical research.