Emergence and Scaling Laws in SGD Learning of Shallow Neural Networks¶
Conference: NeurIPS 2025 arXiv: 2504.19983 Code: None Area: Optimization Theory / Neural Network Learning Theory Keywords: scaling laws, emergence, SGD, shallow neural networks, multi-index model
TL;DR¶
This paper provides a precise analysis of online SGD learning of additive models (sums of single-index functions) in shallow neural networks, proving that the learning of each teacher neuron exhibits a sharp phase transition (emergence), and that the superposition of many such transition curves across different timescales naturally produces a smooth power-law scaling law.
Background & Motivation¶
Background: A growing body of theoretical work has studied gradient-based training of shallow networks on low-dimensional target functions, particularly the sample complexity of SGD for single-index and multi-index models. Empirically, large-scale model training exhibits predictable power-law scaling laws, where the loss decreases smoothly as a function of compute or data.
Limitations of Prior Work: (a) SGD learning of a single skill or direction exhibits emergent behavior — a "search phase" followed by a sudden drop — which appears to contradict smooth scaling laws; (b) existing analyses of multi-index models are largely restricted to narrow width \(P = O(1)\) or uniform signal strengths, which are insufficient to produce the timescale separation needed to explain power-law decay; (c) prior work (e.g., OSSW24) analyzes hierarchical training (first optimizing directions, then weights), requiring student width \(m \gtrsim P^{\Omega(1/a_{\min})}\), which is computationally infeasible.
Key Challenge: Emergence (discrete jumps) and scaling laws (smooth power laws) appear contradictory. The key challenge is to analyze the difficult regime of large width, large condition number, and single-phase training within a unified framework.
Key Insight: The target function is modeled as \(f_*(x) = \sum_{p=1}^P a_p \sigma(\langle x, v_p^* \rangle)\), where \(a_p \asymp p^{-\beta}\) follows a power-law decay. The paper exploits an "automatic cancellation" mechanism arising from 2-homogeneous parameterization to show that learning different directions can be approximately decoupled.
Core Idea: In single-phase SGD training, each teacher direction undergoes a sharp phase transition (emergent transition) at time \(T_p \propto a_p^{-1}\). The superposition of \(P \gg 1\) emergence curves across distinct timescales naturally produces a power-law scaling law \(\mathcal{L}(t) \sim t^{(1-2\beta)/\beta}\).
Method¶
Overall Architecture¶
- Teacher model: \(f_*(x) = \sum_{p=1}^P a_p \sigma(v_p^* \cdot x)\), \(x \sim \mathcal{N}(0, I_d)\), \(\{v_p^*\}\) orthonormal, \(\sigma\) an even function with information exponent \(k_* > 2\)
- Student model: \(f(x) = \sum_{k=1}^m \|v_k\|^2 \sigma(\bar{v}_k \cdot x)\), using 2-homogeneous parameterization (second-layer weight = squared norm of first-layer weight)
- Training algorithm: Online SGD with a fresh sample at each step, updating both layers simultaneously
- Objective: Prove polynomial sample complexity and precisely characterize the recovery time for each teacher neuron
Key Designs¶
-
2-Homogeneous Parameterization and Automatic Cancellation
- Function: The second-layer weights of the student network are set to \(\|v_k\|^2\), coupling directional recovery with norm growth.
- Mechanism: Once \(\bar{v}_p\) converges to \(v_{\pi(p)}^*\), \(\|v_p\|^2\) automatically grows to \(a_{\pi(p)}\), effectively canceling the corresponding teacher direction from the loss — analogous to automatic deflation.
- Design Motivation: This avoids the drawback of hierarchical training, in which optimizing directions via a correlation loss introduces exponential dependence on the condition number \(\kappa = a_{\max}/a_{\min}\). Single-phase MSE training circumvents this issue through automatic cancellation.
-
Greedy Maximum Selection
- Function: Establishes a mapping \(\pi\) from student neurons to teacher neurons, determining the learning order.
- Mechanism: Directions are ordered by \(a_{\pi(p)} \cdot \bar{v}_{p,\pi(p)}^{2I-2}(0)\); directions with larger signal strength and larger initial overlap are learned first.
- Key Property: Three gap conditions — row gap, column gap, and threshold gap — guarantee that irrelevant coordinates remain small throughout training.
-
Approximate Decoupled Dynamics
- Function: Proves that the learning processes for different teacher directions can be analyzed approximately independently.
- Mechanism: The evolution of the aligned coordinate \(\bar{v}_{p,\pi(p)}^2\) is approximated by the ODE \(\frac{d}{dt} \bar{v}^2 \approx 8a_{\pi(p)} \bar{v}^4\), with solution \(\bar{v}^2(t) = (1/\bar{v}^2(0) - 8a t)^{-1}\), exhibiting a sharp phase transition at \(T_p \simeq (8 a_{\pi(p)} \bar{v}_{p,\pi(p)}^2(0))^{-1}\).
- Control of Irrelevant Coordinates: Using the information exponent condition \(k_* > 2\) (i.e., \(2I > 2\)), irrelevant coordinates \(\bar{v}_{p,\pi(q)}\) grow more slowly than aligned ones and remain at the \(O(d^{-0.9})\) level throughout training.
-
Discretization from Gradient Flow to Online SGD
- Function: Translates the continuous-time gradient flow analysis into a rigorous proof for discrete SGD.
- Mechanism: A martingale-plus-drift argument is employed, controlling stochastic terms via Doob's inequality. The learning rate \(\eta \propto a_{\min} \Delta^2 d^{-I}\) is chosen so that the SGD escape time deviates from the gradient flow by at most a factor of \((1 \pm \Delta)\).
- Unstable Discretization Trick: When only the recovery of the top \(P_*\) directions is of interest, the learning rate can be set to \(\eta \propto a_{P_*}\) (rather than \(a_{\min}\)), yielding improved compute–sample scaling.
Loss & Training¶
- MSE loss: \(\ell(x) = \frac{1}{2}(f_*(x) - f(x))^2\)
- Via Hermite expansion, the population MSE can be expressed as a tensor decomposition loss.
- Online SGD uses an independent fresh sample at each step; the learning rate \(\eta\) must satisfy precise conditions to guarantee convergence.
Key Experimental Results¶
Main Results (Theory vs. Experiment)¶
| Setting | Theoretical Scaling | Observed | Notes |
|---|---|---|---|
| Fixed learning rate, \(\beta=0.8\) | \(\mathcal{L} \sim (mt)^{(1-2\beta)/(1+\beta)} = (mt)^{-1/3}\) | Slope \(\approx -1/3\) | Compute-optimal frontier matches |
| Sample scaling | \(\mathcal{L} \sim n^{(1-2\beta)/\beta}\) | Consistent | Matches minimax optimal rate |
| Width scaling | Approximation error \(\sim m^{1-2\beta}\) | Consistent | Student width \(m\) determines how many directions can be learned |
Ablation Study¶
| Parameter | Effect | Notes |
|---|---|---|
| Information exponent \(k_*\) | Sample complexity \(\propto d^{k_*-1}\) | Larger \(k_*\) requires more samples to enter the search phase |
| Power-law exponent \(\beta\) | Scaling slope \(= (2\beta-1)/\beta\) | Scaling law holds when \(\beta > 1/2\) (loss is square-summable) |
| Condition number \(\kappa\) | Ours: polynomial dependence vs. prior: exponential dependence | Core improvement |
| Student width \(m\) | \(m = \tilde{\Theta}(P_*)\) suffices | Only logarithmic overparameterization required |
Key Findings¶
- Individual learning curves are staircase-shaped (emergence), but the superposition of \(P \gg 1\) staircases yields a smooth power law.
- For \(d = 2048\), \(P = 1024\), \(\sigma = h_4\), the theoretical and empirical compute-optimal frontier slopes are in close agreement.
- The unstable discretization scheme yields sample scaling exponents consistent with the minimax optimal rate for Gaussian sequence models.
Highlights & Insights¶
- A clean theoretical explanation of Emergence → Scaling Law: Prior theories of scaling laws either assume linear models or fully decoupled tasks. This work is the first to rigorously establish this connection in a nonlinear feature-learning setting.
- Automatic cancellation mechanism: The 2-homogeneous parameterization causes already-learned directions to be automatically removed from the loss, elegantly circumventing the exponential dependence on the condition number that afflicts hierarchical training.
- Single-phase training outperforms hierarchical training: Counter-intuitively, simultaneously updating both layers is more efficient than first optimizing directions and then weights, reducing sample complexity from exponential to polynomial.
- Unstable discretization: Choosing a learning rate that is "too large" for weak-signal directions — sacrificing accurate tracking of those directions in exchange for faster convergence on strong-signal directions — is a transferable idea for practical adaptive learning rate design.
Limitations & Future Work¶
- Even activation functions: The analysis assumes \(\sigma\) is even, excluding common activations such as ReLU (whose information exponent is 1, violating \(k_* > 2\)).
- Orthogonal teacher directions: The framework requires \(\{v_p^*\}\) to be orthogonal, whereas feature directions in practical models may be highly correlated.
- Theory–practice gap: The analysis holds asymptotically as \(d \to \infty\); at finite dimension, the empirical scaling law slope may deviate from theoretical predictions.
- Online SGD only: Mini-batch SGD and practically used optimizers such as Adam are not analyzed.
Related Work & Insights¶
- vs. OSSW24 (hierarchical training): That work requires width \(m \gtrsim P^{\Omega(1/a_{\min})}\), whereas this paper requires only \(m = \tilde{O}(P)\). The key distinction is the automatic cancellation mechanism enabled by single-phase training.
- vs. MLGT24, NFLL24 (additive model intuition): These works propose the intuition that "superposition of multiple skills produces scaling laws," but assume fully independent tasks. This paper is the first to rigorously prove approximate decoupling in a nonlinearly coupled setting.
- vs. BAP24, LWK+24 (linear model scaling laws): Those works analyze scaling laws in linear models or the kernel regime; this paper extends the analysis to nonlinear feature learning. While the functional form of the scaling exponents is consistent, the underlying mechanisms differ.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First theoretical result establishing the emergence → scaling law connection in a nonlinear feature-learning setting.
- Experimental Thoroughness: ⭐⭐⭐ Primarily theoretical; numerical experiments only validate the scaling slope, with no large-scale empirical evaluation.
- Writing Quality: ⭐⭐⭐⭐ Mathematically rigorous; proof sketches in the main text effectively convey the core ideas.
- Value: ⭐⭐⭐⭐⭐ Provides a new theoretical perspective on scaling laws with significant implications for the theory community.
Additional Remarks¶
- The theoretical framework and technical tools developed in this paper offer insights for adjacent research areas.
- The core contribution lies in providing a deep theoretical understanding that lays the groundwork for subsequent practical optimization advances.
- The paper is methodologically complementary to other NeurIPS 2025 papers published concurrently.
- The exposition of problem motivation and technical approach is exemplary and worth studying.
- Readers are encouraged to consult the appendix for complete experimental details and full proofs.