A Unified Stability Analysis of SAM vs SGD: Role of Data Coherence and Emergence of Simplicity Bias¶

Conference: NeurIPS 2025 arXiv: 2511.17378 Code: https://github.com/changwk1001/Stability_Analysis_and_Simplicity-Bias.git Area: Optimization Theory Keywords: SAM, SGD, linear stability, data coherence, simplicity bias

TL;DR¶

Through a linear stability analysis framework, this paper demonstrates that "flat minima ⇒ better generalization" and "SGD prefers simple functions" are two sides of the same coin — data coherence simultaneously governs both phenomena, and SAM amplifies the simplicity bias further by imposing stricter stability conditions.

Background & Motivation¶

Background: Two central generalization hypotheses in deep learning: (1) the flat minima hypothesis — SGD/SAM preferentially converges to wide, shallow loss basins, and flatness correlates positively with generalization; (2) the simplicity bias hypothesis — overparameterized networks tend to learn solutions that rely on a small number of shared features.

Limitations of Prior Work: Both hypotheses are supported by extensive empirical evidence, yet a unified theoretical framework explaining their intrinsic connection is lacking. In particular: why does SGD prefer flat minima? Why does SAM select better solutions even among minima of equal flatness? How does data geometry influence optimization bias?

Key Challenge: SAM is designed to "seek flat minima," yet experiments show it exhibits selectivity beyond flatness — it still prefers certain solutions among minima of equal flatness. This suggests flatness alone is insufficient to explain generalization.

Key Insight: Linear stability analysis — linearizing the optimization dynamics near a minimum and characterizing stability via the spectral properties of the iteration matrix.

Core Idea: Data coherence (the degree of alignment among per-sample Hessians) simultaneously governs both the flatness preference and the simplicity bias; SAM amplifies this selectivity through an additional curvature penalty term.

Core Problem¶

Can a unified theoretical framework be established to simultaneously explain the selective preferences of SGD and SAM over minima, and reveal how the geometric structure of training data — particularly the gradient alignment across training samples — determines which solutions serve as stable attractors? Specifically: (1) Does noise injection alter the set of stable minima? (2) What additional selectivity does SAM introduce relative to SGD? (3) How do these selectivities manifest as a preference for simple solutions in concrete network architectures?

Method¶

Overall Architecture¶

The analysis centers on linear stability analysis: the update dynamics are Taylor-expanded near a minimum \(w^*\), and the spectral properties of the iteration matrix are examined to determine whether the minimum is an attractor. The core physical intuition is that if \(\mathbb{E}[\|w_k\|^2]\) diverges during iteration, the solution is unstable (SGD will escape); if it converges, the solution is stable (SGD will remain).

Key Designs¶

Coherence Measure: The coherence matrix is defined as \(S_{ij} = \text{Tr}(H_i H_j)\), where \(H_i\) is the per-example Hessian of sample \(i\). The coherence measure is \(\sigma = \lambda_{\max}(S)/\max_i \lambda_{\max}(H_i)\). High coherence implies that the curvature directions of different training samples are highly aligned — i.e., the model fits multiple samples using a small number of shared features. The paper proves that high-coherence solutions are more stable (permitting larger learning rates), because gradients along shared directions produce stronger restoring forces.
Stability Conditions for SGD, Random Perturbations, and SAM:
SGD divergence condition (known): \(\lambda_{\max}(H) \geq \frac{\sigma}{\eta}\left(\frac{n}{B}-1\right)^{-1/2}\)
Random perturbations (Theorem 3.1): The divergence condition is identical to that of SGD (noise injection does not alter which minima are stable), but the escape rate is faster by a constant factor; when stable, convergence is inexact and the iterates oscillate near the minimum.
SAM divergence condition (Theorem 3.2): \(\lambda_{\max}(H) \geq \frac{\sigma}{\eta}\left(\frac{n}{B}-1\right)^{-1/2}\left(1+\frac{\rho}{\alpha}\lambda_{\min}(H)\right)^{-1}\). The additional curvature factor makes SAM strictly more demanding — minima that are marginally stable under SGD may become unstable under SAM. A matching lower bound (Theorem 3.3) establishes the tightness of this condition.
Realization of Simplicity Bias in Two-Layer ReLU Networks:
Memorization vs. generalization solutions (Theorem 3.4): Memorization solutions (each sample activates independent neurons) yield a diagonal coherence matrix (\(S_{ij}=0,\ i\neq j\)), hence minimal coherence and maximal instability; generalization solutions (shared neurons) yield nonzero off-diagonal entries, hence higher coherence and greater stability.
\((C,r)\)-generalization solutions (Theorem 3.5): For fixed \(r\) (equal flatness), \(\lambda_{\max}(S) = O(n/2^C \cdot (d+1)^{1/2})\). Smaller \(C\) (fewer features used) implies higher coherence and faster convergence, directly proving that SGD prefers simpler solutions even under equal-flatness constraints.
SAM amplifies simplicity bias (Theorem 3.6): SAM's effective coherence matrix contains additional \(\rho/\alpha\)-dependent terms that amplify the stability gap between solutions of different \(C\) values.

Loss & Training¶

The theoretical analysis is conducted under a quadratic loss approximation. Experiments employ two-layer ReLU networks with MSE loss, \(d=100\) hidden units, batch size 10, \(\eta=0.01\), and \(\rho \in \{0.01, 0.05, 0.1, 0.2\}\). Additional validation is performed on CIFAR-10 with ResNet-18.

Key Experimental Results¶

Main Results: Coherence Measure vs. SAM Perturbation Radius¶

Metric	SGD	SAM (ρ=0.05)	SAM (ρ=0.1)	SAM (ρ=0.2)
\(\lambda_{\max}(S)\)	133.9	121.5	90.3	65.7
\(\max_i \lambda_{\max}(H_i)\)	12740	10103	6422	3446
Hessian max eigenvalue	6.776	—	3.834	—
Effective rank (PCA 90%)	94.39	—	—	29.14

Setup: two-layer ReLU network, \(n=100\), variable \(d\), \(x \in \{-1,1\}^d\), \(y = x[0] \cdot x[1]\) (ensuring both simple and complex solutions coexist).

Ablation Study¶

Configuration	Description
Stability boundary in \((B,\sigma)\) space	SGD and random perturbation boundaries largely coincide (validating Thm. 3.1); SAM boundary is strictly tighter
Varying \(\rho/\alpha\)	Increasing SAM perturbation radius further shrinks the stable region
Fixed \(r\), varying \(C\)	SAM converges faster to low-\(C\) solutions (validating Thm. 3.5, 3.6)
Training dynamics	Coherence is a dynamic quantity; SAM more aggressively reduces \(\max_i \lambda_{\max}(H_i)\)
CIFAR-10/ResNet-18	SAM reduces feature effective rank; approximate coherence measure decreases monotonically with SAM \(\rho\)

Key Findings¶

SAM's effect extends beyond "seeking flat minima" — it still selects high-coherence (simple) solutions among minima of equal flatness.
Coherence is a dynamic quantity: \(\lambda_{\max}(S)\) evolves throughout training, and SAM reduces it more effectively than SGD.
Random perturbations (noise injection) do not alter which minima are stable; they only accelerate escape from unstable minima.

Highlights & Insights¶

Theoretical elegance of unification: The coherence measure \(\sigma\) simultaneously explains both the flat-minima preference and the simplicity bias, revealing them as two sides of the same coin.
Theorem 3.4 (memorization solutions ⟺ diagonal coherence matrix) is particularly elegant: It bridges the abstract notion of coherence with concrete neuron activation patterns.
SAM transcends flatness search: SAM not only prefers flat solutions but also prefers solutions that exploit shared features — it retains discriminative power within equal-flatness solution manifolds.
Matching lower bound (Theorem 3.3): Establishes the tightness of the SAM divergence condition, with upper and lower bounds differing by at most a constant factor.
Practical implication: The simplicity bias of SGD/SAM may partially explain why simple models or prompts outperform more complex alternatives in certain settings.

Limitations & Future Work¶

The core analysis relies on a linear approximation near minima (quadratic loss); non-local dynamics are not covered.
Only SGD and SAM are analyzed; practical optimizers such as momentum SGD and Adam are not addressed.
The construction of \((C,r)\)-generalization solutions, while principled, is restrictive; solution structures in real networks are more complex.
The coherence measure is computationally expensive (requiring per-sample Hessians) and is currently impractical for use during actual training.
The CIFAR-10 experiments employ an approximate coherence measure, introducing a gap relative to the theoretical definition.
Experiments are conducted primarily on synthetic binary data; behavior on real-world datasets (e.g., ImageNet) may differ.

vs. Dexter et al. (2024): That work analyzes the linear stability of SGD only; this paper extends the framework to random perturbations and SAM, and instantiates the theory on two-layer ReLU networks.
vs. Foret et al. (2021, original SAM): The original SAM paper interprets the method as seeking flat minima; this paper proves that SAM's bias extends beyond flatness — it additionally favors high-coherence (simple) solutions.
vs. Andriushchenko et al. (2023): That empirical work observes that SAM learns low-rank representations; this paper provides a theoretical explanation.
The coherence measure may inspire new optimizer designs — e.g., adaptive learning rates based on gradient alignment across mini-batches.
The framework connects to the Neural Collapse phenomenon: the inter-class feature alignment observed in late-stage training can be understood as a high-coherence state.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Unifying the flatness and simplicity hypotheses via a coherence measure constitutes a genuinely insightful theoretical contribution.
Experimental Thoroughness: ⭐⭐⭐ — Synthetic data validation is thorough, but real-data validation is limited (only approximate CIFAR-10 experiments).
Writing Quality: ⭐⭐⭐⭐ — The theoretical narrative is clear; the theorem → discussion → empirical evidence structure is well-organized, though notation is heavy.
Value: ⭐⭐⭐⭐ — Contributes deep insight into generalization mechanisms in deep learning, though practical utility is limited by the computational intractability of the coherence measure.