Generalization Below the Edge of Stability: The Role of Data Geometry¶
Conference: ICLR 2026 arXiv: 2510.18120 Code: None Area: Learning Theory / Optimization Keywords: Generalization Theory, Edge of Stability, Data Geometry, ReLU Networks, Implicit Regularization
TL;DR¶
This paper introduces the principle of data shatterability to provide a unified explanation of how data geometry governs the strength of implicit regularization induced by gradient descent near the Edge of Stability (EoS). For the Beta(α) radial distribution family, the authors derive a spectrum of generalization upper and lower bounds that depend on α. For mixture distributions supported on low-dimensional subspaces, they prove that the generalization rate adapts to the intrinsic dimension \(m\) rather than the ambient dimension \(d\).
Background & Motivation¶
Background: Overparameterized neural networks generalize well even without explicit regularization (e.g., weight decay), a phenomenon that classical statistical learning theory fails to explain. The discovery of the Edge of Stability (EoS)—where large-step GD training drives the Hessian's largest eigenvalue to \(\lambda_{\max}(\nabla^2\mathcal{L}) \approx 2/\eta\)—offers a new lens for understanding implicit regularization.
Limitations of Prior Work:
- Existing work has shown that the EoS condition is equivalent to a data-dependent weighted path norm constraint, but generalization bounds derived for the uniform spherical distribution suffer from the curse of dimensionality—contradicting the empirical success of deep learning.
- No unified theoretical framework exists to determine which data geometries lead to generalization and which lead to memorization.
- Existing generalization bounds are distribution-agnostic and cannot distinguish the effects of different data geometries.
Key Challenge: The data-dependent regularization induced by EoS varies dramatically across distributions—networks trained on data lying on a sphere can memorize without penalty, while data inside the ball is subject to strong regularization constraints. A unifying principle is needed to explain this discrepancy.
Goal: The paper introduces the concept of data shatterability—the difficulty of shattering a data distribution using ReLU half-spaces—as the core geometric quantity governing generalization behavior.
Method¶
Overall Architecture¶
The theoretical analysis is built on the BEoS (Below Edge of Stability) condition for two-layer ReLU networks:
The BEoS condition \(\lambda_{\max}(\nabla^2_{\boldsymbol{\theta}}\mathcal{L}) \leq 2/\eta\) is equivalent to an upper bound on the data-dependent weighted path norm. The core technical pipeline is:
Half-space depth stratification → Good/bad region decomposition → Generalization upper bound → Instantiation with data geometry
Key Design 1: Half-Space Depth for Partition Quantification¶
The paper introduces the Tukey half-space depth to stratify the input space:
For the \(T\)-deep region \(\Omega_T\), any ReLU activation boundary passing through this region must retain at least a \(T\)-fraction of data on each side, giving a positive lower bound on the weight function \(g(\boldsymbol{u}, t)\). This effectively controls the (unweighted) path norm of neurons in this region by \(O(1/g_{\min}(T))\).
This yields the following key generalization decomposition:
Key Design 2: Generalization Spectrum for Isotropic Beta(α) Radial Distributions¶
Define \(\boldsymbol{X} = h(R)\boldsymbol{U}\), where \(h(r) = 1 - (1-r)^{1/\alpha}\), \(R \sim \text{Uniform}[0,1]\), \(\boldsymbol{U} \sim \text{Uniform}(\mathbb{S}^{d-1})\).
- Large \(\alpha\) → mass concentrated near the origin → small shallow-region probability → good generalization.
- Small \(\alpha\) → mass concentrated near the sphere → many disjoint spherical caps can be packed → easy memorization.
- \(\alpha \to 0\) (spherical limit) → a network of width \(\leq n\) can perfectly interpolate under the BEoS condition with \(\lambda_{\max} \leq 1 + (D^2+2)/n\).
Both the upper bound (Theorem 3.4) and the lower bound (Theorem 3.5) depend on \(\alpha\), with rates \(n^{-\alpha(d+3)/(2(d^2+4\alpha d+3\alpha))}\) and \(n^{-2\alpha/(d-1+2\alpha)}\), respectively.
Key Design 3: Adaptation to Low-Dimensional Structure¶
For a mixture distribution \(\mathcal{P}_X = \sum_{j=1}^J \pi_j \mathcal{P}_{X,j}\), where each component is a uniform ball distribution on an \(m\)-dimensional affine subspace (\(m < d\)), the paper proves the generalization rate:
The core mechanism: when the network is restricted to subspace \(V_j\), neuron activations are determined solely by \(\text{proj}_{V_j} \boldsymbol{w}_k\), so high-dimensional hyperplanes degenerate into low-dimensional "knots," dramatically reducing shatterability.
Key Experimental Results¶
Main Results: Generalization Rate Verification on Isotropic Distributions¶
In \(d=5\) dimensional space, two-layer ReLU networks of width 1000 are trained for 20,000 epochs with learning rate 0.4 on Beta(α) radial distributions for \(\alpha \in \{0.1, 0.3, 1.5, 5.0\}\).
| Distribution parameter \(\alpha\) | Log-log slope (observed) | Theoretical trend |
|---|---|---|
| 0.1 | ≈ −0.05 (nearly no generalization) | Mass near sphere → memorization |
| 0.3 | ≈ −0.12 | Weak generalization |
| 1.5 | ≈ −0.25 | Moderate generalization |
| 5.0 | ≈ −0.38 (steepest) | Mass near origin → strong generalization |
Larger \(\alpha\) yields a steeper log-log slope, indicating faster generalization, consistent with theory.
Ablation Study: Intrinsic Dimension Adaptation¶
20 one-dimensional lines embedded in \(\mathbb{R}^d\) for \(d \in \{10, 50, 100, 500\}\):
| Ambient dimension \(d\) | Log-log slope | Variation |
|---|---|---|
| 10 | ≈ −0.22 | Baseline |
| 50 | ≈ −0.21 | +0.01 |
| 100 | ≈ −0.21 | +0.01 |
| 500 | ≈ −0.20 | +0.02 |
The slope remains nearly constant, confirming that generalization adapts to the intrinsic dimension (\(m=1\)) and is unaffected by the ambient dimension. As a control, the uniform ball distribution (\(\alpha=1\)) exhibits significantly degraded generalization as \(d\) increases.
Validation on MNIST¶
| Data type | Clean MSE after 20,000 steps | Behavior |
|---|---|---|
| \(\mathcal{N}(0, I_{784})\) | ≈ 1.0 (noise level) | Rapid memorization |
| MNIST images | ≈ 0.2 | Resists overfitting for 10,000+ steps |
Gaussian data concentrates on a thin spherical shell (high shatterability) → rapid memorization; MNIST approximately lies on a low-dimensional structure → resists overfitting. Shallower MNIST points exhibit larger prediction errors, consistent with theoretical predictions.
Highlights & Insights¶
Strengths¶
- Theoretical depth: The paper unifies EoS implicit regularization, data geometry, and generalization within a single "shatterability" framework, providing both upper and matching lower bounds.
- Breakthrough insight: It explains why real data (low-dimensional manifolds) is harder to overfit than random Gaussian data—a theoretical response to Zhang et al. (2017) "Rethinking Generalization."
- Technical innovation: Half-space depth stratification avoids the explosion of global metric entropy, breaking through the bottleneck of distribution-agnostic bounds.
- Practical implications: The framework provides theoretical justification for Mixup data augmentation and activation-frequency-based pruning.
Limitations & Future Work¶
- The analysis is restricted to two-layer ReLU networks; extension to deeper networks faces theoretical challenges regarding the propagation of EoS regularization.
- The depth-quantization concentration exponent \(\mathsf{S}_{\text{DQ}}\) admits precise characterization only for isotropic distributions; quantifying shatterability for non-isotropic data remains heuristic.
- Experiments are limited to simple synthetic data and MNIST; the predictive power of the theory on more complex datasets such as CIFAR/ImageNet has not been validated.
Rating¶
⭐⭐⭐⭐⭐
This is a theoretically rigorous and broadly applicable work that, for the first time, establishes a quantitative connection between EoS implicit regularization and data geometry. The concept of "data shatterability" elegantly unifies previously disparate empirical observations—real data is harder to overfit than random data, low-dimensional data generalizes better, data on a sphere is easily memorized—and provides a solid theoretical foundation for understanding why deep learning can escape the curse of dimensionality in practice.