SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures¶
Conference: NeurIPS 2025 arXiv: 2505.09572 Code: GitHub Area: Deep Learning Theory / Optimization Theory Keywords: gradient flow, o-minimal structures, divergence phenomenon, SAD activation functions, asymptotic optimality
TL;DR¶
Using mathematical tools from o-minimal structures, this paper establishes a dichotomy for gradient flows in fully connected networks with common smooth activation functions (sigmoid, tanh, softplus, GELU, etc.): the flow either converges to a critical point or diverges to infinity with the loss converging to an asymptotic critical value. In particular, for polynomial target functions, the paper proves that the loss cannot be exactly zero but can be made arbitrarily close to zero, which necessarily causes parameter divergence.
Background & Motivation¶
In deep learning practice, gradient-based optimization methods often achieve near-zero training loss, yet the theoretical understanding of convergence on non-convex loss landscapes remains incomplete. The central question is:
Under what conditions does gradient flow converge to a good local minimum?
Existing theory largely relies on strong assumptions: convexity, overparameterization, specific initialization, Łojasiewicz inequalities, etc., and almost all results implicitly or explicitly assume bounded parameter trajectories. However, a growing body of work shows this assumption does not always hold—the simplest example being that parameters of a linear classifier with logistic activation diverge to infinity.
The motivation of this paper is to bridge this gap by dropping the boundedness assumption and directly analyzing gradient flow behavior in the unbounded regime. The authors identify a key mathematical structure: all functions arising in neural network training can be "defined" within an o-minimal structure, which endows them with strong finiteness and rigidity properties.
Method¶
Overall Architecture¶
The paper presents results at two levels: 1. General dichotomy theorem (Theorem 2.8): applicable to gradient flows with any \(C^1\) definable activation function. 2. Divergence theorem for polynomial targets (Corollary 3.6): a specialization to SAD activation functions and polynomial target functions.
Key Designs¶
- Application of o-minimal structures: O-minimal structures are concepts from mathematical model theory. Intuitively, they refer to sets/functions definable by first-order logic and "well-behaved operations" (addition, subtraction, multiplication, division, exponential, logarithm, derivatives, antiderivatives). A key property is that sets in \(\mathcal{S}_1\) are finite unions of points and intervals, providing extremely strong finiteness guarantees.
Common activation functions are Pfaffian functions (whose partial derivatives are compositions of themselves and polynomials), and Wilkie (1999) proved that the structure generated by all Pfaffian functions is o-minimal. This covers sigmoid, tanh, softplus, swish, GELU, Mish, ELU, softsign, and others.
-
Dichotomy theorem (Theorem 2.8): For \(C^1\) definable activation and loss functions, the gradient flow \(\Theta'(t) = -\nabla\mathcal{L}(\Theta(t))\) admits a unique global solution satisfying exactly one of the following:
-
(a) \(\lim_{t\to\infty}\Theta(t)\) exists and is a critical point.
- (b) \(\lim_{t\to\infty}\|\Theta(t)\|=\infty\) and \(\lim_{t\to\infty}\mathcal{L}(\Theta(t))\) is an asymptotic critical value.
A key corollary: there exists \(\varepsilon > 0\) such that any gradient flow initialized within \(\varepsilon\) above the optimal loss value will have its loss converge to the optimal value. This "attraction threshold" appears to have gone unnoticed in the deep learning literature.
-
SAD (Sublinear Analytic Definable) activation function class: Three properties are defined:
-
(S) Sublinearity: \(\limsup_{t\to\infty}\|f(tx)\|/t < \infty\)
- (A) Analyticity: \(f\) is an analytic function
- (D) Definability: \(f\) is definable in some o-minimal structure
Sigmoid, tanh, softplus, swish, GELU, and Mish are all SAD. ReLU satisfies (S) and (D) but not (A). The SAD class is closed under composition: neural networks built from SAD activations remain SAD.
-
Non-exact representability for polynomial targets (Theorem 3.4): For any polynomial target of \(\deg(f)\geq 2\), SAD activations, and sufficiently large architectures/datasets, it is proved that \(\mathcal{L}(\theta) > 0\) for all \(\theta\). The core argument is:
-
Sublinearity of SAD networks prevents them from globally matching polynomials of degree \(\geq 2\).
- Analyticity and definability provide sufficient "rigidity" to detect this property on finite data.
- However, analyticity guarantees sufficiently non-trivial Taylor coefficients to allow arbitrarily good approximation (Theorem 3.5: \(\inf_\theta \mathcal{L}(\theta) = 0\)).
Combined with the dichotomy theorem: loss cannot be zero + can be arbitrarily close to zero → the optimum is only achievable asymptotically → gradient flow must diverge.
Loss Functions / Optimization Setting¶
The theory applies to both empirical loss (finite datasets) and expected loss over continuous distributions (compactly supported density functions). Covered loss functions include mean squared error, binary cross-entropy, Huber loss, and all other \(C^1\) definable losses.
Key Experimental Results¶
Polynomial Target Experiments¶
| Dimension | Activation | Optimizer | Loss→0 | Param. norm→∞ |
|---|---|---|---|---|
| 1D | sigmoid/tanh/softplus/swish/GELU | GD | ✓ | ✓ (slow) |
| 2D | same | GD | ✓ | ✓ |
| 4D | same | GD | ✓ | ✓ |
| 1D | same | Adam | ✓ | ✓ (faster) |
| 2D | same | Adam | ✓ | ✓ |
| 4D | same | Adam | ✓ | ✓ |
Extended Experiments on Complex Tasks¶
| Task | Activation | Loss decrease | Param. growth | Notes |
|---|---|---|---|---|
| Heat PDE | GELU | ✓ | ✓ | Deep Kolmogorov method |
| Black-Scholes PDE | GELU | ✓ | ✓ | same |
| MNIST classification | GELU | ✓ | ✓ | cross-entropy loss |
Ablation Observations¶
| Comparison | Param. growth rate | Loss convergence rate |
|---|---|---|
| GD vs Adam | Adam grows much faster | Adam converges much faster |
| Theoretical prediction \(O(\sqrt{t})\) | Observed growth closer to logarithmic | Consistent with Lyu & Li's logarithmic results |
Key Findings¶
- Parameter growth under gradient descent is very slow (appearing logarithmic), which explains why parameters in practice typically "feel" bounded.
- Adam, due to its adaptive step sizes, sustains larger update magnitudes and exhibits more pronounced parameter growth.
- Divergence is not limited to polynomial targets; it is also observed in PDE solving and MNIST classification.
Highlights & Insights¶
- Introducing mathematical logic (o-minimal theory) into deep learning: This is a highly unusual cross-disciplinary connection. The "finiteness theorem" and "uniform finiteness theorem" from o-minimal structures serve as key technical tools.
- Elegant and practical definition of the SAD function class: The three properties—sublinearity, analyticity, and definability—are well-chosen, enjoy good closure properties, and provide a convenient framework for future theoretical study of smooth networks.
- Existence of an "attraction threshold" \(\varepsilon\) (Theorem 2.8(v)): This implies that sufficiently good initialization necessarily leads to loss convergence to the optimal value without additional convergence conditions—a previously overlooked result.
- The special status of ReLU: The theory explicitly shows that ReLU does not satisfy the divergence conclusion—shallow ReLU networks admit global minima for polynomial targets. This explains the fundamental behavioral differences between activation functions.
Limitations & Future Work¶
- Applies only to gradient flow (continuous time); the approach does not directly extend to (stochastic) gradient descent, with the key obstacle being the lack of control over the Hessian.
- Numerical experiments are relatively limited, serving primarily as proof of concept.
- Activation functions must be at least \(C^1\) (some results require analyticity), excluding ReLU.
- L2 regularization or weight decay (e.g., AdamW) prevents divergence—suggesting this phenomenon may be less common in practice.
- Directional convergence of diverging parameters remains an open problem within exponential-function-based o-minimal structures.
Related Work & Insights¶
- The divergence results of Lyu & Li (2020) and Vardi et al. (2022) for homogeneous networks are special cases; this paper provides a unified perspective.
- Kurdyka's Łojasiewicz inequality guaranteeing convergence in the bounded case is a classical result; this paper extends it to the unbounded regime.
- The work provides theoretical support for studies on parameter norm behavior (implicit bias, weight norm growth).
- Insight: weight decay in practice has a principled justification—it prevents the theoretically inevitable parameter divergence.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The systematic application of o-minimal structures in deep learning is highly original.
- Experimental Thoroughness: ⭐⭐⭐ Experiments are primarily proof-of-concept, with limited scale and diversity.
- Writing Quality: ⭐⭐⭐⭐ Mathematically rigorous, though the barrier to entry is high for readers unfamiliar with model theory.
- Value: ⭐⭐⭐⭐ Provides a new theoretical perspective for understanding gradient dynamics; the SAD framework has value for follow-up research.