Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis¶

Conference: NeurIPS 2025 arXiv: 2510.11312 Code: GitHub Area: Others Keywords: nonlinear preconditioning, gradient clipping, anisotropic smoothness, heavy ball momentum, stochastic optimization

TL;DR¶

Under the anisotropic descent inequality framework, this paper introduces heavy ball momentum into nonlinearly preconditioned gradient methods and analyzes the convergence properties of their stochastic variants under multiple noise assumptions, thereby unifying the theoretical analysis of gradient clipping and normalized gradient methods.

Background & Motivation¶

Many cost functions in modern machine learning do not satisfy the classical Lipschitz gradient condition. For instance, loss functions arising in language model training exhibit more general \((L_0, L_1)\)-smoothness: \(\|\nabla^2 f(x)\| \leq L_0 + L_1 \|\nabla f(x)\|\). Gradient clipping and normalization are standard practical techniques for addressing such problems, yet theoretical analyses are typically restricted to specific algorithmic forms.

A core observation is that gradient clipping methods can be unified as nonlinearly preconditioned gradient descent:

\[x^{k+1} = x^k - \gamma \nabla\phi^*(\nabla f(x^k))\]

where \(\phi\) is a reference function and \(\nabla\phi^*\) is the preconditioning map. Different choices of \(\phi\) yield different algorithms: - \(\phi(x) = \varepsilon(-\|x\| - \ln(1-\|x\|))\) → normalized gradient descent (NGD) - \(\phi(x) = \cosh(\|x\|) - 1\) → inverse hyperbolic gradient descent (iHGD) — the focus of this paper

The preconditioning map of iHGD is \(\nabla\phi^*(y) = \text{arsinh}(\|y\|) \frac{y}{\|y\|}\), which, unlike hard-threshold clipping, adaptively scales gradients — large gradients are compressed but retain information, while small gradients are left nearly unchanged.

Gaps in existing theory: 1. Nonlinearly preconditioned methods with momentum under general anisotropic smoothness conditions have never been analyzed 2. The convergence properties of stochastic variants beyond bounded variance assumptions remain unclear

Method¶

Overall Architecture¶

Two core algorithms are proposed: - m-NPGM (Algorithm 1): nonlinearly preconditioned gradient method with momentum - Stochastic NPGM: momentum-free stochastic variant

Key Designs¶

Momentum update rule (m-NPGM): Unlike standard approaches, momentum is applied to the preconditioned gradient rather than applying momentum to the gradient before preconditioning:

\(m^k = \beta m^{k-1} + (1-\beta) \nabla\phi^*(\nabla f(x^k))\) \(x^{k+1} = x^k - \gamma m^k\)

The equivalent form is \(x^{k+1} = x^k - (1-\beta)\gamma \nabla\phi^*(\nabla f(x^k)) + \beta(x^k - x^{k-1})\), i.e., the standard heavy ball method applied to the map \(\nabla\phi^* \circ \nabla f\). This design is more natural since the analysis is grounded in the anisotropic descent inequality.

Anisotropic descent inequality: The core condition for analysis is \(f(x) \leq f(\bar{x}) + \frac{1}{L} \star \phi(x - \bar{y}) - \frac{1}{L}\star\phi(\bar{x} - \bar{y})\), where \(\bar{y} = \bar{x} - \frac{1}{L}\nabla\phi^*(\nabla f(\bar{x}))\). This is strictly more general than \((L_0,L_1)\)-smoothness (Remark 1.4).
Convergence analysis (Theorem 2.2): Under \(\beta \in [0, 0.5)\) and \(\gamma = \alpha/L\): \(\min_{0 \leq k \leq K} \phi(\nabla\phi^*(\nabla f(x^k))) \leq \frac{L(f(x^0) - f_\star)}{\alpha(K+1)(1-2\beta)}\). Key technical challenges in the proof are: (a) the absence of a global gradient difference upper bound; (b) the range of the preconditioning map may cover the entire space, making distance control difficult.
Preconditioned Lipschitz continuity (Assumption 2.5): A new condition is introduced, \(\|\nabla\phi^*(\nabla f(x)) - \nabla\phi^*(\nabla f(\bar{x}))\| \leq L\|x - \bar{x}\|\), under which the range of \(\beta\) is extended to \((0, 1)\) (Theorem 2.7). This condition is shown to hold naturally for \((L_0,L_1)\)-smooth functions (Proposition 2.6).
Stochastic variant analysis:
- Theorem 3.1: Under a new noise condition \(E[\phi(\nabla\phi^*(\nabla f(x)) - \nabla\phi^*(g(x)))] \leq \sigma^2\), approximate convergence to a \(\sigma^2\)-neighborhood is established
- Proposition 3.2: For \(\phi = \cosh - 1\), the new noise condition is weaker than bounded variance (Example 3.3 provides a function satisfying the new condition but not bounded variance)
- Theorem 3.4: Under standard bounded variance and unbiasedness assumptions, exact convergence is achieved with mini-batch size \(K\)

Loss & Training¶

The stationarity measure is \(\phi(\nabla\phi^*(\nabla f(x)))\) rather than \(\|\nabla f(x)\|\)
For iHGD: \(\phi(\nabla\phi^*(y)) = \sqrt{1 + \|y\|^2} - 1\)
Linear convergence under a generalized PL condition (Theorem 2.4), using Lyapunov function \(V_k = \gamma\phi(m^{k-1}) + f(x^k) - f_\star\)

Key Experimental Results¶

Neural Network Training¶

Task	Method	Training Loss	Test Accuracy	Notes
MNIST MLP	iHGD	Lowest	Highest	Significantly outperforms SGD and Adam
MNIST MLP	SGD	Moderate	Moderate	Baseline
MNIST MLP	Adam	Faster convergence	Close to iHGD	Standard baseline
CIFAR10 ResNet-18 (no momentum)	sHGD	Comparable to SGD	Comparable to SGD	Validation experiment
CIFAR10 ResNet-18 (with momentum)	iHGDm/sHGDm	Comparable to SGDm	Comparable to SGDm	\(\beta=0.9\)

Matrix Factorization Experiments¶

Method	\(r=10\) Convergence	\(r=20\) Convergence	\(r=30\) Convergence
iHGDm (Ours)	Fastest	Fastest	Fastest
AdGD-accel	Moderate	Moderate	Moderate
GDm	Slow	Slow	Slow
GD	Slowest	Slowest	Slowest

Ablation Study¶

Configuration	Key Metric	Notes
\(\beta \in [0, 0.5)\) vs \(\beta \in (0, 1)\)	Enlarged admissible range	Requires additional preconditioned Lipschitz assumption
Isotropic (i) vs separable (s) reference function	Comparable performance	Separable version provides coordinate-wise adaptive step sizes
\(\lambda=100\) scaled iHGD	Significant speedup in matrix factorization	Dual step size strategy is critical

Key Findings¶

iHGDm significantly outperforms all competing methods in matrix factorization (a quartic polynomial that is not Lipschitz smooth)
Performance is comparable to SGD/Adam in standard neural network training, achieving the same results under a different theoretical framework
The new noise assumption is strictly weaker than bounded variance and arises naturally from anisotropic smoothness analysis

Highlights & Insights¶

The power of a unified framework: gradient clipping, normalized gradient descent, and hyperbolic gradient descent are all special cases of the same framework
The design choice of "applying momentum after preconditioning" appears minor but is significant — it allows the analysis to align naturally with the anisotropic descent inequality
Proposition 2.6 provides a new characterization of \((L_0,L_1)\)-smooth functions via Lipschitz continuity of the preconditioned gradient
The elegant properties of \(\cosh - 1\) as a reference function: strong convexity, super-polynomial growth, global definition, and generating sigmoid-type preconditioning

Limitations & Future Work¶

The momentum parameter is currently restricted to \(\beta < 0.5\) in the general case, or requires additional assumptions to extend to \(\beta < 1\)
The stochastic variant has not yet been combined with momentum, which remains an important open problem
Proximal gradient extensions (non-smooth terms / constrained problems) are not considered
Systematic guidance on the optimal choice of \(\phi\) is lacking

This work extends anisotropic gradient descent from the deterministic nonconvex setting to momentum and stochastic settings
It complements the \((L_0,L_1)\)-smoothness literature by providing a more general analytical framework
An open question for inspiration: can modern deep learning optimizers (e.g., Adam) be reinterpreted through the lens of anisotropic smoothness?

Rating¶

Novelty: ⭐⭐⭐⭐ The framework itself is prior work, but the momentum and stochastic analyses are new contributions
Experimental Thoroughness: ⭐⭐⭐⭐ Covers neural networks and matrix factorization, though more large-scale experiments could be included
Writing Quality: ⭐⭐⭐⭐ Theoretically rigorous, but the notation system is relatively heavy
Value: ⭐⭐⭐⭐ Provides theoretical foundations for optimization under generalized smoothness