Second-Order Optimization Under Heavy-Tailed Noise: Hessian Clipping and Sample Complexity¶
Conference: NeurIPS 2025 arXiv: 2510.10690 Code: None Area: Optimization Keywords: second-order optimization, heavy-tailed noise, Hessian clipping, sample complexity, high-probability convergence
TL;DR¶
This paper provides the first systematic theoretical treatment of second-order stochastic optimization under heavy-tailed noise. It establishes tight minimax sample complexity lower bounds, proposes a normalized SGD algorithm with gradient and Hessian clipping (Clip NSGDHess), and proves that the proposed algorithm nearly achieves the information-theoretic limit.
Background & Motivation¶
Heavy-tailed noise is prevalent in modern machine learning, arising from data heterogeneity, outliers, and non-stationary environments. Existing theoretical and algorithmic frameworks suffer from fundamental deficiencies:
First-order methods are mature: SGD and its variants already enjoy optimal convergence guarantees and complete complexity theory under the \(p\)-BCM (bounded \(p\)-th central moment) noise model, with gradient clipping established as a standard tool.
Second-order methods lag significantly: (1) Existing second-order methods (stochastic cubic regularization SCN, trust-region methods, etc.) require bounded noise or bounded variance assumptions, offering no guarantees under heavy-tailed conditions; (2) No existing second-order method can operate reliably under unbounded variance noise; (3) Hessian estimation is more sensitive to noise than gradient estimation, yet the corresponding robustness tools (e.g., Hessian clipping) have never been studied.
Core Problem: Can one design second-order algorithms that retain high-probability convergence guarantees when both gradients and Hessians are corrupted by heavy-tailed noise?
Method¶
Overall Architecture¶
The paper is organized around three levels: (1) establishing minimax sample complexity lower bounds for second-order methods under \(p\)-BCM noise; (2) designing the optimal algorithm NSGDHess that matches the lower bound; (3) achieving high-probability convergence guarantees via Hessian clipping.
Key Designs¶
-
Minimax Lower Bound (Theorem 1): For any \(q\)-th-order zero-respecting algorithm, the minimum number of oracle queries required to find an \(\varepsilon\)-stationary point under \(p\)-BCM noise (\(p \in (1,2]\)) is:
\[\Omega\left(\frac{\Delta \sigma_h}{\varepsilon^2} \left(\frac{\sigma}{\varepsilon}\right)^{\frac{1}{p-1}}\right)\]This improves upon the optimal complexity of first-order methods by a factor of \(1/\varepsilon\) (uniformly for all \(p \in (1,2]\)), demonstrating that second-order information provides provable acceleration even under heavy-tailed noise. The key technique is a worst-case function construction exploiting the "zero-chain" property, extending the \(p\)-BCM noise model to higher-order derivatives.
-
Normalized SGD with Hessian Correction (NSGDHess, Algorithm 1): The core update rule is:
- Uniformly random interpolation between consecutive iterates: \(\hat{x}_t = q_t x_t + (1-q_t) x_{t-1}\)
- Recursive momentum with Hessian correction: \(g_t = (1-\alpha)(g_{t-1} + \nabla^2 f(\hat{x}_t, \hat{\xi}_t)(x_t - x_{t-1})) + \alpha \nabla f(x_t, \xi_t)\)
- Normalized step: \(x_{t+1} = x_t - \gamma \frac{g_t}{\|g_t\|}\)
Design Motivation: The random interpolation point ensures \(\mathbb{E}[\nabla^2 f(\hat{x}_t)(x_t - x_{t-1})] = \nabla F(x_t) - \nabla F(x_{t-1})\), i.e., the exact gradient difference, which forms the theoretical basis for the effectiveness of Hessian-corrected momentum. The normalized step guarantees bounded update magnitude (\(\|x_{t+1} - x_t\| = \gamma\)), which is critical under heavy-tailed noise.
-
Hessian Clipping (Clip NSGDHess, Algorithm 2): Algorithm 1 is augmented with dual clipping of gradients and Hessian-vector products:
\[g_t = (1-\alpha)(g_{t-1} + \gamma \cdot \text{clip}(\gamma^{-1}\nabla^2 f \cdot (x_t - x_{t-1}), \bar{\lambda}_h)) + \alpha \cdot \text{clip}(\nabla f, \lambda)\]where \(\text{clip}(v, \lambda) = \min\{1, \lambda/\|v\|\} \cdot v\). Key Insight: Clipping the Hessian matrix directly requires \(O(d^2)\) computation (operator norm estimation), whereas clipping the Hessian-vector product requires only \(O(d)\) operations while remaining compatible with backpropagation.
Loss & Training¶
- Convergence target: Finding an \(\varepsilon\)-first-order stationary point, i.e., \(\|\nabla F(x)\| \leq \varepsilon\)
- NSGDHess sample complexity (Theorem 2): \(O\left(\frac{\Delta(L+\sigma_h)}{\varepsilon^2}\left(\frac{\sigma}{\varepsilon}\right)^{\frac{1}{p-1}}\right)\), exactly matching the lower bound when \(L \leq \sigma_h\)
- Clip NSGDHess high-probability guarantee (Theorem 3): Achieves near-optimal complexity with probability \(\geq 1-\delta\), with only \(\text{polylog}(T/\delta)\) additional overhead
Key Experimental Results¶
Main Results¶
Synthetic experiment: \(F(x) = 0.5\|x\|^2\), \(d=10\), with heavy-tailed noise generated from a two-sided Pareto distribution (tail index \(p=1.1\)).
| Algorithm | Noise Robustness | Convergence | Clipping Required |
|---|---|---|---|
| NSGDM (first-order normalized momentum) | Poor | Severe oscillation | No |
| NSGDHess (without clipping) | Moderate | Improved but unstable | No |
| Clip NSGDHess | Strong | Stable and fast | Yes |
Ablation Study¶
| Configuration | Complexity (in \(\varepsilon\)) | High-Probability? | Remarks |
|---|---|---|---|
| First-order FOSO (optimal) | \(O(\varepsilon^{-3} \cdot (\sigma/\varepsilon)^{1/(p-1)})\) | Yes | Known lower bound |
| Second-order NSGDHess (expected) | \(O(\varepsilon^{-2} \cdot (\sigma/\varepsilon)^{1/(p-1)})\) | No | This paper, Thm 2 |
| Second-order Clip NSGDHess | \(O(\varepsilon^{-2} \cdot (\sigma/\varepsilon)^{1/(p-1)})\) | Yes | This paper, Thm 3 |
| SCN (Tripuraneni 2018) | \(O(\varepsilon^{-7/2})\) | No | Requires bounded noise |
| SGDHess (Tran 2021) | \(O(\varepsilon^{-3})\) | No | Requires bounded variance |
Key Findings¶
- Second-order methods retain a \(1/\varepsilon\) sample complexity advantage over first-order methods even under heavy-tailed noise
- The unclipped variant suffers noticeably from noise in synthetic experiments, validating the necessity of Hessian clipping
- The complexity result does not depend on the second-order smoothness constant \(L_h\) (which may be infinite), an important practical feature
- For the special case \(p=2\) (bounded variance), the results are also novel and unify previously scattered conclusions
Highlights & Insights¶
- First complete theory of second-order optimization under heavy-tailed noise: Provides a full minimax characterization from lower to matching upper bounds
- Hessian clipping is an entirely new concept: Generalizes the success of gradient clipping to the Hessian, maintaining computational feasibility via Hessian-vector product clipping
- Theoretically proves that the advantage of second-order methods does not vanish under heavy-tailed noise: The \(1/\varepsilon\) improvement is fundamental
- The Hessian-vector product clipping formula \(\gamma \cdot \text{clip}(\gamma^{-1} \nabla^2 f \cdot v, \bar{\lambda}_h)\) is elegantly designed so that the clipping thresholds \(\lambda\) and \(\bar{\lambda}_h\) are on the same scale, facilitating practical hyperparameter tuning
Limitations & Future Work¶
- Experiments are conducted only on synthetic problems; empirical validation on real machine learning tasks is absent
- The algorithm requires multiple hyperparameters (step size \(\gamma\), momentum \(\alpha\), two clipping thresholds), and further simplification is needed for practical deployment
- The convergence target is limited to first-order stationary points; extending to second-order stationary points (saddle point escape) remains an important open problem
- Whether normalized steps are necessary in the clipped variant, and whether pure clipping can replace normalization, merits further investigation
Related Work & Insights¶
- Relation to gradient clipping: This paper is a natural extension of clipping techniques from first- to second-order methods, continuing the line of work by Gorbunov (2020), Sadiev (2023), and others
- Connection to reinforcement learning: NSGDHess is inspired by the SHARP/HAR-PG algorithms from RL (Fatkhullin et al., 2023), reflecting the deepening intersection between optimization theory and RL
- Insights: The concept of Hessian clipping may be valuable for robustness design in distributed optimization and federated learning
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (First complete theory of second-order optimization under heavy-tailed noise; Hessian clipping is entirely new)
- Experimental Thoroughness: ⭐⭐⭐ (Strong theoretical contributions but limited to synthetic experiments)
- Writing Quality: ⭐⭐⭐⭐⭐ (Clear logic, well-structured presentation, highly informative figures and tables)
- Value: ⭐⭐⭐⭐ (Provides a solid theoretical foundation for optimization under heavy-tailed noise)