Adaptive Algorithms with Sharp Convergence Rates for Stochastic Hierarchical Optimization¶

Conference: NeurIPS 2025 arXiv: 2509.15399 Code: Available (submitted with paper appendix) Area: Optimization Theory / Hierarchical Optimization Keywords: minimax optimization, bilevel optimization, adaptive algorithm, momentum normalization, convergence rate

TL;DR¶

This paper proposes Ada-Minimax and Ada-BiO, two adaptive algorithms that combine momentum normalization with a novel online noise estimation strategy to achieve, for the first time, sharp convergence rates of \(\tilde{O}(1/\sqrt{T} + \sqrt{\bar{\sigma}}/T^{1/4})\) for nonconvex-strongly-concave minimax and nonconvex-strongly-convex bilevel optimization without requiring prior knowledge of the gradient noise level.

Background & Motivation¶

Background: Hierarchical optimization (minimax and bilevel) is a central problem formulation in machine learning—adversarial learning and AUC maximization are modeled as minimax problems, while meta-learning and hyperparameter optimization are modeled as bilevel problems. A large number of algorithms have been proposed with established convergence guarantees.

Limitations of Prior Work: All existing methods require prior knowledge of the stochastic gradient noise magnitude \(\bar{\sigma}\) to set hyperparameters such as step sizes. When the noise level is unknown or varies during training, these methods cannot adapt automatically—conservative step sizes in low-noise regimes lead to slow convergence, while excessively large step sizes in high-noise regimes cause divergence.

Key Challenge: Standard AdaGrad-style adaptive methods have already resolved the noise-adaptive problem in single-level optimization, but directly applying them to hierarchical optimization introduces complex stochastic dependency issues—the AdaGrad step sizes for upper- and lower-level variables are mutually coupled, making analysis intractable.

Key Insight: Instead of standard AdaGrad, the paper employs normalized SGD with momentum (NSGDM) for the upper-level variable and AdaGrad-Norm for the lower-level variable, with the key innovation being that the momentum parameter itself is also adaptively adjusted.

Method¶

Overall Architecture¶

Ada-Minimax (Algorithm 1) and Ada-BiO (Algorithm 2) share the same algorithmic framework: the upper-level variable is updated via normalized SGD with momentum (\(x_{t+1} = x_t - \eta_{x,t} \cdot m_t / \|m_t\|\)), while the lower-level variable is updated via AdaGrad-Norm gradient descent/ascent. The core distinction between the two algorithms lies solely in the gradient estimation scheme—Ada-Minimax computes stochastic gradients directly, whereas Ada-BiO approximates the hypergradient using a Neumann series.

Key Designs¶

Adaptive Momentum Normalization (Adaptive NSGDM):
Function: Updates the upper-level variable with automatic adaptation to the noise level.
Mechanism: Momentum update \(m_t = \beta_t \cdot m_{t-1} + (1-\beta_t) \cdot g_{x,t}\), followed by the normalization step \(x_{t+1} = x_t - \eta_{x,t} \cdot m_t / \|m_t\|\). The key is that the momentum parameter \(\alpha_t = 1 - \beta_t\) is adaptively set as \(\alpha_t = \alpha / \sqrt{\alpha^2 + \sum \|g_{x,k} - \tilde{g}_{x,k}\|^2}\), where the difference terms in the denominator provide an online estimate of the gradient variance.
Design Motivation: Gradient normalization decouples the step size from gradient magnitude; the momentum parameter is dynamically adjusted according to the estimated noise level—smaller \(\alpha_t\) (more momentum smoothing) under high noise, and larger \(\alpha_t\) (faster response) under low noise.
Coordinated Adaptive Step Sizes for Upper and Lower Levels:
Function: Simultaneously sets adaptive learning rates for both upper- and lower-level variables.
Mechanism: Upper-level step size \(\eta_{x,t} = \eta_x \cdot \alpha'_t / t\), where \(\alpha'_t = \alpha / \sqrt{\alpha^2 + \sum \|g_{x,k} - \tilde{g}_{x,k}\|^2 + \|g_{y,k}\|^2}\) incorporates both upper- and lower-level information; lower-level step size \(\eta_{y,t} = \eta_y / \sqrt{\gamma^2 + \sum \|g_{y,k}\|^2}\) follows the standard AdaGrad-Norm form.
Design Motivation: \(\alpha'_t\) jointly encodes gradient information from both levels to effectively control the ratio \(\eta_{x,t}/\eta_{y,t}\), ensuring coordinated update speeds across levels (key to Lemma 5.7).
Double-Sampling Noise Estimation:
Function: Online estimation of the gradient noise level.
Mechanism: Two independent stochastic gradients \(g_{x,t}\) and \(\tilde{g}_{x,t}\) are drawn at each step; \(\|g_{x,t} - \tilde{g}_{x,t}\|^2\) serves as an unbiased proxy for the variance. The cumulative sum \(\sum \|g_{x,k} - \tilde{g}_{x,k}\|^2\) is shown to be bounded within \([\bar{\sigma}^2 t,\, 4\bar{\sigma}^2 t]\) with high probability (Lemma 5.5).
Design Motivation: Direct variance estimation requires knowledge of the true gradient, whereas the double-sampling difference requires only one additional gradient query to obtain an online variance estimate.

Loss & Training¶

Ada-Minimax targets \(\min_x \max_y f(x, y)\) (with \(f\) strongly concave in \(y\)); Ada-BiO targets \(\min_x f(x, y^*(x))\) s.t. \(y^*(x) = \operatorname{argmin}_y g(x, y)\) (with \(g\) strongly convex in \(y\)). Ada-BiO approximates the hypergradient \(\nabla \Phi(x)\) via a Neumann series, requiring additional second-order information queries.

Key Experimental Results¶

Main Results¶

Task	Dataset	Metric	Ada-Minimax/BiO	TiAda/TFBO	SGDA/PDSM
Synthetic minimax (\(\sigma=0\))	1-D function	\(\\|\nabla\Phi\\|\)	Fastest convergence	Comparable	Slower
Synthetic minimax (\(\sigma=100\))	1-D function	\(\\|\nabla\Phi\\|\)	Converges	Diverges	Diverges
Deep AUC maximization	Sentiment140	Train AUC	+20% over PDSM	Low	Suboptimal
Deep AUC maximization	Sentiment140	Test AUC	+2% over PDSM	Low	Suboptimal
Hyperparameter optimization	TREC	Train/Test ACC	Fast convergence	TFBO diverges	—

Ablation Study¶

Configuration	Key Metric	Remarks
\(\alpha \in [0.1, 2.0]\)	AUC nearly unchanged	Robust to initial momentum parameter
\(\eta_y \in [0.001, 0.05]\)	Early-stage variation; consistent final performance	Robust lower-level learning rate
\(\eta_x \in [0.001, 0.01]\)	Affects early training dynamics	Final performance degrades at \(\eta_x=0.05\)
\(\gamma \in [0.01, 2.0]\)	High consistency	Robust initialization parameter

Key Findings¶

Ada-Minimax converges under high noise (\(\sigma=100\)), whereas TiAda fails to converge even with extensive tuning.
On Deep AUC maximization, Ada-Minimax achieves the fastest convergence both in terms of epochs and wall-clock time.
Strong hyperparameter robustness: varying the four main hyperparameters (\(\alpha\), \(\eta_x\), \(\eta_y\), \(\gamma\)) over wide ranges has minimal impact on final performance.
Experiments validate the noise lower-bound assumption: empirical noise on the AUC task spans \([0.21, 210.71]\).

Highlights & Insights¶

Theoretical novelty: This is the first work to provide adaptive and sharp convergence guarantees for stochastic hierarchical optimization, achieving automatic rate interpolation without prior noise knowledge.
General analysis framework: The paper first establishes an analysis framework for single-level adaptive NSGDM (Theorem 5.6), then extends it to minimax and bilevel settings—a clear and transferable proof strategy.
Hyperparameter robustness: The four main hyperparameters exhibit minimal performance sensitivity over orders-of-magnitude variation, substantially reducing tuning burden.
Practical variant design: Replacing the theoretical double-sampling difference with adjacent-iterate gradient differences (more computationally efficient) does not degrade practical performance.

Limitations & Future Work¶

Requires the assumption that stochastic gradient noise has a nonzero lower bound (Assumption 3.2(ii)); although the noiseless case is technically covered, this assumption is unnatural in certain settings.
Only the nonconvex-strongly-concave/strongly-convex setting is considered; the more general nonconvex-nonconvex case remains to be explored.
The contribution is primarily theoretical; deep learning experiments are limited in scale (single dataset with simple models).
Each step requires two independent gradient samples (though the practical variant approximates this with adjacent-iterate gradients).

TiAda: The only prior adaptive minimax method, but its convergence rate is not sharp with respect to noise (no explicit dependence on \(\bar{\sigma}\)).
AdaGrad-Norm: The adaptive foundation for single-level optimization; this paper extends it to the lower level of hierarchical optimization.
Cutkosky & Mehta (2020): The original NSGDM method, but with a fixed momentum parameter; the key innovation in this paper is making the momentum parameter also adaptive.

Rating¶

Novelty: ⭐⭐⭐⭐ — The dual guarantee of adaptivity and sharpness in hierarchical optimization is a first; the combination of momentum normalization with adaptive parameters is original.
Experimental Thoroughness: ⭐⭐⭐ — Synthetic experiments clearly validate the theory, but deep learning experiments cover only one dataset at limited scale.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are rigorous and complete; the proof structure is well-organized and hierarchical.
Value: ⭐⭐⭐⭐ — An important contribution to the optimization theory community; the hyperparameter robustness of the practical variant is a valuable reference for practitioners.