Exploring Landscapes for Better Minima along Valleys¶
Conference: NeurIPS 2025 arXiv: 2510.27153 Code: PyPI Area: Optimization / Large-Batch Training Keywords: Loss landscape exploration, valley tracking, large-batch training, ALTO optimizer, gradient difference
TL;DR¶
This paper proposes an optimizer adapter "E" that incorporates an exponential moving average of gradient differences \(\mathbf{a}_k = \text{EMA}(\mathbf{g}_k - \mathbf{g}_{k-1})\) into the gradient update, enabling optimizers to continue exploring "valleys" in the loss landscape for lower and flatter minima after reaching a local minimum. The resulting ALTO optimizer achieves an average improvement of 2.5% in test accuracy under large-batch training.
Background & Motivation¶
Background: Nearly all gradient-based optimizers (SGD, Adam, etc.) cease searching once a local minimum is reached. Relying solely on local information provides no guarantee that the found minimum is the lowest or the most generalizable.
Limitations of Prior Work: (a) Traditional optimizers become "trapped" at local minima and cannot continue exploring the valley structure of the loss landscape. (b) Large-batch training is the most direct approach to fully utilizing GPU parallelism, but suffers from the challenge of "fewer update steps"—achieving comparable test accuracy to small-batch training with far fewer parameter updates. (c) Existing learning rate scaling rules (linear scaling \(\eta_k \propto |\mathcal{Z}_k|\), square-root scaling) are constrained by task-dependent critical thresholds at very large batch sizes.
Key Challenge: In the loss landscape, optimizers must macroscopically be "captured" by large-scale valleys (walking along them) while microscopically escaping small-scale sharp minima—a duality that traditional optimizers address only partially.
Goal: To design an optimizer that continues exploring along valleys after reaching a local minimum, thereby finding lower and flatter minima.
Key Insight: The observation that \(-\nabla\|\nabla f(\theta_k)\|^2 = -2\mathbf{H}_k \bar{\mathbf{g}}_k \approx \bar{\mathbf{g}}_k - \bar{\mathbf{g}}_{k-1}\) reveals that the gradient difference direction naturally possesses the property of repelling sharp minima and being attracted to flat ones.
Core Idea: Augmenting the optimizer's gradient with the term \(\alpha \cdot \text{EMA}(\mathbf{g}_k - \mathbf{g}_{k-1})\) to microscopically repel sharp minima and macroscopically track valleys.
Method¶
Overall Architecture¶
The E-adaptor is a plug-in module compatible with any gradient-based optimizer. The core modification replaces the gradient \(\mathbf{g}_k\) with \(\mathbf{g}_k + \alpha \mathbf{a}_k\), where \(\mathbf{a}_k = \beta_1 \mathbf{a}_{k-1} + (1-\beta_1)(\mathbf{g}_k - \mathbf{g}_{k-1})\) is the EMA of gradient differences. The standard Adam/Lamb update then proceeds as usual.
Key Designs¶
-
Directional Analysis of Gradient Differences:
- Function: Proves that \(\mathbf{g}_k - \mathbf{g}_{k-1}\) is an ideal direction for escaping sharp minima.
- Mechanism: Four stages of optimizer behavior near a minimum (①② descending → ②③ crossing → ③④ ascending → ④⑤ decelerating) are considered. The sign of the inner product \(\langle \theta_k - \theta_{k-1}, \mathbf{g}_k - \mathbf{g}_{k-1} \rangle\) is analyzed:
- In stages ②③ and ③④ (crossing the minimum): positive inner product → accelerates escape
- In stages ①② and ④⑤: negative inner product → decelerates (captured by valley)
- Comparison with \(-\nabla\|\nabla f\|^2\): the latter yields a negative inner product in stage ②③ (counterproductively decelerating), making it inferior to gradient differences.
- Design Motivation: \(-\nabla\|\nabla f\|^2 = -2\mathbf{H}_k \bar{\mathbf{g}}_k\); a large Hessian \(\mathbf{H}_k\) indicates a sharp minimum → large steps facilitate escape; a small \(\mathbf{H}_k\) indicates a flat minimum → small steps lead to capture.
-
ALTO Algorithm (Adapted Lamb with Exploration):
- Function: Embeds the E-adaptor into the Lamb optimizer.
- Core iterations:
- \(\mathbf{a}_k = \beta_1 \mathbf{a}_{k-1} + (1-\beta_1)(\mathbf{g}_k - \mathbf{g}_{k-1})\) (EMA of acceleration term)
- \(\mathbf{m}_k = \beta_2 \mathbf{m}_{k-1} + (1-\beta_2)(\mathbf{g}_k + \alpha \mathbf{a}_k)\) (first moment)
- \(\mathbf{v}_k = \beta_3 \mathbf{v}_{k-1} + (1-\beta_3)[\mathbf{g}_k + \alpha \mathbf{a}_k]^2\) (second moment)
- Bias correction → \(\hat{\mathbf{m}}_k, \hat{\mathbf{v}}_k\)
- \(\mathbf{r}_k = \hat{\mathbf{m}}_k / (\sqrt{\hat{\mathbf{v}}_k} + \varepsilon_1) + \lambda_k \theta_k\)
- Layer-wise regularized update: \(\theta_{k+1}^{(i)} = \theta_k^{(i)} - \eta_k \mathbf{r}_k^{(i)} \phi(\|\theta_k^{(i)}\|) / (\|\mathbf{r}_k^{(i)}\| + \varepsilon_2 \phi(\|\theta_k^{(i)}\|))\)
- Key constraint: \(|\alpha| < 1/(1-\beta_1)\), derived from the stability condition of the continuous-time ODE.
-
Why \(\mathbf{a}_k\) is Added to \(\mathbf{g}_k\) Rather Than \(\mathbf{m}_k\):
- Adding to \(\mathbf{m}_k\) implies that gradient differences are on the same order of magnitude as gradients → severe oscillation or ineffectiveness.
- Adding to \(\mathbf{g}_k\) means the resulting momentum contains EMA₂ (double exponential moving average) of gradient differences with weights \((1-\beta)^2 \beta^{k-i} \binom{k-i+1}{1}\) → smoother and more stable.
- EMA₂ accumulates the most informative directions during the early stages of training (when gradients decay rapidly) and serves as a "navigation signal" in later stages.
-
Effect of the Sign of \(\alpha\):
- \(\alpha > 0\): faster convergence but less exploration; suitable for small-batch training.
- \(\alpha < 0\): more exploration, finds flatter minima, but slower convergence; suitable for large-batch training.
- Recommendation: small-batch \(\alpha = 0.5, \beta_1 = 0.01\); large-batch \(\alpha = -5, \beta_1 = 0.99\).
Convergence Analysis¶
- Theorem 1 (Non-convex): Under assumptions of \(L\)-smoothness and unbiased gradient with bounded variance, for \(T \geq O(G_\infty^{1.5} \epsilon^{-2})\), \(\frac{1}{T+1}\sum_{k=0}^T \mathbb{E}\|\nabla f_k(\theta_k)\|^2 \leq 4\epsilon^2\). This improves upon Lamb's \(O(\epsilon^{-4})\) rate.
- Theorem 2 (Convex): \(R(T) \leq O(\sqrt{T})\), on par with Adam, but under more relaxed conditions (\(\beta_2^2/\beta_3 < 1\) rather than \(\beta_{1,k} = \beta_k \lambda^k\)).
Key Experimental Results¶
Large-Batch ImageNet Training (ResNet-50, 90 epochs)¶
| Batch Size | Adam | AdamW | AdaBelief | Lamb | ALTO |
|---|---|---|---|---|---|
| 1K | 73.08 | 75.65 | 73.32 | 77.06 | 77.22 |
| 2K | 73.08 | 74.93 | 73.48 | 77.11 | 77.25 |
| 4K | 73.32 | 74.65 | 73.41 | 76.92 | 77.35 |
| 8K | 73.11 | 74.40 | 73.14 | 76.89 | 77.10 |
| 16K | 73.09 | 74.10 | 73.00 | 76.66 | 76.87 |
| 32K | 72.50 | 73.57 | 72.89 | 76.42 | 76.70 |
CIFAR-10/100 + ImageNet (ResNet-20/34)¶
| Dataset | Batch | SGD | Adam | Lamb | ALTO |
|---|---|---|---|---|---|
| CIFAR-10 | 128 | 91.85 | 89.88 | 90.89 | 91.24 |
| CIFAR-10 | 16384 | 80.86 | 87.34 | 83.56 | 88.83 |
| CIFAR-100 | 128 | 64.93 | 64.35 | 61.29 | 65.74 |
| CIFAR-100 | 16384 | 44.20 | 54.91 | 56.06 | 57.78 |
| ImageNet | 256 | 70.64 | 65.06 | 69.17 | 69.95 |
| ImageNet | 4086 | 49.35 | 54.96 | 70.34 | 70.83 |
Training Time Comparison (VGG-16, CIFAR-100, batch=16384)¶
| Target Accuracy | ALTO (s) | Lamb (s) | Speedup |
|---|---|---|---|
| 20% | 137 | 196 | 1.43× |
| 40% | 334 | 409 | 1.23× |
| 60% | 608 | 865 | 1.42× |
Key Findings¶
- ALTO outperforms the SOTA baseline (Lamb) across all 17 CV+NLP experiments.
- The advantage is more pronounced at large batch sizes: at batch=16384, ALTO surpasses Lamb by 5.27% on CIFAR-10.
- ALTO at large-batch ImageNet (batch=4086, 70.83%) exceeds SGD at small-batch (batch=256, 70.64%).
- In GPT-2 training, ALTO achieves a test perplexity of 78.37, substantially better than Lamb's 83.13.
- To reach the same accuracy, ALTO can save 29.68% of computation time.
Highlights & Insights¶
- Constraint derivation from ODE perspective: The discrete optimizer is cast as a continuous-time ODE, and the stability condition on eigenvalue real parts yields the constraint \(|\alpha| < 1/(1-\beta_1)\), bridging theoretical stability and practical hyperparameter selection.
- EMA₂ = memory of early directions: The EMA of EMA allows the optimizer to leverage informative directions accumulated during early training—when gradients decay rapidly—as a "navigation guide" in later stages, which is particularly beneficial for large-batch training.
- Large batch = more reliable gradient differences: Larger batches produce more accurate gradient estimates → more reliable gradient differences \(\mathbf{g}_k - \mathbf{g}_{k-1}\) → greater advantage for ALTO, which explains why improvements are most pronounced in the large-batch regime.
Limitations & Future Work¶
- Five additional hyperparameters are introduced (\(\alpha, \beta_1, \beta_2, \beta_3, \varepsilon\)); although the authors claim that only \(\beta_1\) and \(\eta\) typically require tuning, the overall tuning burden is still increased.
- Each step incurs a minor additional cost for maintaining the EMA of gradient differences, resulting in slightly longer epoch times compared to Lamb.
- The non-convex convergence analysis relies on relatively strong assumptions (Assumptions 3.3–3.5), particularly the monotonicity assumption, which may not hold in practice.
- Experiments are conducted on a single node with 4×A100 GPUs; communication bottlenecks in multi-node distributed settings remain unevaluated.
- The relationship between the sign of \(\alpha\) and batch size is empirically motivated, lacking theoretical justification.
Related Work & Insights¶
- vs. Lamb [You et al., 2020]: Lamb serves as the foundation of ALTO, with layer-wise regularization inherited from it. ALTO with the exploration term \(\mathbf{a}_k\) outperforms Lamb at all batch sizes.
- vs. AdaBelief [Zhuang et al., 2020]: AdaBelief replaces \(g_k^2\) with \((g_k - m_{k-1})^2\) in the adaptive learning rate, focusing on gradient prediction accuracy. ALTO instead focuses on the directional information contained in gradient differences.
- vs. SAM/Sharpness-Aware methods: SAM explicitly minimizes loss landscape sharpness. ALTO implicitly favors flat minima via gradient differences, without requiring additional forward passes.
- vs. LR warmup/cosine schedules: Scheduling strategies operate in the temporal dimension. ALTO's exploration operates in the geometric/directional dimension; the two are orthogonal and can be combined.
Rating¶
- Novelty: ⭐⭐⭐⭐ The insight of using gradient differences as a valley-tracking direction is original, and the EMA₂ design is theoretically grounded.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across CV/NLP/RL tasks, multiple models, and multiple batch sizes, with 17 consistently positive results.
- Writing Quality: ⭐⭐⭐⭐ The directional analysis diagrams and tables (Table 1, Fig. 2) are highly intuitive.
- Value: ⭐⭐⭐⭐ Directly applicable to large-batch training; ALTO can serve as a drop-in replacement for Lamb.