NeurIPS 2025 Optimization Adam Signum implicit bias variational inference signal-to-noise ratio language modeling

In Search of Adam's Secret Sauce¶

Conference: NeurIPS 2025 arXiv: 2505.21829 Code: GitHub Area: Optimization Keywords: Adam, Signum, implicit bias, variational inference, signal-to-noise ratio, language modeling

TL;DR¶

Through large-scale experiments training 1500+ language models, this paper establishes: (1) Signum closes 96% of the SGD–Adam gap yet remains 25% slower than Adam; (2) setting \(\beta_1 = \beta_2\) is a near-optimal simplification of Adam; (3) under \(\beta_1 = \beta_2 = \beta\), Adam can be reinterpreted as a signal-to-noise ratio–adaptive Signum that estimates the gradient mean and variance via online Gaussian variational inference.

Background & Motivation¶

The indispensability of Adam: Although recent optimizers such as Muon, Scion, and SOAP outperform Adam in certain settings, they still rely on Adam to update embedding layers, LM heads, and normalization parameters. The core advantage of Adam remains incompletely understood.

Adam ≈ Signum?: Recent work has highlighted a close connection between Adam and SignSGD with momentum (Signum). Nevertheless, at the 160M parameter scale, a carefully tuned Signum still incurs a 25% effective slowdown—requiring 25% more training budget to reach the same perplexity.

Core Problem: What is the "secret sauce" that distinguishes Adam from simplified variants such as Signum, RMSprop, and SGD?

Methodology: Approximately 10,000 A100 GPU hours are invested to systematically ablate all hyperparameters—including independently tuning momentum parameters for each learning rate—providing a comprehensive and reproducible benchmark.

Method¶

Large-Scale Benchmark Experiments (§3)¶

Basic setup: 160M-parameter Transformer LM, SlimPajama dataset, Chinchilla-optimal training.

Run counts: - SGD: 131 runs (sweeping weight decay, gradient clipping, momentum, lr) - RMSprop: 48 runs - Signum: 70 runs (sweeping clipping, momentum, lr) - Adam: 200 runs (sweeping \(\beta_1\), \(\beta_2\), lr)

Takeaway 1: Signum Closes the Gap but Falls Short¶

Optimizer	Validation Perplexity
Adam	21.86 ± 0.21
Signum	23.23 ± 0.16
RMSprop	27.04 ± 0.34
SGD+Cclip	33.40 ± 0.39
SignSGD	36.78 ± 0.57
SGD+Gclip	37.76 ± 0.61
SGD	53.62 ± 5.14

Signum closes 96% of the SGD–Adam gap, yet a 1.37 perplexity-point difference remains, corresponding to a 25% training efficiency loss.

Takeaway 2: \(\beta_1 = \beta_2\) Is a Near-Optimal Choice¶

Among 200 Adam configurations, the optimal \(\beta_2\) is consistently close to \(\beta_1\) for every value of \(\beta_1\). Restricting to \(\beta_1 = \beta_2\) incurs a performance degradation of no more than 0.3 perplexity points—far smaller than the 1.37-point gap with Signum.

Theoretical Interpretation (§4): A Variational Inference Perspective¶

Setting \(\beta_1 = \beta_2 = \beta\) and \(\epsilon = 0\) (experiments confirm that \(\epsilon\) has no significant effect in the range \([10^{-6}, 10^{-15}]\)):

Proposition 1: The Adam update direction can be rewritten as:

\[d_k = \frac{m_k}{\sqrt{m_k^2 + \beta \cdot \text{EMA}_\beta[(m_{k-1} - g_k)^2]}}\]

The term \(\beta \cdot \text{EMA}_\beta[(m_{k-1} - g_k)^2]\) in the denominator is precisely an online estimate of the gradient variance.

Theorem 4.1 (Variational Inference Interpretation): The two momentum buffers of Adam correspond to the closed-form solution of the following KL-regularized maximum likelihood problem:

\[\min_{m, \sigma^2 \geq 0} -\log p(g_{k+1} | m, \sigma^2) + \frac{1}{\lambda} \text{KL}(\mathcal{N}(m_k, \sigma_k^2) \| \mathcal{N}(m, \sigma^2))\]

where \(\beta = 1/(1+\lambda)\), with solutions: - \(m_{k+1} = \beta m_k + (1-\beta) g_{k+1} = \text{EMA}_\beta[g_{k+1}]\) - \(\sigma_{k+1}^2 = \beta \sigma_k^2 + \beta(1-\beta)(m_k - g_{k+1})^2 = \beta \cdot \text{EMA}_\beta[(m_k - g_{k+1})^2]\)

Signal-to-Noise Ratio (SNR) Adaptive Trust Region: Adam can be viewed as Signum with a step size that adapts to the SNR:

\[d_k = \frac{\text{sign}(m_k)}{\sqrt{1 + \sigma_k^2 / m_k^2}}\]

High SNR (\(\sigma_k^2 / m_k^2\) small) → step size ≈ 1 → approaches Signum
Low SNR → step size is shrunk → conservative update

Uniqueness (Proposition, §C.2): The Adam update direction admits the representation \(m_k / \sqrt{m_k^2 + \gamma \cdot \text{EMA}_\tau[(a m_{k-1} - b g_k)^2]}\) if and only if \(\beta_1 = \beta_2\). That is, only when the two beta parameters are equal does the denominator admit an exact variance-estimation interpretation.

Ruling Out Confounders (§3.3)¶

Confounder	Conclusion
\(\epsilon\) value	No significant difference in \([10^{-6}, 10^{-15}]\); adding an \(\epsilon\) mollifier to Signum yields no improvement
Momentum initialization	Zero initialization vs. gradient initialization makes no meaningful difference
Bias correction	Final validation perplexity is nearly unchanged

Key Experimental Results¶

Robustness of \(\beta_1 = \beta_2\) Across Settings¶

Ablation Dimension	Performance of \(\beta_1 = \beta_2\)
Different batch sizes (128, 256, 512)	Consistently near-optimal
Different sequence lengths (512, 2048)	Consistently near-optimal
Different data (SlimPajama, Fineweb)	Consistently near-optimal
No weight decay	Consistently near-optimal
2× token budget	Consistently near-optimal
410M parameter model (44 runs)	Consistently near-optimal

Standard \((\beta_1, \beta_2) = (0.9, 0.95)\) vs. Equal Beta¶

Model Scale	\((0.9, 0.95)\)	Equal Beta Optimum
160M	Competitive	Comparable or better
410M	Suboptimal (Figure 5)	Better

Quadratic Function Validation (§5)¶

Setting	SGD	Signum	Adam (\(\beta_1=\beta_2\))
Homogeneous Hessian	Slow	Fast	Fast (≈ Signum)
Heterogeneous Hessian (Transformer-like)	Very slow	Moderate	Fastest

Key insight: On heterogeneous loss landscapes, the variance term \(\sigma_k^2\) takes on different magnitudes across parameter blocks, enabling block-level adaptivity that a fixed mollifier cannot replicate.

Highlights & Insights¶

Adam = SNR-Adaptive Signum: This result precisely refines the observation of Balles & Hennig (2018)—who could not prove that the denominator estimates variance because they assumed \(\beta_1 \neq \beta_2\). The \(\beta_1 = \beta_2\) simplification makes this connection exact.
Unexpected Elegance of the Variational Inference Perspective: Adam's two EMA buffers are exactly the online solution to Gaussian variational inference, and the regularization parameter \(\lambda\) corresponds precisely to the EMA coefficient \(\beta\).
A Fixed \(\epsilon\) Cannot Replace Adaptive Variance: Both experiments and the quadratic example demonstrate that Signum with a constant mollifier cannot match Adam's performance; a data-driven adaptive variance term is necessary.
\(\beta_1 = \beta_2 = 0.95\) as a Default for LM Training: This setting has been independently adopted by Zhao et al. and Shah et al., among others.

Limitations & Future Work¶

Limited to 160M and 410M scales: Despite the remarkable number of runs (1500+), the findings are not validated on models with 1B+ parameters.
Fixed hyperparameter grids: Although results reside at grid interiors rather than boundaries, different grid choices could in principle yield different conclusions.
\(\beta_1 = \beta_2\) may shift slightly at small batch sizes: Figure 3 hints at this for batch size 128.
Theorem 4.1 explains Adam's structure but not why that structure works: Why should the mean and variance be arranged as a ratio?
The effect of weight decay on the theory is not addressed: The interaction between weight decay and the Adam update in AdamW is not covered theoretically.

Rating¶

Novelty: ⭐⭐⭐⭐ The \(\beta_1 = \beta_2\) finding appears simple yet carries far-reaching implications; the variational inference interpretation is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 1500+ runs, ~10K A100 GPU hours, covering a wide range of ablation dimensions.
Theoretical Depth: ⭐⭐⭐⭐ Proposition 1 and Theorem 4.1 are rigorously derived; the uniqueness proof is compelling.
Writing Quality: ⭐⭐⭐⭐⭐ The narrative is clear (empirical findings → simplification → theoretical explanation → validation), and figures are carefully designed.
Practical Value: ⭐⭐⭐⭐⭐ Directly actionable tuning advice: \(\beta_1 = \beta_2\) reduces Adam to a single-momentum-parameter optimizer.

In Search of Adam's Secret Sauce¶

TL;DR¶

Background & Motivation¶

Method¶

Large-Scale Benchmark Experiments (§3)¶

Takeaway 1: Signum Closes the Gap but Falls Short¶

Takeaway 2: \(\beta_1 = \beta_2\) Is a Near-Optimal Choice¶

Theoretical Interpretation (§4): A Variational Inference Perspective¶

Ruling Out Confounders (§3.3)¶

Key Experimental Results¶

Robustness of \(\beta_1 = \beta_2\) Across Settings¶

Standard \((\beta_1, \beta_2) = (0.9, 0.95)\) vs. Equal Beta¶

Quadratic Function Validation (§5)¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Related Work & Insights¶

Highlights & Insights¶

Rating¶

Related Papers¶