NeurIPS 2025 Optimization natural gradient variational inference non-conjugate models mirror descent relative smoothness convergence guarantees

Natural Gradient VI: Guarantees for Non-Conjugate Models¶

Conference: NeurIPS 2025 arXiv: 2510.19163 Code: None Area: Optimization / Variational Inference Keywords: natural gradient, variational inference, non-conjugate models, mirror descent, relative smoothness, convergence guarantees

TL;DR¶

Under mean-field parameterization, this paper establishes three key theoretical results for natural gradient variational inference (NGVI) in non-conjugate models: a relative smoothness condition on the variational loss, a global convergence-to-stationary-point guarantee for a modified NGVI with non-Euclidean projections, and, under additional structural assumptions, hidden convexity and fast global convergence guarantees.

Background & Motivation¶

Background: Stochastic NGVI is one of the most widely used methods for approximating posterior distributions. It has been shown to be a special case of stochastic mirror descent and is deeply connected to information geometry.

Limitations of Prior Work: - For conjugate models (where prior and likelihood are conjugate), recent work has established convergence guarantees via relative smoothness and strong convexity. - However, these results do not apply to non-conjugate models, where the variational loss becomes non-convex and considerably harder to analyze. - Non-conjugate settings encompass a broad class of practical models, including logistic regression and neural networks.

Core Problem: - Why does NGVI perform well empirically on non-conjugate models? - Can rigorous convergence guarantees be provided? - Does the non-conjugate variational loss possess some form of hidden favorable structure?

Key Insight: The paper focuses on mean-field parameterization and advances the theory along three axes: smoothness conditions, convergence to stationary points, and hidden convexity.

Method¶

Overall Architecture¶

Variational Inference Basics¶

Given a model \(p(x, y) = p(y|x) p(x)\), the goal is to approximate the posterior \(p(x|y)\) with a variational distribution \(q_\lambda(x)\). The variational loss (negative ELBO) is:

\[\mathcal{L}(\lambda) = \text{KL}(q_\lambda \| p(\cdot|y)) = \mathbb{E}_{q_\lambda}[\log q_\lambda(x) - \log p(x, y)] + \text{const}\]

NGVI Algorithm¶

NGVI updates parameters using the Fisher information matrix \(F(\lambda)\) of the variational family:

\[\lambda_{t+1} = \lambda_t - \eta F(\lambda_t)^{-1} \nabla \mathcal{L}(\lambda_t)\]

Equivalence: For exponential family distributions, NGVI is equivalent to mirror descent in the natural parameter space with the negative entropy as the mirror map.

Key Designs¶

Contribution 1: Relative Smoothness Condition¶

Problem: Standard (Euclidean) smoothness does not hold for the variational loss in non-conjugate models.

Theorem 1 (informal): For mean-field Gaussian variational distributions \(q_\lambda(x) = \prod_i \mathcal{N}(x_i; \mu_i, \sigma_i^2)\), if the log-likelihood \(\log p(y|x)\) satisfies: - Bounded Hessian: \(\|\nabla^2 \log p(y|x)\|\) is bounded - Bounded third-order derivatives

then the variational loss \(\mathcal{L}(\lambda)\) satisfies relative smoothness with respect to the mirror map \(\phi\):

\[\mathcal{L}(\lambda') \leq \mathcal{L}(\lambda) + \langle \nabla \mathcal{L}(\lambda), \lambda' - \lambda \rangle + L \cdot D_\phi(\lambda', \lambda)\]

where \(D_\phi\) is the Bregman divergence.

Contribution 2: Modified NGVI and Convergence to Stationary Points¶

Building on relative smoothness, the paper proposes a modified NGVI algorithm that: - Performs mirror descent in the natural parameter space, - Adds a non-Euclidean projection step to ensure parameters remain in the feasible domain, - Sets the step size based on the relative smoothness constant \(L\).

Theorem 2 (informal): The modified NGVI achieves a global non-asymptotic convergence guarantee:

\[\min_{t \leq T} \|\nabla \mathcal{L}(\lambda_t)\|^2 \leq O\left(\frac{\mathcal{L}(\lambda_0) - \mathcal{L}^*}{\eta T}\right)\]

That is, convergence to a stationary point at rate \(O(1/T)\).

Contribution 3: Hidden Convexity and Fast Global Convergence¶

Key Finding: When the likelihood satisfies additional structural assumptions (e.g., log-concavity), the variational loss, though non-convex in Euclidean space, becomes convex after an appropriate transformation via the mirror map in the natural parameter space.

Theorem 3 (informal): When the likelihood satisfies log-concavity, the variational loss satisfies relative strong convexity with respect to the mirror map, and NGVI converges to the global optimum at a linear rate:

\[D_\phi(\lambda^*, \lambda_t) \leq (1 - \mu\eta)^t \cdot D_\phi(\lambda^*, \lambda_0)\]

Applicable Settings¶

Model Type	Likelihood	Applicable Theory
Conjugate models	Exponential family	Prior results (relative strong convexity)
Logistic regression	Sigmoid	Theorem 2 (stationary point) + Theorem 3 (global convergence)
Probit regression	Gaussian CDF	Theorem 2 + Theorem 3
Poisson regression	Exp link function	Theorem 2 (stationary point)
Neural networks	Arbitrary smooth	Theorem 2 (stationary point, under boundedness conditions)

Key Experimental Results¶

Main Results¶

NGVI Convergence on Logistic Regression (Synthetic Data)¶

Method	Iterations (to \(\epsilon=10^{-4}\))	Final KL Divergence	Runtime (s)
Standard VI (Adam)	8,500	2.3e-4	12.4
NGVI (standard)	3,200	1.8e-4	5.1
NGVI + non-Euclidean projection (Ours)	2,800	1.5e-4	4.7
NGVI + projection + adaptive step size	2,100	9.2e-5	3.8

Convergence Comparison Across Non-Conjugate Models¶

Model	NGVI Convergence Rate (empirical)	Theoretical Prediction	Agreement
Logistic regression (\(d=10\))	\(O(e^{-0.12t})\)	Linear convergence	✓
Logistic regression (\(d=100\))	\(O(e^{-0.03t})\)	Linear convergence	✓
Probit regression (\(d=10\))	\(O(e^{-0.15t})\)	Linear convergence	✓
Poisson regression (\(d=10\))	\(O(1/t)\)	Sublinear	✓
NN (1 layer, \(d=20\))	\(O(1/t^{0.8})\)	Sublinear	≈

Ablation Study¶

Effect of Dimensionality on Convergence Rate (Logistic Regression)¶

Dimension \(d\)	Convergence Constant \(\mu\)	Iterations to Convergence	Relative Smoothness Constant \(L\)	\(L/\mu\) Ratio
5	0.21	950	1.8	8.6
10	0.12	1,800	2.3	19.2
50	0.04	6,200	4.1	102.5
100	0.03	12,500	5.7	190.0
500	0.008	48,000	12.3	1537.5

Role of Non-Euclidean Projection¶

Method	Logistic Regression (steps)	Poisson Regression (steps)	Divergence Observed
NGVI without projection	3,200	Diverges (15%)	Yes
NGVI + Euclidean projection	3,000	7,500	Occasional
NGVI + non-Euclidean projection	2,800	5,800	No

Key Findings¶

Relative smoothness holds in practice: Experiments validate the theoretically predicted relative smoothness condition; empirically measured smoothness constants are consistent with theoretical estimates.
Hidden convexity confirmed: Linear convergence rates are observed on log-concave likelihood models (logistic and probit regression), matching theoretical predictions.
Non-Euclidean projection is critical: Standard NGVI risks divergence on models such as Poisson regression; the non-Euclidean projection effectively resolves this stability issue.
Dimension dependence: The condition number \(L/\mu\) grows with dimension, slowing convergence, yet the theoretically guaranteed convergence rates are preserved.
Practical guidance: For log-concave likelihoods, NGVI can be applied with confidence and fast convergence is expected; for general non-convex likelihoods, convergence to a stationary point is at least guaranteed.

Highlights & Insights¶

Closing a theoretical gap: This work provides the first rigorous convergence guarantees for NGVI on non-conjugate models, bridging the gap between empirical success and theoretical understanding.
Discovery of hidden convexity: The paper reveals that the seemingly non-convex variational loss exhibits convex structure under an appropriate parameterization—a deep geometric insight.
Practical algorithmic improvement: The non-Euclidean projection is not merely a theoretical device; it also improves the numerical stability of NGVI in practice.
Unified perspective: The analysis of NGVI for both conjugate and non-conjugate models is unified within the framework of relative smoothness and relative strong convexity.

Limitations & Future Work¶

Mean-field assumption: Only the simplest mean-field parameterization (diagonal Gaussian) is analyzed; full-covariance or more expressive variational families are not addressed.
Boundedness conditions may be too restrictive: The boundedness conditions required by the theory are not necessarily satisfied in practice for neural network likelihoods.
Dimension dependence: The condition number grows polynomially with dimension, potentially making convergence guarantees overly conservative in high-dimensional settings.
Insufficient analysis of stochastic noise: Although stochastic NGVI is considered, the analysis of noise effects is largely confined to bounded-variance assumptions.
Mixture distributions not covered: Multimodal posteriors require mixture variational distributions, which lie beyond the current mean-field framework.
Computing the Fisher information matrix: Exact computation of the Fisher matrix remains a challenge in large-scale models.

Khan & Rue (2021): Established the equivalence between NGVI and mirror descent.
Lin et al. (2024): Provided convergence guarantees for NGVI in conjugate models based on relative strong convexity.
Variational inference theory: Blei et al. (2017) survey and subsequent convergence analysis work.
Natural gradient methods: The information-geometric framework introduced by Amari (1998).
Mirror descent: The classical optimization method of Nemirovsky & Yudin (1983).

Rating¶

Novelty: ★★★★☆ (significant theoretical advance within an established framework)
Experimental Thoroughness: ★★★★☆ (experimental design closely aligned with theoretical validation)
Value: ★★★★☆ (provides clear practical guidance for VI practitioners)
Writing Quality: ★★★★★ (exemplary theoretical paper with clear structure and well-motivated contributions)