Natural Gradient Descent for Improving Variational Inference Based Classification of Radio Galaxies¶
Conference: NeurIPS 2025 arXiv: 2511.13224 Code: Available Area: Bayesian Deep Learning / Radio Astronomy Keywords: Natural Gradient Descent, Variational Inference, Bayesian Neural Networks, Uncertainty Calibration, Radio Galaxy Classification
TL;DR¶
This work replaces standard SGD with the natural gradient descent optimizer iVON for optimizing BNN parameters under variational inference, achieving better uncertainty calibration in radio galaxy classification while maintaining predictive performance comparable to HMC and BBB-VI.
Background & Motivation¶
Future radio astronomy surveys are expected to produce exabyte-scale data, demanding statistically robust ML models. Bayesian Neural Networks (BNNs) offer a principled approach to uncertainty modeling.
Prior benchmark work found that: - HMC (Hamiltonian Monte Carlo) achieves overall best performance in predictive accuracy, calibration, and out-of-distribution detection, but at prohibitive computational cost (7 days) - BBB-VI (Bayes by Backprop) performs adequately but suffers from initialization sensitivity, slow convergence, and cold posterior effects
Core Problem: Standard VI optimizes variational parameters via SGD, yet the VI parameter space is a statistical manifold (Riemannian manifold) where each point corresponds to a probability distribution. SGD assumes Euclidean geometry and may not identify the most efficient optimization direction in distribution space.
Natural Gradient Descent (NGD) preconditions gradients with the inverse Fisher Information Matrix (FIM), accounting for the geometry of the statistical manifold and providing a more direct path through distribution space.
Method¶
Overall Architecture¶
The iVON (Improved Variational Online Newton) algorithm replaces SGD for optimizing variational parameters of the BNN. iVON is grounded in the Bayesian Learning Rule (BLR) framework, which unifies variational inference and natural gradient descent.
Key Designs¶
-
Natural Gradient Updates under the BLR Framework:
- Variational distribution: multivariate Gaussian \(q(\boldsymbol{\theta}) = \mathcal{N}(\boldsymbol{\theta}|\mathbf{m}, \mathbf{S}^{-1})\)
- Natural gradient update on natural parameters: \(\boldsymbol{\lambda}_{t+1} \leftarrow \boldsymbol{\lambda}_t - \alpha \nabla_{\boldsymbol{\mu}}\{\mathbb{E}_{q}[l(\boldsymbol{\theta})] - \mathcal{H}(q)\}\)
- Equivalent to a Newton-like update requiring both gradient and Hessian information
-
Scalable Approximations in iVON:
- Diagonal Hessian approximation: reduces complexity from \(O(d^2)\) to \(O(d)\)
- Reparameterization trick for second-order estimation: \(\hat{\mathbf{h}} = \hat{\mathbf{g}} \cdot (\boldsymbol{\theta} - \mathbf{m}) / \boldsymbol{\sigma}^2\) — approximates curvature by measuring gradient response to random parameter perturbations
- Geometric correction term: ensures positive definiteness of the precision matrix throughout training
- Mean and standard deviation updates: \(\mathbf{m} \leftarrow \mathbf{m} - \alpha \frac{(\hat{\mathbf{g}} + \delta\mathbf{m})}{(\mathbf{h} + \delta)}\), \(\boldsymbol{\sigma} \leftarrow \frac{1}{\sqrt{\text{ess}(\mathbf{h} + \delta)}}\)
-
Key Differences from BBB-VI:
- BBB-VI computes separate Euclidean gradients for mean and variance: \(\mathbf{m} \leftarrow \mathbf{m} - \alpha \nabla_\mathbf{m} \mathcal{L}\)
- iVON employs natural gradients, with the curvature estimate \((\mathbf{h} + \delta)\) in the denominator providing adaptive step sizes
- iVON requires only 1 MC sample, consistent with BBB
Loss & Training¶
- Dataset: MiraBest Confident (FRI/FRII radio galaxy binary classification; 584 train / 145 validation / 104 test)
- Architecture: LeNet-like network
- Learning rate 0.2 with cosine annealing and 5-epoch warmup
- Hessian initialization \(h_0 = 0.5\) (selected from \(\{0.01, 0.1, 0.5, 1, 5\}\))
- Effective Sample Size (ESS): \(\text{ess} = 10N\) yields best results (cold posterior)
- Weight decay \(\delta = 10^{-4}\), trained for 1000 epochs across 10 random seeds
Key Experimental Results¶
Main Results — Predictive Performance and Calibration (Table 1)¶
| Inference Method | Test Error Rate ↓ | UCE ↓ | Training Time |
|---|---|---|---|
| HMC | 4.16 ± 0.45 | 14.76 ± 0.95 | 7 days |
| BBB-VI | 3.94 ± 0.01 | 12.77 ± 6.11 | 40 min |
| iVON (ess=10N) | 3.07 ± 1.47 | 8.37 ± 4.12 | 25 min |
| iVON (ess=100N) | 3.36 ± 1.23 | 12.19 ± 6.57 | 25 min |
Out-of-Distribution Detection (Energy Score Analysis)¶
| Dataset | HMC | BBB-VI | iVON |
|---|---|---|---|
| MiraBest (iD) | Low energy ✓ | Low energy ✓ | Low energy ✓ |
| GalaxyMNIST (far OoD, optical) | High energy ✓ | High energy ✓ | High energy ✓ |
| MIGHTEE (near OoD, different radio telescope) | High energy ✓ | Moderate separation ✓ | Unreliable detection ✗ |
Key Findings¶
- iVON achieves best UCE across all methods (8.37 vs. BBB-VI 12.77 vs. HMC 14.76), demonstrating substantially improved uncertainty calibration
- Predictive performance is competitive with the best methods (3.07%), with 37.5% faster training than BBB-VI and over 400× faster than HMC
- Cold posterior effect persists: ess=10N (cold posterior) outperforms ess=N (standard posterior), indicating that even improved optimizers do not eliminate this effect
- Near-OoD detection degrades: iVON fails to reliably detect MIGHTEE data (radio galaxies from telescopes with different resolution/sensitivity), though far-OoD detection (optical galaxies) remains effective
- Different optimizers converge to qualitatively distinct solutions — better calibration but weaker near-OoD detection — highlighting that the inductive bias introduced by the optimizer is non-negligible
Highlights & Insights¶
- Optimizer as inductive bias: The paper reveals an important observation — the choice of optimizer not only affects convergence speed but also determines the type of representation learned (distributed/redundant vs. compressed/local), with downstream consequences for different evaluation criteria
- Natural gradient descent exploits the Riemannian geometry of parameter space, providing a more "natural" optimization direction for VI
- Though small in scale, the experiments are analytically thorough, with each finding grounded in concrete implications for physical applications
Limitations & Future Work¶
- Validation is limited to a small LeNet architecture and a small dataset; confirmation on larger architectures and datasets is needed
- Degraded near-OoD detection is a significant limitation, restricting applicability in scenarios requiring detection of distributional shift across different telescopes
- The cold posterior effect remains unresolved, pointing to deeper issues of model misspecification
- The ESS hyperparameter requires domain knowledge to set appropriately, with no automated selection strategy proposed
- The diagonal Hessian approximation may discard important inter-parameter correlations
Related Work & Insights¶
- iVON (Lin et al.) and BLR (Khan & Rue) provide an elegant theoretical framework unifying diverse learning algorithms under the Bayesian Learning Rule
- Noisy Natural Gradient (Zhang et al.) represents an earlier attempt at natural gradient VI
- The findings offer direct guidance for researchers applying BNNs in astronomy and other scientific domains: optimizer choice must be considered in light of its impact across multiple evaluation dimensions
- This work motivates future research into formally characterizing the inductive biases of different optimizers and their effects on posterior approximation quality
Rating¶
- Novelty: ⭐⭐⭐ — Applies an existing optimizer to a new domain; theoretical contribution is limited, but the observations are valuable
- Experimental Thoroughness: ⭐⭐⭐ — Small-scale but carefully analyzed, with faithful reproduction of prior benchmark results
- Writing Quality: ⭐⭐⭐⭐ — Background is clearly explained and mathematical derivations are complete
- Value: ⭐⭐⭐ — Directly useful to the radio astronomy BNN community; insights are broadly inspiring for the wider BNN community