Federated ADMM from Bayesian Duality¶

Conference: ICLR 2026 arXiv: 2506.13150 Code: Available Area: Others Keywords: ADMM, Variational Bayes, Natural Gradient, Federated Learning, Bayesian Duality

TL;DR¶

This paper derives a Bayesian dual structure for ADMM from a variational Bayes (VB) perspective, proving that classical ADMM is a special case of VB over isotropic Gaussian families. Two novel extensions are introduced: a Newton-like variant (one-round convergence on quadratic objectives) and an Adam-like variant (IVON-ADMM, achieving +7% accuracy in heterogeneous deep learning settings).

Background & Motivation¶

State of the Field¶

Background: ADMM has served as a core algorithmic backbone for federated learning since its introduction in the 1970s, with its form remaining largely unchanged. Its robust structure naturally raises the question of whether a more general formulation exists.

Limitations of Prior Work: Accelerated variants of ADMM (over-relaxation, momentum, scaled norms, etc.) introduce additional variables without altering the algorithmic form. Swaroop et al. observed line-by-line similarities between VB and ADMM but could not establish an exact correspondence.

Key Challenge: The deterministic optimization framework underlying ADMM does not extend naturally to heterogeneous deep learning scenarios. A more general framework is needed to unify and generalize ADMM.

Key Insight: The key insight is that solutions to the VB objective exhibit a dual structure that not only resembles the fixed-point structure of ADMM but also naturally generalizes it. The critical missing link is the natural gradient.

Core Idea: The duality between natural parameters and expectation parameters in exponential family distributions establishes a "Bayesian duality" structure, of which ADMM is a special case under isotropic Gaussians.

Mechanism¶

Goal: ### Overall Architecture Classical ADMM: primal-dual structure over \((\theta_g^*, \theta_k^*, \mathbf{v}_k^*, \mathbf{v}_g^*)\); Bayesian ADMM: expectation-natural parameter dual structure over \((\mu_g^*, \mu_k^*, \eta_k^*, \lambda_g^*)\).

Method¶

Overall Architecture¶

Classical ADMM operates with a primal-dual structure over \((\theta_g^*, \theta_k^*, \mathbf{v}_k^*, \mathbf{v}_g^*)\); Bayesian ADMM adopts an expectation-natural parameter dual structure over \((\mu_g^*, \mu_k^*, \eta_k^*, \lambda_g^*)\). The core distinction is: gradient \(\to\) natural gradient, parameters \(\to\) distributions.

Key Designs¶

Bayesian Dual Structure:
- VB fixed-point condition: \(\lambda_g^* = -\sum_{k=0}^{K} \nabla \mathcal{L}_k(\mu_g^*)\)
- Introducing local distributions \(q_k^*\) and dual variables \(\eta_k^*\) yields a four-condition structure analogous to ADMM
- When \(q\) is chosen as an isotropic Gaussian, the natural gradient reduces to the ordinary gradient, recovering classical ADMM
Newton-like Extension (Full-Covariance Gaussian):
- Function: \(q\) is parameterized as a full-covariance Gaussian distribution
- Mechanism: The natural gradient incorporates the inverse Fisher information matrix, which is equivalent to Newton's method on quadratic objectives, enabling one-round communication convergence
- Design Motivation: Classical ADMM requires multiple rounds of iteration even on quadratic objectives
Adam-like Extension (Diagonal Gaussian, IVON-ADMM):
- Function: \(q\) is parameterized as a diagonal-covariance Gaussian, implemented efficiently via the IVON method
- Mechanism: Diagonal Fisher approximation produces adaptive learning rates analogous to Adam
- Design Motivation: Full Fisher information is computationally prohibitive; diagonal approximation is more practical for deep learning

Loss & Training¶

Client: minimizes local loss plus KL regularization (Bayesian formulation)
Server: aggregates natural gradient parameters rather than raw gradients

Key Experimental Results¶

Main Results¶

Deep heterogeneous federated learning:

Method	Accuracy	Runtime	Notes
FedADMM	Baseline	Baseline	Classical ADMM
FedAvg	Baseline-level	Baseline-level	Standard federated
IVON-ADMM	+7%	Comparable	Adam-like extension

Theoretical Validation (Quadratic Objectives)¶

Method	Rounds to Converge	Notes
Classical ADMM	Multiple rounds	Linear convergence
Newton-like ADMM	1 round	One-step convergence

Key Findings¶

IVON-ADMM achieves +7% accuracy in deep heterogeneous settings (non-IID data) without additional communication or computational overhead
The Newton-like variant achieves one-round convergence on quadratic objectives, confirming the theoretical prediction
The natural gradient is the key link connecting VB and ADMM — precisely the missing element identified in Swaroop et al.

Highlights & Insights¶

Mathematical Elegance: Classical ADMM turns out to be a special case of a Bayesian method under the simplest distribution family. This connection is not only aesthetically appealing but also opens a new avenue for generalizing optimization algorithms via families of probability distributions.
The Central Role of Natural Gradients: Prior work using ordinary gradients failed to establish an exact correspondence; substituting natural gradients resolves this immediately, underscoring the deep role of information geometry in algorithm design.
A Free Lunch: IVON-ADMM leverages IVON's efficient diagonal Fisher implementation, incurring no additional runtime while substantially improving performance in heterogeneous settings.

Limitations & Future Work¶

Deep learning experiments are conducted at a relatively small scale (7-layer CNN); performance on larger models (e.g., LLMs) remains unknown
Diagonal Fisher approximation may be insufficiently accurate for certain model architectures
The framework requires selecting an exponential family distribution as a prior assumption; guidelines for choosing the distribution are not clearly specified
Communication efficiency analysis is relatively straightforward and could be further developed

vs. FedADMM: Classical ADMM is recovered as a special case; Bayesian ADMM provides a rigorous generalization
vs. FedAvg: IVON-ADMM demonstrates significant advantages in heterogeneous settings
vs. PVI (Swaroop 2025): This work resolves the missing exact correspondence between PVI and ADMM

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The Bayesian dual structure is original and elegant, unifying two major paradigms
Experimental Thoroughness: ⭐⭐⭐ Theoretical validation is rigorous, but deep learning experiments are limited in scale
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear and the duality diagrams are intuitive
Value: ⭐⭐⭐⭐ Provides a new theoretical foundation and practical extensions for federated optimization