Accelerated Multiple Wasserstein Gradient Flows for Multi-objective Distributional Optimization¶

Conference: ICML 2026
arXiv: 2601.19220
Code: No public code / unconfirmed
Area: Optimization
Keywords: Wasserstein gradient flow, multi-objective optimization, distributional optimization, Nesterov acceleration, particle sampling

TL;DR¶

This paper generalizes Multiple Wasserstein Gradient Descent into continuous-time gradient flows and introduces Nesterov-style momentum acceleration to obtain A-MWGraD. Theoretically, it improves the convergence rate to the weak Pareto optimum from \(O(1/t)\) to \(O(1/t^2)\) in geodesically convex scenarios. Empirically, it accelerates convergence in multi-target sampling and Bayesian multi-task learning.

Background & Motivation¶

Background: Multi-objective Distributional Optimization involves simultaneously optimizing multiple objective functionals in the space of probability distributions. A typical application is multi-target sampling, where particles aim to approach multiple target distributions simultaneously, with each objective formulated as the KL divergence from the current distribution to a specific target distribution.

Limitations of Prior Work: Existing MWGraD can utilize Wasserstein geometry to combine Wasserstein gradients of multiple objectives into a common descent direction. However, like standard gradient descent, its convergence speed is limited. While Nesterov acceleration has proven significantly superior to standard GD in Euclidean space, a systematic acceleration theory for its multi-objective counterpart in probability spaces remains missing.

Key Challenge: Probability distribution spaces are not simple vector spaces, and multiple objectives add complexity. Ensuring that the combined direction maintains multi-objective descent and Pareto stationarity while incorporating momentum for acceleration is more difficult than single-objective Euclidean Nesterov Accelerated Gradient (NAG).

Goal: The authors aim to construct a continuous-time flow for MWGraD and incorporate damped Hamiltonian / Nesterov-style momentum within the Wasserstein space to develop A-MWGraD with provably faster convergence.

Key Insight: The paper interprets discrete MWGraD as an Euler discretization of a specific Wasserstein flow. By drawing on accelerated information gradient flows, it introduces a momentum potential function \(\Phi_t\) into multi-objective distributional optimization.

Core Idea: The approach utilizes "projection onto the convex hull of the first variations of multiple objectives" to preserve a common descent direction, and then applies momentum to this specific Wasserstein flow.

Method¶

The method is structured into a theoretical flow layer and a particle implementation layer. The theoretical layer defines MWGraD and A-MWGraD flows in the space of probability distributions and analyzes convergence rates using a merit function. The implementation layer converts distribution flows into particle dynamics, using SVGD or Blob kernels to approximate Wasserstein gradients.

Overall Architecture¶

Given multiple functional objectives \(F_1,\dots,F_K\), the objective is to find a weakly Pareto optimal distribution in \(\mathcal{P}_2(\mathcal{X})\). MWGraD considers the convex hull \(\mathcal{C}(\rho) = \text{conv}\{\delta_\rho F_k(\rho)\}\) of the first variations of all objectives at a distribution \(\rho\). By projecting \(0\) onto this hull, a compromise descent potential for multiple objectives is obtained.

Continuous-time MWGraD flow is described by the continuity equation \(\dot{\rho}_t+\nabla\cdot(\rho_t\nabla\Phi_t)=0\), where the velocity field is driven by the potential. A-MWGraD incorporates \(\dot{\Phi}_t\), a damping term \(\alpha_t\Phi_t\), and a kinetic term, forming second-order dynamics similar to a Wasserstein accelerated information gradient.

Key Designs¶

1. MWGraD flow and merit function: Upgrading discrete algorithms to continuous flows with a unified metric

By taking the limit \(\eta \to 0\) of discrete MWGraD, the authors derive the MWGraD flow: \(\dot{\rho}_t+\nabla\cdot(\rho_t\nabla\Phi_t)=0\), where the potential function is determined by \(\Phi_t+\text{proj}_{\mathcal{C}(\rho_t),\rho_t}[0]=0\). This projection ensures the velocity direction remains within the common descent direction of all objectives. To quantify convergence, the authors introduce a merit function \(\mathcal{M}(\rho)=\sup_q \min_k\{F_k(\rho)-F_k(q)\}\), which is non-negative and zero if and only if \(\rho\) is weakly Pareto optimal. This provides a unified scale for Pareto optimization. Under geodesic convexity, using \(\tfrac12 \mathcal{W}_2^2(\rho_t,q)\) as a Lyapunov functional yields \(\mathcal{M}(\rho_t) \le R/(2t)=O(1/t)\), establishing the first rigorous continuous-time convergence rate for MWGraD.

2. A-MWGraD accelerated flow: Injecting momentum along the common descent direction to achieve \(O(1/t^2)\)

The \(O(1/t)\) rate of MWGraD flow is as slow as standard GD. To bridge the gap in probability spaces, the authors utilize a damped Hamiltonian perspective to add momentum to the potential evolution: the potential equation becomes \(\dot{\Phi}_t+\alpha_t\Phi_t+\frac{1}{2}\|\nabla\Phi_t\|^2+\text{proj}_{\mathcal{C}(\rho_t),\rho_t}[0]=0\). Here, \(\alpha_t\Phi_t\) is the damping term and \(\frac{1}{2}\|\nabla\Phi_t\|^2\) is the kinetic term. When \(K=1\), this recovers the Wasserstein accelerated information gradient (W-AIG) flow. Crucially, momentum accumulates along the "projected common descent direction" rather than per objective, ensuring acceleration does not violate multi-objective descent. This improves the convergence rate to \(O(1/t^2)\) for geodesically convex cases and \(O(e^{-\sqrt{\beta}t})\) for \(\beta\)-strongly geodesically convex cases.

3. Particle implementation and gradient approximation: Translating distribution PDEs to runnable particle systems

To make the distribution-level PDEs actionable, the authors derive particle dynamics: positions and velocities satisfy \(\dot{x}_t=v_t\) and \(\dot{v}_t+\alpha_t v_t+\sum_k w_{t,k}\nabla\delta_\rho F_k(\rho_t)(x_t)=0\). After discretization, updates are performed via \(x_i^{n+1}=x_i^n+\sqrt{\eta}v_i^n\) and \(v_i^{n+1}=\alpha_n v_i^n-\sqrt{\eta}\sum_k w_{n,k}\bar{\Delta}_k^n(x_i^n)\). For the geodesically convex case, the momentum coefficient is set as \(\alpha_n=(n-1)/(n+2)\). Since the Wasserstein gradient (containing \(\nabla\log\rho\)) cannot be computed directly for empirical measures, SVGD or Blob kernels are used for approximation. This ensures the multi-objective structure (solving a quadratic program on the simplex for weights \(w_n\)) is preserved while enabling practical multi-target sampling.

Key Experimental Results¶

Main Results¶

In Bayesian multi-task learning experiments on Multi-Fashion+MNIST, Multi-MNIST, and Multi-Fashion, the paper compares MOO-SVGD, MWGraD, and its accelerated versions. The table shows ensemble accuracy after 40,000 iterations.

Dataset	Task	MOO-SVGD	MWGraD-SVGD	MWGraD-Blob	A-MWGraD-SVGD	A-MWGraD-Blob
Multi-Fashion+MNIST	#1	94.8±0.4	94.7±0.3	94.1±0.5	96.4±0.4	96.1±0.5
Multi-Fashion+MNIST	#2	85.6±0.2	88.9±0.6	90.5±0.4	90.3±0.3	90.7±0.4
Multi-MNIST	#1	93.1±0.3	95.3±0.7	94.9±0.2	95.3±0.5	95.6±0.4
Multi-MNIST	#2	91.2±0.2	92.9±0.5	93.6±0.5	93.4±0.4	94.2±0.4
Multi-Fashion	#1	83.8±0.8	85.9±0.6	85.8±0.3	85.1±0.4	86.3±0.5
Multi-Fashion	#2	83.1±0.3	85.6±0.5	86.3±0.5	87.4±0.6	86.5±0.7

Ablation Study¶

Configuration	Key Metric	Description
Theory: MWGraD flow	\(\mathcal{M}(\rho_t)=O(1/t)\)	Base convergence rate under geodesic convexity
Theory: A-MWGraD flow	\(\mathcal{M}(\rho_t)=O(1/t^2)\)	Nesterov-style acceleration in convex scenarios
Theory: strongly convex	\(\mathcal{M}(\rho_t)=O(e^{-\sqrt{\beta}t})\)	Exponential convergence under strong geodesic convexity
Toy mixture sampling	A-MWGraD-SVGD/Blob	Faster reduction of GradNorm vs non-accelerated versions
Kernel bandwidth	\(\sigma=1\) or 10 stable	Bandwidths too narrow (0.1) or wide (100) degrade gradient approximation
Particle count	\(K=5\) optimal	Performance drops at \(K=2\); gains plateau after \(K=5\)
Objective count overhead	QP time at \(K=20\)	Weight solving accounts for nearly 79% of compute with many objectives

Key Findings¶

A-MWGraD provides faster convergence curves; in toy sampling, particles concentrate earlier in shared high-density regions of multiple targets.
Both Blob and SVGD approximations benefit from acceleration, indicating the method is compatible with different kernel gradient estimators.
While A-MWGraD variants do not achieve the absolute best in every single cell, they overall provide competitive or state-of-the-art performance across tasks.
The computational cost of solving for weights \(w\) in the simplex increases rapidly with the number of objectives, posing a bottleneck for large-scale multi-objective scenarios.

Highlights & Insights¶

Viewing discrete MWGraD as a continuous flow allows for the use of Lyapunov and Hamiltonian tools to analyze acceleration.
The merit function selection resolves the ambiguity of "where to converge" in multi-objective distributional optimization, offering a more appropriate metric than individual objective values.
The particle implementation of A-MWGraD preserves the joint weight optimization step, meaning momentum is applied to the collective descent direction rather than independently.

Limitations & Future Work¶

Convergence rates are primarily established for continuous time; a rigorous discrete-time convergence rate for A-MWGraD is still needed.
Theoretical analysis assumes exact Wasserstein gradients, but practical implementations use kernel approximations whose error impacts on acceleration require further study.
The quadratic programming cost for weights \(w\) may limit scalability as the number of objectives increases significantly.
Future work could extend A-MWGraD from sampling and multi-task learning to generative model alignment, multi-objective RL, or distributionally robust optimization.

vs MWGraD: While MWGraD identifies multi-objective Wasserstein descent directions, A-MWGraD provides a continuous flow interpretation and a theoretically faster accelerated version.
vs MOO-SVGD / MT-SGD: A-MWGraD incorporates momentum in the probability space while maintaining particle diversity.
vs Nesterov acceleration: This work migrates the damped Hamiltonian interpretation of NAG from Euclidean single-target optimization to multi-objective Wasserstein spaces.
Insight: Many distribution-level optimization problems benefit from deriving a continuous-time flow before particle discretization, providing better interpretability than heuristic momentum.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The acceleration theory for multi-objective Wasserstein gradient flows is a substantial contribution to optimization.
Experimental Thoroughness: ⭐⭐⭐⭐☆ Validated with toy and multi-task datasets, though the application scope could be broader.
Writing Quality: ⭐⭐⭐⭐☆ Clear theoretical structure, albeit notationally dense for readers without an optimal transport background.
Value: ⭐⭐⭐⭐☆ Highly valuable for multi-objective sampling; practical utility depends on gradient approximation and weight solving efficiency.