Skip to content

Accelerated Multiple Wasserstein Gradient Flows for Multi-objective Distributional Optimization

Conference: ICML 2026
arXiv: 2601.19220
Code: No public code / unconfirmed
Area: Optimization
Keywords: Wasserstein gradient flow, multi-objective optimization, distributional optimization, Nesterov acceleration, particle sampling

TL;DR

This paper generalizes Multiple Wasserstein Gradient Descent (MWGraD) into continuous-time gradient flows and introduces Nesterov-style momentum acceleration to develop A-MWGraD. Theoretically, it improves the convergence rate in geodesically convex scenarios from \(O(1/t)\) to \(O(1/t^2)\), while empirically accelerating convergence in multi-target sampling and Bayesian multi-task learning.

Background & Motivation

Background: Multi-objective Distributional Optimization studies the simultaneous optimization of multiple objective functionals in the space of probability distributions. A typical example is multi-target sampling, where a set of particles aims to approximate multiple target distributions simultaneously, with each objective formulated as the KL divergence from the current distribution to a target distribution.

Limitations of Prior Work: Existing MWGraD can utilize the geometry of Wasserstein space to combine multiple Wasserstein gradients into a common descent direction. However, similar to standard gradient descent, its convergence speed remains limited. In Euclidean optimization, Nesterov acceleration is proven to significantly outperform standard GD; a systematic acceleration theory for the multi-objective version in probability spaces is still missing.

Key Challenge: The space of probability distributions is not a simple vector space, and the presence of multiple objectives adds complexity. It is necessary to ensure that the combined direction does not violate multi-objective descent or Pareto stationarity while incorporating momentum for acceleration, which is more complex than single-objective Euclidean NAG.

Goal: The authors aim to construct a continuous-time flow for MWGraD and incorporate damped Hamiltonian / Nesterov-style momentum in Wasserstein space to obtain A-MWGraD with provably faster convergence.

Key Insight: The paper first interprets discrete MWGraD as an Euler discretization of a specific Wasserstein flow. Then, drawing inspiration from accelerated information gradient flow, it introduces a momentum potential function \(\Phi_t\) into multi-objective distributional optimization.

Core Idea: The common descent direction for multiple objectives is preserved by "projecting 0 onto the convex hull of the first variations of multiple objectives," followed by adding momentum to this Wasserstein flow.

Method

The proposed method consists of two layers: theoretical flows and particle implementation. The theoretical layer defines MWGraD and A-MWGraD flows in the space of probability distributions and analyzes the convergence rate to a weak Pareto optimum using a merit function. The implementation layer transforms the distributional flows into particle dynamics, approximating the Wasserstein gradient using SVGD or Blob kernels.

Overall Architecture

Given multiple functionals \(F_1, \dots, F_K\), the goal is to find a weakly Pareto optimal distribution in \(\mathcal{P}_2(\mathcal{X})\). MWGraD considers the convex hull \(\mathcal{C}(\rho)\) of all objective first variations at each distribution \(\rho\), and obtains a compromise descent potential by projecting 0 onto this hull.

The continuous-time MWGraD flow is described by the continuity equation \(\dot{\rho}_t + \nabla \cdot (\rho_t \nabla \Phi_t) = 0\), which dictates how particles are pushed by the velocity field. A-MWGraD adds \(\dot{\Phi}_t\), a damping term \(\alpha_t \Phi_t\), and a kinetic term to form second-order dynamics similar to Wasserstein accelerated information gradients.

Key Designs

  1. MWGraD flow and merit function:

    • Function: Provides a continuous-time interpretation for the original MWGraD and enables analysis of the distance to the weak Pareto optimum.
    • Mechanism: Defines \(\mathcal{M}(\rho) = \sup_q \min_k \{F_k(\rho) - F_k(q)\}\), which is non-negative and zero if and only if \(\rho\) is weakly Pareto optimal. Under geodesically convex conditions, it is proven that the MWGraD flow satisfies \(\mathcal{M}(\rho_t) \le R/(2t)\).
    • Design Motivation: Multi-objective optimization cannot measure convergence using a single function value difference; the merit function provides a unified metric.
  2. A-MWGraD accelerated flow:

    • Function: Transfers Nesterov-style acceleration to multi-objective Wasserstein distributional optimization.
    • Mechanism: Beyond the continuity equation, momentum potential evolution is introduced: \(\dot{\Phi}_t + \alpha_t \Phi_t + \frac{1}{2} \|\nabla \Phi_t\|^2 + \text{proj}_{\mathcal{C}(\rho_t), \rho_t}[0] = 0\). When \(K=1\), this reduces to the Wasserstein accelerated information gradient.
    • Design Motivation: This preserves the geometric structure of the probability space while allowing momentum to accumulate along the common multi-objective descent direction.
  3. Particle implementation and gradient approximation:

    • Function: Transforms theoretical flows into executable sampling algorithms.
    • Mechanism: Particle positions and velocities satisfy \(\dot{x}_t = v_t\) and \(\dot{v}_t + \alpha_t v_t + \sum_k w_{t,k} \nabla \delta_\rho F_k(\rho_t)(x_t) = 0\). Discrete updates are performed via \(x_i^{n+1} = x_i^n + \sqrt{\eta} v_i^n\) and \(v_i^{n+1} = \alpha_n v_i^n - \sqrt{\eta} \sum_k w_{n,k} \bar{\Delta}_k^n(x_i^n)\).
    • Design Motivation: Since the true \(\nabla \log \rho\) is not directly computable, SVGD and Blob kernel approximations allow the particle system to be implemented.

Loss & Training

For multi-objective sampling, the paper typically sets \(F_k(\rho) = \text{KL}(\rho || \pi_k)\). Each step requires solving a quadratic optimization on a simplex to find weights \(w_n\) that minimize the combined Wasserstein gradient norm. Geodesically convex scenarios use \(\alpha_n = (n-1)/(n+2)\), while strongly geodesically convex scenarios use momentum coefficients related to \(\sqrt{\beta \eta}\). The experiments implement A-MWGraD-SVGD and A-MWGraD-Blob.

Key Experimental Results

Main Results

Real-world experiments involve Bayesian multi-task learning on Multi-Fashion+MNIST, Multi-MNIST, and Multi-Fashion datasets, comparing MOO-SVGD, MWGraD, and their accelerated versions. The table shows ensemble accuracy after 40,000 iterations.

Dataset Task MOO-SVGD MWGraD-SVGD MWGraD-Blob A-MWGraD-SVGD A-MWGraD-Blob
Multi-Fashion+MNIST #1 94.8±0.4 94.7±0.3 94.1±0.5 96.4±0.4 96.1±0.5
Multi-Fashion+MNIST #2 85.6±0.2 88.9±0.6 90.5±0.4 90.3±0.3 90.7±0.4
Multi-MNIST #1 93.1±0.3 95.3±0.7 94.9±0.2 95.3±0.5 95.6±0.4
Multi-MNIST #2 91.2±0.2 92.9±0.5 93.6±0.5 93.4±0.4 94.2±0.4
Multi-Fashion #1 83.8±0.8 85.9±0.6 85.8±0.3 85.1±0.4 86.3±0.5
Multi-Fashion #2 83.1±0.3 85.6±0.5 86.3±0.5 87.4±0.6 86.5±0.7

Ablation Study

Configuration Key Metric Description
Theory: MWGraD flow \(\mathcal{M}(\rho_t) = O(1/t)\) Base convergence rate under geodesic convexity
Theory: A-MWGraD flow \(\mathcal{M}(\rho_t) = O(1/t^2)\) Nesterov-style acceleration achieved in convex scenarios
Theory: strongly convex \(\mathcal{M}(\rho_t) = O(e^{-\sqrt{\beta}t})\) Exponential convergence under strong geodesic convexity
Toy mixture sampling A-MWGraD-SVGD/Blob Faster reduction in GradNorm across various step sizes
Kernel bandwidth \(\sigma=1\) or \(10\) stable Extremes (0.1 or 100) degrade gradient approximation quality
Particle count \(K=5\) particles Good balance between precision and computational cost
Objective count overhead \(K=20\) objectives Simplex QP accounts for ~0.79 of time, becoming a bottleneck

Key Findings

  • The advantage of A-MWGraD is reflected not only in final accuracy but also in faster convergence curves; in toy sampling, particles concentrate earlier in the shared high-density regions of multiple targets.
  • Both Blob and SVGD approximations benefit from acceleration, indicating the method is not tied to a specific kernel gradient estimator.
  • Acceleration does not guarantee the highest performance in every single metric, but A-MWGraD variants overall achieve competitive or best performance across multiple tasks.
  • The cost of solving the combined multi-objective weights \(w\) rises rapidly with the number of objectives, which is the most practical bottleneck when scaling from small multi-task problems to many objectives.

Highlights & Insights

  • The transition from discrete MWGraD to a flow perspective is significant: once continuous-time dynamics are established, Lyapunov and Hamiltonian tools can be used to analyze acceleration.
  • The choice of the merit function solves the problem of how to describe "where to converge" in multi-objective distributional optimization, which is more suitable for the Pareto context than individual objective values.
  • The particle implementation of A-MWGraD retains the multi-objective weight solver step; thus, it does not simply add momentum to each objective independently but accelerates the common descent direction.

Limitations & Future Work

  • The convergence rates in the paper are primarily continuous-time results; rigorous discrete-time convergence rates for A-MWGraD have yet to be established.
  • Theoretical analysis assumes exact Wasserstein gradients are available, but in practice, SVGD/Blob approximations must be used. How approximation errors affect acceleration requires further study.
  • As the number of objectives increases, the cost of solving the quadratic program for combined weights \(w\) rises significantly, potentially limiting large-scale multi-objective scenarios.
  • Experiments focus on sampling and Bayesian multi-task learning; future work could extend to generative model alignment, multi-objective reinforcement learning, or distributionally robust optimization.
  • vs MWGraD: MWGraD provides a multi-objective Wasserstein descent direction; A-MWGraD further provides a continuous flow interpretation and an accelerated version with stronger theoretical rates.
  • vs MOO-SVGD / MT-SGD: These methods handle multiple objectives in particle sampling. A-MWGraD maintains particle diversity while introducing momentum in the probability space.
  • vs Nesterov acceleration: Standard NAG accelerates single objectives in Euclidean space; this paper transfers its damped Hamiltonian interpretation to multi-objective Wasserstein space.
  • Insight: Many distribution-level optimization problems can be approached by finding a continuous-time flow followed by particle discretization; this method yields more interpretable convergence properties than stacking heuristic momentum.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The acceleration theory for multi-objective Wasserstein gradient flows is solid and represents a substantial contribution to the optimization field.
  • Experimental Thoroughness: ⭐⭐⭐⭐☆ Validated with toy and real multi-task data, including kernel/particle analysis; the scope of application could be broader.
  • Writing Quality: ⭐⭐⭐⭐☆ The theoretical structure is clear, but the dense notation may present a barrier for readers without an optimal transport background.
  • Value: ⭐⭐⭐⭐☆ Highly valuable for research in multi-objective sampling and distributional optimization; practical deployment depends on the efficiency of gradient approximation and weight solving.