Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach¶

Conference: NeurIPS 2025 arXiv: 2510.20235 Code: GitHub Area: Medical Imaging Keywords: Multi-Objective Reinforcement Learning, Max-Min Fairness, Zero-Sum Game, Mirror Descent, Last-Iterate Convergence

TL;DR¶

This paper reformulates max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game, proposes the ERAM/ARAM algorithms, and leverages mirror descent to achieve a concise closed-form weight update. The approach guarantees global last-iterate convergence and significantly outperforms existing methods on tasks such as traffic signal control.

Background & Motivation¶

Background: Multi-objective reinforcement learning (MORL) addresses MDPs with vector-valued rewards, where utility-function-based methods—most commonly weighted sums—are the dominant paradigm. Max-min MORL maximizes the worst-performing objective dimension and is naturally suited to fairness-oriented resource allocation scenarios.

Limitations of Prior Work: (i) GGF-DQN/GGF-PPO optimize a surrogate objective $\mathbb{E}_\pi[\min_k \sum_t \gamma^t r_k]$ (a Jensen lower bound) rather than the true objective $\min_k \mathbb{E}_\pi[\sum_t \gamma^t r_k]$; (ii) the double-loop method of Park et al. directly optimizes the true objective but incurs substantial memory and computational overhead (20 Q-networks) and guarantees only average-iterate convergence; (iii) GGF-PPO updates only the worst-performing dimension per batch, effectively switching $w$ among one-hot vectors, yielding only average-iterate convergence.

Key Challenge: The nonlinearity of the $\min$ operator prevents max-min MORL from being solved directly by standard RL methods, while simultaneously demanding numerically efficient solvers.

Goal: Design a computationally and memory-efficient max-min MORL algorithm with last-iterate convergence guarantees.

Key Insight: Exploit a special equivalence between max-min and min-max under entropy regularization to transform the problem into finding a Nash equilibrium of a two-player zero-sum game.

Core Idea: The Learner updates the policy via NPG/PPO, while the Adversary updates the weight vector $w$ using the closed-form softmax formula derived from entropy-regularized mirror descent—two players jointly achieving max-min fairness.

Method¶

Overall Architecture¶

Chain of Equivalences: $$\max_\pi \min_{k} V_{k,\tau}^\pi = \max_\pi \min_{w \in \Delta^K} \langle w, \mathbf{V}_\tau^\pi \rangle = \min_{w \in \Delta^K} \max_\pi \langle w, \mathbf{V}_\tau^\pi \rangle$$

The first equality extends the discrete $\min_k$ to $\min_w$ over the simplex; the third equality is the special max-min = min-max equivalence (Theorem 3.1). This defines a two-player zero-sum game:

Learner: policy parameter $\theta$, maximizing $\langle w, \mathbf{V}_\tau^{\pi_\theta} \rangle$
Adversary: weight vector $w \in \Delta^K$, minimizing $\langle w, \mathbf{V}_\tau^{\pi_\theta} \rangle$

Key Designs¶

Module 1: Regularized Game $\mathcal{RG}$ - Function: Adds individual regularization terms to the utility function of the original game. - Learner utility: $u_{Learner}^{\mathcal{RG}} = \langle w, \mathbf{V}^{\pi_\theta} \rangle + \tau \tilde{H}(\pi_\theta) - \tau_w H(w)$ - $\tau \tilde{H}(\pi_\theta)$: avoids the indeterminacy issue in max-min MORL (some MOMDPs admit only stochastic optimal policies). - $-\tau_w H(w)$: prevents $w$ from degenerating to a one-hot vector, encouraging simultaneous consideration of multiple objective dimensions.

Module 2: ERAM — Learner Update (NPG/PPO) - Function: Updates the policy via natural policy gradient. - Tabular closed form: $\pi_{\theta_{t+1}}(a|s) = \frac{1}{Z_\pi} (\pi_{\theta_t}(a|s))^\alpha \exp\left(\frac{1-\alpha}{\tau} Q_{w_t,\tau}^{\pi_{\theta_t}}(s,a)\right)$, where $\alpha = 1 - \frac{\eta\tau}{1-\gamma}$ - Deep RL: directly replaced by PPO.

Module 3: ERAM — Adversary Update (Closed-Form MD) - Function: Updates the weight $w$ via a modified mirror descent step. - Core formula: $w_{t+1} = \text{softmax}\left(-\frac{1-\beta}{\tau_w} \mathbf{V}^{\pi_{\theta_t}} + \beta \log w_t\right)$ - where $\beta = \frac{1}{\lambda \tau_w + 1} \in (0,1)$, always within the valid range. - Design Motivation: Choosing negative-entropy regularization (rather than $\ell_2$-norm) yields an analytic mirror descent update, eliminating the need for projected gradient descent.

Module 4: ARAM — Adaptive Regularization Extension - Function: Replaces the Adversary's regularizer from $H(w)$ (KL from uniform) with $-D_{KL}(w \| c)$. - Reference vector $c$ is computed dynamically: $c_i = \text{softmax}(\mathbb{E}_{s,a}[r_i(s,a) r_{i'}(s,a)])$, where $i'$ is the worst-performing dimension in the previous batch. - Design Motivation: Unlike ERAM, which treats all dimensions equally, ARAM assigns greater weight to underperforming dimensions (without exclusively focusing on the worst), accelerating convergence.

Loss & Training¶

Learner: PPO objective (deep RL setting).
Adversary: closed-form softmax update (projection-free).
Step-size relationship: $\eta = O(\epsilon)$ > $\lambda = O(\epsilon^2)$ (policy updates faster than weight updates).

Key Experimental Results¶

Main Results (Table 1, Traffic Signal Control)¶

Environment	ARAM	ERAM	Park et al.	GGF-PPO	GGF-DQN	Avg-DQN
Base-4	-1160	-1387	-1681	-1731	-1838	-2774
Asym-4	-2696	-2732	-3510	-3501	-3053	-4245
Asym-16	-15043	-17334	-23663	-21663	-17792	-27499

Ablation Study (Multi-Environment Results in Appendix)¶

Experiment	Finding
Species conservation environment	ARAM/ERAM significantly outperform baselines
MO-Reacher environment	Same as above
Four-Room environment	Same as above
$\beta$ ablation	Performance is stable when $\beta$ is not close to 0 (ERAM) or 1 (ARAM)

Efficiency Comparison (Table 2, Training Time)¶

Environment	ERAM	ARAM	Park et al.	GGF-PPO
Base-4	111 min	120 min	346 min	122 min
Asym-4	87 min	87 min	241 min	85 min
Asym-16	356 min	365 min	1125 min	394 min

Memory: ERAM/ARAM use approximately 13,704 parameters per weight update vs. 274,084 for Park et al. (a ~95% reduction).

Key Findings¶

ARAM consistently achieves the best performance across all traffic signal control environments; ERAM ranks second.
Compared to Park et al., training time is reduced by approximately 65–68% and memory usage by approximately 95%.
ERAM exhibits last-iterate convergence (Figure 3), whereas Park et al. displays oscillatory behavior.
Adaptive regularization (ARAM vs. ERAM) yields significant improvements at negligible additional cost.

Highlights & Insights¶

Exploiting the max-min = min-max equivalence: In RL, value functions are non-concave in the policy, precluding direct application of the minimax theorem; however, equivalence holds in the regularized setting via a specialized construction.
Closed-form weight update: Negative-entropy regularization combined with a KL-divergence Bregman yields a closed-form softmax solution, eliminating the iterative overhead of projected gradient descent.
GGF-PPO as a special case: When $\tau_w = 0$ and $D_\psi(w, w_t)$ is removed, the proposed method reduces to GGF-PPO (with $w$ becoming a one-hot vector).
Last-iterate convergence: More practically useful than average-iterate convergence—the final policy can be deployed directly without maintaining a historical average.

Limitations & Future Work¶

Theoretical analysis of ARAM is left for future work; current convergence guarantees apply only to ERAM in the tabular setting.
The regularization coefficients $\tau$ and $\tau_w$ introduce approximation error; the optimal solution deviates from the unregularized case by a term linear in $\tau$ and $\tau_w$.
Theoretical analysis in the tabular setting does not directly extend to deep RL (though empirical PPO substitution proves effective).
For large $K$ (Asym-16: $K=16$), performance gaps narrow; scaling efficiency to even larger $K$ remains to be verified.
The sample complexity $\tilde{O}(K / ((1-\gamma)^3 \epsilon^6 \epsilon_{acc}^2))$ has a strong dependence on $\epsilon$.

vs. Park et al. (2024): The latter employs a double loop with zeroth-order gradient estimation and projected gradient descent, guaranteeing only average-iterate convergence. The proposed method uses a single loop with closed-form updates and last-iterate convergence, achieving approximately 95% efficiency gains.
vs. GGF-PPO (Siddique et al., 2021): GGF-PPO optimizes only the worst dimension per batch (one-hot $w$); this work simultaneously considers multiple dimensions via entropy regularization.
vs. APMD (Abe et al., 2024): APMD assumes smooth monotone games, which does not apply to RL where value gradients are non-monotone.
Insight: The game-theoretic perspective connects MORL to equilibrium computation; the regularization technique (negative entropy → closed-form solution) is also worth borrowing for other constrained RL problems.

Rating¶

⭐⭐⭐⭐⭐ (5/5)

The paper tightly integrates theory and practice: the game-theoretic reformulation is elegant, the closed-form weight update is practical, and the last-iterate convergence guarantee represents a meaningful advance. Experiments on real-world traffic signal control substantially outperform baselines, with a 95% memory reduction and 65% training time reduction offering strong practical value. The adaptive regularization in ARAM further enhances performance. Code is publicly available.