Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach¶
Conference: NeurIPS 2025 arXiv: 2510.20235 Code: GitHub Area: Medical Imaging Keywords: Multi-Objective Reinforcement Learning, Max-Min Fairness, Zero-Sum Game, Mirror Descent, Last-Iterate Convergence
TL;DR¶
This paper reformulates max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game, proposes the ERAM/ARAM algorithms, and leverages mirror descent to achieve a concise closed-form weight update. The approach guarantees global last-iterate convergence and significantly outperforms existing methods on tasks such as traffic signal control.
Background & Motivation¶
Background: Multi-objective reinforcement learning (MORL) addresses MDPs with vector-valued rewards, where utility-function-based methods—most commonly weighted sums—are the dominant paradigm. Max-min MORL maximizes the worst-performing objective dimension and is naturally suited to fairness-oriented resource allocation scenarios.
Limitations of Prior Work: (i) GGF-DQN/GGF-PPO optimize a surrogate objective \(\mathbb{E}_\pi[\min_k \sum_t \gamma^t r_k]\) (a Jensen lower bound) rather than the true objective \(\min_k \mathbb{E}_\pi[\sum_t \gamma^t r_k]\); (ii) the double-loop method of Park et al. directly optimizes the true objective but incurs substantial memory and computational overhead (20 Q-networks) and guarantees only average-iterate convergence; (iii) GGF-PPO updates only the worst-performing dimension per batch, effectively switching \(w\) among one-hot vectors, yielding only average-iterate convergence.
Key Challenge: The nonlinearity of the \(\min\) operator prevents max-min MORL from being solved directly by standard RL methods, while simultaneously demanding numerically efficient solvers.
Goal: Design a computationally and memory-efficient max-min MORL algorithm with last-iterate convergence guarantees.
Key Insight: Exploit a special equivalence between max-min and min-max under entropy regularization to transform the problem into finding a Nash equilibrium of a two-player zero-sum game.
Core Idea: The Learner updates the policy via NPG/PPO, while the Adversary updates the weight vector \(w\) using the closed-form softmax formula derived from entropy-regularized mirror descent—two players jointly achieving max-min fairness.
Method¶
Overall Architecture¶
Chain of Equivalences: $\(\max_\pi \min_{k} V_{k,\tau}^\pi = \max_\pi \min_{w \in \Delta^K} \langle w, \mathbf{V}_\tau^\pi \rangle = \min_{w \in \Delta^K} \max_\pi \langle w, \mathbf{V}_\tau^\pi \rangle\)$
The first equality extends the discrete \(\min_k\) to \(\min_w\) over the simplex; the third equality is the special max-min = min-max equivalence (Theorem 3.1). This defines a two-player zero-sum game:
- Learner: policy parameter \(\theta\), maximizing \(\langle w, \mathbf{V}_\tau^{\pi_\theta} \rangle\)
- Adversary: weight vector \(w \in \Delta^K\), minimizing \(\langle w, \mathbf{V}_\tau^{\pi_\theta} \rangle\)
Key Designs¶
Module 1: Regularized Game \(\mathcal{RG}\) - Function: Adds individual regularization terms to the utility function of the original game. - Learner utility: \(u_{Learner}^{\mathcal{RG}} = \langle w, \mathbf{V}^{\pi_\theta} \rangle + \tau \tilde{H}(\pi_\theta) - \tau_w H(w)\) - \(\tau \tilde{H}(\pi_\theta)\): avoids the indeterminacy issue in max-min MORL (some MOMDPs admit only stochastic optimal policies). - \(-\tau_w H(w)\): prevents \(w\) from degenerating to a one-hot vector, encouraging simultaneous consideration of multiple objective dimensions.
Module 2: ERAM — Learner Update (NPG/PPO) - Function: Updates the policy via natural policy gradient. - Tabular closed form: \(\pi_{\theta_{t+1}}(a|s) = \frac{1}{Z_\pi} (\pi_{\theta_t}(a|s))^\alpha \exp\left(\frac{1-\alpha}{\tau} Q_{w_t,\tau}^{\pi_{\theta_t}}(s,a)\right)\), where \(\alpha = 1 - \frac{\eta\tau}{1-\gamma}\) - Deep RL: directly replaced by PPO.
Module 3: ERAM — Adversary Update (Closed-Form MD) - Function: Updates the weight \(w\) via a modified mirror descent step. - Core formula: \(w_{t+1} = \text{softmax}\left(-\frac{1-\beta}{\tau_w} \mathbf{V}^{\pi_{\theta_t}} + \beta \log w_t\right)\) - where \(\beta = \frac{1}{\lambda \tau_w + 1} \in (0,1)\), always within the valid range. - Design Motivation: Choosing negative-entropy regularization (rather than \(\ell_2\)-norm) yields an analytic mirror descent update, eliminating the need for projected gradient descent.
Module 4: ARAM — Adaptive Regularization Extension - Function: Replaces the Adversary's regularizer from \(H(w)\) (KL from uniform) with \(-D_{KL}(w \| c)\). - Reference vector \(c\) is computed dynamically: \(c_i = \text{softmax}(\mathbb{E}_{s,a}[r_i(s,a) r_{i'}(s,a)])\), where \(i'\) is the worst-performing dimension in the previous batch. - Design Motivation: Unlike ERAM, which treats all dimensions equally, ARAM assigns greater weight to underperforming dimensions (without exclusively focusing on the worst), accelerating convergence.
Loss & Training¶
- Learner: PPO objective (deep RL setting).
- Adversary: closed-form softmax update (projection-free).
- Step-size relationship: \(\eta = O(\epsilon)\) > \(\lambda = O(\epsilon^2)\) (policy updates faster than weight updates).
Key Experimental Results¶
Main Results (Table 1, Traffic Signal Control)¶
| Environment | ARAM | ERAM | Park et al. | GGF-PPO | GGF-DQN | Avg-DQN |
|---|---|---|---|---|---|---|
| Base-4 | -1160 | -1387 | -1681 | -1731 | -1838 | -2774 |
| Asym-4 | -2696 | -2732 | -3510 | -3501 | -3053 | -4245 |
| Asym-16 | -15043 | -17334 | -23663 | -21663 | -17792 | -27499 |
Ablation Study (Multi-Environment Results in Appendix)¶
| Experiment | Finding |
|---|---|
| Species conservation environment | ARAM/ERAM significantly outperform baselines |
| MO-Reacher environment | Same as above |
| Four-Room environment | Same as above |
| \(\beta\) ablation | Performance is stable when \(\beta\) is not close to 0 (ERAM) or 1 (ARAM) |
Efficiency Comparison (Table 2, Training Time)¶
| Environment | ERAM | ARAM | Park et al. | GGF-PPO |
|---|---|---|---|---|
| Base-4 | 111 min | 120 min | 346 min | 122 min |
| Asym-4 | 87 min | 87 min | 241 min | 85 min |
| Asym-16 | 356 min | 365 min | 1125 min | 394 min |
Memory: ERAM/ARAM use approximately 13,704 parameters per weight update vs. 274,084 for Park et al. (a ~95% reduction).
Key Findings¶
- ARAM consistently achieves the best performance across all traffic signal control environments; ERAM ranks second.
- Compared to Park et al., training time is reduced by approximately 65–68% and memory usage by approximately 95%.
- ERAM exhibits last-iterate convergence (Figure 3), whereas Park et al. displays oscillatory behavior.
- Adaptive regularization (ARAM vs. ERAM) yields significant improvements at negligible additional cost.
Highlights & Insights¶
- Exploiting the max-min = min-max equivalence: In RL, value functions are non-concave in the policy, precluding direct application of the minimax theorem; however, equivalence holds in the regularized setting via a specialized construction.
- Closed-form weight update: Negative-entropy regularization combined with a KL-divergence Bregman yields a closed-form softmax solution, eliminating the iterative overhead of projected gradient descent.
- GGF-PPO as a special case: When \(\tau_w = 0\) and \(D_\psi(w, w_t)\) is removed, the proposed method reduces to GGF-PPO (with \(w\) becoming a one-hot vector).
- Last-iterate convergence: More practically useful than average-iterate convergence—the final policy can be deployed directly without maintaining a historical average.
Limitations & Future Work¶
- Theoretical analysis of ARAM is left for future work; current convergence guarantees apply only to ERAM in the tabular setting.
- The regularization coefficients \(\tau\) and \(\tau_w\) introduce approximation error; the optimal solution deviates from the unregularized case by a term linear in \(\tau\) and \(\tau_w\).
- Theoretical analysis in the tabular setting does not directly extend to deep RL (though empirical PPO substitution proves effective).
- For large \(K\) (Asym-16: \(K=16\)), performance gaps narrow; scaling efficiency to even larger \(K\) remains to be verified.
- The sample complexity \(\tilde{O}(K / ((1-\gamma)^3 \epsilon^6 \epsilon_{acc}^2))\) has a strong dependence on \(\epsilon\).
Related Work & Insights¶
- vs. Park et al. (2024): The latter employs a double loop with zeroth-order gradient estimation and projected gradient descent, guaranteeing only average-iterate convergence. The proposed method uses a single loop with closed-form updates and last-iterate convergence, achieving approximately 95% efficiency gains.
- vs. GGF-PPO (Siddique et al., 2021): GGF-PPO optimizes only the worst dimension per batch (one-hot \(w\)); this work simultaneously considers multiple dimensions via entropy regularization.
- vs. APMD (Abe et al., 2024): APMD assumes smooth monotone games, which does not apply to RL where value gradients are non-monotone.
- Insight: The game-theoretic perspective connects MORL to equilibrium computation; the regularization technique (negative entropy → closed-form solution) is also worth borrowing for other constrained RL problems.
Rating¶
⭐⭐⭐⭐⭐ (5/5)
The paper tightly integrates theory and practice: the game-theoretic reformulation is elegant, the closed-form weight update is practical, and the last-iterate convergence guarantee represents a meaningful advance. Experiments on real-world traffic signal control substantially outperform baselines, with a 95% memory reduction and 65% training time reduction offering strong practical value. The adaptive regularization in ARAM further enhances performance. Code is publicly available.