Constrained Multi-Objective Reinforcement Learning with Max-Min Criterion¶
Conference: ICML 2026
arXiv: 2605.31388
Code: https://github.com/Giseung-Park/Constrained-Maxmin-MORL
Area: Reinforcement Learning / Multi-Objective RL / Constrained Optimization
Keywords: max-min fairness, constrained MORL, occupancy measure, dual convex optimization, projected gradient descent
TL;DR¶
This paper unifies "max-min multi-objective fairness" and "hard constraint satisfaction" into a single MORL framework. By reformulating the problem as a convex program using occupancy measures and deriving a dual convex optimization problem over weights \((u,w)\), the authors implement a projected gradient descent algorithm that simultaneously achieves fairness and constraint feasibility with theoretical guarantees of geometric convergence rates.
Background & Motivation¶
Background: The mainstream approach in Multi-Objective Reinforcement Learning (MORL) is to use a scalarization function \(f(J_1(\pi),\ldots,J_K(\pi))\) to combine multiple returns into a single scalar for single-objective RL. When \(f\) is linear, it becomes a weighted sum \(\sum_k w_k J_k(\pi)\), which is simple but fails to reflect "fairness." For instance, in traffic signal control, a scheduler may prioritize "minimizing the maximum waiting time" (max-min fairness, \(f=\min\)) rather than minimizing the total average wait time across all directions.
Limitations of Prior Work: Existing max-min MORL methods assume an unconstrained environment. However, real-world systems often involve hard constraints: a scheduler must maximize throughput fairness within a power budget, or a traffic light must minimize the longest wait time within a greenhouse gas emission limit. Integrating constraints into max-min MORL is not a trivial task—current max-min MORL algorithms either optimize imprecise lower bounds (Fan et al. 2023; Peng et al. 2025) or rely on biased gradients through Gaussian smoothing (Park et al. 2024). Conversely, established constrained RL methods (CPO, RCPO, Lagrangian) focus on \(K=1\) scalar rewards and are ill-equipped to handle the non-differentiability introduced by \(f=\min\).
Key Challenge: The max-min objective \(\min_k J_k(\pi)\) is non-convex and non-differentiable in the policy space, causing standard Lagrangian dual analysis to fail. To incorporate constraints, one must identify a mathematical structure capable of simultaneously handling "non-differentiable max-min" and "inequality constraints."
Goal: To establish a unified MORL framework capable of optimizing \(\min_k J_k(\pi)\) while satisfying \(J_{K+l}(\pi)\ge C^{(l)}\), accompanied by a provably convergent algorithm.
Key Insight: The authors observe that while max-min is non-convex in the policy space, it is convex in the occupancy measure \(\rho(s,a)\) space, a classic result from Puterman 1994. By rewriting policy optimization as a convex program regarding \(\rho\), constraints naturally become linear inequalities and the max-min objective becomes a min-of-linear form, transforming the entire problem into a standard convex program with a Lagrangian dual.
Core Idea: Convexification of the primal problem via occupancy measures → derivation of its entropy-regularized dual → formulation of a convex optimization problem involving only "constraint multipliers \(u\) + objective weights \(w\)" → use of projected gradient descent to learn both sets of weights, achieving max-min fairness and constraint satisfaction in a joint iteration.
Method¶
Overall Architecture¶
The input is a MOMDP \(\langle\mathcal{S},\mathcal{A},T,\mu_0,r,\gamma\rangle\), where the first \(K\) dimensions of the reward \(r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}^{K+L}\) are objectives for "max-min fairness," and the subsequent \(L\) dimensions are constraints requiring \(J\ge C\). The algorithm outputs a softmax policy \(\pi(\cdot|s)=\mathrm{softmax}\{Q(s,\cdot)/\beta\}\) that maximizes the minimum objective return while satisfying all constraints.
The workflow follows a bilevel iteration:
- Inner Loop: With fixed weights \((u,w)\), the Q-function is trained to \(Q^*_{u,w}\) via soft Bellman iterations (corresponding to an entropy-regularized "scalarized RL subproblem" with reward \(\sum_l u_l c^{(l)}+\sum_k w_k r^{(k)}\)).
- Outer Loop: Based on the \(\pi^*_{u,w}\) obtained from the inner loop, \(\nabla_{(u,w)}\mathcal{L}(u,w)\) is estimated. A projected gradient descent step is performed on \((u,w)\) to project them onto the constraint multiplier space \(\mathbb{R}_+^L\) and the objective weight simplex \(\Delta^K\).
Mathematically, the primal problem is transformed into a convex program through occupancy measures. The authors further prove that the dual can be written as:
where \(v^*_{u,w}\) is the fixed point of the entropy-regularized Bellman operator \(\mathcal{T}_{u,w}\).
Key Designs¶
-
Occupancy Measure Convexification + Dual Convexification (Double Convex Structure):
- Function: Converts non-convex and non-differentiable max-min MORL in the policy space into a finite-dimensional convex optimization over \((u,w)\).
- Mechanism: The policy \(\pi\) is replaced by the occupancy measure \(\rho(s,a)\), making the primal max-min problem a convex program—the objective is \(\max_\rho \min_k \sum_{s,a} r^{(k)}(s,a)\rho(s,a)\) with Bellman flow and linear return constraints. Solving its dual yields a convex loss \(\mathcal{L}(u,w)\) over multipliers \(u\in\mathbb{R}_+^L\) and weights \(w\in\Delta^K\). The non-differentiability of \(f=\min\) disappears in the dual, as max-min is equivalent to maximizing a linear function over the simplex \(\Delta^K\).
- Design Motivation: To avoid dealing with subgradients of \(\min\) directly on policy parameters. While Lagrangian duality fails in the original max-min form due to non-differentiability, the double convexification via occupancy measures allows max-min and inequality constraints to be unified into a single convex loss.
-
Unified Gradient Formula (Theorem 3.3):
- Function: Uses the same entropy-regularized policy \(\pi^*_{u,w}\) to simultaneously calculate \(\nabla_u \mathcal{L}\) and \(\nabla_w \mathcal{L}\).
- Mechanism: The authors prove \(\nabla_u v^*_{u,w}(s) = v_c^{\pi^*_{u,w}}(s)\) and \(\nabla_w v^*_{u,w}(s) = v_r^{\pi^*_{u,w}}(s)\), where \(v_c\) and \(v_r\) are value functions evaluating constrained rewards \(\{c^{(l)}\}\) and objective rewards \(\{r^{(k)}\}\), respectively.
- Design Motivation: This compresses "constraint updates" and "max-min weight updates" into a single value function evaluation. The gradient for \(w\) automatically amplifies "bottleneck" objectives (dimensions with small returns), while the gradient for \(u\) pushes multipliers of violated constraints higher to enforce feasibility.
-
Entropy Regularization + Projected Gradient Descent (with Smoothness Guarantee):
- Function: Ensures solvability and numerical stability while providing a geometric convergence rate.
- Mechanism: Adding an entropy term \(\beta \sum_s \mathcal{H}_\rho(s)\rho(s)\) introduces only an \(O(\beta\log|\mathcal{A}|/(1-\gamma))\) approximation error. Entropy regularization keeps \(\pi^*_{u,w}(a|s)>0\) strictly positive, making the Hessian \(H[\mathcal{L}]\) positive definite under Slater's condition. This renders the dual \(\alpha\)-smooth and strongly convex.
- Design Motivation: Entropy regularization acts as both a "theoretical crutch" (guaranteeing positivity and smoothness) and an "algorithmic crutch" (enabling soft Bellman Q-updates). PGD is more stable than standard Lagrangian methods as \(w\) remains on the simplex without specialized normalization tricks.
Loss & Training Strategy¶
The outer objective is the convex loss \(\mathcal{L}(u,w) = \sum_s \mu_0(s) v^*_{u,w}(s) - \sum_l u_l C^{(l)}\). The inner loop uses entropy-regularized soft Bellman updates: \(Q(s,a) \leftarrow [u;w]^\top [c;r] + \gamma \sum_{s'} T(s'|s,a)\beta\log\sum_{a'}\exp(Q(s',a')/\beta)\). In continuous state spaces, a gradient network \(g_\theta(s)\in\mathbb{R}^{L+K}\) parameterizes the estimate of \(\nabla v^*_{u,w}(s)\), paired with SAC-style Q-networks. The hyperparameter \(\beta\) was tuned within \(\{0.1, 0.03, 0.01, 0.003, 0.001\}\), with \(\beta=0.03\) yielding the lowest error.
Key Experimental Results¶
Main Results¶
Evaluations were conducted on tabular settings and three real-world environments. For the tabular setup, an LP was used to find the ground truth.
| Setting | Algorithm | Metric | Value | Comparison |
|---|---|---|---|---|
| Tabular MOMDP | Constrained max-min (ours) | Optimality Error ↓ | 0.004 | unconstr. max-min: 0.325 / constr. max-avg: 0.657 / unconstr. max-avg: 1.008 |
| Building (HVAC, \(C_{th}=180\)) | Ours | Energy / Min Comfort ↑ | 178.7 / 639.8 | MA-SAC-L: 171.4 / 620.9 (feat/poor fairness); Max-min GS: 202.1 / 653.6 (vioat); ARAM: 276.9 / 664.3 (severe violat) |
| MoAnt-v5 (\(C_{th}=50\)) | Ours | Control Cost / Min Return ↑ | 28.3 / 92.2 | MA-SAC-L: 47.8 / 83.0; MA-SAC: 275.3 / 98.8 (violat); ARAM: 620.7 / 101.3 (severe violat) |
| Traffic 16-lane (\(C_{th}=70{,}000\)) | Ours | CO₂ / Min Lane Return ↑ | 69,147 / −25,229 | MA-CPGO: 67,887 / −27,830; Max-min GS: 73,162 / −21,527 (violat); ARAM: 88,748 / −19,700 (severe violat) |
Notably, only the proposed method and the "max-average + Lagrangian" baseline strictly satisfied constraints. Among feasible methods, ours achieved higher "worst-case returns," validating the effectiveness of max-min fairness.
Ablation Study¶
| Configuration | Energy (\(C_{th}=180\)) | Worst-case Return | Description |
|---|---|---|---|
| Full model | 178.7 | 639.8 | Complete algorithm (joint \((u,w)\) updates) |
| w/o \(w\) update | 178.1 | 626.7 | No max-min weight learning → worst return drops by 13 |
| w/o \(u\) update | 200.7 | 653.5 | No multiplier learning → constraint violation (>180) |
| w/o \((u,w)\) update | 222.0 | 646.9 | No weight learning → violates and unfair |
Key Findings¶
- Complementary \(u\) and \(w\): Ablations show that removing \(w\) updates hurts fairness, while removing \(u\) updates leads to constraint violations. Simultaneous updates are essential.
- Sensitivity of \(\beta\): \(\beta=0.1\) introduces high approximation error, while \(\beta<0.01\) decreases the convergence coefficient \(\lambda/\alpha\). The sweet spot is \(\beta\in[0.01, 0.03]\).
- Failure of existing max-min methods: Max-min GS and ARAM fail to satisfy constraints in all real-world scenarios, suggesting that constraints cannot be simply tacked onto max-min MORL.
- MA-SAC-L Limitation: This method optimizes for max-average while satisfying constraints, failing to address fairness between objectives, resulting in lower "worst-case" performance.
Highlights & Insights¶
- Occupancy measure convexification is an underutilized tool: Non-convex objectives in the policy space often become convex in the occupancy measure space. This avoids the necessity of subgradient methods for \(\min\).
- Elegant Unified Gradient (Theorem 3.3): It is surprising that gradients for both multipliers \(u\) and weights \(w\) are simply value functions of different reward channels under the same policy.
- The Triple Role of Entropy Regularization: (1) Smoothing the hard-min; (2) ensuring the dual Hessian is positive definite; (3) enabling a closed-form soft Bellman update.
- Transferable Design: The joint PGD framework for objective weights and constraint multipliers can be applied to other fair-constrained tasks like fair federated learning or fair-RLHF.
Limitations & Future Work¶
- Tabular Theory vs. Neural Practice: Geometric convergence via Theorem 3.6 relies on assumptions that may not strictly hold during neural network approximation.
- Entropy Parameter \(\beta\): \(\beta\) is a hyperparameter coupled with the convergence rate. A potential future direction includes developing an "adaptive \(\beta\)" mechanism.
- Scaling to Multiple Constraints (\(L \ge 2\)): Real-world experiments focused on \(L=1\). Framework support exists, but performance under complex multiple constraints is unexplored.
- High-dimensional Objectives: Traffic \((K=16)\) was the scale limit in the study; stability for very large \(K\) (e.g., \(K=100\)) requires further validation.
Related Work & Insights¶
- vs. Park et al. 2024 (Gaussian Smoothing): Their method requires multiple network copies and results in biased gradients. This work provides exact gradients with a single network and supports constraints.
- vs. Byeon et al. 2025 (ARAM): ARAM uses game-theoretic ideas but lacks constraint support. Experiments show ARAM severely violates constraints in practical settings.
- vs. CPO / Lagrangian RL: These assume scalar rewards (\(K=1\)). This work unifies objective weights and multipliers into a single dual framework, offering a cleaner structural solution for MORL.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐