NeurIPS 2025 Reinforcement Learning Constrained Reinforcement Learning Risk Aversion Optimized Certainty Equivalent (OCE) CVaR Partial Lagrangian Relaxation

Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents¶

Conference: NeurIPS 2025 arXiv: 2510.20199 Code: Available Area: Reinforcement Learning / Risk Aversion Keywords: Constrained Reinforcement Learning, Risk Aversion, Optimized Certainty Equivalent (OCE), CVaR, Partial Lagrangian Relaxation

TL;DR¶

This paper proposes a reward-based risk-aware constrained RL framework that applies Optimized Certainty Equivalent (OCE) risk measures to both objectives and constraints, establishes parametric strong duality, and delivers a modular algorithm that wraps standard RL solvers (e.g., PPO) as a black box.

Background & Motivation¶

Constrained RL is a common framework for handling conflicting objectives (e.g., reaching a goal while avoiding walls in maze navigation). Standard constrained RL expresses objectives and constraints via expected cumulative rewards, which is insufficiently rigorous for high-stakes applications:

Limitations of Expectation: Expected values fail to capture catastrophic tail events in return distributions. For instance, maximizing average returns in portfolio management may overlook the risk of large losses.

Existing risk-averse RL paradigms each have shortcomings:

Return-based: Applies a risk measure in place of expectation to discounted cumulative returns, $\rho[-\sum \gamma^\tau r]$. This captures risk only over the overall return distribution and is insensitive to risk at individual time steps.

Recursive Risk: Recursively evaluates risk measures at each decision stage, generalizing the tower property of expectations. Computationally demanding and difficult to scale.

This paper proposes a third paradigm — reward-based: Applies the risk measure to the occupancy measure, i.e.:

\[R^* = \sup_{\nu^\pi} \frac{1}{1-\gamma} \cdot -\rho_{\nu^\pi}(-r(s,a))\]

Key Advantage: This paradigm provides stepwise robustness — capturing risk simultaneously across reward values and time steps. As CVaR's $\beta \to 0$, the objective degrades to the essential infimum of the worst reward over all state-action pairs at all time steps, a strictly stronger safety guarantee than the return-based approach.

Core Challenges: - How to incorporate reward-based risk measures into constrained RL? - How to establish duality guarantees (making Lagrangian relaxation exact)? - How to design practical algorithms compatible with existing RL methods?

Method¶

Overall Architecture¶

For general OCE risk measures (including CVaR, mean semi-variance, etc.), the constrained problem is written as:

\[\sup_{\pi,t_0} \mathbb{E}[\sum \gamma^\tau r_0'(s_\tau, a_\tau, t_0)] \quad \text{s.t.} \quad \sup_{t_i} \mathbb{E}[\sum \gamma^\tau r_i'(s_\tau, a_\tau, t_i)] \geq c_i\]

where the modified reward $r_i'(s,a,t) = t - \frac{1}{\beta}(t - r_i(s,a))_+$ (for CVaR), and the auxiliary variable $t$ controls the degree of risk aversion.

Key Designs¶

1. Parametric Strong Duality and Partial Lagrangian Relaxation¶

Function: Proves that the constrained problem can be solved exactly via partial Lagrangian relaxation.

Mechanism: - With $t \in \mathcal{T}$ fixed, the problem reduces to standard constrained RL → strong duality holds under Slater's condition (Proposition 3.3). - A constraint qualification is introduced (Assumption 3.4): there exists a convex compact set $\mathcal{I} \subset \mathcal{T}$ such that Slater's condition holds for all $t \in \mathcal{I}$. - Under this condition, the resulting partial dual problem is exactly equivalent to the original constrained problem.

\[D_\theta^* = \sup_{t \in \mathcal{T}} \inf_{\lambda \in \Lambda} \underbrace{\sup_\theta \mathcal{L}(\pi_\theta, t, \lambda)}_{\text{black-box RL}}\]

Design Motivation: To make the dual relaxation exact rather than approximate — a first in the risk-averse constrained RL literature.

2. Modular Algorithm Design (Algorithm 1)¶

Function: Decomposes the problem into an inner RL subproblem and an outer $(t, \lambda)$ update.

For fixed $(t, \lambda)$, the inner problem is standard RL (with a modified reward function) solvable by any RL algorithm (e.g., PPO). The outer loop updates via SGDA:

\[\lambda^{(j+1)} \leftarrow \Pi_\Lambda(\lambda^{(j)} - \eta_\lambda \hat{\nabla}_\lambda \hat{\mathcal{L}})$$ $$t^{(j+1)} \leftarrow \Pi_\mathcal{T}(t^{(j)} + \eta_t \hat{\nabla}_t \hat{\mathcal{L}})\]

Design Motivation: The modular structure allows users to flexibly choose any combination of risk-neutral/risk-averse objectives and/or constraints, and to plug in any existing RL algorithm as a black-box subproblem solver.

3. Approximate Optimality Guarantee (Theorem 3.5)¶

Under $\epsilon$-universal policy parameterization:

\[P^*(t^*) \geq \sup_t \inf_\lambda \sup_\theta \mathcal{L}(\pi_\theta, t, \lambda) \geq P^*(t^*) - \mathcal{O}\left(\frac{\epsilon}{1-\gamma}\right)\]

The gap between the parametric partial dual and the true primal problem depends only on the policy parameterization error.

Convergence Analysis¶

Theorem 3.12: Under assumptions of Lipschitz smoothness, unbiased gradient oracles, and an approximate policy solver, the iteration complexity to recover an $\epsilon$-stationary point is:

\[\mathcal{O}\left(\frac{\ell^3(C^2+\sigma^2+\delta^2)(\text{diam}(\Lambda))^2 \hat{\Delta}_\Phi}{\epsilon^6}\right)\]

Key feature: only a single trajectory ($n=1$) is required, enabling online operation. When the inexact solver bias satisfies $\delta = \mathcal{O}(\epsilon^2)$, exact $\epsilon$-stationary points are recovered.

Key Experimental Results¶

Point agent at Level 1 difficulty, 5–10M training steps:

Environment	PPO Cumulative Cost ↓	PPO Reward ↑	MARS Reward ↑
Button	150.76	24.29	2.58
Circle	206.74	60.18	39.19
Goal	45.09	21.89	13.56
Push	38.48	0.93	2.42

MARS achieves zero constraint violations across all environments, being the only PPO-based method to do so. In the Push environment, the constraint even helps the agent achieve higher reward.

Ablation Study: Safe Velocity Constraint (MuJoCo-v4)¶

Agent	Velocity Threshold $c$	$\beta$-upper Quantile	Converged $t$	Match
HalfCheetah	1.450	1.419	1.417	✓
Hopper	0.373	0.370	0.370	✓
Swimmer	0.228	0.248	0.207	≈
Walker2d	1.171	1.133	1.122	✓

Key Findings¶

$t$ Aligns with CVaR: The converged $t$ value precisely matches the $\beta$-upper quantile of the trained velocity distribution, validating correct operation of the CVaR constraint.
Stable $\lambda$ Convergence: The dual variable $\lambda$ stabilizes and oscillates around a consistent value after sufficient training.
Interpretable Policies: Constrained agents move more cautiously (more stable velocity), whereas unconstrained PPO agents exhibit high velocity variance.
Stabilizing Evaluation Reward: Variance in evaluation reward decreases over training, demonstrating the effect of risk management.
Only Zero-Violation Method: In navigation tasks, MARS is the only PPO-based method in the literature to achieve strictly zero constraint violations.

Highlights & Insights¶

Reward-based vs. Return-based: The paper clearly argues how reward-based risk measures provide robustness simultaneously in value and time dimensions — a stronger safety guarantee than return-based approaches.
Exact Duality: Unlike prior risk-constrained RL work (e.g., Chow et al.'s CVaR methods) that only yield approximate relaxations, this paper establishes exact equivalence under the constraint qualification — a significant theoretical contribution.
High Practicality: The black-box wrapper design enables direct use of PPO/SAC/TD3 or any other RL algorithm, lowering the implementation barrier.
Flexibility: Risk-neutral objectives can be freely combined with risk-averse constraints (as in the experiments), closely matching practical requirements.

Limitations & Future Work¶

Full Strong Duality Unproven: Whether Assumption 3.4 holds unconditionally remains an open problem.
High Computational Cost: Each $(t, \lambda)$ update requires approximately solving a policy optimization subproblem, making the method slower than risk-neutral approaches.
Step-Size Sensitivity: Tuning $\eta_\lambda$ and $\eta_t$ is critical for convergence and requires considerable patience.
Artificial Action Noise: Experiments simulate risk by injecting 5% Gaussian noise into actions; real-world uncertainty may be more complex.
CVaR Only Validated: Although the theory covers general OCE measures, experiments validate only CVaR.

Extends Bonetti et al.'s unconstrained reward-based risk-averse RL to the constrained setting.
Compared to Chow et al.'s return-based CVaR constrained RL: the partial Lagrangian relaxation proposed here is exact under the constraint qualification.
The convergence analysis framework builds on Lin et al.'s minimax optimization theory.
Potential extensions to multi-agent settings or hierarchical RL are worth exploring.

Rating¶

Novelty: ⭐⭐⭐⭐ (Reward-based constrained RL with exact duality is a significant theoretical contribution)
Experimental Thoroughness: ⭐⭐⭐⭐ (Validated on navigation and velocity scenarios with detailed convergence curves)
Writing Quality: ⭐⭐⭐⭐⭐ (Clear theoretical exposition; Table 1's formulation comparison is immediately informative)
Value: ⭐⭐⭐⭐ (Strong in both theory and practice; modular design is highly practical)