Skip to content

Risk-Sensitive Exponential Actor Critic

Conference: AAAI2026 arXiv: 2602.07202 Area: Reinforcement Learning Keywords: risk-sensitive RL, entropic risk measure, policy gradient, actor-critic, numerical stability

TL;DR

To address the high variance and numerical instability of policy gradients under the entropic risk measure, this paper derives a complete set of on/off-policy risk-sensitive policy gradient theorems and proposes the rsEAC algorithm, which achieves stable risk-sensitive continuous control via log-domain critic parameterization and gradient normalization-clipping mechanisms.

Background & Motivation

Standard RL optimizes expected return, but risk-aware decision-making is required in domains such as autonomous driving, robotics, and finance. The entropic risk measure is a widely used risk metric:

\[J^\beta(\pi_\theta) = \frac{1}{\beta} \log \mathbb{E}_{p_\pi(\tau)} \left[ e^{\beta \sum_t r_t} \right]\]
  • \(\beta > 0\): risk-seeking
  • \(\beta < 0\): risk-averse
  • \(\beta \to 0\): reduces to the standard risk-neutral objective

Core difficulties of existing methods: - Gradient estimation via the likelihood ratio trick requires complete trajectories and is scaled by the exponentiated return, resulting in high variance and numerical instability - The exponential value function \(Z^\beta(s,a) = e^{\beta Q(s,a)}\) is highly susceptible to overflow/underflow under function approximation - Existing model-free methods (e.g., R-AC) are limited to simple tasks and tabular settings

Method

Risk-Sensitive Policy Gradient Theorems

Theorem 1 (Stochastic Policy):

\[\nabla_\theta J^\beta = \frac{1}{\beta} \int_S \rho_\pi^*(s) \int_A \nabla_\theta \pi_\theta(a|s) \cdot e^{\beta(Q^\beta(s,a) - V^\beta(s))} \, da \, ds\]

where \(\rho_\pi^*\) is the state distribution under the exponential twisted dynamics. The key distinction is that the Q-value is replaced by the exponentiated advantage, which introduces numerical risk.

Theorem 2 (Deterministic Policy):

\[\nabla_\theta J^\beta = \int_S \rho_\mu^*(s) \nabla_\theta \mu_\theta(s) \nabla_a Q^\beta_{\mu_\theta}(s,a) \big|_{a=\mu_\theta(s)} \, ds\]

This formulation avoids exponential terms and action integration, making it more suitable for practical use. Deterministic policy improvement under off-policy approximation is also established (Theorem 3).

Log-Domain Critic Parameterization

Directly learning \(Z_\psi(s,a) = e^{\beta Q(s,a)}\) causes numerical overflow/underflow. The key improvement is:

  • Parameterize as \(Z_\psi(s,a) = e^{Q_\psi(s,a)}\), where \(Q_\psi\) is a neural network
  • \(\frac{1}{\beta} Q_\psi(s,a)\) approximates the soft-value function
  • Gradients are computed in the log domain, yielding inherently greater stability

Gradient Stabilization Mechanisms

Gradients of the exponential TD loss take the form \(e^x(e^x - e^y)\), requiring stabilization:

  1. Helper function: \(f(x,y) = (1-e^{y-x})\) or \((e^{x-y}-1)\), bounded in \([-1,1]\), ensuring stability
  2. Batch normalization: the exponential prefactor is normalized by subtracting the in-batch mean \(z\)
  3. Gradient clipping: the range of exponential arguments is restricted to prevent gradient explosion/vanishing

rsEAC Algorithm

Built on the TD3 framework: - Twin critics: two \(Q_\psi\) networks; minimum is taken when \(\beta > 0\) and maximum when \(\beta < 0\) (to control overestimation direction) - Actor: deterministic policy \(\mu_\theta\), updated via the off-policy deterministic gradient - Exploration: Gaussian noise is added

Key Experimental Results

GridWorld (Tabular)

Validates the risk-modulation effect of \(\beta\): - \(\beta = -1\): the risk-averse policy detours around cliff regions - \(\beta = 1\): the risk-seeking policy takes the shortest path along the cliff edge - Large \(|\beta|\) causes exponential value function overflow, confirming the necessity of stabilization

Inverted Pendulum

  • rsEAC learns high-return policies under both \(\beta=1\) and \(\beta=-1\)
  • R-AC yields poor policy quality and lacks risk sensitivity due to numerical instability

MuJoCo Risk Variants (Swimmer / HalfCheetah / Ant)

Stochastic noise risk regions are introduced (\(\mathcal{N}(0,10^2)\) or \(\mathcal{N}(0,7^2)\)):

Method Risk Behavior Mean Return Performance
rsEAC Low risk-region visitation Comparable to MVPI
R-AC Numerically unstable Worst across all tasks
MVPI Low risk-region visitation Weaker on Ant
MG (PPO) Moderate risk aversion Moderate
  • rsEAC outperforms R-AC on all tasks, validating the critical role of the stabilization mechanism
  • rsEAC outperforms MVPI on the high-dimensional Ant task

Stability Comparison (CartPole)

  • Direct learning of \(Z_\psi\): value function estimates overflow/underflow; only 1 out of 4 \(\beta\) settings learns the optimal policy
  • \(Q_\psi\) (proposed) + gradient normalization-clipping: optimal policies are learned under all \(\beta\) settings

Highlights & Insights

  • Complete theoretical framework: the first work to derive all four policy gradient theorems for the entropic risk measure across on/off-policy × stochastic/deterministic settings
  • Practical stabilization scheme: log-domain parameterization + batch normalization + clipping addresses the long-standing challenge of learning exponential value functions
  • Tunable risk: a single parameter \(\beta\) controls the degree of risk-seeking or risk-aversion
  • First model-free risk-sensitive actor-critic capable of handling complex continuous control tasks

Limitations & Future Work

  • The inherent instability of the exponential function cannot be fully eliminated; failures may still occur under extreme \(\beta\) values
  • The risk parameter \(\beta\) requires task-specific tuning, with no adaptive mechanism provided
  • Validation is limited to MuJoCo continuous control tasks; safety-critical scenarios remain untested

Rating

  • Novelty: ⭐⭐⭐⭐ — Rigorous derivation of policy gradient theorems with clear engineering contributions in stabilization
  • Experimental Thoroughness: ⭐⭐⭐ — Reasonable coverage of tabular and continuous control settings, though task diversity is somewhat limited
  • Writing Quality: ⭐⭐⭐⭐ — Theory is presented clearly, and numerical issues are intuitively visualized
  • Value: ⭐⭐⭐⭐ — Paves the way for practical deployment of risk-sensitive RL