Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning¶

Metadata¶

Conference: ICLR 2026
arXiv: 2602.15817
Code: https://oswinso.xyz/fge
Area: Reinforcement Learning
Keywords: safe RL, Hamilton-Jacobi, robust optimization, feasibility, curriculum learning, MuJoCo

TL;DR¶

This paper proposes Feasibility-Guided Exploration (FGE), which simultaneously identifies the feasible parameter subset and learns a safe policy over that subset, addressing parameter-robust avoid problems with unknown feasibility. FGE covers more than 50% additional safe parameters compared to the best existing methods on MuJoCo tasks.

Background & Motivation¶

Hamilton-Jacobi (HJ) safety control is a powerful tool for computing the maximal safe initial state set, but classical methods are limited by the curse of dimensionality.
Deep RL for HJ approximation: RL can be used to learn approximate optimal control policies, but a fundamental mismatch exists between RL's expected-return objective and worst-case safety requirements — performance may be poor on low-probability yet safety-critical states.
Robust optimization approaches (e.g., RARL) apply worst-case optimization over the set of initial conditions, but presuppose that this set is feasible (i.e., a safe policy exists). Including infeasible parameters causes all policies to perform equally poorly, leading to degenerate solutions.
Core difficulty: Determining the feasible parameter set is itself the goal of HJ analysis — it is unknown a priori.
Illustrative example: In autonomous driving, blizzard conditions combined with high speed may be inherently unsafe regardless of the policy; training on such impossible scenarios prevents the model from learning well under normal conditions.

Method¶

Overall Architecture¶

FGE simultaneously accomplishes two objectives: 1. Identifying the maximal feasible parameter subset $\Theta^* \subseteq \Theta$ (i.e., for which parameters a safe policy exists). 2. Learning a robust policy that is safe over $\Theta^*$.

Key Design 1: Feasibility Classifier¶

The core challenge is label asymmetry: observing safety confirms feasibility; observing unsafety does not confirm infeasibility (it may simply reflect a poor policy).

A mixed distribution is constructed to train the classifier:

\[p_{\text{mix}}(\mathfrak{f}, \theta) = \alpha \cdot p^*(\mathfrak{f}|\theta) p_{\mathcal{D}_\mathfrak{f}}(\theta) + (1-\alpha) \cdot p^\pi(\mathfrak{f}|\theta) \rho(\theta)\]

First term: parameters confirmed safe (reliable positive labels).
Second term: online samples under the current policy (may contain false negatives).
Variational inference is used to fit $q_\psi(\mathfrak{f}=1|\theta)$, which is thresholded to obtain the classifier.

Theoretical guarantees: - Zero false positives: infeasible parameters are never labeled as feasible. - Controllable false negative rate: regulated via $\alpha$ and $\rho$.

Key Design 2: Saddle-Point Optimization¶

Robust safety control is formulated as a maximin problem. An online-learning saddle-point approach is adopted in place of RARL's adversarial policy:

\[\pi_{t+1} = \arg\max_\pi \mathbb{E}_{\theta \sim \mathcal{D}_{\theta,t}}[J(\pi, \theta)]$$ $$\theta_{t+1} = \arg\min_{\theta} J(\pi_t, \theta), \quad \theta \sim p(\cdot | \theta \in \Theta^*)\]

Follow-the-Regularized-Leader (FTRL) is combined with PPO for policy updates, and a rehearsal buffer stores historical worst-case parameters.

Key Design 3: Exploration Distribution Expansion¶

FGE decomposes the sampling distribution into three components (Figure 1): - Base distribution: standard parameter sampling. - Exploration distribution: upweights parameters not yet observed to be safe. - Rehearsal distribution: samples previously solved parameters that may have regressed (approximate best response).

The three components are combined to balance maximizing safety rate gains and minimizing safety rate losses.

Loss & Training¶

Policy training uses a standard RL objective (PPO) with a negative indicator reward:

\[r_k = -\mathbb{1}\{h_\theta(\bm{s}_k) > 0\}\]

Safety yields a reward of 0; entering an unsafe state yields a reward of −1; episodes terminate upon the first constraint violation.

Key Experimental Results¶

Main Results: MuJoCo Safety Coverage¶

Environment	Domain Rand.	RARL	FGE (Ours)	Gain
Ant (avoid)	~40%	~45%	~70%	+56%
Humanoid (avoid)	~35%	~40%	~65%	+63%
HalfCheetah	~50%	~55%	~78%	+42%

FGE covers more than 50% additional safe parameter sets compared to the best baseline across all challenging MuJoCo tasks.

Ablation Study: Component Contributions¶

Ablation Setting	Safety Coverage	Note
FGE (full)	~70%	Reference
w/o feasibility classifier	~50%	Infeasible parameters interfere with training
w/o exploration distribution	~55%	Insufficient exploration
w/o rehearsal distribution	~60%	Learned skills regress
Density model replacing classifier	~58%	Inferior to mixed-distribution classifier

Key Findings¶

Standard domain randomization and RARL are severely limited when parameter feasibility is unknown.
The zero-false-positive guarantee of the feasibility classifier is critical for training stability.
FTRL converges more stably than GDA (the method approximated by RARL) on saddle-point problems.
Balancing the exploration and rehearsal distributions is indispensable for continuously expanding the safe set.

Highlights & Insights¶

Novel problem formulation: Parameter-robust avoidance with unknown feasibility fills an important gap between safe RL and HJ analysis.
Positive-only label learning: The asymmetric label problem (only positive labels are reliable) is handled elegantly with theoretical guarantees of zero false positives.
Online learning perspective: Replacing unstable adversarial RL with a saddle-point method yields stronger theoretical guarantees.
Practical three-distribution sampling: The base + explore + rehearse combination draws inspiration from curriculum learning and online learning.

Limitations & Future Work¶

Theoretical convergence guarantees rely on assumptions such as convex-concavity and exact best responses, which are not fully satisfied in practice.
The accuracy of the feasibility classifier in high-dimensional parameter spaces requires further validation.
Only deterministic dynamics are considered; extensions to stochastic systems are not discussed.
MuJoCo environments still exhibit a gap relative to real robotic systems.

HJ safety control: DeepReach (Bansal et al., 2021), ISAACS (Hsu et al., 2023), So et al. (2024)
Robust RL: RARL (Pinto et al., 2017), WCSAC (Yang et al., 2021)
Unsupervised environment design (UED): PAIRED (Dennis et al., 2020), PLR (Jiang et al., 2021)
Safe RL: Constrained MDP (Altman, 1999), SauteRL (Sootla et al., 2022)

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Novel problem formulation; robust safety control with unknown feasibility.
Theoretical Depth: ⭐⭐⭐⭐ — Classifier guarantees and saddle-point convergence analysis.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple MuJoCo environments with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ — Direct relevance to safe robot deployment.