Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning¶
Metadata¶
- Conference: ICLR 2026
- arXiv: 2602.15817
- Code: https://oswinso.xyz/fge
- Area: Reinforcement Learning
- Keywords: safe RL, Hamilton-Jacobi, robust optimization, feasibility, curriculum learning, MuJoCo
TL;DR¶
This paper proposes Feasibility-Guided Exploration (FGE), which simultaneously identifies the feasible parameter subset and learns a safe policy over that subset, addressing parameter-robust avoid problems with unknown feasibility. FGE covers more than 50% additional safe parameters compared to the best existing methods on MuJoCo tasks.
Background & Motivation¶
- Hamilton-Jacobi (HJ) safety control is a powerful tool for computing the maximal safe initial state set, but classical methods are limited by the curse of dimensionality.
- Deep RL for HJ approximation: RL can be used to learn approximate optimal control policies, but a fundamental mismatch exists between RL's expected-return objective and worst-case safety requirements — performance may be poor on low-probability yet safety-critical states.
- Robust optimization approaches (e.g., RARL) apply worst-case optimization over the set of initial conditions, but presuppose that this set is feasible (i.e., a safe policy exists). Including infeasible parameters causes all policies to perform equally poorly, leading to degenerate solutions.
- Core difficulty: Determining the feasible parameter set is itself the goal of HJ analysis — it is unknown a priori.
- Illustrative example: In autonomous driving, blizzard conditions combined with high speed may be inherently unsafe regardless of the policy; training on such impossible scenarios prevents the model from learning well under normal conditions.
Method¶
Overall Architecture¶
FGE simultaneously accomplishes two objectives: 1. Identifying the maximal feasible parameter subset \(\Theta^* \subseteq \Theta\) (i.e., for which parameters a safe policy exists). 2. Learning a robust policy that is safe over \(\Theta^*\).
Key Design 1: Feasibility Classifier¶
The core challenge is label asymmetry: observing safety confirms feasibility; observing unsafety does not confirm infeasibility (it may simply reflect a poor policy).
A mixed distribution is constructed to train the classifier:
- First term: parameters confirmed safe (reliable positive labels).
- Second term: online samples under the current policy (may contain false negatives).
- Variational inference is used to fit \(q_\psi(\mathfrak{f}=1|\theta)\), which is thresholded to obtain the classifier.
Theoretical guarantees: - Zero false positives: infeasible parameters are never labeled as feasible. - Controllable false negative rate: regulated via \(\alpha\) and \(\rho\).
Key Design 2: Saddle-Point Optimization¶
Robust safety control is formulated as a maximin problem. An online-learning saddle-point approach is adopted in place of RARL's adversarial policy:
Follow-the-Regularized-Leader (FTRL) is combined with PPO for policy updates, and a rehearsal buffer stores historical worst-case parameters.
Key Design 3: Exploration Distribution Expansion¶
FGE decomposes the sampling distribution into three components (Figure 1): - Base distribution: standard parameter sampling. - Exploration distribution: upweights parameters not yet observed to be safe. - Rehearsal distribution: samples previously solved parameters that may have regressed (approximate best response).
The three components are combined to balance maximizing safety rate gains and minimizing safety rate losses.
Loss & Training¶
Policy training uses a standard RL objective (PPO) with a negative indicator reward:
Safety yields a reward of 0; entering an unsafe state yields a reward of −1; episodes terminate upon the first constraint violation.
Key Experimental Results¶
Main Results: MuJoCo Safety Coverage¶
| Environment | Domain Rand. | RARL | FGE (Ours) | Gain |
|---|---|---|---|---|
| Ant (avoid) | ~40% | ~45% | ~70% | +56% |
| Humanoid (avoid) | ~35% | ~40% | ~65% | +63% |
| HalfCheetah | ~50% | ~55% | ~78% | +42% |
FGE covers more than 50% additional safe parameter sets compared to the best baseline across all challenging MuJoCo tasks.
Ablation Study: Component Contributions¶
| Ablation Setting | Safety Coverage | Note |
|---|---|---|
| FGE (full) | ~70% | Reference |
| w/o feasibility classifier | ~50% | Infeasible parameters interfere with training |
| w/o exploration distribution | ~55% | Insufficient exploration |
| w/o rehearsal distribution | ~60% | Learned skills regress |
| Density model replacing classifier | ~58% | Inferior to mixed-distribution classifier |
Key Findings¶
- Standard domain randomization and RARL are severely limited when parameter feasibility is unknown.
- The zero-false-positive guarantee of the feasibility classifier is critical for training stability.
- FTRL converges more stably than GDA (the method approximated by RARL) on saddle-point problems.
- Balancing the exploration and rehearsal distributions is indispensable for continuously expanding the safe set.
Highlights & Insights¶
- Novel problem formulation: Parameter-robust avoidance with unknown feasibility fills an important gap between safe RL and HJ analysis.
- Positive-only label learning: The asymmetric label problem (only positive labels are reliable) is handled elegantly with theoretical guarantees of zero false positives.
- Online learning perspective: Replacing unstable adversarial RL with a saddle-point method yields stronger theoretical guarantees.
- Practical three-distribution sampling: The base + explore + rehearse combination draws inspiration from curriculum learning and online learning.
Limitations & Future Work¶
- Theoretical convergence guarantees rely on assumptions such as convex-concavity and exact best responses, which are not fully satisfied in practice.
- The accuracy of the feasibility classifier in high-dimensional parameter spaces requires further validation.
- Only deterministic dynamics are considered; extensions to stochastic systems are not discussed.
- MuJoCo environments still exhibit a gap relative to real robotic systems.
Related Work & Insights¶
- HJ safety control: DeepReach (Bansal et al., 2021), ISAACS (Hsu et al., 2023), So et al. (2024)
- Robust RL: RARL (Pinto et al., 2017), WCSAC (Yang et al., 2021)
- Unsupervised environment design (UED): PAIRED (Dennis et al., 2020), PLR (Jiang et al., 2021)
- Safe RL: Constrained MDP (Altman, 1999), SauteRL (Sootla et al., 2022)
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Novel problem formulation; robust safety control with unknown feasibility.
- Theoretical Depth: ⭐⭐⭐⭐ — Classifier guarantees and saddle-point convergence analysis.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple MuJoCo environments with detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ — Direct relevance to safe robot deployment.