Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm¶

Conference: NeurIPS 2025 arXiv: 2509.23135
Code: Available
Area: Reinforcement Learning Keywords: Inverse Reinforcement Learning, Trust Region, Reward Learning, Non-Adversarial IRL, Monotonic Improvement

TL;DR¶

This paper proposes the TRRO theoretical framework and the PIRO practical algorithm, which guarantee monotonic improvement of reward function updates in IRL via a Minorization-Maximization procedure, achieving stability guarantees analogous to those of TRPO/PPO in forward RL.

Background & Motivation¶

Inverse reinforcement learning (IRL) learns reward functions from expert demonstrations. Modern IRL methods fall into two main paradigms:

Adversarial IRL (e.g., GAIL, AIRL): models reward learning as a minimax game, alternating between reward and policy optimization. Theoretically elegant but practically unstable and sensitive to hyperparameters.

Non-adversarial IRL (e.g., SQIL, IQ-Learn, ML-IRL): couples reward and policy via energy-based models and performs joint updates. Empirically more stable, but lacks principled control over reward updates — with no guarantee that each update step moves in the correct direction.

The paper identifies a key observation: existing non-adversarial IRL methods all essentially maximize the likelihood of expert behavior (equivalently, minimize the imitation gap). This unified perspective motivates the core idea: if each update step can be guaranteed to increase the likelihood, stable IRL training becomes achievable.

This is a perfect symmetric counterpart to TRPO in forward RL: - TRPO guarantees monotonic policy improvement under a fixed reward. - TRRO guarantees monotonic reward improvement given expert behavior.

The paper claims to fill the "right half of this symmetric picture."

Method¶

Overall Architecture¶

TRRO/PIRO follows the non-adversarial, explicit reward (ER) learning paradigm: 1. Unified perspective: proves that SQIL, IQ-Learn, f-IRL, and ML-IRL all optimize expert behavior likelihood. 2. Theoretical contribution: the TRRO framework guarantees monotonic improvement in inverse reward optimization via the MM algorithm. 3. Practical algorithm: PIRO realizes TRRO through adaptive regularization and approximate policy optimization.

Key Designs¶

Equivalent form of the likelihood objective (Proposition 1):
- The log-likelihood of ML-IRL: \(\ell(\boldsymbol{\theta}) = \mathbb{E}_{\rho^{\pi_E}}[\log \pi_{\boldsymbol{\theta}}(\mathbf{a}|\mathbf{s})]\)
- Equivalent to the imitation gap: \(\ell(\boldsymbol{\theta}) = J(\pi_E, r_{\boldsymbol{\theta}}) - J(\pi_{\boldsymbol{\theta}}, r_{\boldsymbol{\theta}})\)
- The gradient is the difference of reward gradients under two occupancy measures: \(\nabla_{\boldsymbol{\theta}} \ell = \mathbb{E}_{\rho^{\pi_E}}[\nabla r_{\boldsymbol{\theta}}] - \mathbb{E}_{\rho^{\pi_{\boldsymbol{\theta}}}}[\nabla r_{\boldsymbol{\theta}}]\)
- This bypasses the inner RL loop, reducing nested optimization to a single-loop procedure.
Trust Region Reward Optimization (TRRO, Theorem 3):
- Introduces a surrogate function \(\ell_{\boldsymbol{\theta}_{\text{old}}}(\boldsymbol{\theta})\): the imitation gap computed using the old policy \(\pi_{\text{old}}\) in place of the new policy.
- Proposition 2 shows that the surrogate matches the original objective to first order at \(\boldsymbol{\theta}_{\text{old}}\).
- Theorem 3 establishes a lower bound: \(\ell(\boldsymbol{\theta}_{\text{new}}) \geq \ell_{\boldsymbol{\theta}_{\text{old}}}(\boldsymbol{\theta}_{\text{new}}) - C\epsilon_{\boldsymbol{\theta}_{\text{old}}}(\boldsymbol{\theta}_{\text{new}})\)
- where \(\epsilon = \max_{s,a} |r_{\boldsymbol{\theta}_{\text{new}}} - r_{\boldsymbol{\theta}_{\text{old}}}|\) measures the reward change.
- Maximizing this lower bound guarantees that \(\ell\) is monotonically non-decreasing (Corollary 4).
- This constitutes an MM algorithm: the surrogate minorizes the original objective and is tangent to it at \(\boldsymbol{\theta}_{\text{old}}\).
Proximal Inverse Reward Optimization (PIRO):
- The theoretical constant \(C\) is too large for direct use; it is replaced by a tunable coefficient \(\mu > 0\).
- The \(\ell^\infty\) norm in \(\epsilon\) is non-differentiable; it is approximated by an \(L^2\) norm estimated over expert data and policy rollouts.
- Objective: \(L_{\boldsymbol{\theta}_{\text{old}}}(\boldsymbol{\theta}) = \ell_{\boldsymbol{\theta}_{\text{old}}}(\boldsymbol{\theta}) - \mu \bar{\epsilon}_{\boldsymbol{\theta}_{\text{old}}}(\boldsymbol{\theta})\)
- \(\mu\) is adapted: if \(\bar{\epsilon} > \bar{\epsilon}^{\text{target}} \times x\), then \(\mu \leftarrow \mu \times y\) (and vice versa).
- The policy is approximately optimized via a fixed number of SAC iterations rather than exact solving.

Loss & Training¶

PIRO alternates between: - Policy update: \(k\) rounds of SAC iteration based on the current reward \(r_{\boldsymbol{\theta}_{\text{old}}}\). - Reward update: \(n\) gradient ascent steps with gradient \(\nabla_{\boldsymbol{\theta}} L = \mathbb{E}_{\hat{D}_E}[\nabla r_{\boldsymbol{\theta}}] - \mathbb{E}_{D_S}[\nabla r_{\boldsymbol{\theta}}] - \mu \nabla \bar{\epsilon}\) - Setting \(k=n=1, \mu=0\) recovers standard non-adversarial IRL.

Key Experimental Results¶

Main Results: MuJoCo and Gym Robotics¶

Task	Expert	GAIL	AIRL	HyPE	IQ-Learn	ML-IRL	f-IRL	PIRO	Gain
Ant-v4	5926	997	991	2801	3590	5383	980	5967	+585
Humanoid-v4	5501	508	281	718	1848	5573	470	5955	+382
Walker2d-v4	5525	4158	73	1479	3023	4795	244	5644	+849
AntMaze-UMaze	35.6	5.2	4.5	11.9	3.9	4.2	3.6	25.7	+13.8
AntMaze-Large	11.5	0.9	3.4	1.5	0.8	0.3	0.9	8.8	+5.4

Ablation Study¶

Analysis Dimension	Result
Training stability	PIRO yields the smoothest learning curves; baselines exhibit large variance or performance collapse
Sample efficiency	PIRO converges at a rate comparable to the fastest baseline while achieving higher final performance
State-only reward recovery	Recovered rewards in a \(7\times7\) grid world closely match ground truth
Reward transfer	Rewards learned on LunarLander remain effective for training policies under added wind perturbations
Hyperparameter sensitivity	Robust within \(x, y \in (1, 2)\) and \(\bar{\epsilon}^{\text{target}} \in (0.1, 1)\)

Key Findings¶

PIRO outperforms or matches SOTA on nearly all tasks, with particularly notable advantages on challenging tasks (Humanoid, AntMaze, AdroitHand).
Training stability is the most prominent advantage — baselines such as ML-IRL frequently suffer performance collapse on complex tasks.
Although per-step computational cost is slightly higher, stable convergence does not increase total computation.
The only task where PIRO underperforms a baseline is Hopper-v4 (−173.7), suggesting that the proximal constraint may be overly conservative on simple tasks.

Highlights & Insights¶

The inverse symmetry with TRPO perspective is elegant: trust region in forward RL guarantees policy improvement ↔ trust region in inverse RL guarantees reward improvement.
Unifying multiple non-adversarial IRL methods under a likelihood maximization framework constitutes a significant theoretical contribution.
PIRO's implementation is concise: only a few additional reward gradient steps are needed on top of SAC, making it engineering-friendly.
The reward transfer experiment demonstrates the advantage of explicit reward learning over implicit methods — the learned reward is decoupled from environment dynamics.

Limitations & Future Work¶

Theoretical guarantees assume exact policy optimization; in practice, finite-step SAC introduces a gap between theory and implementation.
Reliance on on-policy sampling may limit scalability to tasks where environment interactions are costly.
The theoretical constant \(C\) is too large for direct use, necessitating adaptive \(\mu\) to relax the constraint in practice.
The framework is potentially extensible to RLHF settings, given the natural connection between reward model learning from human feedback and IRL.

The instability of adversarial methods such as GAIL and AIRL is a longstanding challenge in IRL; PIRO offers a principled alternative with formal guarantees.
PIRO is closely related to ML-IRL and can be viewed as ML-IRL augmented with a trust region constraint.
The TRPO→PPO simplification pathway inspired the TRRO→PIRO design trajectory.
The framework has potential applications to reward modeling in LLM alignment, as reward learning in RLHF is fundamentally an IRL problem.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First formal stability guarantee for IRL; the inverse-symmetric perspective on TRPO is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 9 tasks, 13 baselines, and comprehensive analysis of stability, efficiency, transfer, and sensitivity.
Writing Quality: ⭐⭐⭐⭐ — Rigorous theoretical derivations and a concise practical algorithm.
Value: ⭐⭐⭐⭐⭐ — Significant contributions to both the theory and practice of IRL.