ICML2025 Reinforcement Learning imitation learning world model reward-free inverse soft-Q learning model predictive control latent dynamics

Reward-free World Models for Online Imitation Learning¶

Conference: ICML2025
arXiv: 2410.14081
Code: To be confirmed
Area: Imitation Learning / World Models / Model Predictive Control
Keywords: imitation learning, world model, reward-free, inverse soft-Q learning, model predictive control, latent dynamics

TL;DR¶

Proposes IQ-MPC, a reward-free world model online imitation learning method that jointly learns the dynamics model and Q-function in latent space via inverse soft-Q learning, achieving stable expert-level imitation in high-dimensional observation and complex dynamics tasks using MPPI planning.

Background & Motivation¶

Limitations of Offline Imitation Learning: Behavior Cloning (BC) methods (e.g., Diffusion Policy, Implicit BC) rely on massive expert data, struggle with out-of-distribution (OOD) states, and suffer from error accumulation and performance degradation.
Limitations of Prior Work: Existing online IL methods (e.g., GAIL, IQ-Learn, CFIL) perform poorly in high-dimensional observation/action spaces and complex dynamics tasks; IRL-based min-max optimization is unstable in the reward-policy space.
Potential of World Models: Decoder-free world models like the TD-MPC series demonstrate remarkable sample efficiency and planning capabilities in RL, but have not been effectively applied to reward-free imitation learning scenarios.
Goal: Can the dynamic modeling of world models enhance online imitation learning performance while completely eliminating the reliance on explicit reward models?

Method¶

Overall Architecture: IQ-MPC¶

IQ-MPC consists of four core components, all operating in the latent space without reconstructing original observations:

Encoder \(h\): \(\mathbf{z} = h(\mathbf{s})\), mapping states to latent representations.
Latent Dynamics Model \(d\): \(\mathbf{z}' = d(\mathbf{z}, \mathbf{a})\), predicting the next latent state.
Q-function \(Q\): \(\hat{q} = Q(\mathbf{z}, \mathbf{a})\), estimating state-action values.
Policy Prior \(\pi\): \(\hat{\mathbf{a}} = \pi(\mathbf{z})\), guiding MPPI planning.

The system maintains two independent replay buffers: the expert buffer \(\mathcal{B}_E\) and the behavior buffer \(\mathcal{B}_\pi\).

Core Idea: Reward-free Optimization in the Q-Policy Space¶

Key Insight: The inverse Bellman operator \(\mathcal{T}^\pi\) establishes a bijective mapping between the Q-space and the reward space:

\[r(\mathbf{s}, \mathbf{a}) = Q(\mathbf{s}, \mathbf{a}) - \gamma \mathbb{E}_{\mathbf{s}' \sim \mathcal{P}(\cdot|\mathbf{s},\mathbf{a})} V^\pi(\mathbf{s}')\]

Consequently, there is no need to train a separate reward model; rewards can be directly decoded from Q-values and the policy prior. The optimization is shifted from the reward-policy space to the Q-policy space, bypassing the instability of min-max optimization.

Joint Training Loss¶

The joint training objective for the encoder, dynamics model, and Q-function is:

\[\mathcal{L} = \sum_{t=0}^{H} \lambda^t \left( \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}'_t) \sim \mathcal{B}} \| \mathbf{z}_{t+1} - \text{sg}(h(\mathbf{s}'_t)) \|_2^2 \right) + \mathcal{L}_{iq}\]

The first term is the consistency loss: ensuring the latent states predicted by the dynamics model are consistent with the actual next states encoded by the encoder (where sg denotes stop-gradient).
The second term is the inverse soft-Q loss \(\mathcal{L}_{iq}\): using \(\chi^2\) regularization, which consists of three parts—Q-estimates on expert data, initial state value function terms, and a regularization penalty on Q-value magnitudes.

Policy Prior Learning¶

The policy is learned via a maximum entropy RL objective:

\[\mathcal{L}_\pi = \sum_{t=0}^{H} \lambda^t \left[ \mathbb{E}_{(\mathbf{s}_t, \mathbf{a}_t) \sim \mathcal{B}} \left[ -Q(\mathbf{z}_t, \pi(\mathbf{z}_t)) + \beta \log(\pi(\cdot|\mathbf{z}_t)) \right] \right]\]

where \(\beta\) is a fixed entropy coefficient. Policy learning utilizes a mixture of data from both the expert and behavior buffers.

Gradient Penalty for Stable Training¶

To address the issue where policy learning fails due to an overly dominant critic, a Wasserstein-1 gradient penalty is introduced:

\[\mathcal{L}_{pen} = \sum_{t=0}^{H} \lambda^t \left[ \mathbb{E}_{(\hat{\mathbf{s}}_t, \hat{\mathbf{a}}_t) \sim \mathcal{B}} \left( \| \nabla Q(\hat{\mathbf{z}}_t, \hat{\mathbf{a}}_t) \|_2 - 1 \right)^2 \right]\]

The gradient penalty points are generated through linear interpolation between expert and behavior samples, enforcing the Lipschitz condition on the Q-function.

MPPI Planning (Inference Phase)¶

During inference, gradient-free planning is performed using MPPI (Model Predictive Path Integral):

Encode the current state \(\mathbf{z}_t = h(\mathbf{s}_t)\).
Sample \(N\) and \(N_\pi\) action trajectories from the Gaussian distribution and policy prior, respectively.
Roll out through the dynamics model, and decode rewards using the inverse Bellman operator: \(r(\mathbf{z},\mathbf{a}) = Q(\mathbf{z},\mathbf{a}) - \gamma V^\pi(\mathbf{z}')\).
Accumulate soft returns and add the terminal value estimation \(\gamma^H V^\pi(\mathbf{z}_H)\).
Iteratively update Gaussian parameters \((\mu, \sigma)\) and execute the first action.

Key Experimental Results¶

Locomotion Tasks (DMControl, State Input)¶

Task	IQL+SAC	CFIL+SAC	HyPE	IQ-MPC
Hopper Hop	Unstable	Unstable	Moderate	Stable Expert-level
Walker Run	Moderate	Low	Moderate	Stable Expert-level
Humanoid Walk	Low	Low	Moderate	Stable Expert-level
Dog Walk	Low	Low	Moderate	Best

Low-dimensional tasks use 100 expert trajectories, Humanoid uses 500, and Dog uses 1000.

Dexterous Manipulation Tasks (MyoSuite, Success Rate)¶

Task	IQL+SAC	CFIL+SAC	HyPE	IQ-MPC
Key Turn	0.72±0.04	0.65±0.08	0.55±0.09	0.87±0.03
Object Hold	0.00±0.00	0.01±0.01	0.13±0.10	0.96±0.03
Pen Twirl	0.00±0.00	0.00±0.00	0.00±0.00	0.73±0.05

Only uses 100 expert trajectories (each of 100 steps).
Baselines all achieve a 0% success rate in the Pen Twirl task, while IQ-MPC reaches 73%.

Visual Input Experiments (DMControl, Image Observations)¶

Only replaces the encoder with a shallow convolutional network, leaving the rest of the model unchanged.
Outperforms the visual version of IQL+SAC significantly on Cheetah Run and Walker Run.
Matches baseline performance on Walker Walk.

Ablation Study on Number of Expert Trajectories¶

Hopper Hop: Reaches expert-level performance with only 10 expert trajectories (100 trajectories converge faster).
Object Hold: Reaches expert-level performance with only 5 expert trajectories.
In Hopper Hop, instability arises with only 5 trajectories.

Highlights & Insights¶

Reward-free World Model: Decodes rewards directly from Q-values using the inverse Bellman operator, completely eliminating reward model training and lowering system complexity.
Q-Policy Space Optimization: Reformulates the traditional min-max problem in the reward-policy space into optimization in the Q-policy space, yielding theoretically and experimentally proven training stability.
Theoretical Guarantees: Proves that the training objective simultaneously minimizes the upper bound of the policy return discrepancy (comprising T1: distribution matching + T2: dynamics consistency).
Breakthrough in Dexterous Manipulation: On high-dimensional musculoskeletal control tasks in MyoSuite, where baseline methods almost entirely fail (success rate ~0%), IQ-MPC still achieves a 73-96% success rate.
Data Efficiency: Achieves expert-level performance with only 5-10 expert trajectories, demonstrating exceptional sample efficiency.
Modality-agnostic Design: Highly flexible architecture requiring only an encoder replacement when transitioning from state to visual inputs.

Limitations & Future Work¶

Expert Data Acquisition: Still requires sampling expert trajectories from a pre-trained TD-MPC2 model; obtaining high-quality expert demonstrations in real-world scenarios can be difficult.
Computational Overhead: MPPI planning requires multiple iterations of sampling and rollouts, incurring higher inference costs than policy-only methods.
Deterministic Environment Assumptions: Most experiment environments feature deterministic dynamics; performance in high-stochasticity environments remains to be verified.
Single-task Design: Trains an independent world model for each task, lacking cross-task generalization capabilities.
Lack of Real-robot Validation: All experiments are conducted in simulated environments; sim-to-real transfer is not discussed.

IQ-Learn (Garg et al., 2021): Serves as the theoretical foundation for this work, introducing inverse soft-Q learning to shift IRL optimization from the reward space to the Q space.
TD-MPC / TD-MPC2 (Hansen et al., 2022/2023): Serves as the architectural blueprint for this work, offering a decoder-free world model + MPPI planning framework.
GAIL (Ho & Ermon, 2016): Classic adversarial imitation learning; this work builds upon it by resolving the instability of min-max optimization.
CMIL (Kolev et al., 2024): Conservative world model for imitation learning in visual manipulation; its bounded suboptimality lemma is referenced in the theoretical analysis.

Rating¶

Novelty: ⭐⭐⭐⭐ — Combining a decoder-free world model with inverse soft-Q learning for online imitation learning is a novel combination; the reward-free planning design is simple and elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers three categories of tasks (locomotion, manipulation, and vision) with thorough ablation studies, though it lacks real-robot validation.
Writing Quality: ⭐⭐⭐⭐ — The theoretical derivations are clear, experimental descriptions are comprehensive, and notations are consistent and standardized.
Value: ⭐⭐⭐⭐ — The breakthrough performance in dexterous manipulation (baselines success rate ~0% vs. IQ-MPC 73-96%) demonstrates the practical value of the method.