Offline Reinforcement Learning with Generative Trajectory Policies¶

Conference: ICLR 2026 arXiv: 2510.11499 Code: None (planned open-source) Area: Image Generation Keywords: Offline Reinforcement Learning, Generative Policy, ODE Trajectory, Consistency Models, Flow Matching

TL;DR¶

This paper proposes Generative Trajectory Policies (GTP), which adopts a unified perspective treating diffusion models, flow matching, and consistency models as special cases of ODE solution mappings. GTP learns a complete continuous-time trajectory solution mapping and introduces two adaptation techniques—score approximation and advantage weighting—achieving state-of-the-art performance on the D4RL benchmark.

Background & Motivation¶

Background: In offline RL, generative models have emerged as powerful policy classes due to their ability to capture complex multimodal behavior distributions. Diffusion-based policies excel in expressiveness, while consistency-based policies offer inference efficiency.
Limitations of Prior Work: Diffusion policies require iterative denoising, resulting in high inference cost. Consistency policies enable 1–2 step generation but often suffer performance degradation. The two approaches present a fundamental trade-off between expressiveness and efficiency.
Key Challenge: Existing methods have developed independently, lacking a unified perspective to understand and transcend their respective limitations. It remains an open question whether a policy class can be both highly expressive and computationally efficient.
Goal: To break the expressiveness-efficiency trade-off in generative policies by designing a flexible multi-step generative policy that achieves high performance even with a small number of sampling steps.
Key Insight: The authors observe that diffusion models, flow matching, and consistency models can all be viewed as learning solution mappings \(\Phi(\bm{x}_t, t, s)\) of a continuous-time ODE \(\frac{d\bm{x}_t}{dt} = f(\bm{x}_t, t)\), and exploit this unified perspective to design a novel policy class.
Core Idea: Represent the policy as an ODE solution mapping (flow map), combined with efficient score approximation and advantage-weighted objectives, yielding an offline RL policy that is both expressive and efficient.

Method¶

Overall Architecture¶

GTP is an actor-critic framework. The actor is a generative trajectory policy \(\Phi_\theta(s, a_t, t, \tau)\) that learns a solution mapping from noisy actions to clean actions. The critic is a standard double-Q network \(Q_\varphi\). The actor is optimized through two complementary objectives (instantaneous flow loss + trajectory consistency loss) with advantage weighting for policy improvement. At inference time, actions are produced by iteratively stepping from Gaussian noise.

Key Designs¶

1. Unified ODE Trajectory Framework

Function: Provides the theoretical foundation for a unified understanding of diffusion, flow matching, and consistency models.
Mechanism: Defines the ideal flow map \(\Phi(\bm{x}_t, t, s) = \bm{x}_t + \int_t^s f(\bm{x}_\tau,\tau)d\tau\) and its reparameterized form \(\phi(\bm{x}_t, t, s)\). Training employs two complementary objectives: an instantaneous flow loss (local correctness, corresponding to diffusion denoisers and flow matching velocity fields) and a trajectory consistency loss (global coherence, \(\Phi(\bm{x}_t,t,s) \approx \Phi(\Phi(\bm{x}_t,t,u),u,s)\)).
Design Motivation: Existing models are each special cases of ODE learning; the unified perspective reveals a clearer policy design space.

2. Efficient and Stable Score Approximation

Function: Addresses the computational cost of ODE integration and the training instability caused by self-referential supervision.
Mechanism: Replaces the true score \(f^*(\bm{x}_t,t)\), which requires multi-step integration, with a closed-form surrogate \(\tilde{f}(\bm{x}_t,t) = (\bm{x}_t - \bm{x})/t\). Theorem 1 proves that the resulting objective error is \(O(h^p)\) (where \(p\) is the solver order) and vanishes as the step size approaches zero.
Design Motivation: Using inaccurate early estimates as supervision creates a vicious cycle analogous to bootstrapping in TD learning. Anchoring to an analytic signal derived from offline data eliminates error propagation.

3. Advantage-Weighted Value-Guided Policy Improvement

Function: Elevates the generative model from behavior cloning to genuine policy improvement.
Mechanism: Theorem 2 proves that the optimal solution to KL-regularized policy optimization satisfies \(\pi^*(a|s) \propto \pi_{BC}(a|s)\exp(\eta A(s,a))\). An exponential advantage weight \(w(s,a) = \exp(\eta \cdot \max(0, A(s,a))/(\text{std}(A)+\epsilon))\) is incorporated into the generative loss to prioritize imitation of high-advantage actions.
Design Motivation: A purely generative objective reduces to behavior cloning and cannot achieve policy improvement. Advantage weighting provides a theoretically principled form of value guidance.

Loss & Training¶

Total actor loss: \(\mathcal{L}_{\text{actor}} = \mathcal{L}_{\text{Consistency}} + \lambda_{\text{Flow}} \cdot \mathcal{L}_{\text{Flow}}\)

Trajectory consistency loss: \(\mathcal{L}_{\text{Consistency}} = \mathbb{E}[w(s,a)\|\Phi_\theta(s,a_t,t,\tau) - \Phi_{\theta^-}(s,\tilde{a}_u,u,\tau)\|_2^2]\)
Instantaneous flow loss: \(\mathcal{L}_{\text{Flow}} = \mathbb{E}[w(s,a)\|a - \phi_\theta^{\text{inst}}(s,a_t,t)\|_2^2]\)

The critic uses a standard double-Q network trained with TD error loss, with target networks updated via EMA.

Key Experimental Results¶

Main Results¶

D4RL behavior cloning (BC) performance comparison; GTP-BC uses 5-step sampling:

Task	Diffusion-BC	Consistency-BC	GTP-BC (Ours)
Gym Average	76.3	69.7	82.3
AntMaze Average	41.7	44.1	66.3
halfcheetah-mr	41.7	34.4	46.3
hopper-mr	67.3	99.7	100.5

D4RL offline RL performance comparison (full actor-critic framework):

Task	IDQL	DIPO	D-QL	C-QL	GTP (Ours)
AntMaze-large-diverse	47.5	—	47.3	51.0	100.0
AntMaze-medium-diverse	—	—	—	—	100.0

Ablation Study¶

Configuration	Gym Avg.	AntMaze Avg.	Notes
Full GTP	Best	100.0	Both losses + advantage weighting + score approximation
w/o trajectory consistency loss	Drops	Drops	Global consistency critical for long-horizon tasks
w/o instantaneous flow loss	Drops	Drops	Local dynamics anchoring is indispensable
True score (ODE integration)	Unstable	Poor	Validates necessity of score approximation
w/o advantage weighting	Degrades to BC	Degrades	No policy improvement capability

Key Findings¶

GTP achieves perfect scores (100.0) for the first time on several challenging AntMaze tasks, significantly outperforming all prior methods.
Under the BC setting, GTP-BC already substantially outperforms Diffusion-BC and Consistency-BC, demonstrating the intrinsic expressive power of the trajectory policy.
Score approximation accelerates training and improves stability; the theoretical error bound \(O(h^p)\) is empirically validated.
Strong performance is achieved with as few as 5 sampling steps, demonstrating a favorable balance between efficiency and quality.

Highlights & Insights¶

Theoretical contribution of the unified perspective: Subsuming diffusion, flow matching, and consistency models into a single ODE framework provides a clear design space for policy development.
Complementary dual-objective design: The instantaneous flow loss ensures local accuracy while the trajectory consistency loss ensures global coherence.
Elegance of score approximation: A simple closed-form expression replaces complex ODE integration, with theoretical guarantees and superior empirical performance.
Perfect AntMaze scores: A landmark result with significant implications for the offline RL community.

Limitations & Future Work¶

Evaluation is primarily conducted on standard D4RL benchmarks; more complex real-world tasks remain untested.
Despite the elegance of the unified ODE framework, theoretical guidance for selecting the optimal number of sampling steps is lacking.
Extending GTP to online RL and model-based RL settings is a promising direction.
Integration with recent token-level generative policy methods warrants exploration.

Consistency Trajectory Models (CTM) provide the foundation for learning ODE solution mappings; GTP extends this to the RL domain.
The advantage weighting scheme is conceptually aligned with AWR/AWAC, but admits a novel theoretical interpretation within the generative model framework.
Insight: A unified theoretical perspective across different generative models can yield approaches that surpass each individual method.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Unified ODE trajectory perspective combined with two theoretically principled adaptations.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive D4RL evaluation with thorough ablations; perfect AntMaze scores are compelling.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorous theoretical derivations and clear architectural diagrams.
Value: ⭐⭐⭐⭐⭐ Substantial and lasting impact on generative policy research in offline RL.