Intention-Conditioned Flow Occupancy Models¶

TL;DR¶

This paper proposes InFOM, which leverages flow matching to construct an intention-conditioned occupancy model. By applying variational inference to infer latent intentions from unannotated data, InFOM enables RL pre-training without labeled datasets, achieving a 1.8× improvement in median return and a 36% gain in success rate across 36 state-based tasks and 4 visual tasks.

Background & Motivation¶

The large-scale pre-training–fine-tuning paradigm has achieved remarkable success in NLP and CV, yet remains an open problem in reinforcement learning (RL). The core challenges in RL include:

Temporal Reasoning: Agents must reason about the long-term consequences of current actions, while world models are limited by compounding errors that restrict long-horizon reasoning.

Intention Reasoning: Large-scale offline datasets are typically collected by multiple users performing different tasks, and these implicit "intentions" are not explicitly annotated.

Limitations of Prior Work: Behavioral cloning (BC) imitates actions without capturing intentions; discriminative occupancy models are difficult to train; successor feature methods generally ignore user intentions.

This paper proposes InFOM (Intention-conditioned Flow Occupancy Models), which jointly learns a probabilistic model to capture both temporal and intention information, enabling the pre-trained model to be aware of the behavioral goals of different users and thereby facilitating more efficient policy learning during downstream fine-tuning.

Method¶

InFOM consists of two stages: pre-training and fine-tuning.

Pre-training Stage¶

1. Variational Intention Inference

Given an unannotated dataset \(D=\{(s,a,s',a')\}\), latent intentions \(z\) are inferred via variational inference.
An intention encoder \(p_e(z|s',a')\) infers the intention from the next transition \((s',a')\), based on a consistency assumption that consecutive transitions share the same intention.
The ELBO is maximized: \(\mathbb{E}[\log q_d(s_f|s,a,z)] - \lambda D_{KL}(p_e(z|s',a') \| p(z))\)
The prior is \(p(z) = \mathcal{N}(0,I)\), and \(\lambda\) controls the KL regularization strength.

2. SARSA Flow Occupancy Models

Flow matching is used to learn a generative occupancy model \(q_d(s_f|s,a,z)\) that predicts discounted state occupancy measures.
Temporal difference (TD) reasoning is incorporated into the flow matching loss to enable dynamic programming and compositional generalization.
The loss is decomposed into two components: a current flow loss \((1-\gamma)\mathcal{L}_{\text{current}}\) and a future flow loss \(\gamma \mathcal{L}_{\text{future}}\).
The SARSA variant is simpler and more stable than the Q-learning variant, exhibiting better performance on large datasets.

Fine-tuning Stage¶

3. Generative Value Estimation

The pre-trained occupancy model is frozen; \(N=16\) future states \(s_f^{(i)} \sim q_d(s_f|s,a,z)\) are sampled.
The intention-conditioned Q-function is estimated via Monte Carlo: \(Q_z(s,a) = \frac{1}{(1-\gamma)N}\sum_i r(s_f^{(i)})\)
Intentions \(z\) are sampled from the prior \(p(z)\) rather than the posterior.

4. Implicit Generalized Policy Improvement (GPI)

The explicit max over a finite set of intentions in naive GPI is replaced by an upper expectile loss.
Multiple \(Q_z\) estimates are distilled into a single scalar Q-function: \(\mathcal{L}(Q) = \mathbb{E}[L_2^\mu(Q_z(s,a) - Q(s,a))]\)
This avoids backpropagating gradients through the ODE solver, yielding more stable training.
A behavioral cloning regularizer is appended to suppress out-of-distribution actions.

Key Experimental Results¶

Experiment 1: ExORL and OGBench Benchmarks¶

Comparison against 8 baseline methods across 36 state-based tasks and 4 visual tasks:

Domain	InFOM	Best Baseline	Gain
walker (4 tasks avg)	380.9	327.6 (MBPO+ReBRAC)	~16%
jaco (4 tasks avg)	727.4	67.7 (IQL)	~20×
cube single (5 tasks)	92.5	77.8 (MBPO+ReBRAC)	~19%
visual tasks (4 tasks)	—	—	+31% over best

Matches or surpasses all baselines on 7 out of 9 domains.
The most substantial improvement occurs in the jaco domain (~20×), attributed to the high-dimensional state space and sparse rewards.
Outperforms the strongest baseline by 31% on image-based tasks.
Overall median return improves by 1.8× and success rate by 36%.

Experiment 2: Ablation Study on Implicit GPI¶

Policy Extraction Method	quadruped jump Return	scene task 1 Success Rate
InFOM (implicit GPI)	Highest	Highest
InFOM + GPI (naive max)	44% lower	Lower, 8× variance
FOM + one-step PI	Significantly lower	Significantly lower

Implicit GPI outperforms naive GPI by 44% and reduces variance by 8×.
Removing the intention encoder (FOM + one-step PI) causes a substantial performance drop, validating the importance of intention inference.

Highlights & Insights¶

Unified Framework: InFOM is the first to integrate intention inference with flow matching occupancy models, capturing both temporal and intention information within a single framework.
Implicit GPI: Replacing the explicit max operation with an expectile loss avoids instability from ODE backpropagation and the limitations of finite intention sets.
Strong Empirical Performance: Comprehensively outperforms 8 baselines across 36+4 tasks, with a ~20× improvement in the jaco domain.
Intention Visualization: t-SNE visualizations demonstrate that InFOM discovers clustering structures aligned with ground-truth intentions, whereas representations from FB and HILP are entangled.

Limitations & Future Work¶

Inferring intentions from consecutive state-action pairs is a simplification that may fail to accurately capture the original intention at the full trajectory level.
Monte Carlo Q-estimation introduces variance, as evidenced by relatively large cross-seed standard deviations on some tasks.
Joint pre-training of the encoder and flow model incurs higher computational cost compared to pure BC methods.
The consistency assumption—that consecutive transitions share the same intention—may not hold in complex real-world scenarios.

Offline Unsupervised RL: FB (Touati & Ollivier, 2021) and HILP (Park et al., 2024) learn skills or representations but do not simultaneously model occupancy measures.
Occupancy Models / Successor Representations: Dayan (1993), Janner et al. (2020), and TD flows (Farebrother et al., 2025) use flow matching to model occupancy measures but do not model intentions.
Generative RL: Decision Transformer, Diffuser, and related methods use generative models to model trajectories or policies but do not explicitly predict long-horizon state distributions.
Representation Learning: Contrastive learning, MAE, and similar approaches learn general-purpose representations but do not guarantee transferability to policy adaptation.
InFOM's Novelty: Relative to the most closely related work, TD flows, InFOM introduces variational latent variables for intention modeling and replaces explicit GPI over finite sets with implicit GPI.

Rating¶

⭐⭐⭐⭐ (4/5)

Theoretical motivation is clear, with variational inference and flow matching occupancy models organically integrated.
Experimental coverage is broad and baselines are comprehensive: 36+4 tasks × 8 baselines × 8 seeds.
Implicit GPI constitutes an elegant engineering and theoretical contribution.
Deductions: the intention consistency assumption is strong, and the variance issue in Monte Carlo estimation is not fully resolved.