Flow Actor-Critic for Offline Reinforcement Learning (FAC)¶
Conference: ICLR 2026
arXiv: 2602.18015
Code: None
Area: Reinforcement Learning
Keywords: Offline RL, Flow Matching, Actor-Critic, OOD Detection, Continuous Normalizing Flows
TL;DR¶
FAC is the first method to jointly leverage a continuous normalizing flow model to simultaneously construct an expressive actor policy and a critic penalty mechanism based on exact density estimation. By identifying OOD regions for selective conservative Q-value estimation, FAC achieves an average score of 60.3 across 55 OGBench tasks, substantially outperforming the previous best of 43.6.
Background & Motivation¶
Background: Offline RL datasets typically exhibit complex multimodal behavioral distributions. Simple Gaussian policies lack sufficient expressiveness; while diffusion-based policies are highly expressive, their multi-step sampling renders policy optimization unstable.
Limitations of Prior Work: (a) CQL applies a global conservative penalty uniformly to all OOD actions, leading to excessive conservatism; (b) SVR identifies OOD actions via importance sampling ratios, but these ratios can explode when the behavioral policy is poorly approximated by a Gaussian model; (c) existing methods treat actor design and critic penalty independently, lacking co-design between the two components.
Key Challenge: How can strong policy expressiveness be maintained while precisely identifying OOD regions for conservative estimation?
Key Insight: Flow models offer both a highly expressive policy (actor) and exact density estimation (critic OOD penalty), addressing both challenges simultaneously.
Core Idea: A single flow model resolves both actor expressiveness and critic OOD detection.
Method¶
Overall Architecture¶
Two-stage training: (1) a behavioral proxy model \(\hat{\beta}_\psi\) is trained via flow matching to provide exact density estimation; (2) the density estimates are used to construct a weight function that penalizes Q-values in OOD regions, while a one-step flow actor optimizes the policy.
Key Designs¶
-
Flow Behavior Proxy:
- Function: A continuous normalizing flow model is trained via flow matching to model the behavioral distribution \(\hat{\beta}_\psi(a|s)\).
- Mechanism: The flow matching objective is \(\min_\psi \mathbb{E}[\|v_\psi(\tilde{a}_u; s, u) - (a-z)\|^2]\), where \(\tilde{a}_u = (1-u)z + ua\). A key advantage is that the flow model provides exact density estimates \(\log\hat{\beta}_\psi(a|s)\) via ODE integration, as opposed to the ELBO lower bound of VAEs or approximate likelihoods of diffusion models.
- Design Motivation: Exact density estimation is a prerequisite for reliable OOD detection — approximate densities from VAEs or diffusion models lead to misclassification.
-
Flow Critic Penalty:
- Function: Density estimates are used to construct a weight function that applies penalties exclusively to Q-values in OOD regions.
- Mechanism: The weight function \(w^{\hat{\beta}}(s,a) = \max(0, 1 - \hat{\beta}(a|s)/\epsilon)\) is zero within the data support (\(\hat{\beta} \geq \epsilon\)) and increases linearly in OOD regions. The critic loss incorporates an additional term \(\alpha \cdot \mathbb{E}_{a \sim \pi}[w \cdot Q(s,a)]\).
- Proposition 1 guarantees that the Bellman operator remains unbiased within the data distribution while strongly suppressing Q-values in OOD regions.
-
One-Step Flow Actor:
- Function: A simplified one-step flow (direct mapping \(z \mapsto a\)) serves as the policy, avoiding the instability of multi-step sampling.
- Mechanism: The actor loss is \(\max \mathbb{E}[Q(s,a)] - \lambda \cdot \|a_\theta(s,z) - a_\psi(s,z)\|^2\) (Q-value maximization + behavioral regularization). Unlike multi-step diffusion or flow policies, the one-step mapping stabilizes gradient computation.
Loss & Training¶
- Stage 1: Flow matching pre-training of the behavioral proxy.
- Stage 2: Alternating updates of the critic (TD loss + flow density-weighted penalty) and the actor (Q-value maximization + behavioral regularization).
Key Experimental Results¶
Main Results¶
OGBench (55 tasks, the most challenging offline RL benchmark):
| Method | State-based Avg. Score ↑ | Category |
|---|---|---|
| ReBRAC (Gaussian) | 31.0 | Gaussian Policy |
| FQL (Flow) | 43.6 | Flow Policy (actor only) |
| FAC | 60.3 | Flow Policy (actor+critic) |
Highlights: puzzle-3x3-play 100.0 (vs. FQL 29.6, +238%); antmaze-large 92.6 (vs. FQL 78.6).
D4RL AntMaze (6 tasks): average 90.5 (new SOTA, previous best FQL 83.5).
Ablation Study¶
| Configuration | OGBench Avg. ↑ | Notes |
|---|---|---|
| FAC (full) | 60.3 | Actor regularization + critic penalty |
| Actor regularization only (= FQL) | 43.6 | No critic penalty |
| Critic penalty only | ~48 | No actor regularization |
| VAE density replacing flow density | Large drop | Inaccurate density estimation causes OOD detection failure |
Key Findings¶
- Flow-based density estimation substantially outperforms VAE/diffusion models for OOD detection (illustrated in synthetic experiments, Fig. 1).
- Both actor regularization and critic penalty are indispensable; the joint FAC approach improves over FQL (actor only) by +16.7.
- The one-step flow actor is more stable than multi-step flow policies (FAWAC, FBRAC).
- On D4RL MuJoCo, Gaussian methods remain competitive after extensive tuning, but the performance gap is substantial on the complex tasks of OGBench.
Highlights & Insights¶
- The one model, two uses design is highly elegant: the flow model simultaneously provides an expressive policy and accurate OOD density estimation, which is more efficient and self-consistent than designing the two components separately.
- The density-threshold penalty (vs. CQL's global penalty) represents an important advancement — conservatism is applied only in genuinely OOD regions, maintaining unbiasedness within the data distribution and avoiding CQL's over-conservatism.
- The substantial improvement of 60.3 vs. 43.6 on OGBench suggests that, in complex multimodal tasks, precise OOD handling is more critical than policy expressiveness alone.
Limitations & Future Work¶
- Two-stage training (pre-training the flow model before actor-critic training) increases overall training complexity.
- Density evaluation requires numerical ODE integration (10-step Euler), introducing additional inference overhead.
- The threshold \(\epsilon\) introduces an additional design choice.
- Performance gains over baselines are less pronounced on D4RL MuJoCo than on OGBench.
Related Work & Insights¶
- vs. FQL: FQL uses flow only for the actor; FAC additionally leverages it for critic penalty — achieving +38% improvement on OGBench.
- vs. CQL: CQL's uniform penalty over all OOD actions causes excessive conservatism; FAC uses exact density estimates to apply penalties only at genuinely OOD actions.
- vs. DiffQL/IDQL: Multi-step sampling in diffusion policies destabilizes policy optimization; FAC's one-step flow is more stable.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to jointly leverage a flow model for both actor and critic; conceptually clean yet highly effective.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ OGBench (55 tasks) + D4RL (15 tasks) + pixel observations; comprehensive coverage.
- Writing Quality: ⭐⭐⭐⭐ Method motivation is clear; theoretical guarantee (Proposition 1) is concise.
- Value: ⭐⭐⭐⭐⭐ Achieves a major breakthrough on the most challenging offline RL benchmark.