Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning¶
Conference: ICLR 2026
arXiv: 2602.18117
Code: GitHub
Area: Flow Matching/Reinforcement Learning
Keywords: Flow Matching, Offline-to-Online RL, Noise Injection, Exploration-Exploitation Balance, Entropy-Guided Sampling
TL;DR¶
The paper significantly improves the sample efficiency of offline-to-online RL under limited interaction budgets by injecting controllable noise during flow matching training to expand policy coverage and employing an entropy-guided sampling mechanism to dynamically balance exploration and exploitation during online fine-tuning.
Background & Motivation¶
Background: Generative models (diffusion/flow matching) perform excellently as policy representations in offline RL due to their ability to model multi-modal distributions. Flow Q-Learning (FQL) has demonstrated the effectiveness of flow matching policies in offline settings.
Limitations of Prior Work: Offline pre-trained policies are often overly constrained to the dataset distribution, leading to insufficient exploration capabilities during the online fine-tuning stage. Existing methods (like FQL) treat online fine-tuning simply as a continuation of offline pre-training without specifically designed exploration mechanisms. In the antmaze-giant task, FQL agents almost exclusively follow the upper path existing in the dataset to reach the goal, completely ignoring other viable routes.
Key Challenge: Offline RL requires conservative constraints (avoiding out-of-distribution actions), while the online phase requires broad exploration (going beyond data coverage). The requirements for policy distribution in these two stages are essentially contradictory.
Goal: How to enable the policy to learn broader action coverage than the dataset during the offline pre-training stage without increasing the dataset size, and effectively utilize this diversity during online fine-tuning?
Key Insight: It is observed that the conditional probability path of flow matching collapses the distribution to a single data point when \(\sigma_{\min}=0\), limiting coverage. By injecting controlled noise into flow matching, the variance of the conditional probability path can be expanded.
Core Idea: Injecting noise into the flow matching objective to expand the policy support set, combined with entropy-guided sampling to adaptively balance exploration and exploitation during the online phase.
Method¶
Overall Architecture¶
The core problem FINO (Flow matching with Injected Noise for Offline-to-online RL) aims to solve is that flow matching policies pre-trained offline have too narrow coverage, making exploration difficult during online fine-tuning. The approach intervenes at both the "training objective" and "sampling strategy" levels. It maintains the FQL skeleton of dual policies + Q-network (flow policy for multi-modal distribution, one-step policy distilled from flow policy for environment interaction) but introduces two modifications: the offline standard flow matching objective is replaced with a noise-injected version to learn a wider action support set; in the online phase, fixed greedy or fixed temperature sampling is replaced by an adaptive softmax temperature based on policy entropy to dynamically switch between exploration and exploitation. The input is a state-action pair dataset, and the output is a policy that can explore efficiently under limited online interaction budgets.
flowchart TD
D["Offline Dataset<br/>State-Action Pairs"] --> A["Noise-Injected Flow Matching<br/>Inject ε_t along the path"]
A --> P["Broad Coverage Policy<br/>flow + one-step + Q networks"]
P --> S["Sample Candidate Actions Online"]
S --> G["Entropy via GMM<br/>Estimate policy entropy H"]
G --> E["Entropy-Guided Sampling<br/>Adjust temperature ξ based on H"]
E --> ENV["Environment Interaction<br/>Collect data & update networks"]
ENV -->|Online Fine-tuning Loop| S
Key Designs¶
1. Noise-Injected Flow Matching: Integrating Coverage into the Training Objective
The standard flow matching conditional probability path collapses the distribution to a single data point when \(\sigma_{\min}=0\), which is the root cause of narrow FQL coverage. FINO injects time-dependent Gaussian noise \(\epsilon_t \sim \mathcal{N}(0, \alpha_t^2 I)\) along the interpolation path, rewriting the training objective as:
The noise magnitude is determined by \(\alpha_t^2 = (\eta^2 - 2\eta)t^2 + 2\eta t\). \(\eta \in [0,1]\) is the sole control knob; when \(\eta=0\), the objective reverts to standard flow matching. This modification is supported by three theoretical guarantees: Theorem 1 states that the injected noise still forms a valid continuous normalizing flow; Theorem 2 proves that the marginal probability path variance induced by FINO is no less than standard FM, i.e., \(\text{Var}(X_t^{\text{FINO}}) \geq \text{Var}(X_t^{\text{FM}})\), meaning the policy covers a broader action space—precisely the diversity needed for online exploration.
2. Entropy-Guided Sampling: Self-Adaptive Temperature
Expanding coverage offline is insufficient; one must utilize this diversity during online fine-tuning. During the online phase, FINO samples multiple candidate actions and constructs a softmax sampling distribution based on Q-values:
Crucially, the temperature \(\xi\) is not fixed but updated via policy entropy: \(\xi_{\text{new}} = \xi - \alpha_\xi[\mathcal{H} - \bar{\mathcal{H}}]\). A fixed temperature cannot adapt to learning dynamics—early in training, the policy is stochastic, while it converges later. Using entropy as a signal, when policy entropy is high (actions are divergent), the temperature increases to favor exploitation; when entropy is low (actions are concentrated), the temperature decreases to favor exploration. This automatically balances the trade-off without manual tuning.
3. Entropy Estimation via GMM: Addressing One-step Policy Non-differentiability
Entropy-guided sampling requires calculating policy entropy. However, FINO uses a one-step policy (obtained via distillation and Q-optimization) for online inference, which has no analytical form. FINO addresses this by sampling multiple actions from the same state, fitting these samples using a Gaussian Mixture Model (GMM), and then estimating the entropy, bypassing the non-differentiability issue. Two engineering constants are fixed here: noise magnitude \(\eta=0.1\) (based on action range \([-1,1]\)) and the number of candidate actions set to half the action dimension.
Loss & Training¶
- Offline Stage: Simultaneously train the flow policy (Eq.7), one-step policy (Eq.5), and Q-network (TD loss).
- Online Stage: Continue optimizing the three networks, updating temperature \(\xi\) every \(N_\xi\) steps.
- Inference: Deterministically select the action with the highest Q-value.
Key Experimental Results¶
Main Results¶
Evaluated across 45 tasks in OGBench and D4RL, averaged over 10 random seeds:
| Task Category | Metric | Ours (FINO) | Prev. SOTA (FQL) | Cal-QL | Gain |
|---|---|---|---|---|---|
| OGBench humanoidmaze-medium | Score after Online FT | Best | 3±3 | 0±0 | Significant |
| D4RL antmaze (Avg. 6 tasks) | Score after Online FT | Best | Second | - | Stable |
| D4RL adroit (Avg. 4 tasks) | Score after Online FT | Best | Second | - | Stable |
| OGBench (Avg. 5 tasks) | Score after Online FT | Best | Second | - | Significant |
Ablation Study¶
| Configuration | Key Performance | Note |
|---|---|---|
| Full FINO | Best | Noise Injection + Entropy-Guided Sampling |
| w/o Noise Injection (η=0) | Significant Drop | Reverts to standard FQL |
| w/o Entropy-Guided | Drop | Fixed temperature cannot adapt |
| Noise Injection Only | Moderate | Lacks exploration-exploitation balance online |
Key Findings¶
- In the antmaze-giant task, FINO discovers multiple routes to the goal (including the lower path), whereas FQL only takes the upper path.
- Noise injection causes the policy to cover a visibly wider area around data points in toy experiments while remaining centered on the data.
- Advantages are more pronounced in high-dimensional action space tasks (e.g., humanoidmaze) because the exploration provided by noise injection is more critical in higher dimensions.
Highlights & Insights¶
- Theoretical Guarantees for Noise Injection: Three theorems rigorously demonstrate the validity of noise injection—maintaining a valid flow (Theorem 1), expanding coverage (Theorem 2), and forming a legitimate distribution (Proposition 1).
- Offline-for-Online Design Philosophy: Rather than treating offline training independently, the method prepares for subsequent online fine-tuning from the start (retaining diversity via noise). This proactive design is noteworthy.
- Self-Adaptive Temperature: The scheme is simple and elegant, using policy entropy as a closed-loop signal to adjust the exploration-exploitation trade-off without manual parameter tuning.
Limitations & Future Work¶
- \(\eta\) is globally set to 0.1; the optimal noise magnitude for different tasks or states might vary.
- Entropy estimation depends on GMM fitting; GMM accuracy might decrease in extremely high-dimensional action spaces.
- The heuristic of setting the number of candidate actions to half the action dimension lacks theoretical basis.
- Experiments focused on locomotion and navigation; performance on fine-grained tasks like manipulation remains unverified.
Related Work & Insights¶
- vs FQL: FQL uses standard flow matching for the policy. FINO injects noise into the training objective to expand coverage. FINO significantly outperforms FQL after online fine-tuning, especially in tasks requiring exploration.
- vs Cal-QL/RLPD: These methods do not use generative models as policies. FINO leverages the representational power of flow matching to be more competitive in complex tasks.
- Transferable Idea: The concept of noise injection to expand distribution coverage can be transferred to other generative model scenarios, such as diffusion model fine-tuning or diversity enhancement in conditional generation.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of noise-injected flow matching is novel with solid theoretical analysis.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 45 tasks, 10 seeds, multiple baselines, and detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ Clear motivation, complete theoretical derivation, supported by extensive appendices.
- Value: ⭐⭐⭐⭐ Provides a practical and theoretically grounded solution for offline-to-online RL.