Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning¶

Conference: ICLR 2026 arXiv: 2602.18117 Code: GitHub Area: Flow Matching / Reinforcement Learning Keywords: Flow Matching, Offline-to-Online RL, Noise Injection, Exploration-Exploitation Balance, Entropy-Guided Sampling

TL;DR¶

By injecting controllable noise into flow matching training to broaden policy coverage, and combining an entropy-guided sampling mechanism to dynamically balance exploration and exploitation during online fine-tuning, FINO significantly improves sample efficiency in offline-to-online RL under limited interaction budgets.

Background & Motivation¶

Background: Generative models (diffusion / flow matching) excel as policy representations in offline RL due to their ability to model multimodal distributions. Flow Q-Learning (FQL) has demonstrated the effectiveness of flow-matching policies in offline settings.

Limitations of Prior Work: Policies pre-trained offline are over-constrained to the dataset distribution, resulting in insufficient exploration during online fine-tuning. Existing methods such as FQL treat online fine-tuning as a straightforward continuation of offline pre-training without dedicated exploration mechanisms. In the antmaze-giant task, the FQL agent almost exclusively follows the upper path present in the dataset to reach the goal, completely ignoring other viable routes.

Key Challenge: Offline RL requires conservative constraints to avoid out-of-distribution actions, whereas the online phase demands broad exploration beyond data coverage. These two phases impose fundamentally opposing requirements on the policy distribution.

Goal: How can a policy learn action coverage broader than the dataset during offline pre-training—without augmenting the dataset—and effectively leverage this diversity during online fine-tuning?

Key Insight: Standard flow matching with \(\sigma_{\min}=0\) collapses the conditional probability path onto individual data points, limiting coverage. Injecting controlled noise into flow matching can expand the variance of the conditional probability path.

Core Idea: Inject noise into the flow matching objective to enlarge the policy support set, and employ entropy-guided sampling to adaptively balance exploration and exploitation during the online phase.

Method¶

Overall Architecture¶

FINO (Flow matching with Injected Noise for Offline-to-online RL) builds on the FQL framework and comprises two core components: (1) noise-injected flow matching training during offline pre-training, and (2) an entropy-guided sampling mechanism during online fine-tuning. The inputs are state-action pair datasets; the output is a policy capable of efficient exploration in online interaction.

Key Designs¶

Noise Injection for Flow Matching:
- Function: Injects time-dependent Gaussian noise \(\epsilon_t \sim \mathcal{N}(0, \alpha_t^2 I)\) along the interpolation path during flow matching training.
- Mechanism: The training objective is modified to \(\mathcal{L}_{\text{FINO}}(\theta) = \mathbb{E}[\|v_\theta(t, s, x_t + \epsilon_t) - (x_1 - (1-\eta)x_0)\|^2_2]\), where \(\alpha_t^2 = (\eta^2 - 2\eta)t^2 + 2\eta t\) and \(\eta \in [0,1]\) controls noise magnitude. Setting \(\eta=0\) recovers standard flow matching.
- Design Motivation: Theorem 2 proves that the marginal probability path variance induced by FINO is no less than that of standard FM, i.e., \(\text{Var}(X_t^{\text{FINO}}) \geq \text{Var}(X_t^{\text{FM}})\), enabling the policy to cover a broader action space. Theorem 1 guarantees that the noise-injected model still constitutes a valid continuous normalizing flow.
Entropy-Guided Sampling:
- Function: Samples multiple candidate actions from the policy during online fine-tuning, constructs a softmax sampling distribution based on Q-values, and adaptively adjusts the temperature parameter.
- Mechanism: The sampling probability is \(p(i) = \frac{\exp(\xi \cdot Q_\phi(s, a_i))}{\sum_j \exp(\xi \cdot Q_\phi(s, a_j))}\), with temperature \(\xi\) updated adaptively according to policy entropy: \(\xi_{\text{new}} = \xi - \alpha_\xi[\mathcal{H} - \bar{\mathcal{H}}]\).
- Design Motivation: A fixed temperature cannot adapt to the dynamic changes during learning. When policy entropy is high, \(\xi\) is increased to favor exploitation; when entropy is low, \(\xi\) is decreased to encourage exploration, achieving automatic balance.
Implementation Details:
- Function: Addresses the intractable entropy estimation problem for one-step policy distributions.
- Mechanism: Multiple actions are sampled from the same state and a Gaussian Mixture Model (GMM) is fitted to estimate entropy. \(\eta=0.1\) (calibrated to the action range \([-1,1]\)); the number of candidate actions is set to half the action dimensionality.
- Design Motivation: The one-step policy is obtained via distillation and Q-value optimization, so its distribution cannot be used to compute entropy directly.

Loss & Training¶

Offline phase: Jointly trains the flow policy (Eq. 7), the one-step policy (Eq. 5), and the Q-network (TD loss).
Online phase: Continues optimizing all three networks; temperature \(\xi\) is updated every \(N_\xi\) steps.
Inference: Deterministically selects the action with the highest Q-value.

Key Experimental Results¶

Main Results¶

Evaluated on 45 tasks across OGBench and D4RL, averaged over 10 random seeds:

Task Category	Metric	FINO	FQL	Cal-QL	Gain
OGBench humanoidmaze-medium	Score after online fine-tuning	Best	3±3	0±0	Significant
D4RL antmaze (avg. 6 tasks)	Score after online fine-tuning	Best	2nd	—	Consistent
D4RL adroit (avg. 4 tasks)	Score after online fine-tuning	Best	2nd	—	Consistent
OGBench (avg. 5 tasks)	Score after online fine-tuning	Best	2nd	—	Significant

Ablation Study¶

Configuration	Key Performance	Note
Full FINO	Best	Noise injection + entropy-guided sampling
w/o noise injection (\(\eta=0\))	Notable drop	Degenerates to standard FQL
w/o entropy guidance	Drop	Fixed temperature cannot adapt
Noise injection only	Moderate	Lacks exploration-exploitation balance in online phase

Key Findings¶

In the antmaze-giant task, FINO discovers multiple routes to the goal (including the lower path), whereas FQL exclusively follows the upper path.
Noise injection causes the policy to cover a substantially broader region around data points in toy experiments, while remaining data-centered.
The advantage is more pronounced in high-dimensional action space tasks (e.g., humanoidmaze), where the exploration provided by noise injection is more critical.

Highlights & Insights¶

Theoretical guarantees for noise injection: Three theorems collectively justify the noise injection design—preserving a valid flow (Theorem 1), expanding coverage (Theorem 2), and constituting a proper distribution (Proposition 1). The theoretical foundation is rigorous and empirically effective.
Offline-serves-online design philosophy: Rather than treating offline training in isolation, the method prepares for subsequent online fine-tuning from the outset by injecting noise to retain diversity. This forward-looking design philosophy is worth emulating.
Adaptive temperature scheme is concise and elegant, using policy entropy as a closed-loop signal to regulate the exploration-exploitation trade-off without manual tuning.

Limitations & Future Work¶

\(\eta\) is uniformly set to 0.1, without accounting for potentially different optimal noise magnitudes across tasks or states.
Entropy estimation relies on GMM fitting, whose accuracy may degrade in extremely high-dimensional action spaces.
Setting the number of candidate actions to half the action dimensionality is a heuristic rule lacking theoretical justification.
Experiments are conducted primarily on locomotion and navigation tasks; effectiveness on fine-grained manipulation tasks remains unverified.

vs. FQL: FQL uses standard flow matching directly as the policy; FINO injects noise into the training objective to expand coverage. FINO significantly outperforms FQL after online fine-tuning, particularly on tasks requiring exploration.
vs. Cal-QL / RLPD: These methods do not employ generative models as policies. FINO leverages the expressive power of flow matching to achieve greater competitiveness on complex tasks.
Transferable idea: The concept of enlarging distributional coverage via noise injection is transferable to other generative modeling scenarios, such as diffusion model fine-tuning and diversity enhancement in conditional generation.

Rating¶

Novelty: ⭐⭐⭐⭐ — The idea of noise-injected flow matching is original and the theoretical analysis is rigorous.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 45 tasks, 10 seeds, multiple baselines, detailed ablations.
Writing Quality: ⭐⭐⭐⭐ — Motivation is clear, theoretical derivations are complete, and extensive appendices provide additional support.
Value: ⭐⭐⭐⭐ — Provides a practical, theoretically grounded solution for offline-to-online RL.