Skip to content

Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

Conference: ICLR 2026 arXiv: 2602.18117 Code: GitHub Area: Flow Matching / Reinforcement Learning Keywords: Flow Matching, Offline-to-Online RL, Noise Injection, Exploration-Exploitation Balance, Entropy-Guided Sampling

TL;DR

By injecting controllable noise into flow matching training to broaden policy coverage, and combining an entropy-guided sampling mechanism to dynamically balance exploration and exploitation during online fine-tuning, FINO significantly improves sample efficiency in offline-to-online RL under limited interaction budgets.

Background & Motivation

Background: Generative models (diffusion / flow matching) excel as policy representations in offline RL due to their ability to model multimodal distributions. Flow Q-Learning (FQL) has demonstrated the effectiveness of flow-matching policies in offline settings.

Limitations of Prior Work: Policies pre-trained offline are over-constrained to the dataset distribution, resulting in insufficient exploration during online fine-tuning. Existing methods such as FQL treat online fine-tuning as a straightforward continuation of offline pre-training without dedicated exploration mechanisms. In the antmaze-giant task, the FQL agent almost exclusively follows the upper path present in the dataset to reach the goal, completely ignoring other viable routes.

Key Challenge: Offline RL requires conservative constraints to avoid out-of-distribution actions, whereas the online phase demands broad exploration beyond data coverage. These two phases impose fundamentally opposing requirements on the policy distribution.

Goal: How can a policy learn action coverage broader than the dataset during offline pre-training—without augmenting the dataset—and effectively leverage this diversity during online fine-tuning?

Key Insight: Standard flow matching with \(\sigma_{\min}=0\) collapses the conditional probability path onto individual data points, limiting coverage. Injecting controlled noise into flow matching can expand the variance of the conditional probability path.

Core Idea: Inject noise into the flow matching objective to enlarge the policy support set, and employ entropy-guided sampling to adaptively balance exploration and exploitation during the online phase.

Method

Overall Architecture

FINO (Flow matching with Injected Noise for Offline-to-online RL) builds on the FQL framework and comprises two core components: (1) noise-injected flow matching training during offline pre-training, and (2) an entropy-guided sampling mechanism during online fine-tuning. The inputs are state-action pair datasets; the output is a policy capable of efficient exploration in online interaction.

Key Designs

  1. Noise Injection for Flow Matching:

    • Function: Injects time-dependent Gaussian noise \(\epsilon_t \sim \mathcal{N}(0, \alpha_t^2 I)\) along the interpolation path during flow matching training.
    • Mechanism: The training objective is modified to \(\mathcal{L}_{\text{FINO}}(\theta) = \mathbb{E}[\|v_\theta(t, s, x_t + \epsilon_t) - (x_1 - (1-\eta)x_0)\|^2_2]\), where \(\alpha_t^2 = (\eta^2 - 2\eta)t^2 + 2\eta t\) and \(\eta \in [0,1]\) controls noise magnitude. Setting \(\eta=0\) recovers standard flow matching.
    • Design Motivation: Theorem 2 proves that the marginal probability path variance induced by FINO is no less than that of standard FM, i.e., \(\text{Var}(X_t^{\text{FINO}}) \geq \text{Var}(X_t^{\text{FM}})\), enabling the policy to cover a broader action space. Theorem 1 guarantees that the noise-injected model still constitutes a valid continuous normalizing flow.
  2. Entropy-Guided Sampling:

    • Function: Samples multiple candidate actions from the policy during online fine-tuning, constructs a softmax sampling distribution based on Q-values, and adaptively adjusts the temperature parameter.
    • Mechanism: The sampling probability is \(p(i) = \frac{\exp(\xi \cdot Q_\phi(s, a_i))}{\sum_j \exp(\xi \cdot Q_\phi(s, a_j))}\), with temperature \(\xi\) updated adaptively according to policy entropy: \(\xi_{\text{new}} = \xi - \alpha_\xi[\mathcal{H} - \bar{\mathcal{H}}]\).
    • Design Motivation: A fixed temperature cannot adapt to the dynamic changes during learning. When policy entropy is high, \(\xi\) is increased to favor exploitation; when entropy is low, \(\xi\) is decreased to encourage exploration, achieving automatic balance.
  3. Implementation Details:

    • Function: Addresses the intractable entropy estimation problem for one-step policy distributions.
    • Mechanism: Multiple actions are sampled from the same state and a Gaussian Mixture Model (GMM) is fitted to estimate entropy. \(\eta=0.1\) (calibrated to the action range \([-1,1]\)); the number of candidate actions is set to half the action dimensionality.
    • Design Motivation: The one-step policy is obtained via distillation and Q-value optimization, so its distribution cannot be used to compute entropy directly.

Loss & Training

  • Offline phase: Jointly trains the flow policy (Eq. 7), the one-step policy (Eq. 5), and the Q-network (TD loss).
  • Online phase: Continues optimizing all three networks; temperature \(\xi\) is updated every \(N_\xi\) steps.
  • Inference: Deterministically selects the action with the highest Q-value.

Key Experimental Results

Main Results

Evaluated on 45 tasks across OGBench and D4RL, averaged over 10 random seeds:

Task Category Metric FINO FQL Cal-QL Gain
OGBench humanoidmaze-medium Score after online fine-tuning Best 3±3 0±0 Significant
D4RL antmaze (avg. 6 tasks) Score after online fine-tuning Best 2nd Consistent
D4RL adroit (avg. 4 tasks) Score after online fine-tuning Best 2nd Consistent
OGBench (avg. 5 tasks) Score after online fine-tuning Best 2nd Significant

Ablation Study

Configuration Key Performance Note
Full FINO Best Noise injection + entropy-guided sampling
w/o noise injection (\(\eta=0\)) Notable drop Degenerates to standard FQL
w/o entropy guidance Drop Fixed temperature cannot adapt
Noise injection only Moderate Lacks exploration-exploitation balance in online phase

Key Findings

  • In the antmaze-giant task, FINO discovers multiple routes to the goal (including the lower path), whereas FQL exclusively follows the upper path.
  • Noise injection causes the policy to cover a substantially broader region around data points in toy experiments, while remaining data-centered.
  • The advantage is more pronounced in high-dimensional action space tasks (e.g., humanoidmaze), where the exploration provided by noise injection is more critical.

Highlights & Insights

  • Theoretical guarantees for noise injection: Three theorems collectively justify the noise injection design—preserving a valid flow (Theorem 1), expanding coverage (Theorem 2), and constituting a proper distribution (Proposition 1). The theoretical foundation is rigorous and empirically effective.
  • Offline-serves-online design philosophy: Rather than treating offline training in isolation, the method prepares for subsequent online fine-tuning from the outset by injecting noise to retain diversity. This forward-looking design philosophy is worth emulating.
  • Adaptive temperature scheme is concise and elegant, using policy entropy as a closed-loop signal to regulate the exploration-exploitation trade-off without manual tuning.

Limitations & Future Work

  • \(\eta\) is uniformly set to 0.1, without accounting for potentially different optimal noise magnitudes across tasks or states.
  • Entropy estimation relies on GMM fitting, whose accuracy may degrade in extremely high-dimensional action spaces.
  • Setting the number of candidate actions to half the action dimensionality is a heuristic rule lacking theoretical justification.
  • Experiments are conducted primarily on locomotion and navigation tasks; effectiveness on fine-grained manipulation tasks remains unverified.
  • vs. FQL: FQL uses standard flow matching directly as the policy; FINO injects noise into the training objective to expand coverage. FINO significantly outperforms FQL after online fine-tuning, particularly on tasks requiring exploration.
  • vs. Cal-QL / RLPD: These methods do not employ generative models as policies. FINO leverages the expressive power of flow matching to achieve greater competitiveness on complex tasks.
  • Transferable idea: The concept of enlarging distributional coverage via noise injection is transferable to other generative modeling scenarios, such as diffusion model fine-tuning and diversity enhancement in conditional generation.

Rating

  • Novelty: ⭐⭐⭐⭐ — The idea of noise-injected flow matching is original and the theoretical analysis is rigorous.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 45 tasks, 10 seeds, multiple baselines, detailed ablations.
  • Writing Quality: ⭐⭐⭐⭐ — Motivation is clear, theoretical derivations are complete, and extensive appendices provide additional support.
  • Value: ⭐⭐⭐⭐ — Provides a practical, theoretically grounded solution for offline-to-online RL.