ICML2025 Reinforcement Learning Generative Flow Networks Ergodicity Diffeomorphism Flow-matching Imitation Learning Normalizing Flows

Ergodic Generative Flows¶

Conference: ICML2025
arXiv: 2505.03561
Code: To be confirmed
Area: Generative Flows / Reinforcement Learning / Imitation Learning
Keywords: Generative Flow Networks, Ergodicity, Diffeomorphism, Flow-matching, Imitation Learning, Normalizing Flows

TL;DR¶

This paper proposes Ergodic Generative Flows (EGFs), which construct generative flows via a finite set of global diffeomorphisms. By leveraging ergodicity, EGFs guarantee universality. A novel KL-weakFM loss is designed to enable imitation learning without requiring an independent reward model. EGFs outperform baselines on NASA Earth science datasets with a model 30 times smaller.

Background & Motivation¶

Generative Flow Networks (GFNs) were originally proposed on directed acyclic graphs for sampling from unnormalized distributions. Although subsequent works have extended them to continuous state spaces and non-acyclic structures, GFNs still face four key challenges:

Imitation learning requires an independent reward model: In IL scenarios, the target distribution density is unknown. Existing methods first need to train an independent reward model and then solve it using RL techniques, which increases training and computational costs.

FM loss is intractable in continuous settings: Plain forward policies (such as adding noise in diffusion models) require computing a high-dimensional integral for the star inflow \(f^*_\leftarrow\), making it computationally infeasible.

Acyclicity constraint: Manual construction of additional structures is required, whereas cycles naturally occur in naive implementations and RL environments.

0-flow instability: Divergence-based loss functions can be unstable in the presence of ergodic measures, a theoretical prediction that has not yet been experimentally verified.

The proposed EGFs address these four issues in a unified manner.

Method¶

Core Definition: Ergodic Generative Flow¶

The forward policy of an EGF is composed of a finite set of diffeomorphisms \(\{\Phi_i\}_{i=1}^p\):

\[\pi^*_\rightarrow(s) = \sum_{i=1}^p \alpha^i_\rightarrow(s) \delta_{\Phi_i(s)}\]

where \(\alpha_\rightarrow: \mathcal{S} \to [0,1]^p\) is the policy network (with a softmax head), and the group of diffeomorphisms generated by \(\Phi_i\) is required to satisfy topological ergodicity: for any \(x, y \in \mathcal{S}\) and any neighborhood \(\mathcal{U}\) of \(y\), there exists a sequence of transformations such that \(x\Phi_{i_1}\Phi_{i_2}\cdots\Phi_{i_t} \in \mathcal{U}\).

Tractability of Star Inflow¶

Since only a finite number of diffeomorphisms are used, the star inflow has a closed-form formula:

\[f^*_\leftarrow(s) = \sum_{i=1}^p (\alpha^i_\rightarrow f^*_\rightarrow) \circ \Phi_i^{-1}(s) \cdot |\det J_s \Phi_i^{-1}|\]

When the number of transformations \(p\) is small, the FM loss \(\mathcal{L}^{\text{stable}}_{\text{FM}}\) is fully tractable.

Universality Theorem¶

Master Universality Theorem (Thm 3.4): If a parameterized family of EGFs contains a policy \(\pi^*_\rightarrow\) satisfying summably \(L^2\)-mixing, and \(f^*_\rightarrow\) is dense in \(L^2(\mathcal{S}, \lambda)\), then this family is universal.

Concrete examples: - Torus \(\mathbb{T}^d\): Affine torus family, using two generators of \(\text{SL}_d(\mathbb{Z})\) and their inverses, i.e., \(p=4\) is sufficient to achieve universality. - Sphere \(\mathbb{S}^d\): Isometric sphere family, using two generators of \(\text{SO}_{d+1}(\mathbb{R})\) and their inverses, also achieving universality with \(p=4\).

Quantitative Sampling Theorem (Thm 3.8)¶

For any generative flow, the sampling error satisfies:

\[\text{TV}(s_\tau \| \kappa/\kappa(\mathcal{S})) \leq \frac{\delta}{1+\delta} + \text{TV}(\hat{\kappa}/\hat{\kappa}(\mathcal{S}) \| \kappa/\kappa(\mathcal{S}))\]

where \(\delta = (F_\text{init} + F^*_\leftarrow - F^*_\rightarrow)^-(\mathcal{S})\) measures the negative part of the flow-matching defect.

KL-weakFM Loss¶

The core loss function designed for IL without an independent reward model:

\[\mathcal{L}_{KL\text{-}wFM}(\theta) = b \cdot \mathbb{E}_{s \sim \nu_\text{train}} \delta f_\text{init}(s) - \mathbb{E}_{s \sim \kappa} \log \hat{f}_\text{term}(s)\]

First term (weak-FM term): Controls only the negative part of the FM defect, ensuring \(\hat{f}_\text{term}\) is non-negative.
Second term (cross-entropy term): Controls the KL divergence between the virtual terminal distribution \(\hat{F}_\text{term}\) and the target \(\kappa\).
The weak-FM term simultaneously controls the normalizing factor (Eq. 18), allowing the two terms to work synergistically.

FM Loss in RL Settings¶

Stable FM loss (for RL):

\[\mathcal{L}^{\text{stable}}_{\text{FM},q} = \mathbb{E}_{s \sim \nu_\text{train}} [(f_\text{init} + f^*_\leftarrow - f_\text{term} - f^*_\rightarrow)^q(s)]\]

Unstable divergence FM loss (used in comparative experiments):

\[\mathcal{L}^{\text{div}}_{\text{FM}} = \mathbb{E}_{\underline{s}} \sum_{t=1}^\tau \log\left(\frac{f^*_\leftarrow + f_\text{init}}{f^*_\rightarrow + f_\text{term}}\right)^2(\underline{s}_t)\]

The regularization term \(\mathcal{R} = \mathbb{E}_{\underline{s}} \sum_t (f^*_\rightarrow)^2(\underline{s}_t)\) is used to help stabilize training.

Key Experimental Results¶

RL Experiments (Checkerboard Distribution, \(\mathbb{T}^2\))¶

Architecture: 16 transformations (8 translations + 2 \(\text{SL}_d(\mathbb{Z})\) elements and their inverses), 5-layer MLP with width 32.
Validated the tractability and expressiveness of EGF.
Confirmed the explosion behavior of the unstable divergence loss \(\mathcal{L}^{\text{div}}_{\text{FM}}\) (divergence of flow size and sampling time \(\tau\)), which can be mitigated by regularization.
The stable loss \(\mathcal{L}^{\text{stable}}_{\text{FM}}\) converged well.

IL Experiments: Toy Distribution on \(\mathbb{T}^2\)¶

EGF: Minimum of 4 affine transformations, 3-layer MLP with width 32.
Compared with Moser Flow (same 32x3 architecture).
Results: Moser Flow failed to train under such a small model size, whereas EGF still reproduced the target distribution with high fidelity.

IL Experiments: NASA Dataset on \(\mathbb{S}^2\)¶

Method	Volcano ↓	Earthquake ↓	Flood ↓
Mixture vMF	-0.31	0.59	1.09
Stereographic	-0.64	0.43	0.99
Riemannian	-0.97	0.19	0.90
Moser Flow	-2.02	-0.09	0.62
EGFN	-2.31	-0.12	0.56

EGF used 6 rotations (rotations of angle \(\pi/4\) on 3 axes and their inverses), with a 256x5 MLP.
Baseline Moser Flow used 512x6, meaning the EGF model was approximately 30 times smaller.
Training time was 10 times faster than Moser Flow.
Learning rate: 1e-3, exponentially decaying to 1e-5, running for 3000 epochs × 25 steps.

Highlights & Insights¶

Theory-Practice Unification: The combination of ergodicity theory and universality guarantees is elegant; only 4 simple transformations are needed to achieve universality in any dimension.
Tractable FM Loss: Finite diffeomorphisms yield a closed-form formula for the star inflow, avoiding the intractable integration issues in continuous GFNs.
IL Without Reward Models: The KL-weakFM loss realizes, for the first time, imitation learning within the GFN framework without requiring an independent reward model.
Extreme Parameter Efficiency: Outperforming Moser Flow with a 30x smaller model demonstrates the expressiveness gains brought by ergodicity.
Bridge to NF: EGF can be viewed as a stochastic sampler of Normalizing Flows (NFs), where each sampled trajectory corresponds to a random NF, establishing a deep connection between GFNs and NFs.
Quantitative Sampling Theorem: This work derives the first quantitative sampling error bound for non-acyclic generative flows.

Limitations & Future Work¶

Limited to Low-Dimensional Experiments: Experiments were only conducted on \(\mathbb{T}^2\) and \(\mathbb{S}^2\); the actual performance in high-dimensional scenarios remains unknown.
Difficulty in Verifying the \(L^2\)-mixing Summability Condition: The technical conditions required for universality are harder to satisfy in high dimensions, necessitating further theoretical development.
Theoretical Lower Bounds on the Number of Transformations: Sufficiency is only proved for two generators in affine tori and isometric sphere families; the minimum number of transformations in the general case remains unknown.
Insufficient Hyperparameter Tuning: The authors acknowledge that the highly modular nature of EGF was not fully exploited, as complex transformations, replay buffers, or advanced architectures were not attempted.
Lack of Direct Comparison with Diffusion Models: As a generative model, it was not compared against mainstream methods like DDPM on standard image generation benchmarks.
Background Noise Issues: The KL-weakFM loss tends to make \(\hat{f}_\text{term}\) positive everywhere in the space, leading to outliers that require additional threshold filtering.

GFlowNet Theory (Bengio et al., 2021; 2023): EGF advances GFNs in continuous, non-acyclic settings.
Moser Flow (Rozen et al., 2021): The main baseline for IL; EGF significantly outperforms it with a small parameter footprint.
Non-acyclic GFN Theory (Brunswic et al., 2024): A prior work by the first author of this paper; EGF validates its prediction regarding the instability of divergence-based losses.
CINF (Caterini et al., 2021): A similar approach which aggregates multiple NFs via expectation.
Ergodic Theory (Walters, 2000; Bourgain & Gamburd, 2012): Provides the mathematical foundation for EGF's universality.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Integrating ergodic theory into generative flows is highly original.
Experimental Thoroughness: ⭐⭐⭐ — Limited to low-dimensional experiments, lacking high-dimensional and mainstream benchmarks.
Writing Quality: ⭐⭐⭐⭐ — Theoretically rigorous and well-structured, though academically dense.
Value: ⭐⭐⭐⭐ — The theoretical contribution is significant, though practical utility in high dimensions remains to be validated.