Reinforced Sequential Monte Carlo for Amortised Sampling¶
Conference: ICML 2026
arXiv: 2510.11711
Code: https://github.com/hyeok9855/ReinforcedSMC (Available)
Area: Reinforcement Learning / Probabilistic Inference / Diffusion Models / Neural Samplers
Keywords: Amortized Sampling, SMC, MaxEnt RL, GFlowNet, Importance Weighted Replay
TL;DR¶
This paper unifies hierarchical variational inference, MaxEnt reinforcement learning, and Sequential Monte Carlo (SMC) / Annealed Importance Sampling (AIS) into a single framework. The learned policy and flow function serve simultaneously as the proposal kernel and twisting target for SMC. Conversely, near-target samples generated by SMC are used as an off-policy behavior policy to train the neural sampler. Combined with adaptive weight tempering and importance-weighted experience replay, this approach improves both mode coverage and training stability on multi-modal targets and the alanine dipeptide Boltzmann distribution.
Background & Motivation¶
Background: Sampling from a target distribution given an unnormalized energy function \(R(x)=\pi(x)\cdot Z\) is fundamental to Bayesian inference and molecular conformation sampling. One main branch is classical Monte Carlo—MCMC (HMC, Langevin), adaptive importance sampling, and SMC—which possess "anytime" properties where more particles lead closer to the true distribution. The other branch is "amortized sampling" using neural networks (autoregressive models, diffusion models), where the energy function is fitted into a network during training to enable single-forward-pass sampling during inference.
Limitations of Prior Work: Amortized samplers often use the reverse-KL objective during training, which has a strong mode-seeking tendency, leading to mode collapse on multi-modal targets. If trained on on-policy data, they may never learn modes they haven't encountered. While classical SMC can "unbiasedly" approach the target, inference is slow and particle degeneracy (where a few particles consume most of the weight) causes the effective sample size \(\widehat{\mathrm{ESS}}\) to decay rapidly. These two categories have complementary strengths and weaknesses but lack a unified interface to leverage each other.
Key Challenge: Amortized samplers cannot learn modes they have not seen, yet their learned parts provide good proposals. SMC can explore new regions but requires efficient proposal kernels to be effective. Merely stacking them (training a sampler then refining with MCMC) does not form a training loop—learning still occurs only on the sampler's own trajectories.
Goal: (i) Mathematically express HVI, MaxEnt RL, and SMC/AIS under a unified notation, such that the neural sampler's policy \(\overrightarrow p_\theta\) and flow function \(F^\phi_n\) naturally correspond to the SMC proposal kernel and intermediate targets; (ii) design a complementary loop—using the learned sampler for SMC proposals and using weighted SMC samples as an off-policy behavior policy to train the sampler; (iii) address the variance and stability issues of joint training.
Key Insight: The trajectory balance (TB) and subtrajectory balance (SubTB) losses of GFlowNets allow training with any full-support behavior policy without importance correction, providing the critical interface to ingest "SMC output as training data."
Core Idea: By reformulating TB/SubTB losses, they are shown to be exactly equal to the second moment of the AIS log-weights. At optimality, the policy equals the proposal kernel and the flow equals the intermediate target, satisfying detailed balance. This turns "training a sampler" and "performing SMC" into two sides of the same coin.
Method¶
Overall Architecture¶
The sampling of target \(\pi(x)=R(x)/Z\) is formulated as an \(N\)-step hierarchical model \(\overrightarrow p_\theta(x_{0:N})=\overrightarrow p_0(x_0)\prod_{n}\overrightarrow p_\theta(x_{n+1}\mid x_n)\), with a fixed backward kernel \(\overleftarrow p(x_n\mid x_{n+1})\). This is treated as a deterministic MDP: state \((n,x_n)\), action as the next variable value, reward \(r((n,x_n),x_{n+1})=\log \overleftarrow p(x_n\mid x_{n+1})\), and terminal reward \(\log R(x)\). The MaxEnt objective is equivalent to minimizing \(\mathrm{KL}(\overrightarrow p_\theta(x_{0:N})\|\pi(x)\overleftarrow p(x_{0:N-1}\mid x))\).
The training loop (Fig. 1) consists of four objects and two data streams:
- Policy/Proposal \(\overrightarrow p_\theta\) and Flow/Twisting Target \(F^\phi_n\), where ideally \(F^\phi_N(x)=R(x)\).
- On-policy stream: Directly rollout trajectories from \(\overrightarrow p_\theta\) to train the policy using TB.
- Off-policy stream: Run SMC with the current \(\overrightarrow p_\theta\) and \(F^\phi_n\) to obtain \((x_N,w_N)\), then backward sample \(\overleftarrow p(\cdot\mid x_N)\) for full trajectories, which enter the IW-Buffer to train the flow using SubTB.
- Importance Weighted Experience Replay: Historical SMC outputs from multiple batches are combined using batch-level normalization constant estimates \(\widehat Z_m\) and within-batch self-normalized weights \(W^{m,k}_N\) as sampling probabilities.
Key Designs¶
-
TB/SubTB Loss = Second Moment of AIS Log-Weights (Unified Interface):
- Function: Bridges "training neural samplers" and "running SMC" under the same loss, making the flow the twisting target and the policy the proposal kernel.
- Mechanism: The trajectory balance loss is defined as \(\mathcal L^{\theta,\phi}_{\mathrm{TB}}(x_{0:N})=\big[\log\frac{F^\phi_0(x_0)\prod_n \overrightarrow p_\theta(x_{n+1}\mid x_n)}{R(x_N)\prod_n \overleftarrow p(x_n\mid x_{n+1})}\big]^2\). The bracketed term is exactly the AIS log-importance weight minus \(\log Z_\theta\). Subtrajectory balance applies the same second moment over sub-segments \([m,n]\). When SubTB reaches zero on length-1 segments, detailed balance \(\pi_n(x_n)\overrightarrow p(x_{n+1}\mid x_n)=\pi_{n+1}(x_{n+1})\overleftarrow p(x_n\mid x_{n+1})\) is automatically satisfied, and SMC weights remain uniform across all steps, eliminating the need for resampling.
- Design Motivation: The authors experimentally confirmed that the most stable division of labor is updating the policy \(\theta\) via TB only and the flow \(\phi\) via SubTB only (see G.2); this avoids interference between targets while simultaneously binding geometric annealing, twisting targets, and proposal kernels to neural network parameters.
-
SMC as Behavior Policy + Importance Weighted Experience Replay (IW-Buffer):
- Function: Uses near-target samples from SMC as off-policy data to train \(\overrightarrow p_\theta\) and utilizes past exploration through historical replay.
- Mechanism: The behavior policy follows a mixture of on-policy (\(\overrightarrow p_\theta\)) and off-policy (backward-reconstructed trajectories from SMC outputs \(x_N\)). For the \(m\)-th batch of historical samples \(\{x^{m,k}\}\), each sample is assigned weight \(\widehat Z_m\cdot W^{m,k}_N\). The batch weight \(\widehat Z_m\) is the particle estimate of the normalization constant \(Z\): for on-policy batches, \(\widehat Z_m=\frac{1}{K}\sum_k w^{m,k}_N\); for SMC batches with resampling, \(\widehat Z_m=\prod_j\big(\sum_k W^{m,k}_{r_{j-1}}\prod_i \widetilde w^{m,k}_i\big)\). Samples \(x_N\) are drawn from the buffer proportional to \(\widehat Z_m W^{m,k}_N\), and trajectories are completed via \(\overleftarrow p(x_{1:N-1}\mid x_N)\) for training.
- Design Motivation: Traditional prioritized experience replay uses TD-error, but here the core difficulty is that "samples come from different proposal distributions." Using particle estimates of normalization constants provides a statistically grounded "priority," and the buffer weakly converges to the target as \(MK\to\infty\). Unlike Langevin local search, which depends on target gradients, SMC works in gradient-free settings and utilizes the learned flow for twisting.
-
Adaptive Importance Weight Tempering:
- Function: Mitigates instability in early training caused by high weight concentration (where a few samples dominate the gradient).
- Mechanism: Weights are transformed as \(w\mapsto w^\lambda\) before normalization, with \(\lambda\in[0,1]\). Since a fixed \(\lambda\) introduces bias, an adaptive scheme is used to select the largest \(\lambda\) such that \(\widehat{\mathrm{ESS}}(w^\lambda_{1:K})\ge \gamma K\), i.e., \(\lambda^\ast=\max\{\lambda\in[0,1]:\widehat{\mathrm{ESS}}(w^\lambda_{1:K})\ge\gamma K\}\). Since \(\widehat{\mathrm{ESS}}(w^\lambda)\) is monotonically decreasing with \(\lambda\), this is solved via binary search at near-zero cost. This is used alongside adaptive resampling triggered when \(\widehat{\mathrm{ESS}}\le\kappa\).
- Design Motivation: In early training, \(\overrightarrow p_\theta\) is far from detailed balance, leading to exploding weight variance. Hard clipping is common but introduces uncontrollable bias. Letting \(\lambda\) automatically increase from 0 to 1 as training progresses (ideally returning to unbiased AIS) aligns with the intuition that later models are more trustworthy.
Loss & Training¶
The policy only uses TB (Eq. 8), while the flow only uses SubTB (Eq. 7); other combinations (double TB, double SubTB, or mixed) were found unstable in G.2. Diffusion samplers use Langevin parameterization and temperature annealing-correction (Appendix E). Each training step involves: (a) running on-policy trajectories from \(\overrightarrow p_\theta\) for TB; (b) drawing \(x_N\) from IW-Buffer to calculate TB+SubTB; (c) if SMC behavior policy is enabled, running SMC to add new \((x_N,w_N)\) to the buffer.
Key Experimental Results¶
Experiments used diffusion samplers across gradient-free and gradient-based settings, including alanine dipeptide Boltzmann distribution and discrete spaces (Appendix). Results report mean ± standard deviation over 5 runs.
Main Results¶
| Target | Metric | TB (on-policy baseline) | + IW-Buf | TB/SubTB + SMC | + SMC + IW-Buf |
|---|---|---|---|---|---|
| GMM40 (\(d=2\)) | EUBO ↓ | 273.10 | 0.88 | 1.06 | 0.89 |
| GMM40 (\(d=2\)) | Sinkhorn ↓ | 607.31 | 6.50 | 39.99 | 6.46 |
| GMM40 (\(d=5\)) | EUBO ↓ | 3156.7 | 1183.3 | 30.1 | 2.3 |
| GMM40 (\(d=5\)) | Sinkhorn ↓ | 3110.2 | 2813.9 | 330.9 | 83.3 |
| Funnel (\(d=10\), grad-free) | EUBO ↓ | 8.33 | 1.53 | 41.54 | 3.64 |
| ManyWell (\(d=32\)) | Sinkhorn ↓ | 29.57 | 22.97 | 21.91 | 22.97 |
| Robot4 (\(d=10\), grad-based) | Sinkhorn ↓ | 1.72 | 1.27 | 64.48 | 0.39 |
| GMM40 (\(d=50\)) | Sinkhorn ↓ | 3903.95 | 4284.49 | × | 3579.17 |
| ManyWell (\(d=64\)) | MMD ↓ | 0.243 | 0.058 | 0.138 | 0.043 |
Ablation Study¶
| Configuration | Performance on Multimodal Targets | Description |
|---|---|---|
| TB only (on-policy) | Sinkhorn \(607\), EUBO \(273\) (GMM40 d=2) | Reverse-KL mode-seeking; severe mode collapse |
| + IW-Buf | Sinkhorn \(6.50\), EUBO \(0.88\) | Historical samples help recover missed modes |
| TB/SubTB + SMC (no buffer) | Sinkhorn \(39.99\) | Strong SMC exploration but samples wasted per step |
| + IW-Buf | Sinkhorn \(6.46\) | Combines SMC exploration with replay reuse |
| TB for flow / SubTB for policy (G.2) | Unstable or worse | Confirms optimal separation: TB for \(\theta\), SubTB for \(\phi\) |
| Fixed \(\lambda\) vs Adaptive \(\lambda^\ast\) | Fixed: high bias or high variance | Adaptive tempering balances bias/variance via \(\widehat{\mathrm{ESS}}\ge\gamma K\) |
| Robot4 (\(d=10\)): DDS / TB | DDS failed, TB \(1.72\) | Shows fragility of on-policy methods on complex control targets |
Key Findings¶
- Synergy of SMC and IW-Buffer is critical: Adding SMC behavior policy alone (e.g., Robot4 Sinkhorn \(64.48\)) can destroy training due to high variance; adding IW-Buffer reaches the best metrics (Sinkhorn \(0.39\)). Exploration and sample reuse must occur simultaneously.
- On-policy training → mode collapse: Pure on-policy DDS, LV, and TB on GMM40-\(d=2\) have EUBOs in the hundreds, indicating most modes are missed, consistent with mode-seeking theoretical predictions for reverse-KL gradients.
- Importance in gradient-free settings: On targets where only \(\log R(x)\) is available without gradients (common in molecular simulations), traditional Langevin search fails. Ours still reduces Sinkhorn from \(29.57\) to \(22.97\) on ManyWell (\(d=32\)).
- Dimensional scalability: The advantage over baselines grows with dimensionality, e.g., ManyWell-\(d=64\) MMD dropped from \(0.243\) to \(0.043\) (\(5.6\times\) improvement).
Highlights & Insights¶
- Unification of three domains: By expressing HVI, MaxEnt RL, and SMC/AIS with a unified \(\overrightarrow p_\theta/\overleftarrow p/F^\phi_n\) notation, the TB loss naturally becomes the second moment of AIS log-weights. This structural discovery treats "training samplers" and "running SMC" as different aspects of the same problem.
- Statistical priority for behavior policies: Using normalization constant estimates \(\widehat Z_m\) instead of TD-error for buffer priority is a non-trivial upgrade to prioritized replay, connecting directly to Layered IS literature.
- Automatic bias-variance tuning via adaptive \(\lambda^\ast\): Transforming the ESS threshold into a training knob allows aggressive tempering early on (trading high bias for low variance) and automatically returns to unbiased AIS, matching the training trajectory of the sampler.
Limitations & Future Work¶
- Training cost: Each step involves SMC, buffer resampling, and on-policy rollout, resulting in \(2\text{--}3\times\) higher wall-clock time than pure TB. The paper lacks detailed wall-clock tables.
- Discrete space results are primarily in the appendix; coverage of scenarios like NLP (prepend/append models) is relatively thin and requires larger-scale validation.
- Thresholds \(\gamma, \kappa\) for adaptive \(\lambda^\ast\) and resampling are manually set; making them learnable parameters is a natural direction.
- Compatibility with few-step diffusion work (e.g., Berner et al. 2026) is only mentioned in footnotes; combining with 1–4 step diffusion could significantly improve inference efficiency.
Related Work & Insights¶
- vs SCLD (Chen et al. 2025): SCLD also combines Langevin/MCMC with controlled sampling but does not use SMC outputs as off-policy training data. Ours explicitly proves TB equals the second moment of AIS log-weights.
- vs Sendera et al. 2024 (Langevin-guided GFlowNet): They use Langevin local search as an off-policy source; ours uses SMC, which works in gradient-free settings and leverages learned flow functions for twisting.
- vs Wu et al. 2025 (Concurrent): Wu et al. treat untrained diffusion samplers as SMC proposals but do not feed SMC samples back into training; ours is a bidirectional loop.
- vs DDS/PIS/LV: These pure on-policy diffusion methods collapse on multimodal targets. Ours, using the same diffusion backbone with TB + SMC + IW-Buf, reduces GMM40-\(d=5\) EUBO from thousands to \(2.3\).
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Unifies HVI/MaxEnt RL/SMC under one loss and designs a bidirectional SMC ↔ Sampler loop.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers gradient-free/based, continuous/discrete, synthetic/molecular targets, and various dimensions with strong baselines.
- Writing Quality: ⭐⭐⭐⭐ Unified notation and clear comparisons; explains the interface between domains well.
- Value: ⭐⭐⭐⭐⭐ Provides a unified interface for amortized sampling to continuously absorb MC algorithms (annealing, adaptive IS).