Progressive Tempering Sampler with Diffusion¶

Conference: ICML 2025
arXiv: 2506.05231
Code: None
Area: Image Generation
Keywords: diffusion-model, sampling, MCMC, parallel-tempering, neural-sampler

TL;DR¶

This paper proposes the Progressive Tempering Sampler with Diffusion (PTSD). By combining the temperature swapping mechanism of Parallel Tempering (PT) with a diffusion-based neural sampler, PTSD utilizes "temperature guidance" to extrapolate and generate low-temperature approximate samples from high-temperature diffusion models, achieving orders-of-magnitude faster target density evaluation.

Background & Motivation¶

Sampling from unnormalized density functions is a fundamental problem in Bayesian inference, statistical physics, and molecular simulations. Currently, two major paradigms each have their limitations:

Parallel Tempering (PT): The state-of-the-art (SOTA) in Markov Chain Monte Carlo (MCMC). It achieves efficient mixing by running parallel Markov chains across multiple temperature levels and swapping samples. However, a full propagation process must be rerun every time new independent samples are needed, making it computationally expensive and capable of generating only correlated samples.
Diffusion-based Neural Samplers (e.g., DDS, iDEM, BNEM, CMCD): These can amortize the sampling process to generate uncorrelated samples. However, their efficiency in target density evaluation is far inferior to PT. This inefficiency stems from the need to estimate the target function via importance sampling or heavily query the target density during trajectory simulation.

Key Insight: These two categories of methods lie at opposite ends of the methodological spectrum—one completely avoids data-driven training, while the other fully relies on post-hoc fitting of data generated by PT. PTSD is positioned in the middle ground, fusing the advantages of both.

Method¶

Overall Architecture¶

PTSD defines a decreasing temperature sequence \([T_K, T_{K-1}, \dots, T_1]\), where \(T_1\) is the target temperature. The algorithm consists of four steps:

High-temperature Initialization: Run PT at the two highest temperatures \(T_K\) and \(T_{K-1}\) to collect samples into a buffer.
Fitting Initial Diffusion Models: Train diffusion models \(\theta_K\) and \(\theta_{K-1}\) at the two high temperatures, respectively.
Temperature-Guided Extrapolation: Generate approximate samples for \(T_{K-2}\) from the high-temperature models using the temperature guidance mechanism.
Fine-tuning & Iteration: Initialize \(\theta_{K-2} \leftarrow \theta_{K-1}\) and fine-tune the model; repeat steps 3–4 until the target-temperature model \(\theta_1\) is obtained.

Key Design 1: Temperature Guidance¶

The core idea is to estimate the score function at a lower temperature \(T\) using Taylor expansion and finite difference approximation, leveraging two pre-trained diffusion models at temperatures \(T_1\) and \(T_2\).

Performing a first-order Taylor expansion of the score function around \(T_1\) and approximating the derivative with finite differences yields:

\(\text{score}(x_t, T) \approx (1+w) \cdot \text{score}(x_t, T_1) - w \cdot \text{score}(x_t, T_2)\)

where the weight \(w = (T_1 - T) / (T_2 - T_1)\). This is highly similar in form to classifier-free guidance (CFG): achieving guidance by contrasting a "better" low-temperature model with a "worse" high-temperature model. Although the approximation becomes imprecise as \(t \to 0\), the accuracy of the diffusion model's score at small time steps has a limited impact on the final generation quality.

Key Design 2: Truncated Importance Resampling¶

Since temperature guidance generates approximate samples, errors can accumulate across temperature levels. By using PF-ODE sampling, both samples and their corresponding densities can be obtained simultaneously to compute the self-normalized importance weights:

\(w_n = [\tilde{p}(x_n)^{1/T_k} / q(x_n)] / \sum_{n'} [\tilde{p}(x_{n'})^{1/T_k} / q(x_{n'})]\)

To prevent instability caused by the variance introduced by the Hutchinson trace estimator, a truncated importance sampling strategy is adopted—clipping the weights to a maximum value defined by a preset quantile \(\tau\), and then performing categorical resampling according to the weights to fill the buffer.

Following importance resampling, several steps of MCMC refinement are performed on the buffered samples. Parallel PT chains are run between buffers of adjacent temperatures, enhancing the sample quality of both buffers through sample swaps. Optionally, one can run PT chains only on a subset of the importance sampled (IS) samples to optimize the utilization of energy evaluations. This instantiates the design philosophy of "diffusion models as the workforce, MCMC for refinement."

Key Experimental Results¶

Main Results (Table 1)¶

Comparison of the \(W_2\) distance (lower is better) against various neural samplers on three benchmark tasks:

Method	GMM-40 \(W_2\)	MW-32 \(W_2\)	LJ-55 \(W_2\)
iDEM	-	-	-
BNEM	2.16	-	1.76
PT+DM	-	-	-
PTSD	1.93	4.99	1.81

PTSD achieves SOTA sample quality on GMM-40 and MW-32. On LJ-55, it is slightly inferior to BNEM (since BNEM explicitly regresses the target energy to handle suppressed regions), but it requires significantly fewer target evaluations than BNEM.

Target Density Evaluation Efficiency (Table 2)¶

PTSD achieves a 2–3 orders of magnitude improvement in target evaluation efficiency compared to DDS/CMCD:

Method	GMM-40 Evaluations	MW-32 Evaluations
CMCD	~4.4e9	~1.6e9
DDS	~2.6e9	~8.2e8
iDEM	~5e8	~1.8e7
BNEM	~7.5e7	~1.8e7
PT+DM	~1e6	~1e6
PTSD	~1e6	~1e6

PTSD exhibits comparable efficiency to PT+DM; however, in the detailed comparison of Fig. 6, PTSD yields superior sample quality under the same evaluation budget. Temperature guidance generates a more informative "swap" mechanism.

Ablation Study (Table 4, MW-32)¶

Configuration	TVD	\(W_2\)
PTSD w/o temp-guide	0.34	24.59
PTSD w/o IS	0.23	5.84
PTSD (Full)	0.14	4.99

Both temperature guidance and truncated IS are critical components. Even with IS removed, PTSD still outperforms most baselines in Table 1.

Alanine Dipeptide Verification (Table 3)¶

Metric	PT+DM	PTSD
Mean log-likelihood	Lower	213.32
KL divergence	6.9e-2	3.2e-2

Under an energy evaluation budget of 2.6e7, PTSD achieves a higher mean log-likelihood and a lower KL divergence.

Key Findings¶

Effective Extrapolation of Temperature Guidance: On LJ-55, when extrapolating to a temperature of 1.0 from models trained at temperatures 2.0 and 1.5, the distribution generated by temperature guidance highly overlaps with the true distribution, outperforming alternatives such as Model Extrapolation (ME), Automatic Differentiation (AD), and Score Rescaling (RS).
Cross-temperature Information Transfer Outperforms Sample Swaps: Serving as a "functional representation" of the target density, the diffusion model achieves more efficient cross-temperature information transfer than traditional sample swaps through weight sharing.
PTSD Consistently Leads on the Pareto Front: On the Pareto front of log-likelihood versus number of energy evaluations (Fig. 10), PTSD consistently occupies the optimal position across all datasets.

Highlights & Insights¶

Precise Methodological Positioning: The integration of PT and neural samplers is cleverly positioned in the middle ground of the methodological spectrum, rather than pursuing either extreme.
Analogy Between Temperature Guidance and CFG: The mathematical formulation of temperature guidance, \((1+w) D_1 - w D_2\), is completely aligned with classifier-free guidance—"contrasting a poor version with a good version." This cross-domain conceptual transfer is elegant and highly inspiring.
Progressive Bootstrapping: Unlike the self-bootstrapping in traditional neural samplers (which may get trapped in inefficient loops), PTSD's cross-temperature bootstrapping naturally provides a curriculum learning effect from easy to hard.
Dual Utilization of PF-ODE: PF-ODE is utilized simultaneously for both sampling and density estimation (via the instantaneous change-of-variables formula), eliminating the need for additional target evaluations during importance weight computation.

Limitations & Future Work¶

Wall-clock Time Disadvantage: Although the number of target evaluations is significantly reduced, the diffusion model must be fine-tuned at each temperature level. The actual wall-clock time in current experiments remains slower than pure PT.
Inability to Parallelize: Temperatures must be processed stepwise in a decreasing sequence, unlike PT which can run chains at different temperatures in parallel across multiple devices.
Sensitivity to Temperature Schedules and Hyperparameters: Performance can become fragile when temperature intervals are too large or the target distribution is overly complex; the choice of network, learning rate, and truncation threshold all impact the final results.
Evaluation Limitations: Currently, verification is limited to synthetic multimodal distributions and small molecular systems; tests have not yet been conducted on high-dimensional, real-world problems.

FAB (Midgley et al., 2023): Trains normalizing flows using \(\alpha\)-2 divergence and AIS, introducing a replay buffer.
iDEM / BNEM: Direct estimation of score functions but relies heavily on importance sampling, incurring large target evaluation overheads.
DDS / CMCD: Matches path measures between sampling and target processes, where Langevin preconditioning leads to a large number of target evaluations.
Generalized PT (Zhang et al., 2025): Transports samples between adjacent temperatures using neural networks to improve the PT swap rate—presenting a complementary perspective to PTSD.
CFG (Ho & Salimans, 2021; Karras et al., 2024): PTSD's temperature guidance is a direct analogy to the "good vs. poor edition contrasting" paradigm of CFG.

Insight: The role of diffusion models in sampling problems can shift from "replacing MCMC" to "enhancing MCMC," indicating that the fusion of both is likely a more pragmatic direction.

Rating¶

Dimension	Score
Novelty	★★★★☆
Theoretical Depth	★★★★☆
Experimental Thoroughness	★★★★☆
Practical Value	★★★☆☆
Writing Quality	★★★★★
Overall Rating	★★★★☆

Though the derivation of temperature guidance is based on simple Taylor expansion and finite differences, its connection to CFG is profound and elegant. Experiments on synthetic and molecular tasks fully validate the efficiency improvements. The main score reduction lies in its practical value—wall-clock time is still not advantageous, it is sensitive to temperature scheduling, and it has not yet been scaled to truly large problems.