Tortoise and Hare Guidance: Accelerating Diffusion Model Inference with Multirate Integration¶

Conference: NeurIPS 2025 arXiv: 2511.04117 Code: Available (https://github.com/yhlee-add/THG) Area: Image Generation / Diffusion Models Keywords: Diffusion model acceleration, Classifier-Free Guidance, multirate integration, NFE reduction, training-free

TL;DR¶

This paper proposes Tortoise and Hare Guidance (THG), a training-free acceleration strategy for diffusion sampling that reformulates the classifier-free guidance (CFG) ODE as a multirate ODE system. The noise estimation term is integrated with fine-grained steps (tortoise equation), while the additional guidance term is integrated with coarse-grained steps (hare equation), reducing the number of function evaluations (NFE) by up to 30% with negligible degradation in generation quality.

Background & Motivation¶

Inference Bottleneck of Diffusion Models¶

Diffusion models have achieved remarkable success in image generation, yet slow inference remains a primary bottleneck. Each generation requires multiple denoising steps, each involving one or more neural network forward passes (Function Evaluations, NFE).

Computational Redundancy in Classifier-Free Guidance¶

CFG is the dominant approach for conditional generation, formulated as: $$\hat{\epsilon}_\theta(x_t, c) = \epsilon_\theta(x_t) + s \cdot [\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t)]$$ - First term $\epsilon_\theta(x_t)$: unconditional noise estimate - Second term $s \cdot [\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t)]$: additional guidance term

Each CFG step requires two network forward passes (conditional and unconditional), constituting the primary computational bottleneck.

Key Observation¶

The additional guidance term is far less sensitive to numerical error than the noise estimation term. Conventional uniform-step solvers fail to exploit this asymmetry, leading to substantial redundant computation.

Method¶

Overall Architecture¶

The CFG ODE is decomposed into two subsystems operating on different timescales:

CFG ODE: dx/dt = f(x,t) + g(x,t)
  ├── Tortoise Equation: dx/dt = f(x,t)  — noise estimation, fine-grained steps
  └── Hare Equation:     dx/dt = g(x,t)  — additional guidance, coarse-grained steps

Key Designs¶

1. Multirate ODE Decomposition¶

The CFG solve is decomposed from a single ODE into a multirate system:

Tortoise equation (slow, fine): computes noise estimates $\epsilon_\theta(x_t)$ and $\epsilon_\theta(x_t, c)$ at the original timestep resolution
Hare equation (fast, coarse): the additional guidance term $g(x_t) = s \cdot [\epsilon_\theta(x_t, c) - \epsilon_\theta(x_t)]$ is evaluated only on a coarse grid, with interpolation or extrapolation applied between coarse grid points

2. Error Bound Analysis¶

Rigorous error bound analysis establishes that: - The noise estimation term has a larger Lipschitz constant, requiring fine step sizes for error control - The guidance term has a smaller Lipschitz constant, tolerating larger step sizes - Quantitatively, the error bound of the guidance term is $O(s)$ times smaller than that of the noise estimation term

3. Error-Bound-Aware Timestep Sampler¶

Adaptively selects step sizes: - Applies finer steps in regions where the noise changes rapidly (e.g., intermediate timesteps) - Permits larger steps in regions of slow variation (e.g., near the terminal time) - Dynamically adjusts the tortoise/hare step size ratio based on local error estimates

4. Guidance-Scale Scheduler¶

When the hare equation spans large time intervals, naive extrapolation may become unstable. A scheduler is introduced to: - Moderately reduce the guidance scale $s$ over large intervals - Ensure stability of the extrapolation - Preserve final generation quality

Loss & Training¶

THG is a completely training-free method: - No modification or retraining of the diffusion model is required - Only the ODE solving strategy at inference time is altered - Plug-and-play compatible with any CFG-based diffusion model

Key Experimental Results¶

Main Results¶

Performance on Stable Diffusion and SDXL:

Method	NFE ↓	FID ↓	CLIP Score ↑	ImageReward ↑	ΔImageReward
DDIM (50 steps)	100	15.2	0.312	0.876	baseline
DPM-Solver++ (25 steps)	50	15.8	0.310	0.871	-0.005
PNDM (25 steps)	50	16.1	0.308	0.865	-0.011
PAB	70	15.5	0.311	0.872	-0.004
DeepCache	60	16.4	0.307	0.858	-0.018
THG (ours)	70	15.3	0.311	0.873	-0.003
THG (ours, aggressive)	50	15.9	0.309	0.844	-0.032

Comparison under equal NFE budgets:

Method	NFE=50 FID ↓	NFE=50 ImageReward ↑	NFE=70 FID ↓	NFE=70 ImageReward ↑
DPM-Solver++	15.8	0.871	15.4	0.874
DeepCache	17.2	0.845	16.4	0.858
PAB	16.5	0.860	15.5	0.872
THG	15.5	0.878	15.3	0.873

Ablation Study¶

Configuration	NFE	FID	ImageReward	Note
Full THG	70	15.3	0.873	Complete method
w/o adaptive step size	70	15.8	0.865	Adaptive steps matter
w/o guidance-scale scheduler	70	16.2	0.852	Scheduler stabilizes coarse extrapolation
Coarse ratio = 2:1	85	15.4	0.872	Conservative setting
Coarse ratio = 4:1	55	16.5	0.838	Too aggressive
Coarse ratio = 3:1 (default)	70	15.3	0.873	Optimal balance

Key Findings¶

30% NFE reduction with near-lossless quality: THG achieves $\Delta$ImageReward $\leq 0.032$ while reducing computation by 30%.
Outperforms alternatives under equal budgets: At the same NFE, THG achieves better FID and ImageReward than DeepCache and PAB.
Adaptive step size contributes significantly: Compared to a fixed step ratio, the adaptive timestep sampler yields +0.008 ImageReward improvement.
Guidance-scale scheduler is essential for stability: Removing the scheduler increases FID by 0.9, primarily affecting high-guidance-scale scenarios.
3:1 step ratio is optimal: A tortoise-to-hare ratio of 3:1 achieves the best efficiency–quality trade-off.

Highlights & Insights¶

Mathematically grounded motivation: The robustness of the guidance term is derived from rigorous ODE error analysis rather than empirical observation.
Training-free design: No additional training cost; fully plug-and-play.
Vivid naming: The tortoise-and-hare metaphor intuitively conveys the multirate concept.
Strong practical value: Directly integrable with existing diffusion models for inference acceleration.
Open-source code: Facilitates community reproduction and extension.

Limitations & Future Work¶

CFG-only applicability: Not directly applicable to methods that do not use CFG (e.g., flow matching).
Limited extrapolation accuracy: When the guidance scale $s$ is large, coarse-grid extrapolation may introduce perceptible artifacts.
Step ratio requires tuning: The optimal step ratio may depend on the specific model and task.
Compatibility with other acceleration techniques: Joint use with distillation-based methods remains unexplored.
Extension to video generation: Applicability to video diffusion models has yet to be validated.

DPM-Solver++: High-order ODE solver for accelerated diffusion sampling.
DeepCache: Caches intermediate features to reduce redundant computation.
PAB (Pyramid Attention Broadcast): Progressive attention broadcasting for acceleration.
Multirate integration: A classical technique in numerical analysis, applied here to diffusion models for the first time.
Future directions: Exploring finer-grained, component-level multirate decomposition (e.g., assigning different step sizes to different layers).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First application of multirate integration to diffusion sampling acceleration
Theoretical Depth: ⭐⭐⭐⭐ — Supported by rigorous error bound analysis
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple models, metrics, and thorough ablations
Practical Impact: ⭐⭐⭐⭐⭐ — Training-free, plug-and-play, directly reduces inference cost
Writing Quality: ⭐⭐⭐⭐ — Clear and vivid, with clever nomenclature