Accelerating Diffusion Sampling via Exploiting Local Transition Coherence¶
Conference: ICCV 2025 arXiv: 2503.09675 Code: Project Page Area: Image Generation Keywords: Diffusion model acceleration, training-free acceleration, sampling step compression, text-to-image, text-to-video
TL;DR¶
This paper proposes LTC-Accel, a training-free diffusion sampling acceleration method based on the phenomenon of Local Transition Coherence (LTC). By exploiting the strong correlation between transition operators of adjacent denoising steps, the method approximates the current step's computation using the previous step's transition operator. It achieves 1.67× speedup on Stable Diffusion v2, and combined with distilled models, reaches 10× acceleration in video generation.
Background & Motivation¶
Diffusion models have achieved remarkable breakthroughs in text-guided image and video generation, yet the lengthy denoising sampling process remains a primary bottleneck for practical deployment. For instance, generating a 5-second, 8 FPS, 720P video using Wan2.1-14B on a single H20 GPU requires approximately 6935 seconds.
Existing acceleration methods fall into two categories:
Training-based methods: Improve efficiency via distillation or architectural modifications, but require additional training resources and time.
Training-free methods: Include efficient solvers such as DDIM and DPM-Solver, as well as caching-based methods like DeepCache that reuse intermediate features.
However, existing training-free methods exhibit notable limitations:
- Some ignore the statistical relationship between adjacent steps (e.g., direct step skipping).
- Others rely on attention or feature similarity (e.g., DeepCache), which are tightly coupled to specific network architectures and require redesigned caching strategies when the architecture changes.
Core Finding: The authors identify a novel statistical phenomenon — Local Transition Coherence (LTC) — whereby the transition operators \(\Delta\mathbf{x}_{t+1,t}\) between adjacent steps exhibit highly similar directions and magnitudes during certain phases of the diffusion process. This coherence is independent of any specific network architecture, conferring strong generalizability.
Method¶
Overall Architecture¶
The core idea of LTC-Accel is straightforward: within intervals where transition operators are highly coherent, the previous step's transition operator is used to approximate the current step's, thereby skipping the network inference computation for that step.
The complete pipeline consists of three stages:
- Identify acceleration intervals: Measure the angle between consecutive transition operators and locate continuous intervals \([a, b]\) where the angle is below a threshold \(\tau\).
- Apply approximate substitution: Within the acceleration interval, use the acceleration period \(r\) to determine which steps are approximated.
- (Optional) Refine \(w_g\): Further improve approximation quality via end-to-end search.
Key Design 1: Quantifying Local Transition Coherence¶
The transition operator is defined as \(\Delta\mathbf{x}_{t+1,t} = \mathbf{x}_t - \mathbf{x}_{t+1}\), i.e., the update between adjacent steps.
Coherence is measured by the angle between two consecutive transition operators:
Empirical observations show that during 40-step DDIM sampling with Stable Diffusion v2, the angles between steps 12 and 38 are very small (approximately 0.1–0.2), indicating that the update trajectories within this interval are highly similar.
Acceleration interval: All consecutive steps satisfying \(\theta_t < \tau\) (typically \(\tau = 0.1\)) constitute the interval \([a, b]\).
Key Design 2: Transition Operator Approximation Formula¶
Within the acceleration interval, for steps satisfying \(t \bmod r = r - 1\), the following approximation is applied:
where:
- \(\gamma\): An inter-step progress scaling factor defined as \(\gamma = \frac{\phi(t) - \phi(t+1)}{\phi(t+1) - \phi(t+2)}\), where \(\phi(t) = \sqrt{\text{SNR}_t}\) reflects denoising progress.
- \(w_g\): An amplitude scaling factor obtained by minimizing the approximation error \(\|\Delta\mathbf{x}_{t+1,t} - w_g \gamma \Delta\mathbf{x}_{t+2,t+1}\|_2\), yielding the closed-form solution:
Although \(w_g\) theoretically depends on the unknown target \(\mathbf{x}_t\), the authors demonstrate that it is a convergent quantity depending only on the step index \(t\) — its value at the same step converges consistently across different initial noises and prompts.
Key Design 3: Estimation and Refinement of \(w_g\)¶
Since \(\mathbf{x}_t\) is unavailable during actual sampling (it is precisely the quantity being approximated), the authors propose a two-stage algorithm:
- Algorithm 2: Computes \(w_g\) incrementally, uses the approximated \(\mathbf{x}_t^*\) as input for the next step, and performs local search over the entire acceleration interval to minimize accumulated error.
- Algorithm 3 (optional): Introduces a global bias to adjust all \(w_g\) values, performing end-to-end binary search evaluated by PSNR, which improves PSNR from 37.5 to 39.
Error Analysis¶
The relative error of single-step approximation has a strict upper bound:
For \(\tau \in [0.1, 0.2]\), the error is negligible. In practice, even when 32.5% of steps are approximated, the accumulated error is only 6.0%, with PSNR reaching 36.6 dB.
Key Experimental Results¶
Main Results: Text-to-Image¶
Evaluated on 1,000 prompts from MS-COCO 2017, with ImageReward and PickScore as metrics.
| Model | Sampler | Original Steps | Accelerated Steps | Speedup | ImageReward Change |
|---|---|---|---|---|---|
| SD v2 | DDIM | 50 | 30 | 1.67× | 0.4209→0.4183 |
| SD v2 | DDIM | 100 | 60 | 1.67× | 0.4451→0.4467 |
| SD v3.5 | DPM-Solver++ | 60 | 40 | 1.50× | Nearly lossless |
| SD v3.5 | EDM | 60 | 39 | 1.54× | Nearly lossless |
Main Results: Text-to-Video¶
Evaluated on 100 prompts from MS-COCO 2017, measuring Frame Consistency and Textual Faithfulness.
| Model | Original Steps | Accelerated Steps | Speedup | Text↑ | Smooth↑ |
|---|---|---|---|---|---|
| Anime-diff | 30 | 19 | 1.58× | 0.3462→0.3465 | 0.9676→0.9681 |
| CogVideoX 2B | 30 | 19 | 1.58× | 0.2302→0.2320 | 0.9464→0.9435 |
| CogVideoX 2B | 40 | 26 | 1.54× | 0.3918→0.3775 | 0.9514→0.9511 |
Combination with Other Methods¶
| Combination | Model | Individual Speedup | Combined Speedup | Quality Impact |
|---|---|---|---|---|
| + DeepCache | SD v2 (50 steps) | 1.66× | 2.34× | ImageReward improves from 0.4039 to 0.4096 |
| + Align Your Steps | SD v1.5 (10 steps) | — | 1.25× | ImageReward drops only 0.0023 |
| + Anime-diff-Lightning | Distilled 4-step | 7.5× | 10× | Minimum 3-step generation, nearly lossless quality |
| + INT8 Quantization | CogVideoX 2B | — | 1.54× | Compatible, no significant quality degradation |
Ablation Study: LTC-Accel vs. Direct Step Skipping¶
| Sampler | Steps | Step-skip ImageReward | LTC-Accel ImageReward |
|---|---|---|---|
| DDIM | 7 | 0.0537 | 0.1472 |
| DDIM | 10 | 0.2003 | 0.2442 |
| DDIM | 13 | 0.2812 | 0.3129 |
| EDM | 7 | 0.0158 | 0.2018 |
| EDM | 10 | 0.2003 | 0.3171 |
| EDM | 13 | 0.2582 | 0.3335 |
LTC-Accel consistently and substantially outperforms the naive step-skipping strategy across all settings, with the advantage being most pronounced at lower step counts.
Highlights & Insights¶
- Discovery of a novel statistical phenomenon: Local Transition Coherence reveals the intrinsic consistency of transition operators between adjacent steps in diffusion sampling. This finding differs from prior observations based on attention similarity or feature caching, representing a more fundamental and universal regularity.
- Complete decoupling from network architecture: LTC-Accel focuses solely on relationships between network outputs, making no assumptions about internal network structure. It thus seamlessly adapts to diverse architectures including U-Net (SD v2), DiT (SD v3.5), and video models (CogVideoX).
- Orthogonality to nearly all existing acceleration methods: LTC-Accel can be freely combined with DeepCache, Align Your Steps, distilled models, and INT8 quantization for additional speedup. This composability is of substantial practical value.
- Convergence of \(w_g\) underpins the method's feasibility: Although \(w_g\) theoretically depends on the unknown \(\mathbf{x}_t\), its convergence across different inputs enables precomputation, providing the mathematical foundation for practical application.
- Potential for real-time video generation: Combined with distilled models, the method achieves 10× acceleration and 16+ FPS real-time video generation, carrying significant practical implications for the deployment of video diffusion models.
Limitations & Future Work¶
- Dependence on the existence of LTC: When the number of sampling steps is extremely small (e.g., fewer than 3), local transition coherence weakens and the method breaks down, limiting its combination with aggressively compressed distilled models.
- Hyperparameter tuning required: The acceleration interval \([a, b]\), period parameter \(r\), and threshold \(\tau\) require manual adjustment for different models and samplers. While \(r=2\) generalizes well across most scenarios, the optimal acceleration interval may vary considerably across different diffusion processes.
- Additional overhead for \(w_g\) precomputation: Although \(w_g\) converges and can be precomputed, determining the \(w_g\) sequence and acceleration interval for a new model or sampler requires running one complete sampling pass.
- Ceiling on speedup ratio: The typical speedup is approximately 1.5–1.67×; while higher ratios are achievable in combination with other methods, the standalone acceleration is limited.
Related Work & Insights¶
- DeepCache: Achieves acceleration by caching and reusing high-level features, but is coupled to network architecture. LTC-Accel can be combined with DeepCache to provide an additional 1.41× speedup on top.
- Align Your Steps: Optimizes the sampling schedule and is orthogonal to LTC-Accel. Their combination achieves the quality of the original 10-step AYS using only 8 steps.
- DPM-Solver++: A high-order ODE solver that reduces sampling steps. Experiments show LTC-Accel significantly outperforms DPM-Solver++ under equivalent step counts.
- Distillation methods (Progressive Distillation, Consistency Models, etc.): Training-based methods on top of which LTC-Accel can provide further acceleration.
- Insight: The paradigm of "exploiting intrinsic process redundancy rather than model structural redundancy" is highly generalizable and may inspire further architecture-agnostic acceleration methods.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Discovering the LTC phenomenon and designing an acceleration method around it is a genuinely novel perspective.
- Theoretical Depth: ⭐⭐⭐⭐ — Supported by complete mathematical derivations, error analysis, and convergence analysis.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers multiple models, samplers, and combination strategies with thorough ablation studies.
- Value: ⭐⭐⭐⭐⭐ — Training-free, architecture-agnostic, composable, and highly engineering-friendly.
- Writing Quality: ⭐⭐⭐⭐ — Clear logic with rich figures and tables.
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD