Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache¶

Conference: CVPR 2026
arXiv: 2602.22654
Code: GitHub
Area: Diffusion Model Acceleration
Keywords: Diffusion Model Acceleration, Caching, Dynamic Programming, Path Planning, Training-Free

TL;DR¶

Diffusion model sampling acceleration is formulated as a global path planning problem. By constructing a Path-Aware Cost Tensor (PACT) to quantify the path dependency of skipping errors and using dynamic programming to select the optimal sequence of key steps, DPCache achieves a 4.87× speedup on FLUX, surpassing the full-step baseline by +0.028 ImageReward.

Background & Motivation¶

Inference overhead for diffusion models (especially DiT architectures) is high. Training-free caching methods are favored but suffer from fundamental flaws: - Fixed Scheduling (e.g., DeepCache): Uniform skipping patterns ignore local feature dynamics, leading to large deviations in critical transition regions. - Local Adaptive Scheduling (e.g., TeaCache/AdaCache): Greedy, short-sighted decisions may skip crucial steps, causing irreversible drift and error accumulation. - Both strategies overlook the global structure of the denoising trajectory.

Key Challenge: Optimal acceleration requires a global perspective—considering the entire denoising trajectory and planning sampling steps holistically.

Method¶

Overall Architecture¶

DPCache consists of three stages: (1) Calibration phase: Run the full denoising process on a few samples to construct a 3D Path-Aware Cost Tensor; (2) Optimal schedule selection: Use dynamic programming to select the optimal key step sequence given a step budget $K$; (3) Inference phase: Execute full computation only at key steps and use cached features to predict remaining steps.

Key Designs¶

Path-Aware Cost Tensor (PACT): Traditional 2D cost matrices only encode the skipping error between two steps, but cached prediction features depend on the previously cached state (path dependency). PACT is a 3D tensor $\mathcal{C} \in \mathbb{R}^{(T+1)\times(T+1)\times(T+1)}$, where each element $\mathcal{C}[i,j,k]$ ($i>j>k$) represents the cumulative prediction error from step $j$ to step $k$, conditioned on the predecessor key step being $i$: $$\mathcal{C}[i,j,k] = \sum_{\tau=k}^{j-1} \|\mathbf{h}_\tau^L - \mathbf{h}_{pred,\tau}^L(i,j)\|_1$$ Using cumulative error instead of single-point error prevents large strides that seem locally optimal but remain globally unstable. Only the last layer features are used for construction to save memory. A sentinel step $t=0$ is introduced for unified boundary handling.
Dynamic Programming for Optimal Schedule Selection: Given the cost tensor and target step count $K$, the total path cost is minimized: $$\arg\min_{\mathcal{T}} \sum_{m=2}^{K} \mathcal{C}[t_{m-1}, t_m, t_{m+1}]$$ A DP table $D[m,k]$ (minimum cumulative cost to reach step $k$ using $m$ key steps) and a path table $P[m,k]$ (for backtracking) are maintained. The recurrence relation is: $$D[m,k] = \min_{j>k} D[m-1,j] + \mathcal{C}[P[m-1,j], j, k]$$ The first $M=3$ steps are fixed as mandatory to preserve early denoising dynamics. The time complexity is $O(KT^2)$, which is negligible when $K<T=50$.
Generalization via Calibration Set: Denoising trajectory shapes depend primarily on the diffusion model rather than the generated content. Thus, a universal optimal schedule can be obtained using only ~10 calibration samples, eliminating the need for per-input computation. During inference, key steps execute full forward passes and cache features, while other steps are predicted via Taylor expansion: $$\mathcal{F}_{pred,m}^l(\mathbf{x}_{t-k}) = \mathbf{h}_t^l + \sum_{i=1}^m \frac{\Delta^i \mathbf{h}_t^l}{i! \cdot \mathcal{N}^i}(-k)^i$$

Loss & Training¶

Completely training-free method; no fine-tuning or distillation required.
The calibration phase uses ~10 samples, and the prediction order is set to 2nd-order Taylor expansion.
Decoupled from specific predictors—seamlessly integrates with future improved prediction methods.

Key Experimental Results¶

Main Results (FLUX.1-dev, DrawBench)¶

Method	Gain	ImageReward↑	CLIP Score↑	PSNR↑	SSIM↑
Full 50-step Baseline	1.00×	0.979	17.40	—	—
TeaCache	3.42×	0.934	17.17	16.31	0.6812
TaylorSeer	3.51×	0.939	17.31	16.95	0.6922
SpeCa	3.62×	0.975	17.27	18.35	0.6773
Ours (K=13)	3.54×	1.007	17.34	21.65	0.8106
Ours (K=9)	4.87×	0.958	17.33	18.77	0.7117

Ours (K=13) exceeds the full-step baseline by +0.028 ImageReward at a 3.54× speedup!

Ablation Study (HunyuanVideo, VBench)¶

Method	Gain	VBench↑	PSNR↑	SSIM↑
Full 50-step	1.00×	80.93	—	—
TaylorSeer	3.87×	80.33	18.53	0.6025
SpeCa	4.05×	80.26	20.09	0.6370
Ours (K=11)	4.05×	80.35	23.11	0.7462

The method significantly outperforms others in video generation, with particularly notable improvements in PSNR.

Key Findings¶

The core advantage of global optimal scheduling lies in trajectory fidelity (PSNR/SSIM/LPIPS leading by a large margin), indicating that path planning keeps the sampling trajectory closer to the original denoising path.
Advantages become more pronounced at higher acceleration ratios—surpassing the baseline at 3.54× and remaining far ahead of others at 4.87×.
Modeling path dependency (3D PACT vs. 2D cost matrix) is critical; 2D matrices fail to capture the influence of predecessor steps on current skipping errors.
10 calibration samples are sufficient, confirming that denoising trajectory structures are content-independent.

Highlights & Insights¶

Modeling sampling acceleration as a path planning problem is highly novel—dynamic programming is perfectly suited for this scenario.
The 3D PACT design accurately captures path dependency, representing a fundamental advancement over existing 2D analyses.
Resulting in better-than-baseline performance on FLUX after acceleration suggests that skipping certain steps may not only preserve quality but potentially enhance it by preventing cumulative error.
The design's decoupling from the predictor makes it a universal acceleration framework.

Limitations & Future Work¶

The calibration phase requires a full diffusion run once, adding to preprocessing time.
Memory overhead for PACT construction is $O(T^3)$, which may become a bottleneck for large step counts $T$.
Currently assumes all samples share the same optimal schedule, which may be suboptimal for out-of-distribution samples.
Exploration of integration with adaptive scheduling is possible—PACT providing the global framework while local fine-tuning adapts to specific samples.

DeepCache discovered cross-step similarity in high-level features; DPCache optimizes when to cache versus compute based on this.
TaylorSeer’s Taylor expansion predictor is complementary to DPCache’s path planning framework.
Unlike step-reduction methods (DDIM/DPM-Solver/Distillation), DPCache belongs to the per-step computation optimization category.
The path planning perspective could inspire computational optimization in other sequential decision-making problems.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of path planning, PACT, and DP is highly original.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers DiT, FLUX, and HunyuanVideo across image and video domains.
Writing Quality: ⭐⭐⭐⭐⭐ Clear problem definition, rigorous formalization, and intuitive illustrations.
Value: ⭐⭐⭐⭐⭐ Training-free, universal, and high-performance, offering immense practical value.