SODA: Sensitivity-Oriented Dynamic Acceleration for Diffusion Transformer¶
Conference: CVPR2026
arXiv: 2603.07057
Code: leaves162/SODA
Area: Model Compression
Keywords: Diffusion Transformer, training-free acceleration, caching, pruning, sensitivity modeling, dynamic programming
TL;DR¶
SODA is proposed to achieve controllable-speedup high-fidelity generation for Diffusion Transformers without training, via offline fine-grained sensitivity modeling, dynamic-programming-based cache schedule optimization, and a unified adaptive pruning strategy.
Background & Motivation¶
Background: Diffusion Transformers excel at image and video generation, but repeated sampling steps and Transformer blocks result in low inference efficiency, severely hindering deployment.
Limitations of Prior Work: Training-based acceleration methods such as distillation and fine-tuning incur high computational costs and limited generalizability, motivating the pursuit of training-free acceleration. Caching offers high efficiency at the cost of fidelity, while pruning is flexible but less efficient; combining both is desirable. Existing methods such as ToCa and DuCa rely on fixed or heuristic configurations for cache intervals and pruning rates, capturing only coarse-grained sensitivity trends and failing to account for fine-grained variation across timesteps, layers, and module dimensions.
Key Challenge: Heuristic schemes inevitably skip computations that are highly sensitive to acceleration, leading to degraded generation fidelity and poor cross-model generalizability.
Goal: To design a training-free acceleration framework that models sensitivity at fine granularity and optimizes caching and pruning decisions jointly and adaptively.
Method¶
Overall Architecture¶
SODA consists of three modules: (1) OFS — offline fine-grained sensitivity modeling; (2) DCS — dynamic cache schedule optimization; (3) UAS — unified adaptive strategy formulation. The inference pipeline proceeds as follows: offline sensitivity priors are first computed → dynamic programming solves for the optimal cache interval combination → during pruning/cache reuse, pruning timing and rate are decided adaptively.
Key Designs¶
1. Offline Fine-Grained Sensitivity Modeling (OFS)¶
- Cache sensitivity error \(\mathcal{E}_c(t,l,m,n) = 1 - \text{Cos}(\mathcal{D}_{t+n,l,m}(x), \mathcal{D}_{t,l,m}(x))\): measures the cosine distance error of reusing the cache from step \(t+n\) at timestep \(t\), covering four dimensions: timestep, layer, module, and cache interval.
- Pruning sensitivity error \(\mathcal{E}_p(t,l,m,\alpha)\): analogously measures feature error under different pruning rates \(\alpha\).
- Offline strategy: Sensitivity errors are averaged over 100 random generations (10 for video models), stored as model-specific priors, and reused permanently after a single offline pass. The offline overhead is minimal (approximately 160s for DiT-XL/2, with only 0.56GB additional memory).
2. Dynamic Cache Schedule Optimization (DCS)¶
- The cache interval combination problem is formulated as a dynamic programming problem with optimal substructure: \(dp[t][i+1] = \min_{n \in \mathcal{N}} \{\mathcal{E}_{dp}(t,n) + dp[t+n][i]\}\).
- Given an acceleration budget (number of cached steps \(N_s\)), the algorithm minimizes cumulative sensitivity error from \(T\) to 1 and recovers optimal cache timesteps and intervals via backtracking.
- The algorithm operates entirely on offline sensitivity errors, incurring no additional inference overhead.
3. Unified Adaptive Strategy Formulation (UAS)¶
- Adaptive pruning timing: Pruning is applied at step \(t+1\) only when the pruning error \(\delta_{t+1,l,m} = \mathcal{E}_p(t+1,l,m,\alpha)\) is smaller than the cache error \(\mathcal{E}_c(t,l,m,n)\); otherwise, the cache is directly reused, ensuring that pruning always reduces overall error.
- Adaptive pruning rate: \(\alpha_{t+1,l,m} = \lambda \cdot \mathcal{E}_c(t,l,m,n) + \beta\), where \(\lambda\) is a scaling coefficient and \(\beta\) is a base pruning rate (adjusted adaptively according to the acceleration budget), enabling pruning to be aware of both the global budget and local sensitivity.
- Token importance metric: Feature mean is used for TopK selection, avoiding compatibility issues arising from the unavailability of attention weights under Flash Attention.
Loss & Training¶
No additional training loss is introduced. The core optimization objective is minimizing the cumulative sensitivity error \(dp[1][N_s]\) via dynamic programming.
Key Experimental Results¶
Main Results¶
| Model / Setting | Speedup | FID↓ | sFID↓ | IS↑ | Notes |
|---|---|---|---|---|---|
| DiT-XL/2 DDPM baseline | 1.00× | 2.23 | 4.57 | 275.65 | — |
| ToCa (DDPM) | 2.75× | 2.58 | 5.74 | 256.26 | — |
| DuCa (DDPM) | 2.73× | 2.59 | 5.68 | 256.36 | — |
| SODA (DDPM, \(N_s\)=72) | 2.73× | 2.47 | 5.09 | 262.30 | sFID −0.65, IS +6 |
| DiT-XL/2 DDIM baseline | 1.00× | 2.25 | 4.33 | 239.97 | — |
| DuCa (DDIM) | 2.48× | 3.05 | 4.66 | 233.21 | — |
| SODA (DDIM, \(N_s\)=18) | 2.49× | 2.75 | 4.56 | 235.65 | FID −0.30 |
| Model | Speedup | FID-30K↓ | CLIP↑ | Notes |
|---|---|---|---|---|
| PixArt-α baseline | 1.00× | 28.10 | 16.29 | — |
| DuCa | 1.87× | 28.05 | 16.42 | — |
| SODA (\(N_s\)=8) | 1.88× | 27.33 | 16.42 | FID −0.72 vs. DuCa |
| SODA (\(N_s\)=7) | 2.21× | 27.72 | 16.44 | Higher speedup still outperforms baseline |
Ablation Study¶
| Configuration | FID↓ | sFID↓ | IS↑ |
|---|---|---|---|
| Vanilla (fixed cache) | 3.83 | 5.24 | 213.12 |
| OFS + DCS | 2.78 | 4.63 | 234.78 |
| OFS + UAS | 2.89 | 4.75 | 235.01 |
| SODA (full) | 2.75 | 4.56 | 235.65 |
- DCS alone contributes FID −1.05 and IS +21.66; UAS alone contributes FID −0.94 and IS +21.89; combining both yields the best performance.
- Cosine distance outperforms L1/L2 as the sensitivity metric.
Key Findings¶
- Low speedup can improve over baseline: At 1.55×, FID decreases from 2.23 to 2.21 (DDPM), possibly because skipping redundant computations leads to more stable denoising.
- Offline modeling exhibits high consistency: Offline sensitivity distributions closely match those observed during actual inference (Fig. 6), with stable variance, confirming that sensitivity is an intrinsic model property.
- Robustness to hyperparameters: Across the tested ranges of \(\lambda\) and \(\beta\), ΔFID ≤ 0.02 and ΔIS ≤ 2.2.
- Effective on video tasks: At 2.50× speedup on OpenSora, VBench drops by only 0.64%, outperforming ToCa, DuCa, PAB, and FORA.
Highlights & Insights¶
- SODA is the first to push sensitivity modeling to the granularity of timestep × layer × module, and to solve for the globally optimal cache schedule via dynamic programming with theoretical guarantees.
- Pruning and caching decisions are unified through sensitivity error, enabling adaptive retention of high-sensitivity tokens without manual design, while maintaining generalizability.
- Offline modeling is required only once per model with negligible cost (< 1 hour); at runtime, only a 0.16MB prior is loaded, incurring zero additional inference overhead.
- Cross-task generalization: the same framework applies without modification to class-conditional image generation, text-to-image, and text-to-video tasks.
Limitations & Future Work¶
- As a training-free method, the achievable acceleration still falls short of training-based approaches such as distillation.
- Integration with training-based techniques such as distillation remains unexplored.
- Offline modeling requires one execution per new model (though with low overhead), so the method is not fully plug-and-play.
- The pruning position selection relies on a feature-mean heuristic, which may not be the optimal token importance metric.
- The dynamic programming search space grows with the size of the cache interval candidate set, and solver efficiency may become a concern in extreme large-step scenarios.
Related Work & Insights¶
- Cache-based acceleration: FORA, FasterDiffusion, ToCa (ICLR 2025), DuCa — exploit inter-timestep similarity to reuse intermediate features, but rely on fixed or heuristic cache schedules.
- Token pruning: ToMe, AT-EDM — exploit token redundancy for pruning; flexible but less efficient than caching.
- Combined caching and pruning: ToCa, DuCa — perform full computation at anchor steps and cache, then apply pruning with cache reuse at intermediate steps, but with manually designed strategies.
- Sensitivity analysis inspiration: This work is inspired by sensitivity analysis in LLM quantization (e.g., SqueezeLLM) and extends it to multi-dimensional acceleration decisions for diffusion models.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Combining sensitivity modeling with dynamic programming for cache optimization is novel; the unified pruning/caching decision mechanism is conceptually clear.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three models, three task types, comprehensive ablations, offline analysis, hyperparameter sensitivity, and qualitative comparisons; very thorough.
- Writing Quality: ⭐⭐⭐⭐ — Well-structured, problem-driven, with an effective motivation figure (Fig. 1).
- Value: ⭐⭐⭐⭐ — Practical and generalizable method with open-source code for reproducibility; directly contributes to the DiT acceleration community.