Skip to content

LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation

Conference: NeurIPS 2025
arXiv: 2511.00090
Code: GitHub
Area: Video Generation
Keywords: diffusion acceleration, video generation, caching, DAG optimization, lexicographic minimax

TL;DR

This paper proposes LeMiCa, a training-free acceleration framework for diffusion-based video generation that formulates cache scheduling as a lexicographic minimax path optimization problem on a directed acyclic graph (DAG), achieving simultaneous gains in speed and quality (2.9× speedup on Latte; LPIPS as low as 0.05 on Open-Sora) via global error control.

Background & Motivation

  • High inference cost of diffusion models: Although DiT architectures substantially improve video quality, multi-step denoising incurs heavy computation and high latency, constraining interactive applications.
  • Limitations of prior Work — local greedy strategies: State-of-the-art methods such as TeaCache decide whether to cache based on local relative differences between adjacent timestep outputs (\(L1_{rel}\)) and a fixed threshold. However, the diffusion denoising process exhibits temporal heterogeneity — early steps shape global structure while later steps refine fine details — causing semantic inconsistency under a uniform threshold.
  • Global degradation from accumulated local errors: Minimizing adjacent-step differences (local-greedy error) ignores the cumulative effect of small errors along the denoising trajectory, ultimately degrading both video quality and content consistency.
  • Prior acceleration methods require training or engineering effort: Distillation, pruning, and quantization demand large datasets and retraining; caching methods hold a natural advantage as training-free alternatives.

Method

Revisiting Caching in Diffusion Sampling

Local-greedy error (prior methods): Decisions are made based on the relative L1 distance between adjacent timestep outputs:

\[L1_{rel}(O, t) = \frac{\|O_t - O_{t+1}\|_1}{\|O_{t+1}\|_1}\]

Global result-aware error (proposed): Measures the impact of each cache segment on the final output:

\[L1_{glob}(i \to j) = \frac{1}{N} \|x_0^{\text{cache}(i \to j)} - x_0^{\text{original}}\|_1\]

Key findings: ① Global error propagation is non-uniform and time-dependent — downstream errors caused by early caching are amplified exponentially; ② The temporal position of a cache segment matters more than its length.

Lexicographic Minimax Path Caching (LeMiCa)

Graph construction: A DAG is constructed where each edge represents a candidate cache segment with an edge weight equal to the global result-aware error. A maximum skip length is imposed to control complexity. The static graph is built offline using multiple prompts and random seeds (70 prompts × 10 seeds, averaged).

Graph optimization: Given a fixed inference budget \(B\) (number of model forward passes), the method seeks the optimal path from source to sink under the lexicographic minimax criterion:

\[\min_{P \in \mathcal{P}_{s \to t}^{(B)}} \text{LexMax}\big(\text{sort\_desc}(\{w(e) \mid e \in P_{\text{cache}}\})\big)\]

That is, among all feasible paths, the approach first minimizes the maximum cache error; ties are broken by the second-largest error, and so on. This is more robust than shortest-path strategies because cache errors do not accumulate independently — extreme errors cause disproportionate damage to the final output.

Why not shortest path: Early caching errors are exponentially amplified during denoising; conventional shortest-path minimizes total accumulated cost without bounding per-step peak errors. Lexicographic minimax explicitly constrains worst-case degradation.

Key Properties

  • Training-free: Cache schedules are precomputed offline with no runtime overhead.
  • Model-agnostic: Applicable to diverse architectures including Open-Sora, Latte, and CogVideoX.
  • Robust static cache schedule: Only 1–20 samples are needed to construct a high-quality DAG.

Key Experimental Results

Table 1: Inference Efficiency vs. Visual Quality (Single GPU)

Method Model Speedup↑ VBench↑ LPIPS↓ SSIM↑ PSNR↑
TeaCache-slow Open-Sora 1.50× 79.20% 0.134 0.837 23.50
LeMiCa-slow Open-Sora 1.52× 79.26% 0.050 0.923 31.32
TeaCache-fast Open-Sora 2.10× 78.24% 0.252 0.743 19.03
LeMiCa-fast Open-Sora 2.44× 78.34% 0.187 0.798 21.76
TeaCache-slow Latte 1.65× 77.40% 0.195 0.775 21.52
LeMiCa-slow Latte 1.69× 77.45% 0.091 0.865 27.65
TeaCache-fast Latte 2.60× 76.09% 0.318 0.674 18.04
LeMiCa-fast Latte 2.93× 76.75% 0.273 0.700 19.43
TeaCache-slow CogVideoX 1.70× 76.79% 0.053 0.928 31.07
LeMiCa-slow CogVideoX 1.72× 76.89% 0.023 0.958 35.93

LeMiCa consistently outperforms TeaCache across all models. LeMiCa-slow reduces LPIPS from 0.134 to 0.050 on Open-Sora (2.7× improvement) and achieves LPIPS of only 0.023 on CogVideoX. LeMiCa-fast reaches 2.93× speedup on Latte (up from TeaCache's 2.60×).

Table 2: Number of Samples Required for DAG Construction

# Samples VBench↑ LPIPS↓ SSIM↑ PSNR↑
1 78.58 0.164 0.838 24.51
5 78.70 0.161 0.843 24.57
10 78.95 0.158 0.844 24.56
350 79.27 0.143 0.851 24.67

Strong results are obtained with as few as 1 sample (PSNR 24.51 vs. 24.67 with 350 samples), demonstrating the robustness of the static cache schedule.

Table 3: Ablation on Path Strategy

Path Strategy VBench↑ LPIPS↓ SSIM↑ PSNR↑
Shortest Path 76.04 0.203 0.809 22.90
Minimax Path 79.27 0.143 0.851 24.67

The minimax path comprehensively outperforms the shortest path, validating the core argument that controlling peak error is more important than minimizing accumulated error.

Trajectory Robustness

Varying the scale parameter of the sampling schedule (0.5–1.5), LeMiCa consistently achieves lower LPIPS than TeaCache across all trajectories, demonstrating effectiveness under diverse denoising paths.

Highlights & Insights

  • Novel theoretical perspective: Formalizing cache scheduling as a lexicographic minimax path problem on a DAG introduces a new optimization paradigm for diffusion acceleration.
  • Global error control: The global result-aware error eliminates the influence of temporal heterogeneity, while lexicographic minimax explicitly constrains worst-case degradation.
  • Minimal perceptual degradation: LPIPS of 0.05 on Open-Sora and 0.023 on CogVideoX represent near-lossless acceleration.
  • Zero additional runtime overhead: Cache schedules are fully precomputed offline and applied via direct lookup at inference time.

Limitations & Future Work

  • Offline DAG construction cost: Building the graph requires full denoising trajectories evaluated across multiple prompts, incurring a non-trivial one-time computational cost.
  • Fixed sampling path assumption: The assumption that denoising trajectories are stable for a trained model may not hold under certain adaptive schedulers.
  • Evaluated only on text-to-video: The method has not been validated on image-to-video generation, video editing, or other related tasks.
  • Maximum skip length requires manual tuning: The setting is heuristic-based, and no automatic determination procedure is provided.
  • Diffusion acceleration (DDIM [Song+ 2020], consistency distillation [Song+ 2023], adversarial distillation [Sauer+ 2024]): These methods require training; LeMiCa is training-free.
  • Diffusion caching (DeepCache [Ma+ 2023], PAB [Zhao+ 2024], TeaCache [Liu+ 2024]): Based on local greedy strategies; LeMiCa proposes a global optimization alternative.
  • Δ-DiT [Chen+ 2024], T-GATE [Zhang+ 2024]: Image generation acceleration methods with limited effectiveness when extended to video.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Lexicographic minimax path optimization represents a genuinely new perspective in the diffusion caching literature.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three models, multiple ablations, sample efficiency analysis, and trajectory robustness evaluation.
  • Writing Quality: ⭐⭐⭐⭐ — The analysis of local vs. global error is thorough and the figures are clear.
  • Value: ⭐⭐⭐⭐⭐ — Provides a strong training-free acceleration baseline for diffusion-based video generation.