VORTA: Efficient Video Diffusion via Routing Sparse Attention¶
Conference: NeurIPS 2025 arXiv: 2505.18809 Code: GitHub Area: Video Generation Keywords: Video diffusion model acceleration, sparse attention, routing mechanism, coreset selection, video generation
TL;DR¶
This paper proposes VORTA, a framework that achieves end-to-end 1.76× acceleration of video diffusion Transformers without quality degradation, through bucketed coreset attention (for modeling long-range dependencies) and a signal-aware routing mechanism (for adaptively selecting sparse attention branches). Combined with caching and distillation methods, it achieves up to 14.41× acceleration.
Background & Motivation¶
Efficiency Bottleneck of Video Diffusion Transformers¶
Video Diffusion Transformers (VDiT) have achieved remarkable progress in high-quality video generation, but at an extremely high computational cost. For instance, HunyuanVideo requires approximately 1,000 seconds (500 PFLOPS) to generate a 5-second 720p video, with attention operations accounting for over 75% of the total computation.
The complexity of 3D self-attention is \(\mathcal{O}(L^2 d)\), where the sequence length \(L = F \times H \times W\). Even after VAE compression and patchification, HunyuanVideo still operates on sequences of approximately 100K tokens.
Limitations of Prior Work¶
Local sparse methods (e.g., STA): These exploit the concentration of attention scores within local neighborhoods to restrict interaction range, but perform poorly on long-range attention heads. While the nearest 4% of keys contribute over 80% of attention weights, the remaining 96% remain important during early sampling steps.
Online analysis methods (e.g., ARnR): These dynamically detect sparsity patterns but introduce \(\mathcal{O}(L^2)\) similarity computation overhead and require re-tuning when sampling configurations change.
Key Challenge: VDiT simultaneously exhibits both local and long-range attention patterns, which dynamically switch throughout the sampling process, making simple static strategies insufficient.
Three Categories of Attention in VDiT¶
The authors classify attention in VDiT into three types: - Local attention: Focuses on short-range interactions, responsible for fine-grained details. - Long-range attention: Distributed across the entire sequence, capturing high-level semantics (layout and motion), with high tolerance for minor perturbations (e.g., merging similar tokens). - Critical attention: Maintains both global awareness and local detail simultaneously, and is highly sensitive to perturbations.
Key finding: Long-range attention heads exhibit high intra-sequence redundancy — tokens are highly similar, and a small number of representative tokens suffice to summarize their information.
Method¶
Overall Architecture¶
VORTA consists of two core components: (1) sparse attention variants tailored to different attention types; and (2) a signal-to-noise ratio-based routing mechanism that adaptively selects the optimal sparse strategy for each attention head.
Key Designs¶
1. Sliding Window Attention (for Local Attention)¶
A 3D Sliding Tile Attention is adopted, converting jagged attention masks into block-dense masks to improve GPU hardware efficiency:
where \(\tau(i) = \lfloor i/t \rfloor \cdot t + \lceil t/2 \rceil\) denotes the tile center. The window size is set to \((18, 27, 24)\).
2. Bucketed Coreset Attention (for Long-Range Attention)¶
Core Idea: For long-range attention heads, redundant tokens are first pruned via coreset selection, followed by attention computation on the compressed sequence.
Bucketed Coreset Selection (BCS): Tokens are partitioned into buckets of size \((t, h, w) = (2, 3, 2)\). Within each bucket, the similarity between the center token and its neighbors is computed, and the top-\(k\) most similar tokens are pruned (coreset ratio \(r_{\text{core}} = 0.5\)).
Complexity advantage: BCS requires only \(\mathcal{O}(L)\) complexity (\(\mathcal{O}(thw)\) per bucket × \(L/(thw)\) buckets), in contrast to \(\mathcal{O}(L^2)\) for global pairwise methods. Retaining 50% of tokens reduces attention computation to 25% of the original (due to the quadratic relationship).
Compared to standard average pooling, BCS offers a critical advantage in selectivity: when neighboring tokens differ significantly, simple averaging causes information loss leading to mosaic artifacts or blurring, whereas BCS preserves diversity by pruning the most similar tokens.
3. Signal-Aware Attention Routing¶
Design Motivation: Attention behavior is strongly correlated with the signal-to-noise ratio of input features — long-range attention dominates in early steps (constructing high-level semantics), while local attention prevails in later steps (refining details).
The router consists of one linear layer per attention layer, taking the diffusion timestep embedding \(\mathbf{T}\) as input:
During inference, hard selection is applied by choosing the branch with the highest gating value:
This adds only 0.1% additional parameters and introduces no inference-time overhead. In practice, the full attention branch is selected in only approximately 0.2% of cases.
Loss & Training¶
The router is trained using a self-supervised distillation strategy, with the original VDiT parameters frozen and only the router weights updated:
where \(\lambda_{\text{distill}} = 20\) and \(\lambda_{\text{reg}} = 0.02\). The \(L_2\) regularization encourages sparse selection. Training requires only 100 steps on the Mixkit dataset and completes in approximately one day on 2× H100 GPUs.
Key Experimental Results¶
Main Results on HunyuanVideo¶
| Method | Type | VBench↑ | LPIPS↓ | Latency (s) | Speedup | Memory (GB) |
|---|---|---|---|---|---|---|
| HunyuanVideo | - | 82.26 | - | 1043.85 | 1.00× | 47.64 |
| + ARnR | Sparse | 82.39 | 0.211 | 790.55 | 1.32× | 78.15 |
| + STA | Sparse | 82.33 | 0.201 | 676.39 | 1.54× | 51.79 |
| + PAB | Cache | 82.40 | 0.186 | 815.51 | 1.28× | >80 |
| + VORTA | Sparse | 82.59 | 0.185 | 594.23 | 1.76× | 51.15 |
| + VORTA & PAB | Combined | 82.56 | 0.195 | 444.19 | 2.35× | >80 |
| + PCD | Distill | 81.17 | 0.564 | 125.98 | 8.29× | 47.64 |
| + VORTA & PCD | Combined | 81.49 | 0.575 | 72.46 | 14.41× | 51.15 |
Ablation Study (Wan 2.1 1.3B, 480p)¶
| Configuration | VBench↑ | Latency (s) | Speedup |
|---|---|---|---|
| Wan 2.1 baseline | 81.20 | 73.24 | 1.00× |
| w/o sliding attention | 80.25 | 65.14 | 1.12× |
| w/o coreset attention | 79.89 | 66.10 | 1.11× |
| w/o full attention (fallback branch removed) | 77.14 | 59.34 | 1.23× |
| w/o timestep conditioning | 81.03 | 65.00 | 1.13× |
| Average pooling AP(2,1,1) | 77.08 | 57.53 | 1.27× |
| Average pooling AP(1,2,1) | 76.01 | 57.64 | 1.27× |
| VORTA (full) | 81.06 | 58.42 | 1.25× |
Key Findings¶
- VORTA slightly outperforms the original model on VBench (82.59 vs. 82.26), possibly due to a regularization effect from pruning redundant attention.
- Removing the full attention branch causes a 4-point VBench drop with no additional speedup, confirming the existence of critical attention heads.
- Removing timestep conditioning causes the router to uniformly select the same branch across all steps, resulting in degraded performance and reduced speedup.
- BCS significantly outperforms simple average pooling (81.06 vs. 75.94–77.08), validating the necessity of selective pruning.
- Routing patterns exhibit clear temporal regularity: coreset attention dominates in early steps, transitioning to sliding attention in later steps.
Highlights & Insights¶
- Theoretically grounded design: The three-category attention taxonomy provides a clear theoretical basis for assigning distinct sparse strategies.
- Linear complexity of BCS: The bucketed strategy reduces coreset selection from \(\mathcal{O}(L^2)\) to \(\mathcal{O}(L)\), which is key to practical acceleration.
- Strong composability: VORTA is orthogonal to caching (PAB) and distillation (PCD) methods, achieving 14.41× acceleration when combined.
- Scheduler generalization: Adapts to different ODE solvers and step counts without re-analysis.
- Backbone generalization: Validated on both MMDiT (HunyuanVideo) and DiT (Wan 2.1) architectures.
Limitations & Future Work¶
- Primarily targets attention acceleration; limited speedup for short sequences (images or low-resolution video).
- Only supports bidirectional generation paradigms; autoregressive video generation requires substantial adaptation.
- When the pretrained model itself produces artifacts (e.g., distortions or physics violations), VORTA inherits and may amplify these issues.
- Router training requires approximately one day, which, while far less than pretraining, is still non-trivial.
- The coreset ratio is fixed at 50%; adaptive ratios could further optimize the efficiency–quality trade-off.
Related Work & Insights¶
- Sparse attention: STA (Sliding Tile Attention), ARnR (online analysis), SVG
- Video diffusion acceleration: PAB (feature caching), PCD (consistency distillation)
- Conditional computation: MoE (Mixture of Experts), BlockDrop
- Video generation: HunyuanVideo, Wan 2.1, CogVideoX
- Insights: The routing mechanism design can be generalized to other conditional computation scenarios; BCS's linear-complexity coreset selection is applicable to long-context LLM inference.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ — The combination of coreset attention and signal-aware routing is a novel design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across multiple backbones, schedulers, and combinations, with detailed runtime analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ — Taxonomically clear with excellent figure design.
- Value: ⭐⭐⭐⭐⭐ — Significant practical speedup with open-sourced code and weights.