VSA: Faster Video Diffusion with Trainable Sparse Attention¶

Conference: NeurIPS 2025 arXiv: 2505.13389 Code: https://github.com/hao-ai-lab/FastVideo Area: Video Generation / Attention Acceleration Keywords: Sparse Attention, Video Diffusion Transformer, End-to-End Training, Key Token Prediction, Hardware Alignment

TL;DR¶

This paper proposes VSA (Video Sparse Attention), an end-to-end trainable, hardware-aligned sparse attention mechanism with a hierarchical coarse-fine design: a coarse-grained stage predicts key token positions via cube pooling, and a fine-grained stage performs token-level attention within the predicted block-sparse regions. VSA accelerates both training and inference of video DiTs simultaneously: pretraining from scratch achieves a 2.53× reduction in training FLOPs without quality loss, while adapting to Wan2.1-1.3B yields a 6× attention speedup and reduces end-to-end inference time from 31s to 18s.

Background & Motivation¶

Video Diffusion Transformers (DiTs) face severe computational bottlenecks: a mere 5-second 720p video unfolds to over 100K tokens in latent space, and the quadratic-complexity 3D attention dominates the computation. Although prior work has shown that DiT attention matrices are naturally sparse (most entries near zero), existing methods almost universally treat sparsity as a post-hoc inference-time optimization—training with full attention and then substituting fixed or profiled sparse masks at inference.

This "dense-then-sparse" paradigm suffers from two fundamental problems:

Unchanged training cost: The vast majority of training computation cannot benefit from sparsity.

Train-test mismatch: Model parameters are learned under dense context but evaluated under sparse context—the quality ceiling is locked to the dense model, and quality degrades noticeably as sparsity increases.

The core challenge is a chicken-and-egg problem: accurately identifying key tokens requires computing the full attention matrix, which defeats the purpose of sparse savings. Conversely, cheap heuristics may miss high-weight regions. Furthermore, any sparse implementation must conform to GPU block-sparse computation constraints; otherwise, theoretical savings cannot translate into actual speedups.

The core idea of VSA is: use a learnable lightweight coarse-grained attention to predict key token positions, then execute fine-grained token-level attention within the predicted block-sparse regions, with both stages trained jointly end-to-end.

Method¶

Overall Architecture¶

VSA adopts a hierarchical Coarse-Fine two-stage attention design. Video latents are partitioned into cubes of size \((4,4,4)\), with each cube mapped to one tile on a GPU SM (block size = 64). The coarse-grained stage performs low-cost full attention at the cube level to predict key positions; the fine-grained stage executes token-level block-sparse attention only over the selected cubes. The outputs of both stages are fused via learnable gating vectors to produce the final output.

Key Designs¶

Coarse Stage: Tokens within each \((4,4,4)\) cube are mean-pooled to yield cube-level representations \(Q_c, K_c, V_c\), compressing the sequence by 64×. Dense attention is computed over this short sequence to obtain cube-level attention scores \(A_c\) and output \(O_c\). A Top-K selection (default \(K=32\)) is then applied row-wise over \(A_c\) to generate the block-sparse mask \(M\). Since the coarse stage operates on a 64× compressed sequence, its FLOPs account for less than 0.2% of total attention. The key innovation is that the coarse-stage output \(O_c\) not only guides sparse selection but directly participates in the final output—experiments confirm this is critical for preserving global context.
Fine Stage: Using the block-sparse mask \(M\) generated by the coarse stage, sparse attention is computed over the original token-level \(Q, K, V\), restricted to blocks corresponding to selected cubes. Because the mask is inherently block-structured (derived from cube-level operations), it aligns perfectly with FlashAttention's block-sparse computation format without requiring additional mask-to-index conversion.
Gating Fusion: Gating vectors \(G_c\) and \(G_f\) are obtained by projecting the input hidden states through linear layers. The final output is \(O = O_c \odot G_c + O_f \odot G_f\), allowing the model to dynamically balance global overview and local detail across different heads and layers.
Tile Size Trade-off: Small tiles (e.g., 16) allow finer sparse granularity but yield low GPU throughput; large tiles (e.g., 256) have higher arithmetic intensity but coarser sparsity patterns. Systematic experiments show that tile=64 (i.e., cube=\((4,4,4)\)) achieves the best balance between quality and efficiency.
Sparse Adaptation and Distillation: Converting a pretrained full-attention model to VSA employs a progressive annealing strategy—initially setting \(G_c\) weights to zero and removing \(G_f\) (equivalent to full attention), then gradually reducing \(K\) to the target sparsity level. The paper also presents the first demonstration of sparse attention compatibility with distillation (Sparse-Distill), where the student model serves simultaneously as a few-step and sparse generator, achieving a 50.9× speedup.

Loss & Training¶

VSA is trained end-to-end with the standard Flow Matching loss. For GPU kernels, the fine-grained stage uses a block-sparse attention kernel implemented in ThunderKittens (achieving 85% MFU of FlashAttention3 at tile=64); the coarse-grained stage fuses softmax, Top-K selection, and mask-to-index conversion into a single kernel. Sparse adaptation requires approximately 4,000 fine-tuning steps (learning rate 1e-5). Pretraining experiments total approximately 90k H200 GPU hours.

Key Experimental Results¶

Main Results¶

Scaling experiments for pretraining from scratch (60M–1.4B parameters, up to \(4\times10^{21}\) FLOPs):

Model Scale	Sequence Length	Method	Attention FLOPs Reduction	Training FLOPs Reduction	Loss Difference
120M	16K	Full Attention	—	—	baseline
120M	16K	VSA (87.5% sparse)	~8×	2.53×	negligible
410M	16K	Full Attention	—	—	baseline
410M	16K	VSA (87.5% sparse)	~8×	2.53×	negligible
60M–1.4B	16K	VSA vs Full	—	2.53×	consistently Pareto-superior

Wan2.1 adaptation experiments:

Model	Method	VBench Quality↑	VBench Semantic↑	VBench Total↑	Inference Time
Wan-1.3B	Original (full attn)	83.71%	77.98%	82.56%	31s
Wan-1.3B	Full finetuned	84.07%	81.85%	83.63%	31s
Wan-1.3B	VSA finetuned	83.60%	79.47%	82.77%	18s
Wan-14B	Original	—	—	baseline	1274s
Wan-14B	VSA	—	—	on par (human eval)	576s

Ablation Study¶

Experiment	Configuration	Loss	Notes
Exp 1	Compress KV	0.14282	KV pooling only; too coarse
Exp 2	Spatial-Temporal	0.13034	Classic alternating attention; underperforms Full after extended training
Exp 5	Full Attention	0.12703	Baseline
Exp 6	VSA	0.12687	Surpasses full attention
Exp 7	Fixed Local Pattern	0.13330	Fixed patterns inferior to data-dependent selection
Exp 8	Fine only (no \(O_c\))	0.13296	Missing coarse-stage global information
Exp 10	Coarse + Fine	0.13162	Minimal design achieves best result
Exp 11	C + F + Local	0.13194	Additional local attention yields no gain
Exp 17	tile=64	0.13162	Best efficiency–quality balance
Exp 18	tile=16	0.13155	Marginally better quality but 2.26× slower

Key Findings¶

Trainable sparsity can surpass full attention: After extended training, VSA achieves a strictly better Pareto frontier than full attention under the same FLOPs budget—the first rigorous scaling-law verification of this property on DiTs.
Data-driven dynamic sparsity is far superior to fixed patterns: Fixed patterns such as Spatial-Temporal and sliding window appear competitive under a compute-optimal budget but are overtaken by full attention after extended training; only VSA consistently maintains its advantage.
Global information is necessary; local priors are not: The coarse-stage output \(O_c\) is critical to performance, while additional local window attention provides no further benefit.
Key token prediction accuracy is high: Top-32 selection covers 60%–90% of total attention weights across most layers and timesteps (versus a random baseline of only 8%).
Attention patterns are highly dynamic: Visualizations reveal that different heads in the same layer exhibit drastically different patterns—some global, some local, some mixed—further confirming the infeasibility of fixed patterns.
The optimal Top-K depends on sequence length and training budget—more training compute requires a higher \(K\), suggesting a "sparse-dimension scaling law" worthy of further investigation.

Highlights & Insights¶

The first work to demonstrate via rigorous scaling experiments on DiTs that trainable sparse attention outperforms full attention—a significant result that may reshape the default design of video DiTs.
The coarse-predict-then-fine-execute two-stage design resolves the chicken-and-egg problem: key tokens can be identified without computing the full attention matrix.
The coarse-stage output directly contributing to the final output is a critical design choice: this distinguishes VSA from methods such as MoBA and BiFormer, which use the coarse stage only as an index guide.
Pioneer of Sparse-Distill: the first demonstration that sparse attention is compatible with distillation, simultaneously enabling few-step and sparse acceleration to achieve a 50.9× speedup.
Thorough ablation design: 90k H200 GPU hours of systematic experiments cover all design parameters including tile size, pooling strategy, local priors, and sparsity level.

Limitations & Future Work¶

The cube size is fixed at \((4,4,4)\), requiring each spatial-temporal dimension of the video latent to be divisible by 4, which constrains the set of compatible resolutions.
Determining the optimal sparsity level (Top-K value) remains an open problem and may require treating sparsity as an additional dimension in scaling laws.
Although the coarse-stage FLOPs are negligible, its latency still accounts for 14% of total time on short sequences, leaving room for further kernel optimization.
Long-sequence scaling experiments are limited—pretraining was conducted on sequences of at most 16K tokens, and behavior at longer sequences remains unexplored.
Layer-wise and timestep-wise adaptive Top-K selection is an explicit direction for future improvement.

NSA and MoBA pioneered trainable sparse attention in LLMs; VSA adapts this paradigm to bidirectional 3D video attention.
STA (Sliding Tile Attention) and Sparge Attention are inference-time post-processing methods; VSA demonstrates the fundamental advantage of introducing sparsity during training.
DSV also explores training-time sparsity but uses a multi-stage profiler design; VSA's end-to-end training approach is more straightforward.
Core insight: the attention bottleneck in video DiTs is more severe than in LLMs (due to longer sequences and dense computation throughout both training and inference), making native sparse design even more urgent.

Rating¶

Novelty: ⭐⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐