LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation¶
Conference: CVPR 2026 arXiv: 2510.08318 Code: None (but based on the open-source Wan model) Area: Video Generation / Efficient Inference / Linear Attention Keywords: Video Diffusion Models, Linear Attention, Post-Training, Selective Transfer, Distribution Matching
TL;DR¶
LinVideo is the first data-free post-training framework that automatically identifies which layers are most amenable to linear attention substitution via selective transfer, and recovers model performance through an Arbitrary-timestep Distribution Matching (ADM) objective. It achieves 1.43–1.71× lossless speedup on Wan 1.3B/14B, and up to 15.9–20.9× speedup when combined with 4-step distillation.
Background & Motivation¶
The inference bottleneck of video diffusion models (Wan, CogVideoX, Sora) lies in the \(O(n^2)\) complexity of self-attention — a 10-second video typically produces >50K tokens. Attention sparsification helps but generally reduces computation by less than 50%. Linear attention reduces complexity to \(O(n)\), but full replacement requires expensive pretraining (e.g., SANA-Video). The root cause is that linear attention lacks the expressiveness to capture complex spatiotemporal dynamics in video; naively replacing all layers and fine-tuning yields poor results.
Core Problem¶
Can one efficiently replace as many softmax attention layers as possible with linear attention via data-free post-training, while preserving video generation quality?
Method¶
Overall Architecture¶
Two core technical contributions: (1) Selective Transfer: a learnable binary classifier automatically determines whether each layer uses softmax or linear attention; (2) Arbitrary-timestep Distribution Matching (ADM): aligns the sample distributions of the original and linearized models at every timestep along the sampling trajectory, rather than only at the final output. The entire process is data-free — training data is generated by sampling from the original model itself.
Key Designs¶
-
Selective Transfer: Different layers exhibit vastly different substitutability (shallow layers are generally easier to replace, but certain critical layers such as the first layer cannot be replaced). A learnable scalar \(r \in [0,1]\) is introduced per layer to interpolate attention outputs: \(o_i = r \cdot \text{softmax-attn} + (1-r) \cdot \text{linear-attn}\). A constraint loss \(\mathcal{L}_{con}\) enforces that the number of replaced layers equals the target, while a regularization loss \(\mathcal{L}_{reg} = \sum(1-|2r-1|^\alpha)\) drives \(r\) toward 0 or 1 to avoid rounding errors. \(\alpha\) is annealed from large to small — allowing free exploration early in training and enforcing binarization later.
-
Arbitrary-timestep Distribution Matching (ADM): Naive MSE matching causes temporal jitter, and few-step distillation methods (e.g., DMD) only match the distribution at \(t=0\), ignoring intermediate timesteps. ADM minimizes \(KL(q_t \| p_t)\) at every timestep \(t\) along the sampling trajectory. The key innovation is that the model being trained, \(\hat{u}_\theta\), estimates its own score function \(\hat{s}_t\) analytically (since it is itself a multi-step flow model), eliminating the need to train a separate score model — reducing training cost by ~4.4× compared to DMD. The score difference is elegantly derived as: \(s_t - \hat{s}_t = -\frac{1-t}{t}(u_\theta - \hat{u}_\theta)\).
-
Hedgehog Linear Attention Kernel: A softmax-mimicking kernel \(\phi(q) = \text{softmax}(qW) \oplus \text{softmax}(-qW)\) is adopted, outperforming ReLU or Taylor expansion kernels by 2%+ on VBench. It preserves the peaked weight distribution and dot-product monotonicity of softmax.
Loss & Training¶
\(\mathcal{L}_{total} = \mathcal{L}_{ADM} + \lambda(\mathcal{L}_{con} + \mathcal{L}_{reg})\). Data-free: 50K input–output pairs are sampled from the original model. Wan 1.3B is trained for 3K steps on 8×H100; 14B for 3K steps on 32×H100. Targets: 16/30 layers replaced for 1.3B; 22/40 for 14B.
Key Experimental Results¶
VBench 8-dimension (Wan 1.3B / 14B):
| Method | Latency (s) | Speedup | VBench Score (approx.) |
|---|---|---|---|
| FA2 (baseline) | 97.3 / 1931 | 1× | 67.6 / 67.9 |
| SVG (sparse) | 74.5 / 1203 | 1.31/1.61× | 67.2 / 67.0 |
| SVG2 | 84.9 / 1364 | 1.15/1.42× | 67.5 / 67.3 |
| LinVideo | 68.3 / 1127 | 1.43/1.71× | 67.6 / 67.7 |
| LinVideo+DMD2 (4-step) | 6.1 / 92.6 | 15.9/20.9× | 66.7 / 66.8 |
VBench-2.0 total score: LinVideo (56.74) = FA2 (56.74) >> SVG2 (55.81); the 4-step variant incurs only ~3% degradation.
CogVideoX-2B: Lossless 1.40× speedup (41.35→29.64s) with on-par VBench scores.
Ablation Study¶
- Target selection: Performance degrades slowly and stably when target ≤ 18; degrades sharply beyond 18.
- Selective transfer >> manual / heuristic: LinVideo's automatically selected layer combination substantially outperforms manual (+5%) and heuristic (+7%) baselines.
- ADM >> MSE >> DMD: Imaging Quality scores — ADM: 66.07, MSE: 61.56, DMD: 57.44.
- ADM without extra score model: Self-estimated score (66.07) ≈ separately trained score model (65.61), at 4.4× lower training cost.
- \(\mathcal{L}_{reg}\) is essential: Without regularization, \(r\) remains near 0.5; after rounding, performance collapses (IQ: 18.62).
- Hedgehog kernel is optimal: VBench 67.61 vs. Taylor 67.24 vs. ReLU 65.48.
Highlights & Insights¶
- Data-free post-training — requires no video dataset; training data is sampled from the model itself, avoiding data privacy and copyright concerns.
- Selective transfer reformulates layer selection as a differentiable continuous optimization problem, outperforming manual or heuristic search without human intervention.
- Using the model itself for score estimation in ADM is the key innovation, avoiding the expensive auxiliary model training required by DMD.
- The extreme 15.9–20.9× speedup (with 4-step distillation) demonstrates the synergistic power of combining linear attention with distillation.
- Validation on CogVideoX confirms that the framework is architecture-agnostic.
Limitations & Future Work¶
- Dedicated CUDA kernels are not yet used — the practical speedup of linear attention is constrained by generic PyTorch implementations.
- Orthogonal to SLA (intra-layer mixed attention) — the two approaches can be combined for further acceleration.
- Target selection still requires some trial and error — while performance is insensitive within a reasonable range, extreme values pose risks.
- Visual quality after 4-step distillation still degrades by ~1% — improved distillation methods may address this.
- Only Wan and CogVideoX are evaluated — performance on additional architectures (HunyuanVideo, Kling) remains to be verified.
Related Work & Insights¶
- vs. SVG/SVG2 (attention sparsification): Sparse methods skip only a subset of attention computations (typically retaining >50%), yielding limited speedup (1.31×). LinVideo directly reduces \(O(n^2)\) to \(O(n)\), achieving greater speedup (1.43–1.71×) with better quality.
- vs. LinGen/SANA-Video (pretrained linear attention): These methods require full pretraining at high cost. LinVideo requires only 3K post-training steps with no data.
- vs. SLA (intra-layer mixed attention): SLA mixes softmax and linear attention within each layer and requires specialized GPU kernels (RTX 5090 only). LinVideo operates at the inter-layer level with generic implementations; the two are orthogonal and composable.
- vs. DMD/DMD2 (distillation): DMD requires training an additional score model (5–10× cost). ADM uses the model's own score estimates, substantially reducing training cost.
- The "automatically select which layers to linearize" paradigm in LinVideo mirrors the "automatically select which experts to skip" idea in MoDES — both convert discrete selection into learnable continuous optimization.
- The ADM principle of "matching not only the final distribution but the full trajectory" generalizes to other generative model distillation settings, such as image diffusion model distillation.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Selective transfer, ADM, and data-free training are all novel contributions; their combination yields remarkable results.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two model scales, VBench + VBench-2.0, multiple attention kernels, comprehensive ablations, and 4-step distillation combinations.
- Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are clear (especially the score difference derivation in ADM); the motivation→observation→method→validation chain is complete.
- Value: ⭐⭐⭐⭐⭐ — Inference cost is the primary deployment bottleneck for video generation; a 15–20× speedup carries enormous industrial significance.