Skip to content

LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation

Conference: CVPR 2026 arXiv: 2510.08318 Code: None (but based on the open-source Wan model) Area: Video Generation / Efficient Inference / Linear Attention Keywords: Video Diffusion Models, Linear Attention, Post-Training, Selective Transfer, Distribution Matching

TL;DR

LinVideo is the first data-free post-training framework that automatically identifies which layers are most amenable to linear attention substitution via selective transfer, and recovers model performance through an Arbitrary-timestep Distribution Matching (ADM) objective. It achieves 1.43–1.71× lossless speedup on Wan 1.3B/14B, and up to 15.9–20.9× speedup when combined with 4-step distillation.

Background & Motivation

The inference bottleneck of video diffusion models (Wan, CogVideoX, Sora) lies in the \(O(n^2)\) complexity of self-attention — a 10-second video typically produces >50K tokens. Attention sparsification helps but generally reduces computation by less than 50%. Linear attention reduces complexity to \(O(n)\), but full replacement requires expensive pretraining (e.g., SANA-Video). The root cause is that linear attention lacks the expressiveness to capture complex spatiotemporal dynamics in video; naively replacing all layers and fine-tuning yields poor results.

Core Problem

Can one efficiently replace as many softmax attention layers as possible with linear attention via data-free post-training, while preserving video generation quality?

Method

Overall Architecture

Two core technical contributions: (1) Selective Transfer: a learnable binary classifier automatically determines whether each layer uses softmax or linear attention; (2) Arbitrary-timestep Distribution Matching (ADM): aligns the sample distributions of the original and linearized models at every timestep along the sampling trajectory, rather than only at the final output. The entire process is data-free — training data is generated by sampling from the original model itself.

Key Designs

  1. Selective Transfer: Different layers exhibit vastly different substitutability (shallow layers are generally easier to replace, but certain critical layers such as the first layer cannot be replaced). A learnable scalar \(r \in [0,1]\) is introduced per layer to interpolate attention outputs: \(o_i = r \cdot \text{softmax-attn} + (1-r) \cdot \text{linear-attn}\). A constraint loss \(\mathcal{L}_{con}\) enforces that the number of replaced layers equals the target, while a regularization loss \(\mathcal{L}_{reg} = \sum(1-|2r-1|^\alpha)\) drives \(r\) toward 0 or 1 to avoid rounding errors. \(\alpha\) is annealed from large to small — allowing free exploration early in training and enforcing binarization later.

  2. Arbitrary-timestep Distribution Matching (ADM): Naive MSE matching causes temporal jitter, and few-step distillation methods (e.g., DMD) only match the distribution at \(t=0\), ignoring intermediate timesteps. ADM minimizes \(KL(q_t \| p_t)\) at every timestep \(t\) along the sampling trajectory. The key innovation is that the model being trained, \(\hat{u}_\theta\), estimates its own score function \(\hat{s}_t\) analytically (since it is itself a multi-step flow model), eliminating the need to train a separate score model — reducing training cost by ~4.4× compared to DMD. The score difference is elegantly derived as: \(s_t - \hat{s}_t = -\frac{1-t}{t}(u_\theta - \hat{u}_\theta)\).

  3. Hedgehog Linear Attention Kernel: A softmax-mimicking kernel \(\phi(q) = \text{softmax}(qW) \oplus \text{softmax}(-qW)\) is adopted, outperforming ReLU or Taylor expansion kernels by 2%+ on VBench. It preserves the peaked weight distribution and dot-product monotonicity of softmax.

Loss & Training

\(\mathcal{L}_{total} = \mathcal{L}_{ADM} + \lambda(\mathcal{L}_{con} + \mathcal{L}_{reg})\). Data-free: 50K input–output pairs are sampled from the original model. Wan 1.3B is trained for 3K steps on 8×H100; 14B for 3K steps on 32×H100. Targets: 16/30 layers replaced for 1.3B; 22/40 for 14B.

Key Experimental Results

VBench 8-dimension (Wan 1.3B / 14B):

Method Latency (s) Speedup VBench Score (approx.)
FA2 (baseline) 97.3 / 1931 67.6 / 67.9
SVG (sparse) 74.5 / 1203 1.31/1.61× 67.2 / 67.0
SVG2 84.9 / 1364 1.15/1.42× 67.5 / 67.3
LinVideo 68.3 / 1127 1.43/1.71× 67.6 / 67.7
LinVideo+DMD2 (4-step) 6.1 / 92.6 15.9/20.9× 66.7 / 66.8

VBench-2.0 total score: LinVideo (56.74) = FA2 (56.74) >> SVG2 (55.81); the 4-step variant incurs only ~3% degradation.

CogVideoX-2B: Lossless 1.40× speedup (41.35→29.64s) with on-par VBench scores.

Ablation Study

  • Target selection: Performance degrades slowly and stably when target ≤ 18; degrades sharply beyond 18.
  • Selective transfer >> manual / heuristic: LinVideo's automatically selected layer combination substantially outperforms manual (+5%) and heuristic (+7%) baselines.
  • ADM >> MSE >> DMD: Imaging Quality scores — ADM: 66.07, MSE: 61.56, DMD: 57.44.
  • ADM without extra score model: Self-estimated score (66.07) ≈ separately trained score model (65.61), at 4.4× lower training cost.
  • \(\mathcal{L}_{reg}\) is essential: Without regularization, \(r\) remains near 0.5; after rounding, performance collapses (IQ: 18.62).
  • Hedgehog kernel is optimal: VBench 67.61 vs. Taylor 67.24 vs. ReLU 65.48.

Highlights & Insights

  • Data-free post-training — requires no video dataset; training data is sampled from the model itself, avoiding data privacy and copyright concerns.
  • Selective transfer reformulates layer selection as a differentiable continuous optimization problem, outperforming manual or heuristic search without human intervention.
  • Using the model itself for score estimation in ADM is the key innovation, avoiding the expensive auxiliary model training required by DMD.
  • The extreme 15.9–20.9× speedup (with 4-step distillation) demonstrates the synergistic power of combining linear attention with distillation.
  • Validation on CogVideoX confirms that the framework is architecture-agnostic.

Limitations & Future Work

  • Dedicated CUDA kernels are not yet used — the practical speedup of linear attention is constrained by generic PyTorch implementations.
  • Orthogonal to SLA (intra-layer mixed attention) — the two approaches can be combined for further acceleration.
  • Target selection still requires some trial and error — while performance is insensitive within a reasonable range, extreme values pose risks.
  • Visual quality after 4-step distillation still degrades by ~1% — improved distillation methods may address this.
  • Only Wan and CogVideoX are evaluated — performance on additional architectures (HunyuanVideo, Kling) remains to be verified.
  • vs. SVG/SVG2 (attention sparsification): Sparse methods skip only a subset of attention computations (typically retaining >50%), yielding limited speedup (1.31×). LinVideo directly reduces \(O(n^2)\) to \(O(n)\), achieving greater speedup (1.43–1.71×) with better quality.
  • vs. LinGen/SANA-Video (pretrained linear attention): These methods require full pretraining at high cost. LinVideo requires only 3K post-training steps with no data.
  • vs. SLA (intra-layer mixed attention): SLA mixes softmax and linear attention within each layer and requires specialized GPU kernels (RTX 5090 only). LinVideo operates at the inter-layer level with generic implementations; the two are orthogonal and composable.
  • vs. DMD/DMD2 (distillation): DMD requires training an additional score model (5–10× cost). ADM uses the model's own score estimates, substantially reducing training cost.
  • The "automatically select which layers to linearize" paradigm in LinVideo mirrors the "automatically select which experts to skip" idea in MoDES — both convert discrete selection into learnable continuous optimization.
  • The ADM principle of "matching not only the final distribution but the full trajectory" generalizes to other generative model distillation settings, such as image diffusion model distillation.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — Selective transfer, ADM, and data-free training are all novel contributions; their combination yields remarkable results.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two model scales, VBench + VBench-2.0, multiple attention kernels, comprehensive ablations, and 4-step distillation combinations.
  • Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are clear (especially the score difference derivation in ADM); the motivation→observation→method→validation chain is complete.
  • Value: ⭐⭐⭐⭐⭐ — Inference cost is the primary deployment bottleneck for video generation; a 15–20× speedup carries enormous industrial significance.