LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation¶

Conference: CVPR 2026 arXiv: 2510.08318 Code: None (but based on the open-source Wan model) Area: Video Generation / Efficient Inference / Linear Attention Keywords: Video Diffusion Models, Linear Attention, Post-Training, Selective Transfer, Distribution Matching

TL;DR¶

LinVideo is the first data-free post-training framework that automatically identifies which layers are most amenable to linear attention substitution via selective transfer, and recovers model performance through an Arbitrary-timestep Distribution Matching (ADM) objective. It achieves 1.43–1.71× lossless speedup on Wan 1.3B/14B, and up to 15.9–20.9× speedup when combined with 4-step distillation.

Background & Motivation¶

The inference bottleneck of video diffusion models (Wan, CogVideoX, Sora) lies in the \(O(n^2)\) complexity of self-attention — a 10-second video typically produces >50K tokens. Attention sparsification helps but generally reduces computation by less than 50%. Linear attention reduces complexity to \(O(n)\), but full replacement requires expensive pretraining (e.g., SANA-Video). The root cause is that linear attention lacks the expressiveness to capture complex spatiotemporal dynamics in video; naively replacing all layers and fine-tuning yields poor results.

Core Problem¶

Can one efficiently replace as many softmax attention layers as possible with linear attention via data-free post-training, while preserving video generation quality?

Method¶

Overall Architecture¶

Two core technical contributions: (1) Selective Transfer: a learnable binary classifier automatically determines whether each layer uses softmax or linear attention; (2) Arbitrary-timestep Distribution Matching (ADM): aligns the sample distributions of the original and linearized models at every timestep along the sampling trajectory, rather than only at the final output. The entire process is data-free — training data is generated by sampling from the original model itself.

Key Designs¶

Selective Transfer: Different layers exhibit vastly different substitutability (shallow layers are generally easier to replace, but certain critical layers such as the first layer cannot be replaced). A learnable scalar \(r \in [0,1]\) is introduced per layer to interpolate attention outputs: \(o_i = r \cdot \text{softmax-attn} + (1-r) \cdot \text{linear-attn}\). A constraint loss \(\mathcal{L}_{con}\) enforces that the number of replaced layers equals the target, while a regularization loss \(\mathcal{L}_{reg} = \sum(1-|2r-1|^\alpha)\) drives \(r\) toward 0 or 1 to avoid rounding errors. \(\alpha\) is annealed from large to small — allowing free exploration early in training and enforcing binarization later.
Arbitrary-timestep Distribution Matching (ADM): Naive MSE matching causes temporal jitter, and few-step distillation methods (e.g., DMD) only match the distribution at \(t=0\), ignoring intermediate timesteps. ADM minimizes \(KL(q_t \| p_t)\) at every timestep \(t\) along the sampling trajectory. The key innovation is that the model being trained, \(\hat{u}_\theta\), estimates its own score function \(\hat{s}_t\) analytically (since it is itself a multi-step flow model), eliminating the need to train a separate score model — reducing training cost by ~4.4× compared to DMD. The score difference is elegantly derived as: \(s_t - \hat{s}_t = -\frac{1-t}{t}(u_\theta - \hat{u}_\theta)\).
Hedgehog Linear Attention Kernel: A softmax-mimicking kernel \(\phi(q) = \text{softmax}(qW) \oplus \text{softmax}(-qW)\) is adopted, outperforming ReLU or Taylor expansion kernels by 2%+ on VBench. It preserves the peaked weight distribution and dot-product monotonicity of softmax.

Loss & Training¶

\(\mathcal{L}_{total} = \mathcal{L}_{ADM} + \lambda(\mathcal{L}_{con} + \mathcal{L}_{reg})\). Data-free: 50K input–output pairs are sampled from the original model. Wan 1.3B is trained for 3K steps on 8×H100; 14B for 3K steps on 32×H100. Targets: 16/30 layers replaced for 1.3B; 22/40 for 14B.

Key Experimental Results¶

VBench 8-dimension (Wan 1.3B / 14B):

Method	Latency (s)	Speedup	VBench Score (approx.)
FA2 (baseline)	97.3 / 1931	1×	67.6 / 67.9
SVG (sparse)	74.5 / 1203	1.31/1.61×	67.2 / 67.0
SVG2	84.9 / 1364	1.15/1.42×	67.5 / 67.3
LinVideo	68.3 / 1127	1.43/1.71×	67.6 / 67.7
LinVideo+DMD2 (4-step)	6.1 / 92.6	15.9/20.9×	66.7 / 66.8

VBench-2.0 total score: LinVideo (56.74) = FA2 (56.74) >> SVG2 (55.81); the 4-step variant incurs only ~3% degradation.

CogVideoX-2B: Lossless 1.40× speedup (41.35→29.64s) with on-par VBench scores.

Ablation Study¶

Target selection: Performance degrades slowly and stably when target ≤ 18; degrades sharply beyond 18.
Selective transfer >> manual / heuristic: LinVideo's automatically selected layer combination substantially outperforms manual (+5%) and heuristic (+7%) baselines.
ADM >> MSE >> DMD: Imaging Quality scores — ADM: 66.07, MSE: 61.56, DMD: 57.44.
ADM without extra score model: Self-estimated score (66.07) ≈ separately trained score model (65.61), at 4.4× lower training cost.
\(\mathcal{L}_{reg}\) is essential: Without regularization, \(r\) remains near 0.5; after rounding, performance collapses (IQ: 18.62).
Hedgehog kernel is optimal: VBench 67.61 vs. Taylor 67.24 vs. ReLU 65.48.

Highlights & Insights¶

Data-free post-training — requires no video dataset; training data is sampled from the model itself, avoiding data privacy and copyright concerns.
Selective transfer reformulates layer selection as a differentiable continuous optimization problem, outperforming manual or heuristic search without human intervention.
Using the model itself for score estimation in ADM is the key innovation, avoiding the expensive auxiliary model training required by DMD.
The extreme 15.9–20.9× speedup (with 4-step distillation) demonstrates the synergistic power of combining linear attention with distillation.
Validation on CogVideoX confirms that the framework is architecture-agnostic.

Limitations & Future Work¶

Dedicated CUDA kernels are not yet used — the practical speedup of linear attention is constrained by generic PyTorch implementations.
Orthogonal to SLA (intra-layer mixed attention) — the two approaches can be combined for further acceleration.
Target selection still requires some trial and error — while performance is insensitive within a reasonable range, extreme values pose risks.
Visual quality after 4-step distillation still degrades by ~1% — improved distillation methods may address this.
Only Wan and CogVideoX are evaluated — performance on additional architectures (HunyuanVideo, Kling) remains to be verified.

vs. SVG/SVG2 (attention sparsification): Sparse methods skip only a subset of attention computations (typically retaining >50%), yielding limited speedup (1.31×). LinVideo directly reduces \(O(n^2)\) to \(O(n)\), achieving greater speedup (1.43–1.71×) with better quality.
vs. LinGen/SANA-Video (pretrained linear attention): These methods require full pretraining at high cost. LinVideo requires only 3K post-training steps with no data.
vs. SLA (intra-layer mixed attention): SLA mixes softmax and linear attention within each layer and requires specialized GPU kernels (RTX 5090 only). LinVideo operates at the inter-layer level with generic implementations; the two are orthogonal and composable.
vs. DMD/DMD2 (distillation): DMD requires training an additional score model (5–10× cost). ADM uses the model's own score estimates, substantially reducing training cost.
The "automatically select which layers to linearize" paradigm in LinVideo mirrors the "automatically select which experts to skip" idea in MoDES — both convert discrete selection into learnable continuous optimization.
The ADM principle of "matching not only the final distribution but the full trajectory" generalizes to other generative model distillation settings, such as image diffusion model distillation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Selective transfer, ADM, and data-free training are all novel contributions; their combination yields remarkable results.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Two model scales, VBench + VBench-2.0, multiple attention kernels, comprehensive ablations, and 4-step distillation combinations.
Writing Quality: ⭐⭐⭐⭐⭐ — Mathematical derivations are clear (especially the score difference derivation in ADM); the motivation→observation→method→validation chain is complete.
Value: ⭐⭐⭐⭐⭐ — Inference cost is the primary deployment bottleneck for video generation; a 15–20× speedup carries enormous industrial significance.