Skip to content

🎬 Video Generation

🧪 ICML2026 · 6 paper notes

📌 Same area in other venues: 💬 ACL2026 (3) · 📷 CVPR2026 (55) · 🔬 ICLR2026 (18) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (22) · 📹 ICCV2025 (48)

🔥 Top topics: Video Generation ×2

Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering

SVOO discovers that the attention sparsity of each layer in video DiT is an intrinsic property that is "input-invariant within layers and significantly heterogeneous across layers." Based on this, it first performs offline layer-wise sparsity calibration, then conducts online QK bidirectional co-clustering. Without retraining, it achieves up to 1.93× acceleration on 7 models including Wan/HunyuanVideo while maintaining PSNR at 29 dB.

Exploring Data-Free LoRA Transferability for Video Diffusion Models

This work is the first to analyze the weight space of full fine-tuning (FFT) and LoRA for video diffusion models (VDM), finding that both "preserve the singular spectrum and only rotate the singular subspace," but their routing directions conflict on head clusters. Based on this, the authors propose CASA—a data-free "cluster-wise spectral arbitration" LoRA transfer method that directly migrates LoRA trained on the base Wan2.1 to distilled variants like FastWan, without any user data or retraining.

Lightning Unified Video Editing via In-Context Sparse Attention

To address the secondary attention bottleneck in video editing under the In-Context Learning (ICL) paradigm, the authors design In-context Sparse Attention (ISA) based on two insights: "context tokens are significantly less salient than source tokens" and "Query sharpness is proportional to Taylor approximation error." They train LIVEditor, which both accelerates inference by ~60% and surpasses SOTA full-attention models on multiple benchmarks.

MiVE: Multiscale Vision-language features for reference-guided video Editing

MiVE simultaneously extracts the first and last layer hidden states from Qwen3-VL as multiscale condition tokens, concatenates them with VAE visual latents into a long sequence, and performs reference-guided video editing using unified self-attention in DiT. On a 60-clip 720P benchmark, it achieves top human preference and six VLM-based automatic scores, outperforming open-source Wan-Animate and commercial Kling O1.

Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization

QVG is a training- and finetuning-free KV-cache quantization framework for autoregressive video diffusion. It performs token smoothing via semantic-aware clustering and progressively compresses residuals in multiple stages. On LongCat-Video/HY-WorldPlay/Self-Forcing, it reduces KV memory to 1/7 of the original, with end-to-end latency overhead <4%. At 2 bits, it significantly outperforms LLM quantization baselines such as KIVI/QuaRot in quality.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

VAnim models open-domain text-to-SVG animation as "sparse state updates on a persistent DOM tree" + "Identification-First motion planning" + "GRPO rendering-aware reinforcement learning," achieving a \(9.86\times\) sequence length compression while preserving topology, and significantly surpassing GPT-5.2, Gemini 3 Pro, and LiveSketch.