🎬 Video Generation¶

🧪 ICML2026 · 32 paper notes

📌 Same area in other venues: 📷 CVPR2026 (182) · 🔬 ICLR2026 (98) · 💬 ACL2026 (4) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (23) · 📹 ICCV2025 (49)

🔥 Top topics: Video Generation ×15 · Diffusion Models ×6 · Model Compression ×2

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation: AAD-1 utilizes asymmetric adversarial distillation featuring a "causal generator + bidirectional video-level discriminator" alongside DMD warmup to compress autoregressive image-to-video generation into a single sampling step per chunk, effectively mitigating motion collapse and long-range drift.
Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering: SVOO discovers that the attention sparsity of each layer in video DiT is an intrinsic property that is "input-independent within layers and significantly heterogeneous between layers." Based on this, it performs offline per-layer sparsity calibration followed by online QK bidirectional co-clustering for block partitioning. It achieves up to 1.93× speedup while maintaining a PSNR of 29 dB across 7 models (e.g., Wan, HunyuanVideo) without any training.
Bridging Creative Intent and Visual Quality: Creator-Driven Recurrent Video Generation with Agentic Feedback Loops: CHIEF places the creator at the center of the video generation iterative loop. It utilizes "anthropomorphic multi-modal LLM audience agents" to automatically generate subjective film reviews for generated videos, which are then structured into actionable prompt modifications by a translator. This allows even middle school students without filming experience to scale from 1-minute clips to 10-minute short films with complete narratives.
CamGeo: Sparse Camera-Conditioned Image-to-Video Generation with 3D Geometry Prior: CamGeo distills 3D geometric knowledge from a pre-trained 3D video model (VGGT) through training-only distillation. By providing supervision signals only during the training phase, the diffusion model generates high-quality videos with geometric consistency and smooth motion under sparse camera inputs, while VGGT is completely removed during inference to maintain efficiency.
DFSAttn: Dynamic Fine-Grained Sparse Attention for Efficient Video Generation: DFSAttn achieves 2.1× end-to-end acceleration with quality comparable to full attention through 3D Hilbert curve reordering + hierarchical block scoring + adaptive mask caching. It addresses the core issue of quality degradation in block-sparse attention at high sparsity ratios (>80%).
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos: MIGA enables base video models to generate infinitely long and highly temporally consistent videos without training through two core mechanisms: Two-Stage Training-Inference Alignment (TTA) and Dual Consistency Enhancement (DCE: Self-Reflection + Long-Range Frame Guidance). It achieves a 2.8% improvement in VBench composite score compared to FIFO-Diffusion (97.82 vs 95.02).
EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance: EPiC utilizes a "first-frame visibility mask" approach to construct pixel-aligned anchor videos directly from arbitrary in-the-wild videos. By pairing this with Anchor-ControlNet—comprising only 26M parameters (<1% of the backbone) and operating exclusively on visible regions—Ours achieves SOTA I2V camera control precision and zero-shot generalization to V2V. This is accomplished while freezing the CogVideoX-5B-I2V backbone, using only 5K videos and 500 training steps.
Explainable Forensics of Manipulated Segments in Untrimmed Long Videos: This paper proposes the task of temporal localization and explainable analysis of AI-generated segments in long videos, introducing the large-scale TASLE dataset and the two-stage MSLoc baseline method—achieving precise localization and explainable reasoning of manipulated segments in mixed real-fake videos through boundary-aware proposal generation and MLLM refinement.
Exploring Data-Free LoRA Transferability for Video Diffusion Models: This paper presents the first weight-space analysis of Full Fine-Tuning (FFT) and LoRA for Video Diffusion Models (VDMs). It discovers that both "preserve the singular spectrum and only rotate the singular subspaces," but exhibit conflicting routing directions on head clusters. Based on this, the authors propose CASA—a data-free "spectral arbitration by clustering" LoRA transfer method that allows LoRA trained on base models like Wan2.1 to be directly transferred to distilled variants like FastWan without requiring user data or retraining.
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance: iTryOn defines the "Interactive Video Virtual Try-On" task for the first time—enabling individuals in videos to actively manipulate garments (zipping, lifting corners, stretching) rather than just passive display. By resolving spatial ambiguity through 3D hand priors, strictly aligning timestamped action titles with corresponding frames using Action-aware RoPE (A-RoPE), and amplifying learning signals in sparse interaction frames via Action-aware Constraint Loss (AC Loss), it improves the ISR (Interaction Success Rate) on the self-built VVT-Interact from a baseline of 0.397 to 0.610 (+54%).
Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention: Light Forcing is the first sparse attention scheme customized for autoregressive (AR) video diffusion models. Chunk-Aware Growth (CAG) quantifies the cumulative error contribution of each generated chunk to dynamically allocate sparsity, while Hierarchical Sparse Attention (HSA) flexibly captures historical dependencies through frame-level → chunk-level dual-mask selection. It achieves 1.30× end-to-end / 3.79× attention speedup on Self Forcing, with a VBench total score of 84.5 > dense baseline 84.1.
Lightning Unified Video Editing via In-Context Sparse Attention: To address the quadratic attention bottleneck in video editing under the In-Context Learning paradigm, the authors designed In-context Sparse Attention (ISA) based on two insights: "context token saliency is lower than source tokens" and "Query sharpness is proportional to Taylor approximation error." They trained LIVEditor, which achieves a ~60% speedup while surpassing SOTA full-attention models across several benchmarks.
LocoT2V-Bench: Benchmarking Long-form and Complex Text-to-Video Generation: LocoT2V-Bench is a professional benchmark designed for long video + complex scene generation—comprising 234 real video clips \(\times\) 18 themes \(\times\) an average of 249-word prompts. Accompanied by the LoCoT2V-Eval framework (5 dimensions, 17 sub-dimensions, including hierarchical VQA + conditional gating + Auditor-Evaluator dual-agent HERD), it systematically evaluates 17 long-video generation models. The results reveal a universal bottleneck: "strong perceptual quality but weak fine-grained alignment and poor character consistency."
LuVe: Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts: LuVe redefines UHR video generation from "passive detail enhancement" to "active content completion." Through a three-stage cascade (Low-Resolution Motion → Latent Space Upsampling → High-Resolution Refinement) and frequency domain analysis-driven Dual Frequency Experts (Low-Frequency Expert for global semantic consistency, High-Frequency Expert for texture refinement), it achieves a total score of 84.03 on VBench 4K, surpassing UltraWan-4K's 83.75.
MiVE: Multiscale Vision-language features for reference-guided video Editing: MiVE extracts both the first and last layer hidden states of Qwen3-VL as multi-scale condition tokens. These are concatenated with VAE visual latents into a long sequence for reference-guided video editing within a unified self-attention DiT. It ranks first in human preference and across 6 VLM automated metrics on a 60-video 720P benchmark, surpassing the open-source Wan-Animate and commercial Kling O1.
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning: MotiMotion transforms sparse and imprecise user trajectories and text prompts into physically plausible and causally consistent motion trajectories and text descriptions using VLM reasoning. It then employs a confidence-weighted control strategy to guide the diffusion model in generating natural videos aligned with world knowledge and physical principles—achieving a physical realism score of 0.302 on MotiBench, significantly surpassing Wan-Move's 0.218 (+38%).
OLAF-World: Orienting Latent Actions for Video World Modeling: OLAF-World learns transferable latent actions through Sequence-level Control-Effect Alignment (Seq∆-REPA)—turning unlabeled videos into action-controllable video world models and achieving zero-shot action transfer across contexts. With only 1 minute of annotated data, it achieves performance comparable to AdaWorld with 2 hours of data (rotation control accuracy 0.4680 vs 0.6420).
Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases Them: This paper discovers that in Image-to-Video (I2V) diffusion models, "2-step inference is physically more reliable than 50-step inference." The root cause is identified as the erosion of the phase spectrum during the denoising process. Consequently, the authors propose PhaseLock, a training-free framework that extracts motion priors from 2-step inference and injects them into the high-fidelity denoising trajectory using Latent Delta Guidance. This approach improves physical consistency by an average of 6.2 points with negligible overhead (1.06× time, 1.02× VRAM).
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization: QVG is a training-free KV-cache quantization framework for autoregressive video diffusion. By employing semantic-aware clustering for token smoothing and progressive residual multi-stage compression, it reduces KV memory footprint to 1/7 of the original on LongCat-Video/HY-WorldPlay/Self-Forcing with <4% end-to-end latency overhead. At 2-bit, its quality significantly outperforms LLM quantization baselines like KIVI and QuaRot.
Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Generation: This paper identifies that KV cache quantization in chunked autoregressive video diffusion models causes a systematic shift in attention weights ("Quantized Keys Steal Attention"). By deriving a per-score correction term based on Jensen's Inequality, it restores video quality near BF16 levels (VBench 78.02 vs 78.27) under aggressive INT2 quantization, saving 50% memory.
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories: Ours packs per-pixel camera rays ("origin + direction") into a 3-channel "raxel" map with the same shape as RGB, allowing a pre-trained video VAE to function directly as a camera encoder. By using Decoupled Self-Cross Attention to jointly denoise raxel and video frames within a single Flow Matching DiT, this work for the first time supports pose estimation, camera-controllable video generation, and joint "video + trajectory" generation using a single set of weights.
Self-Refining Video Sampling: The pretrained flow matching video generator is reinterpreted as a "denoising autoencoder." During inference, a Predict-and-Perturb inner loop iteratively corrects latent deviations within the same noise level. An uncertainty mask derived from model self-consistency is applied to refine only dynamic regions. This approach significantly enhances motion coherence and physical plausibility without any external verifier or additional training, achieving a human preference rate exceeding 70%.
SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion: SGMD introduces a stable teacher stop-gradient Fisher objective and a dual potential (NR/RC) mechanism to solve the high cost of fake score tracking (5 updates per round in DMD2) and motion suppression issues in few-step video diffusion distillation. It achieves ~3× training acceleration and improves motion quality from 0.65 to 0.78 (VideoAlign) under 4-step distillation.
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation: T2AV-Compass is the first comprehensive evaluation benchmark for Text-to-Audio-Video (T2AV) generation. It features 500 complex prompts and a dual-level evaluation framework combining low-level signal metrics with high-level MLLM diagnostics. By evaluating 15 cutting-edge T2AV systems, it quantitatively reveals an "audio realism bottleneck," where even top-tier models achieve over 85% realism in video but only approximately 50% in audio.
V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation: Addressing the core challenge in Video-to-Video (V2V) editing—following instructions while maintaining frame-level alignment with the source video—which existing T2V/I2V metrics fail to capture, this paper proposes V2V-Bench. It introduces a benchmark with 11 decoupled dimensions across 5 categories (6 of which are V2V-exclusive) and uses a four-stage pipeline that first checks compliance before detailed evaluation. It achieves a Spearman correlation of 0.905 with human judgment across 6 core V2V dimensions.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation: VAnim models open-domain text-to-SVG animation as "sparse state updates on a persistent DOM tree" + "Identification-First motion planning" + "GRPO rendering-aware reinforcement learning." This approach compresses sequence lengths by \(9.86\times\) while maintaining topological consistency, significantly outperforming GPT-5.2, Gemini 3 Pro, and LiveSketch.
VEDA: Scalable Video Diffusion via Distilled Sparse Attention: VEDA reformulates the sparse attention problem in video DiT as "explicit distillation of the full attention structure." By combining statistic-aware tile scoring, head-aware grouping search, and hardware-efficient kernels, it maintains generation quality at extreme 90-95% sparsity. It achieves a 5.1× end-to-end speedup and 10.5× attention acceleration for Waver-12B generating 720P 10-second videos.
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation: VideoGPA utilizes a Geometric Foundation Model (GFM) to reconstruct generated videos into 3D point clouds and project them back into the original frames. It uses "reprojection error" as a self-supervised geometric consistency reward to automatically construct preference pairs. By applying DPO (fine-tuning ~1% parameters via LoRA with only ~2500 preference samples), it aligns pre-trained video diffusion models to a 3D-consistent manifold, significantly mitigating object deformation and spatial drift without compromising image quality.
Where Concept Erasure Should Occur: Concept-Layer Alignment in Text-to-Video Diffusion Models: This paper discovers that target concepts in text-to-video (T2V) diffusion models are most separable only at specific depths. It proposes CLEAR, which utilizes Gumbel-Softmax to learn "where to erase" and Sparse Autoencoders (SAE) to learn "which concept direction to erase," enabling precise suppression of target concepts while preserving video quality without modifying diffusion model weights.
WIND: Weather Inverse Diffusion for Zero-Shot Atmospheric Modeling: WIND models the global atmospheric sequence as an unconditional video diffusion prior. During inference, it formulates forecasting, downscaling, sparse reconstruction, mass conservation, and warming scenarios as differentiable inverse problems, solving multiple classes of weather and climate tasks zero-shot using a single frozen model.
World-R1: Reinforcing 3D Constraints for Text-to-Video Generation: World-R1 transforms the 3D consistency problem of text-to-video models into reinforcement learning (RL) post-training. By using implicit camera conditioning and 3D-aware rewards to perform Flow-GRPO alignment on video foundation models like Wan 2.1, it significantly reduces geometric hallucinations without altering model architecture or inference pipelines, while maintaining general video generation quality.
WorldCache: Accelerating World Models for Free via Heterogeneous Token Caching: WorldCache addresses the issue of non-uniform evolution of multimodal tokens (such as RGB and depth) in diffusion world models. By categorizing tokens into stable, linear, and chaotic types based on curvature and adaptively triggering full forward passes, it achieves up to 3.65x to 3.7x end-to-end acceleration on models like HunyuanVoyager and Aether, while substantially maintaining the quality of world generation and 3D reconstruction.