🎬 Video Generation¶

🧪 ICML2025 · 7 paper notes

📌 Same area in other venues: 📷 CVPR2026 (182) · 🔬 ICLR2026 (98) · 💬 ACL2026 (4) · 🧪 ICML2026 (32) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (23)

🔥 Top topics: Diffusion Models ×4 · Video Generation ×3

AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration: Proposes AsymRnR—a training-free video DiT acceleration method. Based on the observation that redundancy levels vary across different attention components (Q/K/V), layers, and denoising steps, it asymmetrically reduces tokens to achieve lossless acceleration.
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing: Ca2-VDM is proposed, which eliminates redundant calculations of conditional frames in autoregressive video diffusion models through two key designs: Causal Generation and Cache Sharing. It reduces computational complexity from quadratic to linear, generating 80-frame videos 2.5 times faster than the baseline while maintaining state-of-the-art generation quality.
Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development: This work proposes Data-Juicer Sandbox, a feedback-driven sandbox suite that systematically explores the interactions between data processing operators (OPs) and model performance in low-cost, small-scale experiments through a "Probe-Analyze-Refine" workflow, transferring the obtained data recipes to large-scale scenarios and achieving first place on the VBench leaderboard.
Diffusion Adversarial Post-Training for One-Step Video Generation: This paper proposes the Adversarial Post-Training (APT) framework, which introduces an adversarial training phase after diffusion model pre-training to achieve high-quality one-step video generation (2 seconds, 1280×720, 24fps) with a model named Seaweed-APT.
How Far is Video Generation from World Model: A Physical Law Perspective: This work systematically evaluates whether video generation models can discover physical laws from purely visual data by constructing a 2D physical simulation video dataset that stringently adheres to classical mechanics. It reveals that current models merely memorize patterns within the training distribution rather than generalizing to novel physical conditions.
MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance: Built upon Stable Video Diffusion, this pose-guided human video generation framework achieves a FID-VID of 9.3 (prev. best 12.4) on the TikTok dataset by encoding pose estimation confidence into guidance signals, amplifying training loss for high-confidence hand regions, and employing position-aware progressive latent fusion. It also natively supports the generation of smooth videos of arbitrary length.
RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers: By systematically analyzing the roles of different frequency components in RoPE positional encoding, this paper identifies an "intrinsic frequency" that dominates temporal repetition during extrapolation. It proposes RIFLEx, a minimal intervention scheme that scales down only this frequency to keep it within a single period after extrapolation, achieving high-quality training-free 2× video extrapolation on CogVideoX-5B and HunyuanVideo.