Skip to content

🎬 Video Generation

🧪 ICML2025 · 7 paper notes

📌 Same area in other venues: 📷 CVPR2026 (182) · 🔬 ICLR2026 (98) · 💬 ACL2026 (4) · 🧪 ICML2026 (32) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (23)

🔥 Top topics: Diffusion Models ×4 · Video Generation ×3

AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Proposes AsymRnR—a training-free video DiT acceleration method. Based on the observation that redundancy levels vary across different attention components (Q/K/V), layers, and denoising steps, it asymmetrically reduces tokens to achieve lossless acceleration.

Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing

Ca2-VDM is proposed, which eliminates redundant calculations of conditional frames in autoregressive video diffusion models through two key designs: Causal Generation and Cache Sharing. It reduces computational complexity from quadratic to linear, generating 80-frame videos 2.5 times faster than the baseline while maintaining state-of-the-art generation quality.

Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development

This work proposes Data-Juicer Sandbox, a feedback-driven sandbox suite that systematically explores the interactions between data processing operators (OPs) and model performance in low-cost, small-scale experiments through a "Probe-Analyze-Refine" workflow, transferring the obtained data recipes to large-scale scenarios and achieving first place on the VBench leaderboard.

Diffusion Adversarial Post-Training for One-Step Video Generation

This paper proposes the Adversarial Post-Training (APT) framework, which introduces an adversarial training phase after diffusion model pre-training to achieve high-quality one-step video generation (2 seconds, 1280×720, 24fps) with a model named Seaweed-APT.

How Far is Video Generation from World Model: A Physical Law Perspective

This work systematically evaluates whether video generation models can discover physical laws from purely visual data by constructing a 2D physical simulation video dataset that stringently adheres to classical mechanics. It reveals that current models merely memorize patterns within the training distribution rather than generalizing to novel physical conditions.

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Built upon Stable Video Diffusion, this pose-guided human video generation framework achieves a FID-VID of 9.3 (prev. best 12.4) on the TikTok dataset by encoding pose estimation confidence into guidance signals, amplifying training loss for high-confidence hand regions, and employing position-aware progressive latent fusion. It also natively supports the generation of smooth videos of arbitrary length.

RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

By systematically analyzing the roles of different frequency components in RoPE positional encoding, this paper identifies an "intrinsic frequency" that dominates temporal repetition during extrapolation. It proposes RIFLEx, a minimal intervention scheme that scales down only this frequency to keep it within a single period after extrapolation, achieving high-quality training-free 2× video extrapolation on CogVideoX-5B and HunyuanVideo.