🎬 Video Generation¶

🧠 NeurIPS2025 · 23 paper notes

Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation: This paper proposes AAPT (Autoregressive Adversarial Post-Training), which converts a pretrained video diffusion model into an autoregressive real-time video generator via adversarial training. The model requires only one forward pass per frame (1NFE), employs student-forcing training to reduce error accumulation, and achieves real-time streaming generation at 736×416 resolution and 24fps on a single H100 GPU, supporting videos up to one minute in length (1440 frames).
DisMo: Disentangled Motion Representations for Open-World Motion Transfer: DisMo learns abstract motion representations that are agnostic to appearance, pose, and category from raw videos via a dual-stream architecture (motion extractor + frame generator) and an image-space reconstruction objective. It enables open-world motion transfer across categories and viewpoints, and significantly outperforms video representation models such as V-JEPA on zero-shot action classification.
Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals: This paper proposes Force Prompting, which uses physical forces (local point forces and global wind forces) as control signals for video generation models. Using only ~15K synthetic training videos (Blender flags and rolling balls) and a single day of training on 4×A100 GPUs, the method achieves remarkable generalization across diverse real-world scenes with varying objects, materials, and geometries, including preliminary mass understanding capabilities.
Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation: This paper proposes Foresight, a training-free adaptive layer reuse framework that establishes per-layer MSE thresholds during a warmup phase and dynamically decides at inference time whether to reuse cached features or recompute each layer. Evaluated on 5 video generation models, Foresight achieves superior quality and speed trade-offs compared to static methods, with up to 2.23× acceleration.
LeMiCa: Lexicographic Minimax Path Caching for Efficient Diffusion-Based Video Generation: This paper proposes LeMiCa, a training-free acceleration framework for diffusion-based video generation that formulates cache scheduling as a lexicographic minimax path optimization problem on a directed acyclic graph (DAG), achieving simultaneous gains in speed and quality (2.9× speedup on Latte; LPIPS as low as 0.05 on Open-Sora) via global error control.
MagCache: Fast Video Generation with Magnitude-Aware Cache: This paper discovers that the magnitude ratio of residual outputs between adjacent timesteps in video diffusion models follows a universally monotonically decreasing pattern across models and prompts — termed the "Unified Magnitude Law" — and proposes MagCache: a method that accurately models skip-step error accumulation via magnitude ratios, adaptively skips redundant timesteps and reuses cached outputs with only a single calibration sample, achieving 2.10–2.68× speedup on Open-Sora, CogVideoX, Wan 2.1, and HunyuanVideo while outperforming TeaCache and other existing methods across all three metrics of LPIPS, SSIM, and PSNR.
Photography Perspective Composition: Towards Aesthetic Perspective Recommendation: This paper proposes a novel "Photography Perspective Composition" (PPC) paradigm that goes beyond traditional cropping-based approaches. It constructs a perspective transformation dataset via 3D reconstruction, generates recommended viewpoints through Image-to-Video generation, aligns with human preferences via RLHF, and evaluates perspective quality using a PQA model.
PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation: PhysCtrl employs diffusion models to learn the physical dynamics distribution of four material types (elastic, sand, plasticine, and rigid bodies), representing dynamics as 3D point trajectories. A diffusion model incorporating spatiotemporal attention and physics constraints is trained on 550K synthetic animations; the generated trajectories drive a pretrained video model to achieve high-fidelity physics video generation controllable by force and material parameters.
PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis: This paper proposes PoseCrafter, a training-free framework for extreme pose estimation. It synthesizes high-fidelity intermediate frames via Hybrid Video Generation (HVG, a two-stage pipeline combining DynamiCrafter and ViewCrafter) to address pose estimation for image pairs with minimal or no overlap, and employs a Feature Matching Selector (FMS) to efficiently identify the most informative intermediate frames. The method achieves significant improvements in extreme pose estimation accuracy across four datasets.
Radial Attention: O(n log n) Sparse Attention with Energy Decay for Long Video Generation: Radial Attention identifies a "spatiotemporal energy decay" phenomenon in video diffusion models, wherein attention scores decay exponentially with spatiotemporal distance. Based on this finding, the authors design a static sparse attention mask with O(n log n) complexity, achieving up to 3.7× inference speedup on models such as HunyuanVideo and Wan2.1, and enabling 4× longer video generation via LoRA fine-tuning.
RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation: This paper is the first to systematically quantify geometric distortions in autonomous driving video generation. It proposes the RLGF framework, which leverages hierarchical geometric rewards (vanishing point → lane lines → depth → occupancy) combined with a latent-space sliding window optimization strategy to improve 3D object detection mAP by 12.7 absolute percentage points (25.75→31.42), substantially closing the performance gap between synthetic and real data.
S²Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation: To address the high calibration variance and optimization difficulty caused by extremely long token sequences in video diffusion Transformers, this paper proposes the S²Q-VDiT framework. By combining Hessian-aware salient data selection and attention-guided sparse token distillation, it achieves lossless quantization under W4A6 settings for the first time, yielding 3.9× model compression and 1.3× inference speedup.
Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking: Safe-Sora is the first method to embed graphical watermarks (e.g., logo images) directly into the video generation pipeline. It employs a coarse-to-fine adaptive matching strategy to assign watermark patches to visually similar frames and regions, and designs a 3D wavelet transform-enhanced Mamba architecture for spatiotemporal fusion. The method substantially outperforms all baselines in both video quality (FVD 3.77 vs. the second-best 154.35) and watermark fidelity.
Scaling RL to Long Videos: This paper proposes LongVILA-R1, a full-stack framework that extends VLM reasoning to long videos (up to 8192 frames) via a 104K long-video reasoning dataset, a two-stage CoT-SFT + RL training pipeline, and the MR-SP multimodal reinforcement sequence parallelism system, achieving 65.1%/71.1% on VideoMME.
Seeing the Wind from a Falling Leaf: An end-to-end differentiable inverse graphics framework is proposed that jointly models object geometry/physical properties, force field representations, and physical processes to recover invisible force fields (e.g., wind fields) from video via backpropagation, while supporting physics-based video generation and editing.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion: This paper proposes the Self Forcing training paradigm, which eliminates the exposure bias caused by train-inference distribution mismatch in Teacher Forcing and Diffusion Forcing by performing autoregressive self-rollout during training and applying a holistic video-level distribution matching loss (DMD/SiD/GAN). Built on Wan2.1-1.3B, it achieves real-time streaming video generation at 17 FPS on a single GPU while matching or surpassing the quality of bidirectional diffusion models that are orders of magnitude slower.
Stable Cinemetrics: Structured Taxonomy and Evaluation for Professional Video Generation: This paper proposes SCINE (Stable Cinemetrics), the first structured evaluation framework targeting professional video production. It defines a hierarchical taxonomy with 76 fine-grained cinematic control nodes, accompanied by large-scale professional annotation (80+ film practitioners, 20K+ videos, 248K annotations), revealing significant deficiencies of current state-of-the-art T2V models in professional cinematic control.
Training-Free Efficient Video Generation via Dynamic Token Carving: This paper proposes Jenga, a training-free inference acceleration framework for video DiTs that achieves 8.83× speedup on HunyuanVideo with only a 0.01% drop in VBench score. The framework combines dynamic block attention carving (sparse KV block selection after token reordering via 3D space-filling curves) and a progressive resolution strategy (coarse-to-fine denoising), which operate orthogonally.
Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision: This paper reveals that pretrained video diffusion models naturally learn motion representations suitable for tracking during high-noise denoising stages, and proposes the TED framework that fuses motion and appearance features, achieving up to 10 percentage points improvement over existing self-supervised methods on tracking similar-looking objects.
Video Killed the Energy Budget: Characterizing the Latency and Power Regimes of Open Text-to-Video Models: This paper presents a systematic analysis of latency and energy consumption for open-source text-to-video (T2V) models. It establishes a FLOP-based analytical model to predict scaling laws for WAN2.1 — quadratic scaling along spatial/temporal dimensions and linear scaling with respect to denoising steps — and provides a cross-model energy benchmark across 7 T2V models.
VMDT: Decoding the Trustworthiness of Video Foundation Models: This paper introduces VMDT (Video-Modal DecodingTrust), the first unified benchmark platform for evaluating the trustworthiness of T2V and V2T video foundation models across five dimensions—safety, hallucination, fairness, privacy, and adversarial robustness—covering large-scale assessments of 7 T2V and 19 V2T models, and revealing the complex relationship between model scale and trustworthiness.
VORTA: Efficient Video Diffusion via Routing Sparse Attention: This paper proposes VORTA, a framework that achieves end-to-end 1.76× acceleration of video diffusion Transformers without quality degradation, through bucketed coreset attention (for modeling long-range dependencies) and a signal-aware routing mechanism (for adaptively selecting sparse attention branches). Combined with caching and distillation methods, it achieves up to 14.41× acceleration.
VSA: Faster Video Diffusion with Trainable Sparse Attention: This paper proposes VSA (Video Sparse Attention), an end-to-end trainable, hardware-aligned sparse attention mechanism with a hierarchical coarse-fine design: a coarse-grained stage predicts key token positions via cube pooling, and a fine-grained stage performs token-level attention within the predicted block-sparse regions. VSA accelerates both training and inference of video DiTs simultaneously: pretraining from scratch achieves a 2.53× reduction in training FLOPs without quality loss, while adapting to Wan2.1-1.3B yields a 6× attention speedup and reduces end-to-end inference time from 31s to 18s.