🎬 Video Generation¶
🔬 ICLR2026 · 98 paper notes
📌 Same area in other venues: 📷 CVPR2026 (152) · 💬 ACL2026 (4) · 🧪 ICML2026 (32) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (23) · 📹 ICCV2025 (49)
🔥 Top topics: Video Generation ×44 · Diffusion Models ×32 · Dynamic Scenes ×4 · Alignment/RLHF ×4 · Robotics ×4
- 3D Scene Prompting for Scene-Consistent Camera-Controllable Video Generation
-
This paper proposes 3DScenePrompt, which utilizes dual spatio-temporal conditions—"temporally adjacent frames + projected views from a static 3D point cloud"—to extend future videos from any length of input video, maintaining scene consistency with the entire history while achieving precise camera control.
- AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
-
By treating a pre-trained text-to-video (T2V) diffusion model as a "virtual cinematographer," this work implements a two-stage paradigm—first generating videos with implicit professional camera movements based on 4D human actions, and then explicitly extracting the viewpoint via a camera extrinsic diffusion branch—achieving automatic camera trajectory planning in 4D scenes with open-domain generalization and text controllability significantly exceeding specialized models.
- Anchor Frame Bridging for Coherent First-Last Frame Video Generation
-
To address semantic decay and visual collapse in the intermediate frames of First-Last Frame Video Generation (FLF2V), this paper proposes a training-free Anchor Frame Bridging (AFB) method. By adaptively inserting an "anchor frame" at the point of most severe temporal rupture to "relay" semantics from start to end, AFB achieves a 16.58% improvement in FVD and 10.21% in PSNR on Wan2.1-I2V.
- Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models
-
To be added after deeper reading.
- Any-to-Bokeh: Arbitrary-Subject Video Refocusing with Video Diffusion Model
-
Any-to-Bokeh models video refocusing/bokeh rendering as a single-step video diffusion process guided by a focal-plane adaptive MPI geometric prior. It allows users to freely specify the focal plane and blur intensity for any input video, addressing temporal flickering via three-stage progressive training and weighted overlapping inference, outperforming previous image/MPI bokeh methods on both synthetic and real-world data.
- Arbitrary Generative Video Interpolation
-
ArbInterp proposes a generative video frame interpolation framework that supports arbitrary timestamps and lengths. It achieves precise temporal control via Timestamp-aware Rotary Positional Encoding (TaRoPE) and enables seamless splicing of long sequences through an appearance-motion decoupled conditioning strategy.
- Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers
-
Addressing the inference bottlenecks of Video Diffusion Transformers, Astraea proposes a framework comprising token-wise selection, GPU-friendly sparse attention, and evolutionary token budget search, achieving up to 2.4× acceleration on a single GPU and up to 13.2× in multi-GPU scenarios while maintaining generation quality.
- AUHead: Realistic Emotional Talking Head Generation via Action Units Control
-
AUHead decomposes the "audio \(\to\) emotional video" generation problem into two stages: first, an audio language model (ALM) "perceives emotion" from speech and reasons a discrete Facial Action Units (AU) sequence; then, an AU-driven controllable diffusion model renders these AUs into talking head videos that are both synchronized and carry nuanced expressions. It outperforms existing methods in emotional realism and lip-sync accuracy on MEAD/CREMA.
- Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy
-
This paper proposes DirectAnimator, which discards intermediate representations like skeletons or pose estimation. Instead, it animates a reference portrait directly using the raw pixels of driving videos. The method extracts a "Driving Cue Triplet" (Pose/Face/Location) from the original video and injects these cues into the denoising process via a CueFusion DiT Block. Coupled with a Same2X training strategy that aligns cross-ID features to a same-ID model, the system achieves SOTA performance on TikTok and Unseen datasets with 6.7× faster convergence and lower computational costs.
- BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration
-
BindWeave replaces traditional shallow fusion mechanisms with a Multimodal Large Language Model (MLLM) to parse complex text instructions involving multiple subjects. It generates subject-aware hidden states as conditioning signals for DiT, combined with CLIP semantic features and VAE fine-grained appearance features, enabling high-fidelity and subject-consistent video generation.
- BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation
-
BLADE integrates "dynamic block-sparse attention" and "few-step distillation" into a unified data-free joint training framework for collaborative optimization. It achieves 14.10× end-to-end acceleration on Wan2.1-1.3B and 8.89× on CogVideoX-5B, with VBench-2.0 quality scores surpassing the original 50-step model.
- BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching
-
BWCache identifies that features of individual blocks in video DiTs exhibit a U-shaped similarity curve across adjacent timesteps (highly redundant in intermediate steps). Consequently, it caches and reuses features at the block granularity, using a lightweight similarity metric to dynamically decide when to reuse. This achieve a training-free, plug-and-play acceleration of up to 2.6× with almost no drop in visual quality.
- Captain Cinema: Towards Short Movie Generation
-
Captain Cinema decomposes the task of "generating a short movie" into two steps: top-down planning of a full set of keyframe storyboards, followed by bottom-up synthesis of video motion between these keyframes. It utilizes Golden Ratio Memory Compression (GoldenMem) to fit historical frames from thousands of seconds across dozens of shots into a fixed token budget, maintaining character and scene consistency over long sequences.
- ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
-
ConsisDrive utilizes "instance masks" within a diffusion-based driving world model to constrain both attention and loss to individual objects. This ensures each visual token interacts only with its own instance's identity and trajectory tokens (preventing a bus from transforming into a truck or a red car into a black one) while shifting supervision focus toward the foreground. Consequently, it achieves a record FVD of 37.23 and FID of 3.88 on nuScenes, significantly enhancing downstream perception and tracking metrics.
- Consistent Noisy Latent Rewards for Trajectory Preference Optimization in Diffusion Models
-
This paper proposes SLRM + TAPO: first, a score-based latent reward model that preserves diffusion score capabilities is used to stably evaluate intermediate sampling states; then, multi-timestep SDE exploration and filtering are employed to construct more consistent win-lose trajectory pairs, thereby improving preference alignment for text-to-image and text-to-video diffusion models.
- Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
-
This paper reinterprets the space-time mask in pre-trained Image-to-Video (I2V) models—originally used only for "preserving the first frame and generating subsequent frames"—as a spatially-varying "keep/regenerate" instruction. By combining this with LoRA fine-tuning on a single input video, the model learns the motion of the source video while capturing the target appearance from reference frames. This enables controllable propagation of "edit the first frame only" changes to the entire video, significantly outperforming AnyV2V, I2VEdit, and Go-with-the-Flow in first-frame-guided editing.
- Controllable Video Generation with Provable Disentanglement
-
This paper proposes CoVoGAN, which models static content variables and time-varying dynamic style variables separately. It provides identifiability guarantees through the principle of minimal change, sufficient change properties, and temporal conditional independence constraints, enabling more independent control over factors such as head movement, blinking, and camera displacement in video generation.
- DanceTogether: Generating Interactive Multi-Person Video without Identity Drifting
-
DanceTogether generates long-duration multi-person interactive videos using a single reference image and individual pose-mask sequences per actor. The core mechanism continuously binds "who the person is" with "how the person moves" during the diffusion denoising process, significantly mitigating identity drifting during character swaps, occlusions, and physical contact.
- DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing
-
DreamSwapV redefines "video subject swapping" as a mask-guided video inpainting task. Given a source video, a mask designating the object to be replaced, and a reference image of the target subject, the model performs end-to-end swapping of any subject with any new target. This is achieved through a conditional fusion module and an adaptive masking strategy for fine-grained control and natural subject-environment interaction. It outperforms VACE, HunyuanCustom, and commercial models like Kling 1.6 on the newly established DreamSwapV-Benchmark.
- DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving
-
DrivingGen introduces the first comprehensive benchmark for autonomous driving video world models. It features a diverse evaluation dataset across various weather, geographies, times, and complex scenarios, along with a four-dimensional evaluation system (Distribution, Quality, Temporal Consistency, and Trajectory Alignment). Benchmarking 14 SOTA models reveals the core trade-offs between general-purpose and driving-specialized models.
- DSA: Efficient Inference For Video Generation Models via Distributed Sparse Attention
-
DSA intertwines "sparse attention" and "sequence parallelism," two previously independent acceleration paths. By matching spatial and temporal sparse attention patterns in video diffusion models with partial-ring and Ulysses parallelism respectively, and hiding communication within computation via dynamic scheduling, it achieves a \(10.79 \times\) speedup for 720p/5s video generation on 8x H100 compared to single-card dense attention (\(1.43 \times\) faster than USP) with negligible quality loss.
- Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation
-
The Dual-IPO framework is proposed to continuously improve the quality and human preference alignment of text-to-video generation through multi-round bidirectional iterative optimization between a reward model and a video generation model, enabling a 2B model to surpass a 5B model without massive human annotations.
- EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer
-
EchoMotion moves beyond treating human video generation as a pure pixel regression problem by employing a dual-branch DiT to explicitly model the joint distribution \(p(x, m \mid y)\) of "video appearance + SMPL parametric motion." Combined with temporally synchronized MVS-RoPE and a two-stage training strategy, it significantly improves anatomical plausibility and motion coherence in complex human videos, while inherently enabling bidirectional cross-modal video-to-motion and motion-to-video generation.
- EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning
-
EditVerse unifies text, images, and videos into a single interleaved token sequence. By employing full self-attention for in-context learning, a single 2B model supports both generation and editing across image and video domains. A self-constructed 232K video editing data pipeline transfers editing knowledge from the image domain to the data-scarce video domain. Experimental results on the EditVerseBench show that this method outperforms open-source baselines and even surpasses the commercial model Runway Aleph in editing fidelity.
- EffiVMT: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning
-
EffiVMT addresses the dual challenges of "motion inconsistency" and "slow finetuning" in DiT-based video motion transfer. It proposes a three-stage spatial-temporal decoupled finetuning process (head classification -> spatial LoRA -> temporal LoRA) combined with sparse motion sampling and adaptive RoPE, achieving significant speedups while maintaining higher motion fidelity and temporal consistency.
- EgoTwin: Dreaming Body and View in First Person
-
EgoTwin jointly models "first-person video generation" and "human motion generation" within a single Diffusion Transformer. By utilizing a head-centric motion representation and cross-modal attention with causal constraints, the model ensures that the generated video perspective trajectories and human movements are synchronized in time and aligned geometrically.
- FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge
-
Addressing the phenomenon where the decoding phase of auto-regressive (AR) video generation is dominated by MLP modules and adjacent frames exhibit highly similar MLP outputs, FastCar utilizes the "Temporal Attention Score (TAS)" to determine when to directly reuse cached MLP outputs from the previous frame to skip computations. Accompanied by an FPGA accelerator with dynamic resource scheduling, it achieves over 2.1× decoding acceleration on edge devices with negligible loss in visual quality.
- FastVMT: Eliminating Redundancy in Video Motion Transfer
-
By identifying and eliminating two types of redundancy in training-free video motion transfer pipelines—"motion redundancy" in attention and "gradient redundancy" in optimization—FastVMT utilizes sliding window motion extraction and skip-step gradient optimization to achieve an average 3.43× speedup (up to 14.91×) with almost no loss in fidelity or temporal consistency.
- FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation
-
FilMaster is an end-to-end system for automatically generating editable films from text and character/scene reference images. It explicitly introduces cinematic language and professional post-production workflows from real films into the generation pipeline, significantly outperforming Anim-Director, MovieAgent, and LTX-Studio in both camera language and cinematic rhythm.
- Flow Caching for Autoregressive Video Generation
-
FlowCache identifies that different chunks in autoregressive video generation are in heterogeneous denoising states at the same timestep. Consequently, it replaces "uniform whole-frame caching" with an independent chunkwise adaptive caching strategy, complemented by a joint importance-redundancy KV cache compression. This achieves 2.38× and 6.7× speedups on MAGI-1 and SkyReels-V2, respectively, with near-lossless visual quality.
- Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
-
Frame Guidance is a training-free frame-level guidance method that achieves various controllable video generation tasks, such as keyframe guidance, stylization, and looping videos, without model modification. It utilizes two core components: latent slicing (reducing VRAM by 60×) and Video Latent Optimization (VLO).
- FreeViS: Training-free Video Stylization with Inconsistent References
-
FreeViS incorporates multiple "mutually inconsistent" stylized reference frames into a pre-trained I2V diffusion model. Using a trio of isolated attention, high-frequency compensation, and optical flow guidance, it solves propagation errors found in single-reference methods under completely training-free conditions, producing video stylization with rich stylistic details and strong temporal consistency.
- Generative View Stitching
-
GVS applies "diffusion stitching from robot planning" to video generation: using a training-free parallel sampling algorithm, it enables any Diffusion Forcing video model to generate long videos along pre-defined camera trajectories. By allowing the current frame to "see the future," it avoids collisions, maintains consistency, and enables loop closure.
- Geometry-aware 4D Video Generation for Robot Manipulation
-
This paper proposes a geometry-aware 4D video generation framework that trains video diffusion models via cross-view pointmap alignment supervision. By jointly predicting RGB and pointmaps, the model achieves spatio-temporally consistent multi-view RGB-D videos. It generates consistent videos from new perspectives without requiring camera pose input and recovers robot end-effector trajectories using off-the-shelf 6DoF pose trackers.
- Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
-
By aligning the intermediate features of a video diffusion model to the geometric representations of the 3D foundation model VGGT (using decoupled angular and scale alignment objectives), the diffusion model trained on pure video data "internalizes" 3D structures. This significantly improves geometric and temporal consistency in long-term video generation and enables the extraction of explicit 3D geometry during inference.
- Improving Autoregressive Video Modeling with History Understanding
-
This paper identifies the "quality of internal representations of historical frames" as an overlooked key variable in diffusion-based autoregressive video generation (VideoAR). It proposes MiMo (Masked History Modeling), which performs masked reconstruction of clean historical frames alongside the diffusion denoising objective. This approach learns stronger history representations in a self-supervised manner, significantly improving convergence speed and generation quality without relying on Visual Foundation Models (VFM).
- IVEBench: Modern Benchmark Suite for Instruction-Guided Video Editing Assessment
-
IVEBench constructs a modern evaluation suite specifically for instruction-guided video editing (IVE), utilizing 600 high-quality source videos, 35 subcategories across 8 major editing instruction types, and a three-dimensional metric system (Video Quality, Instruction Compliance, and Video Fidelity) to systematically expose the weaknesses of existing models in complex instruction following and high-fidelity editing.
- JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization
-
Ours proposes JavisDiT, a Joint Audio-Video Diffusion Transformer model that achieves fine-grained audio-visual spatio-temporal alignment through a Hierarchical Spatio-Temporal Prior Synchronizer (HiST-Sypo). Additionally, a new benchmark, JavisBench (comprising 10K complex scene samples), and a new evaluation metric, JavisScore, are introduced.
- JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation
-
JavisDiT++ is proposed as a concise and unified framework for Joint Audio-Video Generation (JAVG). It enhances generation quality via modality-specific MoE, achieves frame-level synchronization through time-aligned RoPE, and aligns with human preferences via audio-video DPO. Based on Wan2.1-1.3B, it achieves SOTA performance using only approximately 1M public data samples.
- Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control
-
The RoboMaster framework is proposed, which decomposes the robot-object interaction process into three phases—pre-interaction, during-interaction, and post-interaction—via collaborative trajectories. Combined with appearance- and shape-aware object embeddings, it achieves high-quality video generation for robotic manipulation.
- Light-X: Generative 4D Video Rendering with Camera and Illumination Control
-
Light-X unifies two previously separate research paths of controllable video generation—"camera viewpoint" and "scene illumination"—into a single diffusion model for the first time. By projecting geometry/motion and illumination into two sets of point clouds as fine-grained conditions, it achieves decoupling. Furthermore, it introduces "Light-Syn," a "degradation + inverse mapping" data synthesis pipeline to create "multi-view × multi-light" paired training data, which is virtually impossible to collect in the real world.
- LightCtrl: Training-free Controllable Video Relighting
-
LightCtrl extends the training-free paradigm of "per-frame image relighting + video diffusion prior for temporal consistency" into the first controllable video relighting method supporting user-defined light trajectories. By utilizing two modules, Light Map Injection and Geometry-Aware Relighting, it enables the generated lighting to follow user-drawn paths while suppressing interference from the original illumination in the source video.
- LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
-
LikePhys utilizes the denoising loss of diffusion models as a proxy for ELBO likelihood to compare "physically plausible vs. implausible" synthetic video pairs. This facilitates a training-free quantification of the intuitive physics understanding in video diffusion models, providing the PPE evaluation metric which aligns highly with human preferences.
- LongLive: Real-time Interactive Long Video Generation
-
LongLive utilizes a frame-level causal autoregressive framework combined with a trio of features—KV-recache, streaming long training (train-long-test-long), and short-window attention with frame-level attention sinks. It fine-tunes a 1.3B short-video model within 32 GPU-days into an interactive long-video generator capable of real-time generation (20.7 FPS) on a single H100, supporting real-time prompt switching and video lengths up to 240 seconds.
- Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
-
Ours proposes Lumos-1, a unified video generation model based on the LLM architecture. It addresses visual spatio-temporal encoding issues via MM-RoPE (Distributed Multimodal RoPE) and solves inter-frame loss imbalance via AR-DF (Autoregressive Discrete Diffusion Forcing). With training on only 48 GPUs, it achieves competitive results on GenEval, VBench-I2V, and VBench-T2V.
- LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
-
LumosX introduces "Relational Self-Attention" and "Relational Cross-Attention" into the Wan2.1 video DiT. By utilizing Relational Rotary Positional Encoding (R2PE), Causal Self-Attention Mask (CSAM), and Multi-level Cross-Attention Mask (MCAM), it explicitly binds each face with its attributes (clothing, accessories, hairstyle) into independent subject groups. Combined with a data pipeline featuring face-attribute dependency annotations, it addresses the long-standing "attribute entanglement" issue in personalized multi-subject video generation.
- MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement
-
MAGREF utilizes "region-aware masking + pixel-level channel concatenation" to inject an arbitrary number and category of reference subjects into a pre-trained I2V backbone. By employing "subject disentanglement" to inject semantic values of individual text tokens into corresponding visual regions, it achieves high-fidelity and controllable any-reference video generation without modifying the underlying architecture.
- MATRIX: Mask Track Alignment for Interaction-aware Video Generation
-
MATRIX discovers that the relationships between subjects, objects, and actions in Video DiTs are primarily encoded within a few interaction-dominant attention layers. It employs multi-instance mask tracks to regularize the grounding and propagation attention of these layers, significantly enhancing interaction fidelity and temporal consistency in text-to-video generation.
- MIMIC: Mask-Injected Manipulation Video Generation with Interaction Control
-
MIMIC decomposes "generating manipulation videos" into two stages: first, an Interaction-Motion-Aware (IMA) attention mechanism learns a sequence of semantic masks from a reference video to serve as motion trajectories; second, Pair Prompt Control renders these masks into frames, generating high-fidelity and controllable manipulation videos while preserving contact-rich interaction semantics.
- Mixture of Contexts for Long Video Generation
-
The authors reframe long video generation as an "internal information retrieval" problem and propose Mixture of Contexts (MoC) — a parameter-free yet trainable sparse attention routing module. It allows each query to dynamically select a few relevant chunks plus mandatory anchors (text + local window) while using causal masking to avoid feedback loops. This maintains or even improves identity/motion/scene consistency in minute-long videos while pruning 85% of token pairs and reducing attention FLOPs by 7×.
- MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models
-
MoAlign distills a low-dimensional motion-only subspace from a frozen video encoder (enforced via optical flow supervision) and aligns the middle-layer features of a video diffusion model to this subspace using soft relationship alignment. This allows the model to generate physically more plausible videos without requiring any inference-time conditions or simulations.
- MoCa: Modeling Object Consistency for 3D Camera Control in Video Generation
-
MoCa avoids explicit 3D reconstruction by decomposing the observation that "smooth camera motion maintains object consistency in viewpoint, appearance, and motion" into three types of consistency constraints. Using a dual-branch diffusion framework, it simultaneously manages camera trajectories, appearance stability, and motion disentanglement to implicitly learn the 3D relationship between the camera and the scene.
- Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model
-
This paper proposes the ANSE framework and its core scoring function, BANSA, which migrates "Bayesian Active Learning by Disagreement (BALD)" from classification tasks to the attention space of diffusion models. By measuring the entropy divergence of attention maps under multiple random perturbations, the method quantifies the model's "certainty" regarding a specific initial noise seed. This allows for the selection of superior initial noise seeds using only a subset of attention layers from the first denoising step, without retraining or running the full denoising process.
- MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation
-
MoGA employs a lightweight learnable token router to group tokens semantically and perform full attention within these groups. This eliminates the "coarse block score estimation" step found in block-sparse attention, enabling the end-to-end generation of minute-level, multi-shot, 480p/24fps long videos with a context length of approximately 580,000 tokens.
- MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling
-
The authors propose the MoSA framework, which decouples human video generation into "structure generation" (pre-generating physically plausible motion skeletons via a 3D Transformer) and "appearance generation" (synthesizing videos via DiT guided by skeletons). A Human-Aware Dynamic Control (HADC) module is designed to expand sparse skeleton signals into the entire motion region. Together with dense tracking loss and contact constraints, MoSA outperforms SOTA models like HunyuanVideo and Wan 2.1 across metrics including FVD and CLIPSIM.
- MotionStream: Real-Time Video Generation with Interactive Motion Controls
-
The authors propose MotionStream, the first real-time streaming video generation system with motion control. The method trains a bidirectional motion-controlled teacher with a lightweight track head, then distills it into a causal student via Self Forcing and DMD. By introducing attention sinks and a rolling KV cache, the system achieves a total match between training and inference distributions. It reaches 17 FPS (29 FPS with Tiny VAE) at 480P on a single H100 GPU, supporting infinite-length generation at constant speed.
- MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
-
MotionWeaver extends character image animation from single-person to multi-humanoid (robots, anthropomorphic animals, game characters) scenes. By "extracting identity-agnostic unified motion representations + fusing motion and video latents in a shared 4D space + hierarchical 4D supervision," it effectively addresses identity confusion and occlusion in multi-character interactions.
- MTVCraft: Tokenizing 4D Motion for Arbitrary Character Animation
-
MTVCraft quantizes 3D joint coordinate sequences (4D motion) from driving videos directly into discrete tokens. Combined with a motion-aware DiT featuring 4D positional encoding, it bypasses the pixel-alignment constraints of traditional 2D rendered pose maps, achieving high-quality pose-guided animation for arbitrary characters (including non-human objects).
- NarrLV: Towards a Comprehensive Narrative-Centric Evaluation for Long Video Generation
-
NarrLV introduces "Temporal Narrative Atoms (TNA)" as the fundamental unit for quantifying narrative richness. It features a prompt suite with an arbitrarily extendable number of TNAs and a three-stage progressive evaluation metric based on MLLM Q&A. This work systematically measures the "storytelling" capability of long video generation models for the first time, revealing that current models can only reliably express approximately 2 narrative units.
- Neodragon: Mobile Video Generation Using Diffusion Transformer
-
Neodragon compresses a video DiT (based on Pyramidal-Flow) into the Qualcomm Hexagon NPU of smartphones/laptops through four major operations: text encoder distillation, asymmetric decoder distillation, MMDiT block pruning, and step distillation extended to pyramidal flow matching (DMD). It generates 49 frames of \(640 \times 1024\) video in ~6.7 seconds, achieving a VBench total score of 81.61 and setting a new SOTA for on-device video generation.
- NeRV-Diffusion: Diffuse Implicit Neural Representation for Video Synthesis
-
This work compresses a video into the "weights of a small convolutional network" (i.e., NeRV, an Implicit Neural Representation). A diffusion Transformer then performs denoising directly on these Gaussian-distributed weight tokens to generate new videos. This approach bypasses the frame-wise feature maps and cross-frame attention of traditional video tokenizers, resulting in a more compact framework with faster decoding and sub-linear growth in resolution/duration overhead.
- NewtonGen: Physics-consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
-
NewtonGen integrates a learnable "Neural Newtonian Dynamics (NND)" module into the text-to-video pipeline. It first utilizes a Neural ODE to learn the latent dynamics of various Newtonian motions from a minimal amount of physics-clean data, then converts predicted future physical states into structured optical flow to guide video generators, achieving physics-consistent and parameter-controllable video generation.
- Phantom-Data: Towards a General Subject-Consistent Video Generation Dataset
-
Addressing the prevalent "copy-paste" issue in subject-to-video (S2V) generation, this paper constructs Phantom-Data, the first general cross-pair subject-consistent dataset. It contains approximately 1 million identity-consistent pairs. Through a three-stage pipeline (S2V Detection → Contextually Diverse Retrieval → Prior-Based Identity Verification), the method finds reference images in different scenarios for each subject from 53 million videos and 3 billion images, significantly improving text-following capabilities and image quality while maintaining identity consistency.
- \(PhyWorldBench\): A Comprehensive Evaluation of Physical Realism in Text-to-Video Models
-
PhyWorldBench constructs a large-scale benchmark covering 50 physical sub-phenomena, 1,050 prompts, and 12 mainstream text-to-video models. Using human evaluation and a context-aware MLLM evaluator system, it reveals significant shortcomings of current video generation models in realistic physics, complex interactions, and anti-physics instruction following.
- Pixel-Perfect Puppetry: Precision-Guided Enhancement for Face Image and Video Editing
-
FlowGuide explicitly extracts semantic directions induced by editing conditions in the diffusion UNet bottleneck as orthogonal bases, then uses the geometric alignment between the reconstruction and editing paths to dynamically correct the denoising noise. This enables more precise attribute modification in facial images and videos while preserving identity, background, and temporal consistency.
- PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation
-
PreciseCache is proposed as a plug-and-play acceleration framework that precisely detects and skips truly redundant calculations in video generation. It consists of LFCache (step-level, based on the Low-Frequency Difference (LFD) metric) and BlockCache (block-level, based on input-output difference metrics), achieving an average 2.6× speedup on mainstream models like Wan2.1-14B without significant quality loss.
- Pusa V1.0: Unlocking Temporal Control in Pretrained Video Diffusion Models via Vectorized Timestep Adaptation
-
Pusa V1.0 replaces the single scalar timestep in pretrained video diffusion models with a frame-wise timestep vector. Through non-destructive Vectorized Timestep Adaptation and minimal LoRA fine-tuning, Wan-T2V gains zero-shot capabilities for image-to-video (I2V), start-end frame control, and video extension while preserving its text-to-video (T2V) quality, achieving performance on VBench-I2V comparable to Wan-I2V.
- QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification
-
This paper proposes the QuantSparse framework, which for the first time synergistically integrates model quantization and attention sparsification for the compression of video diffusion Transformers. By addressing the "amplified attention shift" caused by the naive combination of these two techniques through Multi-Scale Salient Attention Distillation (MSAD) and Second-order Sparse Attention Reparameterization (SSAR), it achieves 3.68× storage compression and 1.88× inference speedup on HunyuanVideo-13B with W4A8 quantization and 15% attention density, while maintaining near-lossless generation quality.
- ReactID: Synchronizing Realistic Actions and Identity in Personalized Video Generation
-
ReactID employs a three-pronged approach—high-precision data construction, difficulty-aware curriculum learning, and timeline-structured conditioning (incorporating subject-aware cross-attention and time-adaptive RoPE)—to simultaneously enhance subject identity consistency and action realism in personalized video generation, mitigating the long-standing trade-off between the two.
- Real-Time Motion-Controllable Autoregressive Video Diffusion
-
This paper proposes AR-Drag—the first few-step autoregressive image-to-video (I2V) diffusion model enhanced by Reinforcement Learning. By using Self-Rollout to maintain Markovian properties, compressing ultra-long decision horizons with selective stochastic sampling, and introducing trajectory-based rewards for GRPO, it achieves a first-frame latency of 0.44s with 1.3B parameters, outperforming existing bidirectional motion-controllable models in both visual quality and motion control.
- Realtime Video Frame Interpolation Using One-Step Diffusion Sampling
-
RDVFI transforms video frame interpolation from "directly drawing intermediate frames with diffusion" into "using one-step diffusion to generate sparse latent keyframes and fitting high-order continuous pixel trajectories to warp input pixels." It achieves real-time speeds of 17 FPS at 1024×576 (approx. 44× faster than SOTA) while minimizing ghosting and deformation in large-motion scenes.
- Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
-
Rolling Forcing transforms frame-by-frame autoregressive video diffusion into a rolling multi-frame joint denoising process, utilizing initial frame attention sinks to anchor global appearance. This achieves near 16 FPS real-time generation of multi-minute long videos on a single GPU while significantly suppressing long-term error accumulation.
- SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
-
SANA-Video replaces the full attention in video DiTs with linear attention to reduce complexity from \(O(N^2)\) to \(O(N)\). By leveraging the additive property of linear attention, a "constant memory" block autoregressive KV cache is designed. This allows a 2B model to be trained on 64 H100s in 12 days (only 1% of MovieGen's cost), producing 720×1280 minute-long videos that match Wan2.1-14B on VBench while being 16× faster during inference.
- Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
-
Self-Forcing++ utilizes a short-video bidirectional diffusion teacher as a "short-window error corrector." It performs Extended DMD training by randomly sampling degraded segments from long-video trajectories generated by the student, combined with a rolling KV cache and optical flow rewards. This extends a 1.3B autoregressive video model from 5-second to 100-second and even 4-minute generation while significantly mitigating overexposure, darkening, stagnation, and error accumulation.
- SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion
-
SIGMark proposes the first blind-extraction in-generation watermarking framework for modern video diffusion models. It achieves constant-time blind extraction via Global Frame-level Pseudo-Random Coding (GF-PRC) and enhances temporal robustness under Causal 3D VAE via a Segmented Group Ordering (SGO) module. It reaches 90%+ bit accuracy with a capacity of 512×16 bits on HunyuanVideo and Wan-2.2.
- SimpleGVR: A Simple Baseline for Latent-Cascaded Generative Video Super-Resolution
-
SimpleGVR moves the super-resolution (VSR) stage of cascaded text-to-video generation entirely into the latent space. By using a "latent upsampler" to eliminate redundant decoding/re-encoding and employing two AIGC-aligned degradation strategies along with three training optimizations, this lightweight diffusion VSR outperforms existing methods on AIGC100. Furthermore, the "512p + SR" cascaded scheme surpasses end-to-end 1080p generation in both quality and speed.
- Stable Video Infinity: Achieving Infinite-Length Video Generation via "Error Recycling"
-
Addressing the fundamental gap in autoregressive long video generation—where training assumes clean inputs but testing is conditioned on error-prone self-generated frames—this paper proposes Error-Recycling Fine-Tuning. By collecting errors made by the DiT itself into a memory bank and re-injecting them into clean inputs to simulate degradation trajectories, the model is forced to actively correct errors. This enables extending video length from seconds to "infinite" with zero additional inference overhead, achieving SOTA results across consistency, creativity, and conditional benchmarks.
- SteinsGate: Injecting Causality into Diffusion Models with Path Integral for Long Video Generation
-
This paper proposes the InstructVC framework and its inference-time instance, SteinsGate. It uses an MLLM to decompose long prompts into "action-duration" sequences for fine-grained temporal control and introduces a novel Video Path Integral to transform pre-trained TI2V diffusion models into "history-aware" autoregressive continuation models at inference time, generating coherent long videos with natural transitions across multiple actions.
- Streaming Autoregressive Video Generation via Diagonal Distillation
-
Diagonal Distillation (DiagDistill) achieves 277.3x acceleration in streaming autoregressive video generation, reaching 31 FPS real-time generation through a diagonal denoising strategy (more steps for early stages, fewer for later stages) and a flow distribution matching loss.
- Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!
-
This paper introduces the REVEL task—allowing users to "drag anything, anytime" during the streaming generation of autoregressive video diffusion models—and proposes DragStream, a training-free method that suppresses latent space drift caused by dragging accumulation via "Adaptive Distribution Self-Rectification" and mitigates context frame interference via "Spatio-Frequency Selective Optimization."
- Syncphony: Audio-to-Video Generation with Synchronized Visual Dynamics using Diffusion Transformers
-
Syncphony inserts audio cross-attention into a pre-trained DiT video backbone, utilizing a "Motion-aware Loss" to strengthen supervision in high-motion regions and "Audio Sync Guidance" to amplify audio influence during sampling. It generates 380×640, 24fps videos precisely synchronized with audio and proposes CycleSync, a synchronization metric based on back-inferring audio from video.
- Target-Aware Video Diffusion Models
-
A target-aware video diffusion model is proposed that generates videos of actors interacting with a specified target using only an input image and a segmentation mask of the target object. The core innovation involves introducing a special [TGT] token and designing a selective cross-attention loss to focus the model on the target's spatial location, outperforming baselines in both target alignment and video quality.
- The Quest for Generalizable Motion Generation: Data, Model, and Evaluation
-
This paper addresses "generalizable 3D human motion generation" by simultaneously augmenting data, refining the model, and redesigning evaluation. It expands the long-tail motion coverage of MoGen using open-world semantic priors from ViGen, converts these priors into usable text-to-motion capabilities via a dual-branch gated DiT and a distilled version (ViMoGen-light), and validates generalization, alignment, and motion quality more precisely using MBench.
- Time-to-Move: Training-Free Motion-Controlled Video Generation via Dual-Clock Denoising
-
Time-to-Move treats rough animations obtained via dragging or depth re-projection as motion sketches. By anchoring appearance using the first frame and employing different noise clocks for controlled and uncontrolled regions during sampling, it achieves precise motion and pixel-level appearance control without training or modifying the backbone.
- ToonComposer: Streamlining Cartoon Production with Generative Post-Keyframing
-
ToonComposer merges the traditionally fragmented "inbetweening" and "colorization" stages of cartoon production into a unified generative "post-keyframing" stage. With only one colored reference frame and a minimal set of keyframe sketches, it leverages a DiT video foundation model to directly generate high-quality cartoon videos, surpassing prior two-stage methods in quality, motion consistency, and efficiency.
- Towards One-Step Causal Video Generation via Adversarial Self-Distillation
-
Addressing the quality collapse of causal video diffusion models during 1~2 step few-step generation, this paper proposes Adversarial Self-Distillation (ASD) within the DMD distillation framework. By using a discriminator to align the distributions of \(n\)-step and \(n+1\)-step outputs from the student model, and combining this with a First Frame Enhancement (FFE) strategy during inference, a single distilled model maintains high quality across 1/2/4-step settings, surpassing Prev. SOTA on VBench.
- TPDiff: Temporal Pyramid Video Diffusion Model
-
TPDiff divides the denoising process of video diffusion into multiple stages, progressively doubling the framerate along the denoising path (with full framerate only in the final stage). Combined with a "phased diffusion" training method that uniformly supports DDIM and flow matching, it reduces training costs by approximately 50% and accelerates inference by 1.5× without degrading generation quality.
- TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
-
TS-Attn proposes a training-free temporal-wise separable cross-attention mechanism that redistributes attention between motion regions and event-specific words during the early denoising stage of pre-trained video generation models, simultaneously improving multi-event completion, temporal order, and video consistency within a single complex prompt inference.
- TTOM: Test-Time Optimization and Memorization for Compositional Video Generation
-
The TTOM framework is proposed to align the attention of video generation models with LLM-generated spatio-temporal layouts during inference by optimizing newly added parameters. A parameter memorization mechanism is utilized to store historical optimization contexts for reuse, achieving relative improvements of 34% (CogVideoX) and 14% (Wan2.1) on T2V-CompBench.
- UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
-
This paper identifies that two failure modes—"periodic repetition" and "general quality degradation"—occurring in Video Diffusion Transformers during out-of-distribution length generation both stem from a single mechanism: attention dissipation (out-of-window tokens dilute the attention distribution learned within the training window). Based on this, it proposes UltraViCo, a training-free and plug-and-play method: it applies a constant decay factor to the attention logits of out-of-window tokens, pushing the extrapolation limit from 2× to 4× (at 4×, dynamic degree and imaging quality are 233% and 40.5% higher than previous state-of-the-art methods, respectively).
- Unified In-Context Video Editing
-
UNIC represents the source video, multimodal editing conditions, and target video noise latents as a single token sequence. This allows the video DiT to perform ID insertion/replacement/deletion, stylization, first-frame propagation, and re-camera control using native full attention within the context, while mitigating multi-task confusion through Task-aware RoPE and Condition Bias.
- Uniform Discrete Diffusion with Metric Path for Video Generation
-
URSA reformulates image and video generation as a global iterative refinement process on discrete visual tokens. By utilizing a linearized metric path based on token embedding distances, resolution-dependent timestep shifting, and frame-wise asynchronous noise scheduling, it enables discrete diffusion to approach or even match the performance of continuous diffusion models in text-to-video, image-to-video, and high-resolution image generation.
- UniVideo: Unified Understanding, Generation, and Editing for Videos
-
UniVideo utilizes a frozen MLLM for multimodal understanding and instruction parsing, and an MMDiT for high-fidelity image/video generation. It unifies video understanding, text-to-video, image-to-video, in-context video generation, and mask-free video editing into a single natural language instruction framework, achieving performance comparable to or better than specialized models across multiple video generation and editing tasks.
- Vid2World: Crafting Video Diffusion Models to Interactive World Models
-
This paper proposes Vid2World, which systematically transforms a full-sequence, non-causal video diffusion model pre-trained on internet-scale videos into an interactive world model capable of autoregressive rollout and frame-by-frame action control through "causalization modification + causal action guidance." It outperforms existing transfer methods and specialized world models in robot manipulation, 3D game simulation, and open-world navigation.
- Video-As-Prompt: Unified Semantic Control for Video Generation
-
This paper reformulates "semantically controllable video generation" as in-context generation: directly utilizing a reference video containing target semantics as a "video prompt." This is achieved through a plug-and-play Mixture-of-Transformers (MoT) expert running in parallel with a frozen backbone, combined with a time-biased RoPE to eliminate spurious pixel alignment priors. This enables a unified model to handle four semantic control types (concept, style, motion, and camera) and enables zero-shot transfer to unseen semantics, achieving a 38.7% human preference rate that approaches commercial closed-source models.
- Video-GPT via Next Clip Diffusion
-
By analogizing a "clip in a video" to a "word in language," this paper proposes the "next clip diffusion" pre-training paradigm—using parallel diffusion denoising within clips and autoregressive conditioning between clips. This allows a naive Transformer to perform self-supervised pre-training on 70 million unlabeled videos, significantly outperforming Kling (23.64) and Wan (20.89) with a score of 34.97 on the Physics-IQ world modeling benchmark, while transferring to 6 downstream video generation and understanding tasks.
- VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
-
VideoPhy-2 utilizes 3,940 multi-event prompts derived from 197 real-world actions. Generated videos from modern text-to-video models are scored by humans across three axes: Semantic Adherence, Physical Commonsense, and Physical Rules. The results reveal that even the strongest model, Wan2.2-27B-A14B, achieves only \(47.7\%\) joint performance on the hard subset. Furthermore, a 7B VideoPhy-2-AutoEval evaluator was trained to reduce human evaluation costs.
- VMoBA: Mixture-of-Block Attention for Video Diffusion Models
-
To address the quadratic complexity bottleneck of full attention in Video Diffusion Models (VDMs), VMoBA transforms the text-oriented MoBA block attention into a sparse attention mechanism tailored for video spatiotemporal characteristics. By employing "inter-layer cyclic 1D-2D-3D partitioning + global block selection + threshold-based dynamic block count," it achieves \(2.92\times\) FLOPs reduction and \(1.48\times\) training acceleration on long sequences (\(93\times 576\times 1024\)), while maintaining or even improving generation quality compared to full attention.