🎬 Video Generation¶
📷 CVPR2026 · 152 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (98) · 💬 ACL2026 (4) · 🧪 ICML2026 (32) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (23) · 📹 ICCV2025 (49)
🔥 Top topics: Video Generation ×66 · Diffusion Models ×24 · Multimodal/VLM ×7 · Compression ×6 · Alignment/RLHF ×6
- 3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation
-
3DiMo shifts human motion control from "relying on external SMPL reconstruction" to "jointly learning a set of view-invariant implicit motion tokens end-to-end with the video generator." By leveraging cross-attention semantic injection and multi-view rich data supervision, the model recovers genuine 3D motion from 2D driving frames. This allows for faithful action reproduction while supporting free camera视角 control via text, with results significantly exceeding 2D pose and SMPL baselines in motion fidelity and image quality.
- A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
-
The paper introduces DeltaTok, which compresses the VFM feature differences between consecutive frames into a single delta token. Combined with Best-of-Many training, DeltaWorld efficiently generates diverse future predictions in a single forward pass. With only 1/35 the parameters and 1/2000 the FLOPs of Cosmos, it outperforms existing models in dense prediction tasks.
- Accelerating Autoregressive Video Diffusion via History-Guided Cache and Residual Correction
-
To address the critical issue in Autoregressive Video Diffusion Models (ARDMs) where "cache approximation errors accumulate and amplify over time" during segment-by-segment generation, this paper proposes the training-free ARCache. It uses History-Guided Cache to schedule caching based on changes in history tokens (suppressing intra-segment errors) and Enhanced Residual Correction to calibrate subsequent segments using the clean residual trajectory of the first segment (preventing inter-segment drift). It achieves up to \(3.13\times\) acceleration across three ARDMs with nearly lossless image quality.
- Accelerating Diffusion-based Video Editing via Heterogeneous Caching: Beyond Full Computing at Sampled Denoising Timestep
-
For masked video editing (MV2V) tasks, this paper proposes HetCache, a training-free framework: it categorizes denoising steps into "full, partial, or reuse" based on cumulative change across timesteps, and partitions tokens into "context, margin, or generation" based on mask spatial priors within a single step. By performing attention only on the most semantically representative context tokens, it achieves a 2.67× speedup on Wan2.1-VACE with almost no drop in visual quality.
- ActivityForensics: A Comprehensive Benchmark for Localizing Manipulated Activity in Videos
-
This paper introduces the activity-level video forgery localization task and the ActivityForensics large-scale benchmark (6K+ forged clips). It utilizes a grounding-assisted automated data construction pipeline to create highly realistic activity manipulations and proposes the Temporal Artifact Diffuser (TADiff) baseline, which amplifies forgery clues through diffusion-based feature regularization.
- AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation
-
AdaCluster is a training-free sparse attention framework that designs specific strategies for the distinct roles of queries and keys in video DiTs. It uses "angular clustering" to compress queries and "layer-wise adaptive multi-stage K-means" to cluster keys. Combined with TensorQuest, which identifies key clusters via Tensor Cores, it achieves 1.67×–4.31× end-to-end acceleration on CogVideoX-2B / HunyuanVideo / Wan-2.1 with nearly lossless visual quality (up to 30.99 PSNR).
- AdapTok: Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
-
AdapTok encodes video into a temporally causal 1D discrete token sequence. During training, it learns "variable-length" representations by randomly dropping tail tokens in blocks. During inference, a scorer predicts "reconstruction quality for \(N\) tokens," and Integer Linear Programming (ILP) dynamically allocates tokens to different frames or samples under a fixed total budget. This achieves rFVD=28 reconstruction on UCF-101 with fewer tokens and significantly improves autoregressive video generation quality.
- Anti-I2V: Safeguarding your photos from malicious image-to-video generation
-
Anti-I2V proposes a defense method against malicious image-to-video generation by optimizing perturbations in the L*a*b* and frequency dual-spaces and designing Internal Representation Collapse (IRC) and Anchoring (IRA) losses to disrupt semantic feature propagation in denoising networks, achieving SOTA protection across CogVideoX, DynamiCrafter, and Open-Sora.
- Archon: A Unified Multimodal Model for Holistic Digital Human Generation
-
Archon discretizes seven modalities involved in digital humans (description, text script, speech, 3DMM animation, semantic video, image, and video) into tokens. A single autoregressive large model is pre-trained on 72 tasks to achieve any-to-any modality generation, understanding, and editing. It addresses token explosion in high-frame-rate talking videos using a \(4\times\) semantic video token compression and semantic-driven diffusion decoding. Furthermore, it stabilizes quality for high-ambiguity tasks like speech-to-video through "Thinking in Modality," which decomposes the process into modality-by-modality intermediate steps.
- Are Image-to-Video Models Good Zero-Shot Image Editors?
-
This paper proposes IF-Edit, a training-free framework that directly utilizes pre-trained Image-to-Video (I2V) diffusion models as zero-shot image editors. By rewriting static editing instructions into "evolution over time" descriptions using Chain-of-Thought (CoT) prompting, employing Temporal Latent Dropout (TLD) to prune redundant frames for denoising acceleration, and using Self-Consistent Post-Refinement (SCPR) to select and regenerate a "static video" for clarity, IF-Edit demonstrates strong performance in non-rigid deformation and reasoning-based editing tasks.
- Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer
-
The authors replace the expensive softmax self-attention in pretrained Video Diffusion Transformers (Wan2.1) with a hybrid attention mechanism consisting of "a few softmax anchor tokens + majority linear tokens." By employing a "surgical" pipeline comprising layer-wise distillation, knapsack-based block-rate selection, and lightweight fine-tuning, the model is linearized to near-original quality in less than 0.4k GPU hours. This achieves approximately 6× faster inference for single attention blocks on long videos.
- BulletTime: Decoupled Control of Time and Camera Pose for Video Generation
-
To address the issue where video diffusion models couple "scene dynamics" and "camera movement" on the same video-time axis, BulletTime decomposes it into two orthogonal conditional paths: "world time \(\tau_{world}\)" and "camera pose \(c\)". It uses Time-RoPE + AdaLN to inject continuous time and 4D-RoPE + Camera-AdaLN to inject viewpoints. A synthetic dataset with independent time/camera variations is utilized to supervise decoupling. This supports flexible 4D control such as "Bullet Time" (moving camera, frozen time), with control precision surpassing two-stage baselines that combine camera methods with temporal remapping on both synthetic and real videos.
- Captain Safari: A World Engine with Pose-Aligned 3D Memory
-
Captain Safari is a "world engine" that maintains an implicit 3D geometric memory. Given an arbitrary camera trajectory, it retrieves world tokens aligned with the target pose to condition a DiT-based video generator. This ensures both precise trajectory following and long-term 3D consistency under aggressive 6-DoF motion. It is accompanied by the OpenSafari wild FPV drone dataset.
- Causality in Video Diffusers is Separable from Denoising
-
The authors discover through probing experiments that "temporal causal reasoning" and "step-wise denoising" in autoregressive video diffusion models are separable. Shallow layers exhibit high redundancy across denoising steps, while deep layers primarily perform intra-frame rendering. Based on this, the SCD architecture is proposed: a causal Transformer encoder performing temporal reasoning once per frame, and a lightweight frame-wise diffusion decoder for multi-step rendering. This reduces per-frame latency by 2–4\(\times\) while maintaining generation quality.
- Chain of Event-Centric Causal Thought for Physically Plausible Video Generation
-
Models physically plausible video generation (PPVG) as a sequence of causally connected events. It decomposes complex physical phenomena into ordered event chains driven by physical formulas through physical-driven event chain reasoning, then generates semantic-visual dual conditions through transition-aware cross-modal prompting to guide video diffusion models in generating videos that follow causal evolution.
- CI-VID: A Coherent Interleaved Text-Video Dataset
-
CI-VID constructs an "interleaved text-video" dataset of 340,000 samples—each sample is a semantically coherent multi-shot video sequence paired with interleaved captions that describe both individual shots and the "continuation/change" between adjacent shots. This allows models to transition from "isolated text → video" to "text + preceding video → subsequent video," enabling the generation of multi-shot videos with storytelling, smooth transitions, and consistent characters and styles.
- CineBrain: A Large-Scale Multi-Modal Audiovisual Brain Dataset for Brain-Conditioned Video Generation
-
This paper constructs CineBrain, the first large-scale brain signal dataset with synchronized fMRI and EEG recorded under natural audiovisual conditions (watching The Big Bang Theory). It proposes the CineSync framework, which utilizes a dual-Transformer fusion encoder to align brain signals with visual/textual semantics, followed by a LoRA-finetuned video diffusion model to decode brain signals into dynamic videos. The method achieves SOTA performance in dynamic video reconstruction and reveals that auditory cortex activation enhances visual decoding precision.
- CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
-
Given a set of static scene images, a text prompt, and a user-specified camera trajectory, CineScene injects "implicit 3D features" extracted by VGGT as context conditions into a pre-trained T2V diffusion model. This enables the generation of scene-consistent cinematic videos with novel dynamic subjects under significant viewpoint changes, achieving SOTA in both scene consistency and camera precision.
- Composing Concepts from Images and Videos via Concept-prompt Binding
-
The authors propose Bind & Compose (BiCo), a one-shot method that binds visual concepts to prompt tokens via a hierarchical binder structure and achieves flexible image-video concept composition through token-level assembly. It significantly outperforms previous methods in concept consistency, prompt fidelity, and motion quality.
- Compressed-Domain-Aware Online Video Super-Resolution
-
CDA-VSR proposes leveraging video compressed domain information (motion vectors, residual maps, and frame types) to guide three key stages of online video super-resolution: motion-vector-guided deformable alignment for efficient and precise registration, residual-map-gated fusion to suppress misaligned regions, and frame-type-aware reconstruction for adaptive computational resource allocation. It achieves optimal PSNR on REDS4 at 93 FPS (>2x the speed of SOTA).
- ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
-
Addressing appearance drift and geometric distortion of rigid objects under viewpoint changes in Image-to-Video (I2V) generation, ConsID-Gen intervenes at both data and model levels: constructing a large-scale object-centric dataset ConsIDVid and a multi-view consistency benchmark ConsIDVid-Bench, and proposing a "view-assisted" framework. By supplementing the first frame with two unposed auxiliary views, using a dual-stream encoding of 2D semantics + VGGT geometry, and pre-aligning vision and text before diffusion, it surpasses Wan2.1/Wan2.2/HunyuanVideo in identity fidelity and geometric stability.
- Content-Aware Dynamic Patchification for Efficient Video Diffusion
-
DynaPatch employs a lightweight router within the 3D VAE latent space to adaptively select patch sizes for each spatio-temporal region (fine patches for detailed areas and coarse patches for static areas). By jointly training the router with the diffusion model end-to-end, it eliminates redundant computation during the token creation stage. On VBench, it achieves a total score of 83.42 with a 30% token reduction, realizing a 1.3–1.8× speedup with near-lossless visual quality.
- CoT-Edit: Let CoT Guide Instruction Video Editing
-
Addressing the issues of inaccurate target localization and physically implausible object additions in complex scenarios for text-only instruction video editing, this paper proposes a three-stage Plan–Guide–Edit framework. A Multimodal Large Language Model (MLLM) with Chain-of-Thought (CoT) first "translates" instructions into a sequence of keyframe bounding boxes and enhanced instructions. A box-constrained mask branch then converts spatial priors into temporally consistent masks. Finally, a diffusion editor integrates the masks, enhanced instructions, and video features to complete the editing, significantly outperforming existing open-source baselines in physical plausibility and spatial relations.
- CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video
-
Ours proposes CubeComposer, which decomposes 360° video into a cubemap six-face representation and generates it in a spatio-temporal autoregressive manner. It achieves native 4K (3840×1920) 360° panoramic video generation from perspective video for the first time, eliminating the need for post-processing super-resolution.
- D2Cache: Second-Order Delta Caching for Higher Video Diffusion Acceleration
-
D2Cache is a training-free, plug-and-play video diffusion cache acceleration method. It identifies that the "second-order difference" (residual of the first-order residuals) between outputs of adjacent timesteps is significantly smoother than the first-order residual. By adding a second-order correction term to the first-order residual reuse, the method reduces cache prediction error from \(O((\Delta t)^2)\) to \(O((\Delta t)^3)\). Furthermore, it utilizes scaling factors estimated from timestep embeddings to adapt to non-uniform skipping. At the same acceleration ratio, its VBench score is 0.4%–2.5% higher than the SOTA method, TeaCache.
- Diff4Splat: Repurposing Video Diffusion Models for Dynamic Scene Generation
-
Diff4Splat is proposed as a feed-forward framework that unifies video diffusion models and deformable 3D Gaussian fields into an end-to-end trainable model. it directly generates dynamic 4D scene representations from a single image in approximately 30 seconds, which is 60 times faster than optimization-based methods.
- DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
-
DisCa for the first time unifies learnable feature caching and step distillation into a compatible framework, replacing manual caching strategies with a lightweight neural predictor (<4% model parameters). Accompanied by Restricted MeanFlow to stabilize the distillation of large-scale video DiTs, it achieves 11.8× near-lossless acceleration on HunyuanVideo.
- Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
-
To address the diversity collapse issue where Text-to-Video (T2V) models produce highly similar results for the same prompt, this paper models "generating a set of diverse videos" as a set-level policy optimization. It uses the marginal gain of a Determinantal Point Process (DPP) to provide a "diminishing returns" diversity reward for each new sample, combined with a relevance reward. A prompt-rewriting policy model (rather than the video generator itself) is trained using GRPO, providing a plug-and-play enhancement for Wan2.1 / CogVideoX / Veo3 that significantly improves diversity in camera movement, scenes, and motion without sacrificing fidelity.
- DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior
-
DreamShot is proposed to leverage the spatio-temporal priors of video diffusion models to generate multi-shot storyboards with consistent characters and coherent scenes. It addresses multi-character confusion via a Role-Attention Consistency Loss and provides unified support for Text-to-Shot, Reference-to-Shot, and Shot-to-Shot modes.
- DreamStyle: A Unified Framework for Video Stylization
-
DreamStyle unifies three style conditions—text, style images, and stylized first frames—into a video stylization model based on Wan14B-I2V. It addresses the lack of paired data through a data construction pipeline that "first stylizes the first frame, then generates paired videos via I2V," and utilizes token-specific LoRA to eliminate interference between different condition tokens, outperforming specialized models across three types of stylization tasks.
- DriveLaW: Unifying Planning and Video Generation in a Latent Driving World
-
DriveLaW is proposed, a driving world model that unifies video generation and motion planning via a shared latent space. By directly injecting intermediate latent features from the video generator into a diffusion planner, it achieves SOTA performance simultaneously on nuScenes video prediction and NAVSIM planning benchmarks.
- Dual-Granularity Memory for Efficient Video Generation
-
Tackling the "chunk isolation" issue in linear recurrent video generators caused by chunk-wise parallelism, this paper stacks two complementary memories on the GSTPN backbone: intra-chunk Context Memory (sink columns + boundary buffers, adding only +150K parameters) and cross-segment LCaM (latent memory bank + content retrieval + cross-attention). This achieves a 1.54× inference speedup while maintaining visual quality comparable to full attention.
- DynamicsBoost: Dynamic Plausible Video Generation via Annotation-Free Continuation Preference Optimization
-
This work treats "video continuation" as a natural preference signal—where more conditional frames lead to less generated content and higher overall quality. This enables the automatic construction of structurally matched preference pairs without manual or VLM labeling. By applying Asymmetrical DPO, calculated only in the generated regions, to align text-to-video models, the approach significantly enhances dynamic plausibility and semantic consistency.
- Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos
-
Addressing the new task of "generating intermediate frames to smoothly transition an object from an initial to a final state given an initial frame, a target frame, and an action instruction" (EIVST), EgoIn first uses a fine-tuned TransitionVLM to reason through the number of steps and their respective time intervals. These conditions are then injected frame-by-frame into a diffusion interpolation model, supplemented by object positioning auxiliary supervision to maintain object appearance consistency. It significantly leads in FVD and other metrics across four egocentric and robotic manipulation datasets.
- EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses
-
EgoControl utilizes a compact representation of "relative head pose + pelvis-rooted joint poses" on the pretrained video diffusion model Cosmos. By injecting control signals through a twin-pathway of AdaLN modulation and pose token cross-attention, it achieves precise future frame prediction driven by the 3D full-body pose of an egocentric wearer, aligning both camera perspective and visible limb movements with the control pose.
- EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
-
Addressing egocentric video editing scenarios in Augmented Reality (AR) characterized by "first-person perspective, frequent hand-object interaction, and large ego-motion," the authors introduce a complete ecosystem: data (EgoEditData, 93.6k editing pairs), a model (EgoEdit, a channel-concatenation editor + EgoEdit-RT, a real-time streaming version distilled in two stages achieving 38.1 fps and 855 ms first-frame latency on a single H100), and a benchmark (EgoEditBench, 1700 samples across 15 task categories). It significantly outperforms existing methods in egocentric editing while maintaining competitive performance on general editing tasks.
- EgoX: Egocentric Video Generation from a Single Exocentric Video
-
Given a single exocentric video and a target egocentric camera trajectory, EgoX first performs 3D lifting to render an "egocentric prior video." It then employs width/channel-wise bidirectional concatenation combined with geometry-guided self-attention, leveraging a pre-trained video diffusion model (Wan 2.1 14B + LoRA) to generate geometrically consistent, high-fidelity egocentric videos. It significantly outperforms baselines such as Exo2Ego-V on the Ego-Exo4D dataset.
- Endless World: Real-Time 3D-Aware Long Video Generation
-
Endless World combines "conditional autoregression (truncated conditional frame gradients) + fusing VGGT-extracted 3D features into text embeddings + attention sinks" into a 1.3B distilled video diffusion model. It achieves real-time (17 FPS) generation of infinitely extendable, geometrically consistent videos on a single GPU without quality degradation over time, achieving a VBench total score of 84.54 at 30 seconds, surpassing SOTAs like LongLive.
- EvoID: Reinforced Evolution for Identity-Preserving Video Generation
-
EvoID reformulates "identity-preserving video generation" from imitation learning into a reinforcement learning-driven self-evolution process. By employing a dual-path reward system (objective metrics + MLLM global preference) as an internal evaluator and anchoring exploration with a frozen T2V teacher, the generative model actively balances identity fidelity, motion naturalness, and temporal consistency. It achieves a Total Score of 0.704 on the OpenS2V-Eval person domain, surpassing both the open-source VACE-14B (0.658) and the commercial Hailuo (0.653).
- ExPose: Reinforcing Video Generation Models for Extreme Pose Estimation
-
Direct relative pose estimation often fails when two images have extreme viewpoint differences and minimal overlap; ExPose fine-tunes a video generation model using GRPO reinforcement learning into a "pose-reward-driven" generator. This allows it to interpolate geometrically consistent intermediate frames between the two views, which are then processed by 3D foundation models like VGGT or MapAnything to significantly improve pose estimation accuracy (e.g., DL3DV AUC 48.1 \(\to\) 53.6).
- FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
-
The FaceCam system is proposed to address camera control in monocular portrait videos by using facial landmarks as a scale-aware camera representation. This approach avoids the scale ambiguity inherent in traditional extrinsic camera representations. Additionally, two data augmentation strategies—synthetic camera motion and multi-clip stitching—are designed to support continuous camera trajectory inference.
- FastLightGen: Fast and Light Video Generation with Fewer Steps and Parameters
-
FastLightGen proposes a three-stage distillation algorithm that achieves joint distillation of sampling steps and model size for the first time. By identifying redundant layers, employing dynamic probabilistic pruning, and using well-guided teacher guidance distribution matching, it compresses HunyuanVideo/WanX into a lightweight generator with 4 steps and 30% parameter pruning, achieving approximately 35x speedup while outperforming the teacher model.
- FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing
-
Addressing the limitation that "First-Frame Propagation (FFP) video editing relies on run-time guidance," this work first constructs FFP-300K, a high-fidelity dataset of 290,000 pairs of 720p, 81-frame edited videos via a dual-track pipeline. It then introduces FreeProp, a guidance-free framework that dynamically decouples "first-frame appearance" from "source motion" using AST-RoPE and employs self-distillation, treating the model’s own ideal representation of the source video as regularization. This approach outperforms all methods on EditVerseBench, including the commercial model Aleph.
- FlashLips: 100-FPS Mask-Free Latent Lip-Sync using Reconstruction Instead of Diffusion or GANs
-
FlashLips reformulates lip-sync as a "deterministic image editing" problem rather than a generation problem. By replacing diffusion/GANs with a single-step latent editor trained purely on reconstruction and driving it via a Flow Matching "audio-to-lip-pose" Transformer, the U-Net version achieves 109 FPS on a single H100. Simultaneously, it outperforms larger and slower diffusion baselines in FID/FVD and lip-sync precision.
- FlashPortrait: 6× Faster Infinite Portrait Animation with Adaptive Latent Prediction
-
FlashPortrait utilizes a training-free inference mechanism consisting of "Weighted Sliding Window + Adaptive Latent Extrapolation." This significantly compresses denoising steps for long portrait animations, achieving up to 6× inference acceleration while generating videos exceeding 1800 frames without identity (ID) drift.
- FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
-
FlowDirector models text-driven video editing as a "direct evolution" driven by ODEs in the data space, completely bypassing the traditional inversion step. By employing three training-free flow steering strategies (Direction-Aware, Motion-Appearance Decoupled, and Differential Average Guidance), it effectively manages "thorough modification," "motion preservation," and "trajectory stability," simultaneously achieving SOTA performance in instruction following, temporal consistency, and background preservation.
- FlowMotion: Training-Free Flow Guidance for Video Motion Transfer
-
FlowMotion is proposed as a training-free video motion transfer framework that constructs motion guidance signals directly using the latent prediction from flow-based T2V models. By avoiding gradient backpropagation through internal model layers, it achieves high motion fidelity while significantly reducing inference time and GPU memory overhead.
- FlowPortal: Residual-Corrected Flow for Training-Free Video Relighting and Background Replacement
-
FlowPortal requires no model training. Instead, it utilizes "Residual-Corrected Flow" to transform off-the-shelf video diffusion flow models into editing models. By enforcing perfect reconstruction when source and target conditions are identical and specifically altering the lighting direction otherwise—supplemented by decoupled conditions, high-frequency transfer, and foreground masks—it achieves temporally coherent, structurally faithful, and naturally relit video editing and background replacement within 3–5 minutes.
- Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
-
FreeLOC introduces a training-free, layer-adaptive framework that identifies the varying sensitivities of different layers in Video DiTs to "frame-level relative position OOD" and "context length OOD." By selectively applying Multi-granularity Positional Recoding (VRPR) and Tiered Sparse Attention (TSA) to sensitive layers, it achieves SOTA long video generation quality without additional training costs.
- From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning
-
This paper proposes the Co-Settle framework, which trains a lightweight linear projection layer on a frozen image pre-trained encoder. By leveraging temporal cycle consistency loss and semantic separability constraints, it consistently enhances multi-granularity video downstream task performance across 8 image foundation models with only 5 epochs of self-supervised training.
- Generative Neural Video Compression via Video Diffusion Prior
-
Ours proposes GNVC-VD, the first DiT-based generative neural video compression framework. By utilizing a video diffusion transformer as a video-native generative prior, it achieves spatio-temporal latent compression and sequence-level generative refinement within a unified codec. At ultra-low bitrates (<0.03 bpp), it significantly outperforms traditional and learned codecs in perceptual quality and substantially reduces flickering artifacts common in previous generative methods.
- Generative Video Motion Editing with 3D Point Tracks
-
This paper proposes Edit-by-Track: a V2V video diffusion model conditioned on a "source video + a pair of source/target 3D point tracks." By using 3D tracks to establish sparse correspondences between source and target, it enables simultaneous editing of camera perspective and object motion (including occlusion, depth ordering, and non-rigid deformation), outperforming existing I2V/inpaint-based methods on DyCheck and in-the-wild videos.
- GenHOI: Towards Object-Consistent Hand-Object Interaction with Temporally Balanced and Spatially Selective Object Injection
-
GenHOI attaches a lightweight module of only 157M parameters (approx. 0.95%) to a pre-trained large video generation model (Wan-14B-I2V). By employing Head-Sliding RoPE (balancing the influence of reference object tokens across frames temporally) and Spatial Attention Gating (focusing object-conditioned attention on interaction zones spatially), it achieves natural hand-object interaction videos with consistent object appearance across frames in in-the-wild scenarios without damaging the base model's generalization. It significantly outperforms SOTA models like VACE and HOI-Swap in both self-re-enactment and cross-re-enactment metrics.
- Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
-
The Geometry-as-Context (GaC) framework is proposed to replace non-differentiable operators (3D reconstruction and rendering) in reconstruction-based scene video generation with a unified autoregressive video generation model. By embedding geometry information (depth maps) as interleaved contexts within the generative sequence, the method achieves end-to-end training and mitigates accumulated errors.
- GT-SVJ: Generative-Transformer-Based Self-Supervised Video Judge For Efficient Video Reward Modeling
-
This paper repurposes an off-the-shelf video generation model (CogVideoX) into a video reward model. By using an Energy-Based Model (EBM) with contrastive learning to train a "real/degraded" video discriminator followed by two-step preference alignment, the method outperforms VLM-based reward models trained on millions of samples using only 30K human annotations on GenAI-Bench and MonteBench.
- HandWorld: Hand-Centric Unified Video Action Generation
-
HandWorld utilizes a shared cross-domain conditioning network to bind "hand action" and "egocentric video" domains together, followed by decoupled Diffusion Transformers for each. Combined with MANO-rendered hands as an intermediate bridge and flexible multi-task training, it enables simultaneous action-conditioned video generation and future hand action prediction within a single framework, outperforming existing specialized baselines in both tasks.
- HarmoVid: Relightful Video Portrait Harmonization
-
HarmoVid adopts a two-stage data and model scheme consisting of "frame-wise harmonization → deflickering → dual-path training." In the absence of real paired data, it harmonizes the lighting, shadows, and tones of foreground portrait videos to match target backgrounds, achieving temporal stability, clean boundaries, and expressive relighting performance.
- HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives
-
Based on DiT video diffusion models like Wan2.2, HoloCine employs "Window Cross-Attention" to align each shot with its storyboard text and "Sparse Inter-Shot Self-Attention" to reduce the quadratic complexity of full-sequence self-attention to near-linear. This enables the holistic, one-pass generation of minute-long, character-consistent cinematic narratives with precise transition control.
- HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
-
HVG-3D connects a ControlNet, which accepts 3D point cloud sequences and 3D tracking signals, to an image-to-video diffusion model (CogVideoX-5B-I2V). Equipped with a hybrid data pipeline that constructs conditions from both real videos and simulators, the model generates geometrically accurate and temporally coherent hand-object interaction (HOI) videos driven by simulation data, achieving SOTA performance on TASTE-Rob.
- Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance
-
The authors discover that videos generated by I2V models are "stiffer" than those from homologous T2V models. The root cause is that high-frequency details from the reference image "lock" the generation trajectory into a static shortcut during the very early stages of denoising. Consequently, they propose Adaptive Low-Pass Guidance (ALG), a training-free method that applies low-pass filtering to the conditional image only during early sampling and reverts to the original image later. This improves the average dynamic degree on VBench by 33% with virtually no loss in image quality.
- Inference-time Physics Alignment of Video Generative Models with Latent World Models
-
This work utilizes the "surprise" score from a pre-trained latent world model (VJEPA-2) as a reward to search and guide the denoising trajectories of video diffusion models during inference. This approach aligns generated videos with physical laws, achieving a first-place score of 62.64% on the PhysicsIQ challenge, surpassing the Prev. SOTA by 7.42%.
- Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout
-
∞-RoPE is proposed as a training-free inference-time framework. Through three components—Block-Relativistic RoPE, KV Flush, and RoPE Cut—it extends an autoregressive video diffusion model trained only on 5-second videos into a system capable of infinite-duration generation, fine-grained action control, and cinematic scene transitions.
- I'm a Map! Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
-
IMAP (Interpretable Motion-Attentive Maps) is proposed as a training-free framework consisting of two modules: GramCol for spatial localization and Motion Head Selection for temporal localization. It extracts spatio-temporal saliency maps of motion concepts from Video DiTs, outperforming existing methods in motion localization and zero-shot video semantic segmentation.
- LAMP: Language-Assisted Motion Planning for Controllable Video Generation
-
The LAMP framework models motion control as a language-to-program synthesis problem. By designing a cinematography-inspired motion DSL, the authors train an LLM to transform natural language descriptions into structured motion programs. These programs are deterministically mapped to 3D object and camera trajectories to condition video generation, enabling the simultaneous generation of object and camera motion from natural language for the first time.
- Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation
-
When adding continuous control over physical camera parameters such as shutter speed, aperture, and color temperature to pre-trained text-to-video models (WAN 2.1), this paper finds that fine-tuning with sparse, low-fidelity synthetic data performs better than using photorealistic data. This is because photorealistic data destroys the backbone's pre-trained priors, leading to "content collapse," whereas simple synthetic data merely "coaxes out" existing priors. High-fidelity controllable generation is achieved through a design incorporating "decoupled cross-attention + joint LoRA training + inference-time pruning."
- Let Your Image Move with Your Motion! – Implicit Multi-Object Multi-Motion Transfer
-
This paper proposes FlexiMMT, the first I2V framework supporting implicit multi-object multi-motion transfer. By utilizing a Motion Decoupled Masked Attention (MDMA) mechanism to constrain motion/text tokens to affect only corresponding target regions and a Differentiated Mask Extraction Mechanism (DMEM) to derive target masks from diffusion attention for progressive propagation, it achieves precise compositional multi-object motion transfer.
- LightMover: Generative Light Movement with Color and Intensity Controls
-
LightMover leverages video diffusion priors to model light source editing as a sequence-to-sequence prediction problem. By utilizing a unified control token representation for precise manipulation of light position, color, and intensity, and introducing an adaptive token pruning mechanism that reduces control sequence length by 41%, the method outperforms existing approaches in both light movement and object movement tasks.
- LinVideo: A Post-Training Framework towards O(n) Attention in Efficient Video Generation
-
LinVideo is proposed as a data-free post-training framework that selectively replaces quadratic attention with linear attention in video diffusion models. It achieves a 1.43–1.71× speedup (up to 15.9–20.9× when combined with distillation) while maintaining generation quality.
- LoL: Longer than Longer, Scaling Video Generation to Hour
-
Addressing the "sink-collapse" phenomenon in autoregressive ultra-long video generation—where the video suddenly reverts to the first few frames—this paper identifies its root cause as "multi-dimensional phase synchronization + multi-head attention homogenization" induced by RoPE periodicity. The authors propose Multi-Head RoPE Jitter, a training-free method that perturbs the RoPE base frequency per head to break such synchronization. Combined with causal VAE sliding window decoding, this work achieves real-time, streaming, and nearly quality-lossless infinite video generation for the first time (demonstrated up to 12 hours).
- LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
-
Ours proposes LottieGPT, the first autoregressive generation framework for vector animations. It designs a Lottie tokenizer to encode hierarchical geometries, transformations, and keyframe motions into compact token sequences. By constructing a 660K animation dataset and fine-tuning Qwen-VL, it achieves direct generation of editable vector animations from text or images.
- M4V: Multimodal Mamba for Efficient Text-to-Video Generation
-
M4V replaces the quadratic complexity attention blocks in text-to-video diffusion models with linear complexity Mamba blocks (MM-DiM). Utilizing a "multimodal token re-composition" scheme, it enables unidirectional scanning SSMs to perform text-conditional fusion and spatio-temporal modeling. It reduces mixed-layer FLOPs by approximately 45% on 768×1280 long videos, maintaining quality comparable to the PyramidFlow baseline and even surpassing the original model when transferred to Wan2.1.
- LocalDPO: Direct Localized Detail Preference Optimization for Video Diffusion Models
-
LocalDPO is proposed, which generates negative samples by locally corrupting real high-quality videos using random spatio-temporal Bézier masks (single inference, no external ranking). Combined with a region-aware DPO loss for preference alignment at the local detail level, it consistently surpasses traditional DPO and SFT in video quality on Wan2.1 and CogVideoX.
- MoCha: End-to-End Video Character Replacement without Structural Guidance
-
MoCha shifts video character replacement from the "frame-by-frame mask + skeleton/depth-guided reconstruction paradigm" to an end-to-end non-reconstruction paradigm. By providing only a single arbitrary frame mask and no structural guidance, the model leverages the inherent tracking capabilities of video diffusion models to transfer the source character's motion and expressions to a reference identity. Utilizing condition-aware RoPE for multi-modal condition fusion and RL post-training for enhanced facial consistency, it outperforms VACE, HunyuanCustom, and Wan-Animate across both synthetic and real-world benchmarks.
- MultiAnimate: Pose-Guided Image Animation Made Extensible
-
MultiAnimate introduces a pair of modules, "Identifier Assigner + Identifier Adapter," into the Wan2.1 DiT video generation framework. By encoding tracking masks of each person into structured labels injected into the DiT and employing a training strategy of "randomly sampling identities from a learnable label bank," the model—trained only on dual-person data—consistently generates dance animations for 3 to 7 people with distinct identities and reasonable occlusions.
- MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
-
MultiShotMaster adapts a pre-trained ~1B parameter single-shot T2V model by implementing two types of RoPE (narrative phase shift + spatiotemporal positioning) and an attention mask. It achieves multi-shot video generation with variable shot counts/durations, independent per-shot text, specified subject positioning/motion, and customizable backgrounds without additional adapters. It significantly outperforms CineTrans / EchoShot / VACE / Phantom in text alignment, cross-shot consistency, transition accuracy, and narrative coherence.
- MusicInfuser: Making Video Diffusion Listen and Dance
-
Instead of training an audio-video model from scratch, MusicInfuser injects zero-initialized music-video modules into a pre-trained text-to-video diffusion model (Mochi). It utilizes a "layer adaptability" criterion to select only a few DiT layers for cross-attention adaptation, enabling the video diffusion model to "dance to music" within one day on a single GPU while preserving the text control and image quality priors of the original model.
- NS-Diff: Fluid Navier-Stokes Guided Video Diffusion via Reinforcement Learning
-
NS-Diff reformulates the denoising trajectory of video diffusion as a "physically constrained Markov Decision Process." It detects rigid/fluid regions in the latent space of a DiT, injects velocity fields and deformation gradients, and fine-tunes the denoising policy using PPO with a reward based on "rigid body minimum jerk + fluid simplified Navier-Stokes." This ensures motion adheres to physical laws (reducing jerk error by 43%, fluid divergence by 33%, and improving FVD by 22.7%) without relying on physical simulators or manual annotations.
- OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
-
OmniLottie proposes a Lottie Tokenizer that converts Lottie JSON files into structured command-parameter sequences. This enables pre-trained VLMs to generate high-quality vector animations based on multi-modal cross-instructions. The work also constructs the MMLottie-2M large-scale dataset to support training.
- One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer
-
Addressing the long-standing challenge of "spatial misalignment" between reference images and driving videos, this paper reformulates character animation training as a self-supervised outpainting task. By combining a specialized reference feature extractor, identity-skeleton decoupled pose control, and a TokenReplace long-video strategy, the model enables a single reference image of arbitrary layout to drive cross-scale video animation and image pose transfer, outperforming SOTA models of similar scale.
- OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
-
OneStory reformulates Multi-Shot video generation (MSV) as a "shot-by-shot autoregressive next-shot generation" task. It employs a Frame Selection module to select semantically relevant frames from the entire historical shot sequence and an Adaptive Conditioner to compress these non-contiguous frames into compact context tokens based on importance. These tokens are directly fed into the DiT, maintaining character/environment consistency and complex plot following in minute-long, ten-shot narratives. SOTA results are achieved in both T2MSV and I2MSV settings.
- Open-world Hand-Object Interaction Video Generation Based on Structure and Contact-aware Representation
-
SCAR proposes a "structure + contact-aware" 2D HOI representation (contact-enhanced hand-object contours + depth maps) and utilizes a "joint generation" paradigm. This approach allows a Diffusion Transformer to simultaneously denoise RGB videos and this representation, learning physically constrained hand-object interactions without relying on 3D annotations, while generalizing to open-world scenarios.
- Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
-
Pantheon360 utilizes explicit 3D point clouds ("3D Cache") reconstructed from sparse 360° inputs to render "geometry-only, texture-less" panoramic videos along user-specified camera trajectories. A fine-tuned SVD diffusion model then "paints" realistic textures onto this geometric skeleton. This achieves precise trajectory control and globally consistent digital twin video generation for in-the-wild panoramic scenes, outperforming perspective baselines like GEN3C in metrics such as PSNR and MET3R.
- PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing
-
PerformRecast proposes a GAN-based portrait video editing method utilizing an improved 3DMM keypoint transformation formula. By applying expression deformation before head rotation (consistent with the FLAME model), precise disentanglement of expression and head pose is achieved. A Boundary Alignment Module is introduced to resolve stitching misalignments between facial and non-facial regions. The method significantly outperforms existing approaches in both expression replacement and enhancement modes.
- PersonaLive! Expressive Portrait Image Animation for Live Streaming
-
PersonaLive introduces a three-stage pipeline—"Hybrid Motion Control + Few-step Appearance Distillation + Micro-chunk Autoregressive Streaming Generation"—to compress diffusion-based portrait animation from offline models requiring 20+ denoising steps and second-level latency to a real-time system capable of 4-step denoising with 15.82 FPS and 0.253s latency. This represents a 7–22× speedup over previous diffusion methods while delivering superior temporal stability for long sequences.
- Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
-
The Phantom framework is proposed, adding a physical dynamics branch to the pretrained video diffusion model (Wan2.2-TI2V). By utilizing physical-aware embeddings extracted by V-JEPA2 as latent physical states, it jointly models visual content and physical dynamics evolution via bidirectional cross-attention. It significantly outperforms baselines on physical consistency benchmarks (50.4% improvement on VideoPhy PC) while maintaining visual quality.
- Physical Object Understanding with a Physically Controllable World Model
-
This paper reformulates the "world model" as a Probabilistic Graphical Model (PGM) capable of querying conditional distributions of arbitrary visual variables. Utilizing GPT-style next-token prediction, the authors efficiently train a 7-billion-parameter physically controllable world model, PSI, which describes scenes using RGB, optical flow, and camera tokens. Once trained, without any task-specific heads, PSI achieves zero-shot movable object segmentation (SpelkeBench SOTA), articulated part discovery, 3D object manipulation, and physical reasoning tasks like Visual Jenga by simply "virtually poking" pixels to observe collective motion.
- Physical Simulator In-the-Loop Video Generation
-
PSIVG is proposed as the first training-free inference-time framework that embeds a physical simulator into the video diffusion generation loop. It reconstructs 4D scenes and object meshes from a template video, generates physically consistent trajectories in an MPM simulator, guides video generation with optical flow, and ensures texture consistency of moving objects via Test-Time Consistency Optimization (TTCO), achieving a user preference rate of 82.3%.
- PhysVid: Physics Aware Local Conditioning for Generative Video
-
PhysVid proposes a physics-aware local conditioning scheme that divides videos into temporal chunks. A VLM annotates physical phenomenon descriptions for each chunk, which are then injected into the generative model via chunk-level cross-attention. At inference, "negative physics prompts" (counterfactual guidance) are introduced to guide generation away from physical violations, improving the physical common sense score on VideoPhy by approximately 33%.
- PLACID: Identity-Preserving Multi-Object Compositing via Video Diffusion with Synthetic Trajectories
-
PLACID reformulates multi-object "staging" compositing as an Image-to-Video (I2V) task: multiple objects scattered randomly are made to "move" along synthetic trajectories to a final layout. By using the last frame of the video diffusion model as the composite result, the method leverages temporal priors to preserve object identity, background, and color while significantly reducing object omissions or duplications.
- Plenoptic Video Generation
-
PlenopticDreamer formulates "re-rendering input videos along arbitrary camera trajectories" as an autoregressive, multi-in-single-out diffusion model. During the generation of each new viewpoint, the model retrieves the most relevant historical video segments from a memory bank based on 3D frustum visibility to serve as conditions. Combined with progressive context expansion and self-conditioning training, it ensures spatiotemporal consistency in "hallucinated" occluded regions across different trajectories, significantly outperforming single-view methods like ReCamMaster on view synchronization metrics in the Basic and Agibot benchmarks.
- PoseAnything: General Pose-guided Video Generation with Part-aware Temporal Coherence
-
PoseAnything enables pose-guided video generation to break free from "human-only" constraints for the first time. Given an initial frame and an arbitrary subject's skeleton sequence, it generates corresponding motion videos. It relies on a "Part-aware Temporal Coherence Module" to refine appearance consistency to local body parts and a "Subject-Camera Motion Decoupled CFG" to achieve independent camera control. The authors release XPose, a dataset of 50,000 non-human pose-video pairs, outperforming SOTA on TikTok (human) and custom non-human benchmarks.
- PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
-
PropFly utilizes a frozen pre-trained Video Diffusion Model (VDM) as its own "supervision source": it performs one-step denoising estimates on the same noisy latent using two different CFG scales (low and high) to obtain structure-aligned but semantically distinct "source/target" video pairs. Subsequently, a new GMFM loss is used to train an adapter to learn to propagate the "edited first frame" to the entire video sequence—requiring no paired (original, edited) datasets while significantly surpassing SOTA across multiple video editing benchmarks.
- ProPhy: Progressive Physical Alignment for Dynamic World Simulation
-
ProPhy attaches a "physical branch" to video diffusion models, utilizing a two-stage Mixture of Physical Experts (video-level semantic experts + token-level refinement experts) to progressively inject physical priors from text into specific spatial regions. By distilling fine-grained alignment targets from VLM attention maps, it ensures generated videos adhere better to physical laws in complex dynamic scenes such as combustion, collision, and fluids.
- RAPID: Reusing Attention Sparsity with Inter-step Adaptation for Efficient Video Diffusion
-
RAPID observes two empirical laws in video diffusion: "temporal stability" and "gradual density decay" of attention sparsity patterns. It eliminates the overhead of recomputing sparse masks at every step by performing high-fidelity importance scoring only once during the early denoising stage, caching the masks and scores for subsequent reuse. By re-thresholding cached scores in later stages for more aggressive pruning, RAPID surpasses the strongest baseline by +3.2 PSNR at equivalent density on Wan2.1-14B, while pushing speedup to 1.79× (2.01× for HunyuanVideo) in Turbo mode.
- Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs
-
The Microsoft team proposes a real-time and streamable audio-driven talking portrait video generation framework: using a "Reference-Guided + Causal Residual" Deep Compression VAE to compress video into a 768× compact latent space, then employing a block-wise autoregressive Rectified Flow Transformer to generate latents block by block. This achieves 42 FPS (over 25× faster than existing diffusion methods) with image quality comparable to or better than large models.
- Reasoning Diffusion for Unpaired Test Time Out-of-distribution Text-Image to Video Generation
-
Addressing common real-world unpaired inputs where text and image semantics are misaligned and the image is not necessarily the first frame, this paper utilizes an MLLM (VisionNarrator) to reason seemingly unrelated conditions into a frame-by-frame script. An AlignFormer then converts the reasoning results into frame-wise latents injected into the Wan2.1 diffusion model to generate videos that are both visually and semantically consistent.
- RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes
-
RecEdit-Drive integrates a 3D reconstruction model (SV3D multi-view synthesis) into a video diffusion editing pipeline. It utilizes "Spatial Feature Warping" to construct foreground object views from multiple relevant novel perspectives and "Spatiotemporal Collaborative Modeling" with Gaussian cross-frame attention to blend edited foregrounds into backgrounds. Coupled with an inference-time background noise replacement strategy, it achieves SOTA results on the nuScenes dataset for four types of editing: deletion, replacement, insertion, and repositioning, while effectively serving as data augmentation for downstream 3D detection.
- Ref4D-VideoBench: Four-Dimensional Reference-Based Evaluation of Text-to-Video Generative Models
-
To address the issue where sample-level failures in existing text-to-video (T2V) evaluations cannot be attributed due to "no-reference, prompt-only" paradigms, this paper proposes Ref4D-VideoBench. Using 600 real reference videos as structured spatio-temporal evidence, it designs 12 interpretable atomic metrics across four dimensions: semantic alignment, motion consistency, event temporal order, and world knowledge. It achieves significantly higher correlation with human scores across 8 T2V models compared to no-reference baselines (e.g., world knowledge SRCC 0.847 vs. baseline ≤0.42).
- Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
-
PoCo (Position Embedding as Context Controller) is proposed to address "reference confusion" in multi-reference multi-shot video generation—where models fail to correctly associate shots with references when reference images have highly similar appearances. By encoding additional SideInfo axes in RoPE to represent reference entity information, the method achieves SOTA cross-shot consistency on the VACE-Wan2.1-14B framework (CrossShot-FaceSim 89.35, CrossShot-DINO 92.66).
- Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
-
Reward Forcing distills bidirectional video diffusion models into few-step autoregressive student models. It employs EMA-Sink to compress historical context, preventing "frame copying," and utilizes Re-DMD to bias distribution matching gradients toward high-dynamic samples based on motion quality rewards. It achieves high-quality real-time streaming video generation at 23.1 FPS on a single H100, outperforming all same-scale baselines in VBench total scores.
- RFDM: Residual Flow Diffusion Models for Video Editing
-
RFDM transforms a 2D image-to-image (I2I) diffusion model into a frame-by-frame autoregressive video editor. By "shifting" the noise mean of the current frame relative to the previous frame's prediction, the model learns inter-frame residuals rather than full frames. This achieves temporal consistency and editing fidelity comparable to 3D spatio-temporal models with zero additional computational cost and support for arbitrary video lengths.
- SeeU: Seeing the Unseen World via 4D Dynamics-aware Generation
-
SeeU is proposed as a 2D→4D→2D learning framework: it reconstructs a 4D world representation from sparse monocular 2D frames, learns continuous and physically consistent 4D dynamics on a low-rank representation (via B-spline parameterization and physical constraints), and finally re-projects the 4D world back to 2D. A spatiotemporal context-aware video generator completes unseen regions, enabling the generation of unseen visual content across time and space.
- SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance
-
SemVideo first employs a Multi-modal Large Language Model (MLLM) to decompose video stimuli into three semantic levels: "anchor description, motion narrative, and holistic summary." It then decodes these semantics hierarchically from fMRI signals and reconstructs motion latents via tri-path attention. Finally, a text-to-video diffusion model generates the video guided by these hierarchical semantics, significantly addressing the persistent issues of "appearance mismatch" and "motion incoherence" in brain-to-video reconstruction.
- ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions
-
ShotDirector treats "how transitions should be edited" as a controllable signal, injecting parameter-level camera poses (dual-branch Plücker + extrinsic) and hierarchical editing mode-aware prompts (shot-aware mask) into a video diffusion model. This allows for generating professional multi-shot videos with cinematographic transitions such as cut-in, cut-out, shot-reverse-shot, and multi-angle scenes based on directorial intent.
- SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
-
Proposes the SLVMEval meta-evaluation benchmark, which tests the ability of existing T2V evaluation systems to identify quality differences in long videos (up to ~3 hours). By synthesizing controlled "high-quality vs. low-quality" video pairs from dense video captioning datasets, it reveals that humans achieve 84.7%-96.8% accuracy across 10 dimensions, while existing automated systems lag behind in 9/10 dimensions.
- SMRABooth: Subject and Motion Representation Alignment for Customized Video Generation
-
SMRABooth utilizes a self-supervised visual encoder (DINOv2) and an optical flow encoder (SEA-RAFT) to provide object-level alignment targets for "subject appearance" and "object motion," respectively. A "cross-layer + cross-timestep" sparse LoRA injection strategy is then employed to decouple the two, achieving simultaneous subject fidelity and motion consistency in DiT video diffusion models.
- SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
-
SoliReward systematically reformulates video generation reward models across "data annotation + training loss + model architecture": it uses single-item binary annotation (Pass/Fail) with cross-prompt pairing to reduce annotation noise, employs a Bradley-Terry loss with Wide Ties (BT-WT) to compress positive samples into a compact interval to suppress reward hacking, and integrates Hierarchical Progressive Query Attention (HPQA) to aggregate multi-layer VLM features. It outperforms existing baselines in RM accuracy and downstream GRPO post-training.
- Soul: Breathe Life into Digital Human for High-fidelity Long-term Multimodal Animation
-
Soul employs a single portrait, text, and audio to drive high-fidelity digital human animation. Built upon the Wan2.2-5B diffusion video backbone, it integrates audio attention and a "three-piece suite" (pivotal frames + clip overlap + threshold-aware codebook replacement) to suppress long-term drift. By utilizing step/CFG distillation and a lightweight eVAE, it achieves an 11.4× speedup. Supported by the self-constructed million-scale Soul-1M dataset and Soul-Bench, it produces identity-consistent 1080P animations up to four minutes long.
- Spatia: Video Generation with Updatable Spatial Memory
-
Spatia equips video generation models with an "updatable spatial memory" by explicitly maintaining the scene as a 3D point cloud. After generating each video segment, the point cloud is updated using visual SLAM, and the point cloud is then projected back to constrain the next stage of generation. This allows the model to "remember" visited locations over long sequences while enabling clean separation of static scenes and dynamic objects, explicit camera control, and 3D interactive editing.
- STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
-
STAGE reformulates "keyframe-based multi-shot video generation" into a storyboard-anchored problem by "predicting a pair of start/end frames for each shot." By using the STEP2 model (multi-shot memory pack + dual encoding + two-stage training) to iteratively generate these pairs and delegating the completion to off-the-shelf I2V models, it significantly outperforms existing end-to-end and keyframe methods in cross-shot consistency and cinematic transitions.
- Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
-
Stand-In introduces a "conditional image branch" to pretrained video Diffusion Transformers (DiT). By utilizing Restricted Self-Attention and Conditional Position Mapping, it injects the identity of a reference face into generated videos. While training only ~1% extra parameters on 2,000 video pairs, it outperforms full-parameter fine-tuning methods in face similarity. Since it preserves the backbone, it remains plug-and-play for tasks like stylization, face swapping, and pose-guided generation.
- STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows
-
STARFlow-V introduces Normalizing Flows (NF) to the field of video generation. By employing a "global-local" invertible architecture for end-to-end maximum likelihood training and causal autoregressive inference, combined with a lightweight causal denoiser (flow-score matching) and video-aware Jacobi parallel solving, it demonstrates for the first time that NF can achieve quality comparable to causal diffusion baselines on 480p video while naturally unifying T2V, I2V, and V2V tasks.
- StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation
-
Ours directly "converts" a pre-trained monocular video diffusion model into a stereo video generator: using a minimalist conditioning approach by concatenating left and right views along the frame dimension, it forces the learning of authentic 3D structures through disparity + depth dual geometry-aware regularization. Combined with spatio-temporal tiling for high-resolution long videos and the first 11-million-frame stereo video dataset aligned with human Interpupillary Distance (IPD), it generates geometrically consistent right-eye views from arbitrary monocular videos end-to-end (PSNR 25.98 vs. StereoCrafter 23.04).
- StoryTailor: A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
-
The researchers propose StoryTailor, a zero-shot visual storytelling pipeline. By utilizing Gaussian Central Attention (GCA) to mitigate subject overlap and background leakage, Action-Boosted Singular Value Reweighting (AB-SVR) to amplify action semantics, and Selective Forgetting Cache (SFC) to maintain inter-frame background continuity, it achieves action-rich image narrative generation with multiple subjects on a single RTX 4090, improving CLIP-T by 10-15% over baselines.
- SURF: Signature-Retained Fast Video Generation
-
SURF decomposes high-resolution video generation into two stages: "low-resolution preview from a pre-trained large model + lightweight Refiner upsampling." By using training-free noise reshifting, it enables large models to maintain their layout/semantic/motion "signatures" even at low resolutions. For Wan 2.1, it achieves a 12.5× speedup for 720p video generation with almost no loss in quality.
- SWIFT: Sliding Window Reconstruction for Few-Shot Training-Free Generated Video Attribution
-
SWIFT defines the "few-shot training-free generated video attribution" task for the first time. By leveraging the "multi-frame pixel \(\leftrightarrow\) single-frame latent" temporal mapping in 3D VAEs, it performs normal and corrupted reconstructions via fixed-length sliding windows. The ratio of reconstruction losses on overlapping frames serves as the attribution signal. It achieves over 90% average attribution accuracy with only 20 samples, and 94% across five models.
- SwitchCraft: Training-Free Multi-Event Video Generation with Attention Controls
-
SwitchCraft is a training-free multi-event video generation framework that achieves clear temporal transitions and scene consistency without modifying model weights. It introduces Event-Aligned Query Steering (EAQS) to align frame-level attention with corresponding event prompts and an Auto-Balance Strength Solver (ABSS) to adaptively balance guidance intensity.
- SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation
-
Ours proposes SymphoMotion, a unified motion control framework that simultaneously and precisely controls camera motion and object 3D trajectories in videos through Camera Trajectory Control (CTC) and Object Dynamic Control (ODC) mechanisms. Furthermore, a large-scale real-world joint-annotated dataset, RealCOD-25K (25K samples), is constructed.
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation
-
SynMotion performs "motion customized video generation" by adapting at both the semantic level (decoupling text embeddings into subject/motion paths with learnable residues) and the visual level (inserting lightweight motion LoRA adapters into MM-DiT). Combined with an alternating optimization strategy for subject and motion embeddings, it enables motions learned from a few example videos to be transferred to arbitrary subjects, such as a "crocodile doing a handstand" or "Marilyn Monroe punching," outperforming SOTA in both T2V and I2V settings.
- Tea-Adapter: Teacher Adapter for Efficient Conditional Generation
-
Tea-Adapter is a plug-and-play adapter that employs "inverse distillation" to transfer control knowledge from a small, efficiently fine-tuned teacher video diffusion model with multi-condition control capabilities into a frozen large student video diffusion model. It utilizes a "Mixture of Condition Experts (MCE)" layer for dynamic routing of multiple conditions within a unified architecture and a "Feature Propagation Module" to ensure cross-frame temporal consistency, achieving high-fidelity, composable multi-condition controllable video generation with low VRAM requirements.
- TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models
-
The study proposes TEAR, the first automated red-teaming framework targeting temporal-dimensional vulnerabilities in T2V models. By utilizing a two-stage optimized temporal-aware test generator and an iterative refinement model, it generates prompts that are textually harmless but trigger harmful videos through temporal dynamics, achieving 80%+ attack success rates on both open-source and commercial T2V models.
- TempoControl: Temporal Attention Guidance for Text-to-Video Models
-
TempoControl performs gradient optimization on cross-attention maps during the denoising process of T2V diffusion models. By using a "Correlation + Magnitude + Entropy" loss to align the temporal attention of specific keywords with user-provided masks, it achieves fine-grained temporal control (e.g., "making an object appear at a specific second") without re-training or annotated data.
- TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction
-
TempoMaster reformulates long video generation as "next-frame-rate prediction"—generating a low-frame-rate global blueprint first via bidirectional attention, followed by hierarchical frame rate enhancement for details. Since segments within each level can be generated in parallel, it achieves both long-range temporal consistency and inference efficiency, reaching SOTA on Vbench-Long and human evaluations.
- TGT: Text-Grounded Trajectories for Locally Controlled Video Generation
-
TGT associates each point trajectory in text-to-video generation with a segment of local text. It utilizes a plug-and-play "Location-Aware Cross-Attention (LACA)" to align "which object, appearance, and motion" to the trajectory neighborhood. Combined with a Dual CFG strategy for global/local guidance control, it reduces trajectory error (EPE) by nearly half compared to the strongest baseline while maintaining the visual quality of the foundation model.
- The Devil is in the Details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
-
The KeyTailor framework is proposed, which utilizes a keyframe-driven detail injection strategy (garment dynamic enhancement + collaborative background optimization) to significantly improve garment fidelity and background integrity in video virtual try-on without altering the DiT architecture. The study also introduces ViT-HD, a high-definition dataset containing 15K samples.
- Thermal Diffusion Matters: Infrared Spatial-Temporal Video Super-Resolution through Heat Conduction Priors
-
THERIS treats pixel-wise grayscale sequences of infrared videos as temperature fields satisfying the heat conduction equation. It utilizes frequency-domain thermal diffusion kernels for frame interpolation (TDIM), Mamba modules modulated by "thermal prompts" for spatial-temporal detail recovery (TSSM), and a loss function (TFM Loss) that enforces discrete heat equations to achieve SOTA in infrared spatial-temporal video super-resolution.
- Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
-
This paper introduces "Thinking with Video," a new multimodal reasoning paradigm where video generation models like Sora-2 are utilized to depict the reasoning process within video frames. The authors construct VideoThinkBench, a five-level capability hierarchy covering "Geometric Intuition → Visual Induction → Abstract Rules → Spatial Planning → Language Reasoning." Evaluation reveals that Sora-2 outperforms GPT-5 by ~10% in "eyeballing" geometry puzzles and achieves 92% accuracy in MATH via audio output, demonstrating that video generation models can serve as unified reasoning vehicles for understanding and generation.
- Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers
-
LDF-VFI transforms Video Frame Interpolation (VFI) from "independent triplet processing" to "unified holistic modeling." By using an auto-regressive Diffusion Transformer, it synthesizes all frames within a temporal block simultaneously. Coupled with skip-concatenate sampling to suppress auto-regressive error accumulation, and sparse attention with tiled VAE for training-free 4K generalization, it achieves SOTA in long-video temporal consistency.
- TV2TV: A Unified Framework for Interleaved Language and Video Generation
-
TV2TV utilizes a unified Transfusion-style model to decompose video generation into an interleaved process: first "thinking" about what will happen next in text, then "acting" it out in pixels. This allows the language tower to handle semantic decisions while the video tower manages rendering. It simultaneously surpasses baselines such as "direct T2V" and "think-then-act" in both visual quality (91% win rate in human evaluation) and fine-grained controllability (+19 points in instruction following accuracy).
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
-
Ours proposes U-Mind, the first unified real-time full-stack multimodal interaction system supporting high-level reasoning dialogue and instruction following. It jointly generates text, speech, and motion within a single interaction loop and renders them into realistic videos, balancing reasoning retention and cross-modal alignment through rehearsal-driven learning and text-first decoding strategies.
- UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
-
UniAVGen proposes a joint audio-video generation framework based on a symmetric dual-branch DiT. By leveraging an asymmetric cross-modal interaction mechanism and a face-aware modulation module, it achieves precise spatio-temporal synchronization. With only 1.3M training samples, it comprehensively outperforms competitors using 30M data in terms of lip-sync, timbre consistency, and emotional consistency.
- Unified Camera Positional Encoding for Controlled Video Generation
-
This paper proposes UCPE, which unifies the complete camera geometry (6-DoF pose + intrinsics + lens distortion) into Transformer attention. It leverages "Relative Ray Encoding" to lower the positional encoding from the camera level to the ray level to accommodate non-linear lenses like fisheye and wide-angle. Additionally, "Absolute Orientation Encoding" is introduced to provide global references for pitch and roll. Using a Spatial Attention Adapter with <1% parameters to inject these into pre-trained video DiTs, UCPE achieves state-of-the-art results in both controllability and image quality for camera-controlled text-to-video generation.
- UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
-
UnityVideo integrates three types of tasks (text-to-video, controllable generation, modality estimation) and five auxiliary modalities (depth, optical flow, DensePose, skeleton, segmentation) into a single 10B Diffusion Transformer. By unifying tasks via dynamic noise scheduling and modalities via a Modality-Aware AdaLN table and In-Context Learner, the model achieves faster convergence and significant zero-shot generalization after joint training on 1.3M multi-modal samples, matching or surpassing specialized SOTA models across multiple tasks.
- V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
-
V-RGBX first inverse-renders a video into intrinsic channels such as albedo, normal, material, and irradiance. It then utilizes a video DiT with interleaved conditional injection to re-synthesize these into RGB. This allows users to modify a single intrinsic property (e.g., changing material or relighting) on sparse keyframes, which the model then stably propagates as a physically consistent edit throughout the entire video.
- VABench: A Comprehensive Benchmark for Audio-Video Generation
-
VABench is a comprehensive benchmark for "synchronized audio-video generation," covering three tasks: Text-to-Audio-Video (T2AV), Image-to-Audio-Video (I2AV), and Stereo Generation across seven content categories. It employs a dual-track evaluation system involving 15 fine-grained metrics from "Expert Models + Multimodal Large Language Models (MLLMs)" — plus 9 stereo acoustic metrics — to perform reference-free evaluation on end-to-end models like Veo3 / Sora2 / Wan2.5 and decoupled "Video Generator + V2A" pipelines. User studies verify a high correlation between these scores and human preferences.
- Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision
-
Vanast proposes a unified framework that simultaneously performs garment transfer and human animation generation within a single stage. Utilizing a Dual Module architecture (HAM + GTM) and a three-stage synthetic data construction pipeline, it achieves a PSNR of 17.95dB (+5.5dB vs. the best two-stage solution) and an LPIPS of 0.237 on Internet datasets.
- Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
-
Addressing the issue where VLMs often produce "chaotic" motion when directly animating SVGs, Vector Prism first employs multiple rendering views to obtain weak labels for each primitive from a VLM. It then utilizes Dawid-Skene statistical inference to aggregate these noisy labels into reliable semantic groupings and reconstructs an "animatable" SVG hierarchy. This allows the VLM to generate animations at a meaningful part granularity, outperforming AniClipart, GPT-5, and even commercial video generation models like Sora 2 in terms of instruction alignment and visual quality.
- VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
-
This paper proposes VerseCrafter, a video world model based on a 4D geometric control representation (static background point cloud + per-object 3D Gaussian trajectories). By injecting 4D control signals into a frozen Wan2.1-14B video diffusion model via a lightweight GeoAdapter, it achieves precise and decoupled control over camera and multi-object motions. Additionally, a large-scale real-world dataset, VerseControl4D, containing 35K samples, is constructed.
- VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
-
VGA-Bench expands text-to-video (T2V) evaluation from "realism" to "aesthetics." It utilizes a three-dimensional framework (Aesthetic Quality / Aesthetic Tag / Generation Quality) consisting of 52 fine-grained sub-dimensions, 1016 dimension-aligned prompts, and 60,000 videos generated by 12 models. By training three dedicated networks—VAQA-Net, VTag-Net, and VGQA-Net—for end-to-end automatic scoring, it eliminates dependency on external models and provides cross-model evaluations aligned with human judgment.
- Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
-
This work upgrades "Next-Event Prediction" from text to video: a VLM first reasons what should happen next, followed by a video diffusion model (VDM) to visualize it. It proposes Joint-GRPO, a two-stage reinforcement learning framework that synthesizes independent reasoning and generation models using a shared reward, achieving SOTA results in both text prediction and video generation on procedural and predictive benchmarks.
- Video Generation with Stable Transparency via Shiftable RGB-A Distribution Learner
-
Addressing the issues of poor quality and unstable transparency caused by the entanglement of RGB and alpha distributions in transparent video (RGB-A) generation, this paper proposes the "Shiftable RGB-A Distribution Learner." It uses a transparency-aware bidirectional diffusion loss in the latent space to push the alpha distribution away while preserving the RGB distribution and employs a Gaussian elliptical mask in the noise space to shift the noise mean for transparency guidance and controllability. Combined with a self-constructed high-quality dataset, it leads in visual quality, transparency rendering, and inference speed (15x faster than SOTA).
- VideoRealBench: A Chain-of-Thought Realism Evaluation Benchmark for Generated Human-Centric Videos
-
Addressing the issue where existing evaluators fail to reliably score "realism" in generated videos, the authors manually re-annotated a dataset of 3,297 human-centric generated videos, VideoRealDataset (including three-step chain-of-thought rationales). This dataset was used to LoRA-finetune an evaluator, VideoRealEval, which significantly outperforms general large models like Gemini-2.5-pro and InternVL3.5-241B, as well as prior specialized evaluators, in correlation with human preferences (PLCC \(57.07\%\) / SROCC \(56.78\%\)).
- VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
-
VideoWeaver extends single-view video-to-video (V2V) style transfer to multiple synchronized cameras. By injecting 4D point cloud coordinates predicted by Pi3 into the latent space of a flow model, it unifies the appearance across views. Coupled with "heterogeneous timestep" training, the model learns both joint and conditional distributions, enabling consistent batch re-rendering of multi-view embodied demonstration videos while preserving the robot's action trajectories.
- VidTAG: Temporally Aligned Video to GPS Geolocalization with Denoising Sequence Prediction at a Global Scale
-
VidTAG reformulates "video geolocalization" as a frame-to-GPS coordinate retrieval problem. By utilizing dual encoders (CLIP+DINOv2) for frame features, TempGeo for inter-frame temporal alignment, and GeoRefiner for trajectory denoising, it generates temporally coherent GPS trajectories on a global scale, achieving an approximate 20% improvement over GeoCLIP at the 1km threshold.
- VISTA: A Test-Time Self-Improving Video Generation Agent
-
VISTA is a multi-agent system that iteratively improves text-to-video quality at test time through a "refine-critique" loop without updating model weights. It decomposes user intent into structured temporal scripts, selects the best video via a pairwise tournament, identifies deficiencies using a "jury" of visual/audio/context agents, and rewrites prompts using a reasoning agent. It achieves up to 60% pairwise win rate against SOTA models like Veo 3, with a 66.4% human preference.
- VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
-
VIVA utilizes a VLM "instructor" to encode instructions, initial frames, and optional reference images into visually grounded multimodal conditions for a video DiT. It employs "Edit-GRPO" post-training (featuring triple rewards for instruction following, source fidelity, and human preference) for alignment. Combined with a self-constructed dataset of 1.5 million synthetic pairs, VIVA outperforms open-source SOTA and approaches commercial models like Runway Gen-4 Aleph on VIE-Bench in terms of instruction following and editing quality.
- VMonarch: Efficient Video Diffusion Transformers with Structured Attention
-
VMonarch identifies that attention maps in Video DiTs naturally exhibit a high-rank, block-diagonal sparse structure, which can be approximated using Monarch structured matrices. By aligning the spatio-temporal dimensions with Monarch factors to achieve sub-quadratic complexity and incorporating first-frame recalculation alongside a fused Online-Entropy FlashAttention kernel, it reduces attention FLOPs by 17.5× and achieves over 5× speedup for long videos on VBench with virtually no performance drop.
- VSRELL: A Simple Baseline for Video Super-Resolution and Enhancement in Low-Light Environment
-
VSRELL jointly solves "Low-Light Enhancement (LLE)" and "Video Super-Resolution (VSR)" tasks, which are traditionally decoupled, using a synchronous decoupling approach within a single CNN framework. It simultaneously models illumination and noise within a temporal window using an INCO module and injects illumination priors into deformable alignment while applying dynamic decay to memory features via an ISFP module. Ultimately, with only 6.3M parameters, it improves the average PSNR on REDS4 from ~20.6 dB (achieved by cascaded/all-in-one methods) to 25.94 dB.
- What Are You Doing? A Closer Look at Controllable Human Video Generation
-
The authors observe that existing benchmarks for controllable human video generation (TikTok, TED-Talks, HumanVid) are overly small and narrow. They construct the WYD benchmark consisting of 1,544 meticulously annotated clips (across 9 major and 56 sub-categories) and introduce two human-centric metrics, pICD and pAPE. Systematic evaluation of 8 SOTA open-source models quantitatively exposes systematic performance gaps in multi-person scenarios, human-object interactions, complex environments, and intense motions.
- When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models
-
The core idea of NUMINA is to avoid retraining video diffusion models by extracting a "countable instance layout" from DiT attention during inference. It detects inconsistencies between the numeral prompt and the current layout, applies conservative layout modifications (additions or deletions), and uses the corrected layout to guide re-generation, significantly improving the adherence of text-to-video models to numerical constraints like "two apples" or "eight ducks."
- YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal
-
YOSE is a plug-and-play fine-tuning framework: it transforms DiT-based video object removal (e.g., MiniMax Remover) from "dense computation on all spatio-temporal tokens" to "processing only tokens within masked regions while using a lightweight module to simulate the external influence on self-attention." This allows inference time to decrease approximately linearly with mask area, achieving \(2.5\times\) speedup in 70% of real-world scenarios with almost no loss in image quality.
- Yume1.5: A Text-Controlled Interactive World Generation Model
-
Yume1.5 transforms a single image or text prompt into an infinite world video that can be freely explored via keyboard. It leverages "Spatio-Temporal and Channel Joint Compression" to save VRAM and "Self-Forcing Distillation" to compress inference to 4 steps (8 seconds). Furthermore, it allows for text-triggered events in the world, improving the Instruction Following score from 0.657 (previous work) to 0.836.