CVPR2025 Video Generation AI paper notes paper summaries Diffusion Models Layout & Composition Super-Resolution Compression Speech & Audio

🎬 Video Generation¶

📷 CVPR2025 · 85 paper notes

📌 Same area in other venues: 📷 CVPR2026 (182) · 🔬 ICLR2026 (98) · 💬 ACL2026 (4) · 🧪 ICML2026 (32) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (23)

🔥 Top topics: Video Generation ×35 · Diffusion Models ×26 · Layout & Composition ×4 · Super-Resolution ×3 · Compression ×3

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion: This work proposes 4Real-Video, a 4D video generation framework based on a two-stream architecture. By splitting video tokens into parallel time and view streams and introducing hard/soft synchronization layers to harmonize information between them, it generates high-quality \(8 \times 8\) spatio-temporal video grids in approximately 1 minute, outperforming existing methods in visual quality and multi-view consistency.
AnimateAnything: Consistent and Controllable Animation for Video Generation: A two-stage controllable video generation framework is proposed. The first stage unifies different control signals (camera trajectories, user drag-and-drop annotations, reference videos) into a frame-by-frame optical flow representation. The second stage uses the unified optical flow to guide a DiT-based video diffusion model to generate the final video, introducing a frequency-domain stabilization module to suppress flickering under large motions.
Articulated Kinematics Distillation from Video Diffusion Models: This paper proposes the AKD framework, which reduces the degrees of freedom of 3D asset motion from full space to a small number of joint angles through skeletal joint parameterization, then distills text-aligned joint motion sequences using SDS gradients from a video diffusion model (CogVideoX), and further ensures physical plausibility via physical simulation.
BF-STVSR: B-Splines and Fourier—Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution: The BF-STVSR framework is proposed to model temporal motion interpolation using a B-spline Mapper and capture spatial high-frequency details using a Fourier Mapper, achieving SOTA performance in continuous spatial-temporal video super-resolution without relying on external optical flow networks.
Can Text-to-Video Generation Help Video-Language Alignment?: Proposes the SynViTA framework to explore whether synthetic videos generated by text-to-video (T2V) models can improve video-language alignment (VLA). By addressing semantic inconsistency and appearance bias in synthetic videos through alignment quality-based sample weighting and semantic consistency regularization, it achieves a improvement of over 4 percentage points on temporally challenging tasks.
ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer: ConMo proposes a zero-shot motion transfer framework. By disentangling the composite motion in a reference video into independent subject motions and background (camera) motion, and then controllably recomposing these motions during target video generation, it enables various applications such as multi-subject motion transfer, semantic/shape transformation, subject removal, and camera motion simulation. It significantly outperforms existing methods in motion fidelity and text alignment.
Dynamic Camera Poses and Where to Find Them: Proposes DynPose-100K—a large-scale dataset containing 100K dynamic internet videos and their camera pose annotations, achieved through a video filtering pipeline combining specialist models with a VLM, and a pose estimation pipeline integrating state-of-the-art point tracking, dynamic masking, and global BA.
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes: DynamicScaler is proposed as a training-free unified framework that achieves panoramic dynamic scene generation with arbitrary resolutions and aspect ratios through an offset-shifting denoiser and global motion guidance, supporting a 360° field of view, long durations, and loopable videos.
DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes: DynamicScaler proposes a training-free unified framework that synthesizes panoramic dynamic scenes with arbitrary resolutions and aspect ratios via an Offset Shifting Denoiser (OSD) and Global Motion Guidance (GMG). It supports both conventional panorama and 360° field-of-view (FoV) video generation while maintaining a constant VRAM footprint.
Exploring Temporally-Aware Features for Point Tracking: This work proposes Chrono, a temporally-aware feature backbone designed for point tracking. By inserting temporal adapters (2D convolutional downsampling + 1D local temporal attention + 2D convolutional upsampling) between the Transformer blocks of DINOv2, Chrono achieves state-of-the-art performance in a refiner-free setting using only simple feature matching (soft-argmax).
FADE: Frequency-Aware Diffusion Model Factorization for Video Editing: Proposes FADE, a training-free video editing method. By analyzing the frequency roles (sketching vs. sharpening) of each transformer block in T2V models, it leverages spectrum-guided modulation to separate preserved and edited content in the frequency domain, achieving high-quality appearance and motion editing.
FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance: FlashMotion proposes a three-stage training framework to distill trajectory-controllable video generation from multi-step denoising to few-step inference (4-8 steps). By first training a trajectory adapter, then distilling the generator, and finally fine-tuning the adapter with a hybrid diffusion and adversarial objective, this strategy significantly accelerates inference while preserving video quality and trajectory accuracy.
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models: CausVid distills a pre-trained bidirectional video diffusion Transformer into an autoregressive 4-step causal generator through asymmetric distillation. Combined with ODE initialization and KV caching, it achieves streaming video generation at 9.4 FPS (160× faster than CogVideoX) and ranks first on the VBench-Long benchmark with a score of 84.27.
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control: GEN3C proposes a video generation framework guided by a 3D cache (point cloud cache). By predicting depth for seed images and unprojecting them into 3D point clouds, it renders the 3D cache into 2D condition maps according to user-specified camera trajectories when generating subsequent frames, thereby achieving precise camera control and cross-frame 3D consistency.
Generative Inbetweening through Frame-wise Conditions-Driven Video Generation: FCVG is proposed, which extracts matching line segments from two keyframes and linearly interpolates them frame-by-frame as frame-wise conditions. These conditions are injected into the SVD video generation model, significantly resolving the ambiguity of the forward/backward paths in generative inbetweening synthesis, thereby achieving temporally stable video interpolation.
Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency: This paper proposes a geometry-guided online video view synthesis method. It constructs view- and temporally-consistent depth representations through progressive depth map optimization and Truncated Signed Distance Field (TSDF) accumulation, and subsequently uses this depth to guide a pre-trained image blending network, achieving highly efficient and consistent novel-view video synthesis.
HOIGen-1M: A Large-Scale Dataset for Human-Object Interaction Video Generation: HOIGen-1M is the first million-scale high-quality dataset designed for Human-Object Interaction (HOI) video generation. It addresses HOI video data scarcity and description hallucination through an efficient data filtering pipeline and a Mixture-of-Multimodal-Experts (MoME) captioning strategy, while introducing two evaluation metrics, CoarseHOIScore and FineHOIScore, to quantify the quality of interaction in generated videos.
HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation: HunyuanPortrait proposes the first implicit condition portrait animation framework based on Stable Video Diffusion, achieving high-fidelity control of fine facial dynamics and robust identity consistency through an intensity-aware motion encoder and an ID-aware multi-scale adapter.
HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks: HyperNVD proposes using a hypernetwork to dynamically generate the parameters of Implicit Neural Representations (INR) based on video embeddings encoded by VideoMAE. This establishes a universal video decomposition model across videos, which achieves the same PSNR over 30 minutes faster than training from scratch on new videos, while improving the final performance by 0.8dB on average.
Identity-Preserving Text-to-Video Generation by Frequency Decomposition: ConsisID proposes a frequency-decomposition-based DiT control scheme. It decouples facial features into low-frequency global information and high-frequency intrinsic identity information, injecting them into different positions of the DiT. This achieves tuning-free, identity-preserving text-to-video generation, significantly outperforming existing methods in identity preservation, text correlation, and visual quality.
IDOL: Instant Photorealistic 3D Human Creation from a Single Image: IDOL achieves instant (<1s) high-fidelity animatable 3D human reconstruction from a single image input by constructing HuGe100K, a large-scale multi-view dataset of 100k human subjects, and training a Transformer-based feed-forward model, significantly outperforming existing methods in quality and generalization.
Improved Video VAE for Latent Video Diffusion Model: This paper proposes IV-VAE to resolve the issues in existing video VAEs where image weight initialization suppresses the learning of temporal compression, and causal convolution leads to unbalanced inter-frame performance. By introducing a Keyframe Temporal Compression (KTC) architecture and Grouped Causal Convolution (GCConv), it achieves SOTA video reconstruction and generation quality on multiple benchmarks.
InterDyn: Controllable Interactive Dynamics with Video Diffusion Models: InterDyn proposes utilizing video diffusion models as implicit physics engines. By introducing an interactive control branch (ControlNet-like) on top of Stable Video Diffusion, the method generates physically plausible interactive dynamics videos from a single image and driving motion signals, outperforming the baseline CosHand by 77% in terms of the FVD metric on the Something-Something-v2 dataset.
Learning from Streaming Video with Orthogonal Gradients: Addressing the issue of gradient redundancy and model collapse caused by highly correlated continuous frames in streaming video learning, an Orthogonal Optimizer is proposed. By projecting the current gradient onto the orthogonal component of historical gradients for decorrelation, it can be seamlessly integrated into SGD/AdamW. It significantly recovers the performance loss of transitioning from shuffled training to sequential training across three scenarios: DoRA, VideoMAE, and future prediction.
Learning Temporally Consistent Video Depth from Video Diffusion Priors: This work proposes ChronoDepth, a video depth estimation method based on Stable Video Diffusion (SVD). By independently sampling noise levels per frame during training and using noise-free preceding frames as context during inference (Consistent Context-Aware Strategy), the method achieves state-of-the-art (SOTA) temporal consistency while maintaining spatial accuracy, ranking first on average for the MFC metric.
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis: LeviTor introduces 3D object trajectory control into image-to-video synthesis for the first time. By clustering object masks into a small number of representative points via K-means and incorporating depth information as control signals injected into the SVD model, it achieves precise control over complex 3D motions such as occlusion, forward/backward movement, and orbiting, reaching FID/FVD of 25.41/190.44 on DAVIS.
Presto: Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation: Presto proposes a Segmented Cross-Attention (SCA) strategy, which segments latent states along the temporal dimension and performs cross-attention with corresponding sub-captions respectively. Combined with a meticulously curated 261K high-quality long video dataset, LongTake-HD, it enables the generation of 15-second long-range coherent videos with rich content, achieving a Semantic Score of 78.5% and a Dynamic Degree of 100% on VBench.
LongDiff: Training-Free Long Video Generation in One Go: Through theoretical analysis, LongDiff reveals two key challenges when short-video models generate long videos: temporal position blurring and information dilution. It proposes two simple temporal attention modification strategies, Position Mapping (GROUP+SHIFT) and Informative Frame Selection (IFS), enabling short-video models to generate high-quality long videos in one go without training.
Mimir: Improving Video Diffusion Models for Precise Text Understanding: Mimir proposes an end-to-end training framework that losslessly fuses the strong text understanding capabilities of a decoder-only LLM (Phi-3.5) with the stable features of a traditional text encoder (T5) through a meticulously designed Token Fuser. This significantly improves the text understanding accuracy of video diffusion models, achieving a substantial lead over existing methods, especially in multi-object, spatial relation, and temporal understanding.
MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling: MIMO proposes a character video synthesis framework based on spatial decomposed modeling, which decomposes 2D videos into three spatial components (human, scene, and occluder) along 3D depth. Through decoupled encoding and compositional decoding, it achieves flexible control over character identity, 3D motion, and interactive scenes, significantly outperforming prior methods on complex motions and scene interactions.
Mind the Time: Temporally-Controlled Multi-Event Video Generation: This work proposes MinT, the first multi-event video generator that supports event temporal control. By utilizing Rescaled RoPE (ReRoPE) position embedding to bind event descriptions to specific time periods, MinT achieves smooth and coherent multi-event video synthesis through fine-tuning on a pre-trained video DiT.
MotiF: Making Text Count in Image Animation with Motion Focal Loss: This paper proposes Motion Focal Loss (MotiF), which spatially weights the diffusion loss using motion heatmaps generated from optical flow. This guides the model to focus on high-motion regions, significantly enhancing text-following and motion quality in Text-Image-to-Video generation, and constructs the TI2V-Bench evaluation benchmark.
Motion Modes: What Could Happen Next?: Motion Modes is proposed, a training-free method that explores the latent distribution of a pretrained image-to-video generator by designing four guidance energy functions, discovering multiple plausible and diverse motion modes of objects from a single image while decoupling object motion from camera motion.
Motion Prompting: Controlling Video Generation with Motion Trajectories: By training ControlNet with spatio-temporally sparse/dense point trajectories as "motion prompts," a single model achieves diverse motion control capabilities—including object control, camera control, motion transfer, and drag-and-drop editing—while demonstrating the emergence of realistic physical behaviors.
MotionPro: A Precise Motion Controller for Image-to-Video Generation: MotionPro is proposed to achieve fine-grained, controllable image-to-video generation that distinguishes between object and camera motions, utilizing dual signals of region-wise trajectories and motion masks.
MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation: This paper proposes MotionStone, which decouples video motion into object motion and camera motion dimensions by training an independent motion intensity estimator. This decoupled motion is then injected into a Diffusion Transformer to achieve fine-grained, motion-intensity-controllable I2V generation.
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation: This paper introduces MovieBench, the first hierarchical dataset designed for movie-level long video generation. It provides a three-level annotation structure (movie-scene-shot) that includes character portraits, subtitles, and audio. Based on this, four benchmark tasks are defined (text-to-keyframe, identity-customized long video, keyframe-conditioned video, and audio-driven speaking head generation), which reveal significant challenges for existing models in multi-scene narrative consistency.
Multi-subject Open-set Personalization in Video Generation: Video Alchemist is proposed, which integrates multi-subject, open-set video personalization capabilities directly into the Diffusion Transformer architecture, supporting foreground object and background customization without requiring test-time optimization.
Navigation World Models: This paper proposes Navigation World Model (NWM), a 1-billion-parameter Conditional Diffusion Transformer (CDiT) jointly trained on multiple robotic navigation datasets and unlabeled Ego4D videos. By predicting future visual observations given specific actions, NWM simulates navigation trajectories, which can be used for MPC planning or ranking trajectories from external policies (such as NoMaD). It significantly outperforms existing navigation policies on the RECON dataset, achieving an ATE of 1.13 and an RPE of 0.35.
NeuS-V: Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification: Proposes NeuS-V, the first framework evaluating the temporal consistency of text-to-video (T2V) models using formal verification (temporal logic + probabilistic model checking). It translates text prompts into temporal logic specifications, scores atomic propositions using VLMs, constructs video automata, and formally verifies the satisfaction probability. It achieves a Pearson correlation of 0.71 with human annotations on Gen-3 (compared to only 0.47 for VBench).
One-Minute Video Generation with Test-Time Training: This paper introduces Test-Time Training (TTT) layers into a pretrained Diffusion Transformer. By capitalizing on the high expressiveness of TTT layers, which employ neural networks as hidden states, the proposed method achieves the capability of generating coherent one-minute long videos from text storyboards, outperforming baselines like Mamba 2 and Gated DeltaNet by 34 Elo points in human evaluations.
Optical-Flow Guided Prompt Optimization for Coherent Video Generation: This paper proposes MotionPrompt, a training-free inference-time guidance method for video diffusion models. By optimizing learnable token embeddings in combination with an optical flow discriminator, it enhances the temporal consistency and motion smoothness of video generation.
OSV: One Step is Enough for High-Quality Image to Video Generation: Proposes a two-stage training framework, OSV, which combines GAN adversarial training and consistency distillation to achieve high-quality single-step image-to-video generation, alongside a novel video discriminator that bypasses decoding.
Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models: StEvo-Bench proposes a benchmark to evaluate the capabilities of video world models in "unobserved state evolution"—testing whether world models can continue to correctly reason about state changes when physical processes are unobserved (due to camera movement, occlusion, or turning off lights). The results reveal a severe "out of sight, out of mind" deficiency, with all current frontier models (e.g., Veo 3, Sora 2 Pro) achieving task success rates of less than 10%.
Parallelized Autoregressive Visual Generation: PAR (Parallelized Autoregressive) is proposed to analyze visual token dependency, generating weakly dependent tokens that are spatially distant in parallel while maintaining sequential generation for locally, strongly dependent tokens, achieving 3.6-9.5x speedup with almost no loss in quality.
PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-Wise Video Super-Resolution: PatchVSR is the first to employ a pre-trained video diffusion model (T2V) for patch-wise video super-resolution. By leveraging a dual-branch adapter (local patch branch + global context branch) and a training-free multi-patch joint modulation scheme, it achieves high-fidelity 4K video super-resolution based on a 512×512 resolution base model while significantly improving computational efficiency.
Pathways on the Image Manifold: Image Editing via Video Generation: Frame2Frame (F2F) reformulates image editing as a video generation task. It leverages an image-to-video model to generate a smooth temporal pathway on the image manifold from the source image to the target edit. By using a VLM to generate temporal editing captions and automatically select frames, F2F achieves a SOTA balance between editing precision and image fidelity.
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation: PhyT2V leverages the Chain-of-Thought (CoT) and step-back reasoning capabilities of LLMs to iteratively analyze discrepancies between generated videos and physical rules, thereby optimizing text prompts. This improves physical rule adherence in existing T2V models by up to 2.3 times without requiring retraining.
PoseTraj: Pose-Aware Trajectory Control in Video Diffusion: This work proposes PoseTraj, a pose-aware trajectory-controlled video diffusion model. By leveraging a two-stage pose-aware pre-training (utilizing the synthetic dataset PoseTraj-10K and 3D bounding box intermediate supervision) and camera motion decoupling fine-tuning, PoseTraj achieves 3D-aligned rotational motion video generation from 2D trajectories.
ReCapture: Generative Video Camera Controls for User-Provided Videos Using Masked Video Fine-Tuning: ReCapture enables camera trajectory control for user-provided videos through a two-stage approach. It first generates a rough anchor video with the new camera trajectory using depth point cloud rendering or a multi-view image diffusion model, and then repairs and completes it using masked video fine-tuning (spatiotemporal LoRA). This approach maintains the original scene motion while enabling the video to be viewed from completely new perspectives.
SAW: Toward a Surgical Action World Model via Controllable and Scalable Video Generation: Proposes SAW (Surgical Action World), which drives a video diffusion model using four lightweight conditioning signals (language prompts, reference frames, tissue functional maps, and tool trajectories) to achieve controllable and scalable surgical action video generation for rare action augmentation and surgical simulation.
Semantic Satellite Communications for Synchronized Audiovisual Reconstruction: This paper proposes an adaptive multimodal semantic transmission system tailored for satellite communication scenarios. By employing a dual-stream generative architecture (Video-to-Audio / Audio-to-Video) that flexibly switches transmission pathways, combined with a dynamic knowledge base update mechanism and an LLM agent decision-making module, high-fidelity synchronized audiovisual reconstruction is achieved under extremely limited satellite bandwidth constraints.
ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models: ShotAdapter proposes a lightweight framework that converts a pretrained single-shot T2V model into a generator supporting text-to-multi-shot video generation (T2MSV) with only about 5000 fine-tuning iterations. This is achieved by introducing learnable "transition tokens" and a local attention masking strategy, enabling multi-shot video generation with consistent character identities and independently controllable shots.
SketchVideo: Sketch-Based Video Generation and Editing: Based on the DiT video generation architecture, this work proposes a memory-efficient sketch conditioning network and an inter-frame attention mechanism. It achieves precise spatial layout and geometric detail control via 1-2 sketch keyframes, while supporting sketch-based local video editing.
Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling: STG (Spatiotemporal Skip Guidance) proposes to construct an implicit weak model as a degraded version of the original model by selectively skipping spatiotemporal layers of the Transformer for self-perturbed guidance. This improves the generation quality of video diffusion models without additional training, while maintaining sample diversity and motion dynamics, overcoming the fundamental flaw of CFG which causes drop in diversity and dynamics in video generation.
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text: Proposes StreamingT2V, an autoregressive text-to-long video generation method, which achieves seamless, highly dynamic video generation of over 2 minutes (\(1200+\) frames) through a Conditional Attention Module (CAM) for short-term memory and an Appearance Preservation Module (APM) for long-term memory.
StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models: This paper proposes StreetCrafter, which utilizes LiDAR point cloud rendering as a pixel-level condition to control a video diffusion model, achieving precise camera-controlled novel view synthesis for street views. The learned generative priors are then distilled into dynamic 3DGS representations to enable real-time rendering.
Taming Teacher Forcing for Masked Autoregressive Video Generation: MAGI proposes the Complete Teacher Forcing (CTF) paradigm, which conditions on fully observed frames rather than masked frames during training. This eliminates the training-inference gap, improves FVD by 23%, and enables the generation of over 100 coherent video frames while being trained on only 16 frames.
Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation: This work proposes Teller, the first real-time streaming audio-driven portrait animation framework based on an autoregressive Transformer. By leveraging RVQ to discretize facial motion into tokens and combining it with an efficient temporal module to refine body details, Teller achieves 25 FPS real-time generation speed (requiring only 0.92s to generate a 1s video, compared to 20.93s with Hallo) while delivering animation quality comparable to diffusion models.
The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation: RAPO proposes a retrieval-augmented prompt optimization framework. By retrieving relevant modifiers from a relation graph constructed from training data, fine-tuning an LLM to reconstruct sentence structures, and utilizing a discriminator to select the optimal prompt, it converts short user prompts into optimized prompts aligned with the training data distribution. This improves multi-object generation on VBench from 37.71% to 64.86%.
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation: This paper proposes Through-The-Mask (TTM), a two-stage compositional I2V framework. By utilizing mask-based motion trajectories as an intermediate representation, it decomposes the image-to-video generation process into "motion generation" and "video generation" stages, achieving SOTA performance in complex multi-object motion scenarios.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model: This paper proposes TeaCache, a training-free caching acceleration method for video diffusion models. It estimates the output differences of the model between adjacent timesteps by leveraging the timestep-embedding-modulated noise inputs, calibrated via polynomial fitting to adaptively decide when to cache or reuse outputs. It achieves a \(4.41\times\) speedup on Open-Sora-Plan with virtually lossless visual quality (VBench drops by only 0.07%).
TokenMotion: Decoupled Motion Control via Token Disentanglement for Human-centric Video Generation: TokenMotion proposes the first DiT-based video diffusion framework that represents camera trajectories and human poses as spatiotemporal tokens. By leveraging a "decouple-and-fuse" strategy in conjunction with a human-aware dynamic mask, it achieves fine-grained and joint control over both camera and human motion, outperforming existing state-of-the-art (SOTA) methods in both text-to-video (T2V) and image-to-video (I2V) paradigms.
Tora: Trajectory-Oriented Diffusion Transformer for Video Generation: Proposes Tora, the first trajectory-oriented Diffusion Transformer (DiT) framework for video generation. By employing a trajectory extractor (3D VAE encoding motion trajectories into spatiotemporal patches) and a motion-guidance fuser (adaptive normalization injecting into DiT blocks), it achieves scalable trajectory-controlled video generation supporting multiple resolutions, durations, and aspect ratios. In 128-frame tests, it achieves trajectory control accuracy 3 to 5 times higher than UNet-based methods.
Towards Precise Scaling Laws for Video Diffusion Transformers: This paper systematically verifies the existence of scaling laws in Video Diffusion Transformers (Video DiT) for the first time. It is discovered that video models are more sensitive to learning rate and batch size than language models. Subsequently, the paper proposes a precise scaling law formula that simultaneously predicts optimal hyperparameters, optimal model size, and validation loss, reducing inference cost by 40.1% or model size by 39.9% under the same compute budget.
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better: Tracktention proposes a novel point-tracking-based attention layer. By injecting pre-extracted point trajectory information into Vision Transformers, it achieves motion-aware temporal feature aggregation. This layer upgrades image-only models to SOTA video models, significantly improving temporal consistency in video depth estimation and video colorization tasks.
TransPixeler: Advancing Text-to-Video Generation with Transparency: TransPixeler proposes introducing alpha channel tokens into pre-trained DiT video generation models. Through shared positional encoding, domain embedding, partial LoRA fine-tuning, and attention mask design, it achieves high-quality joint generation of RGB and alpha channels under extremely scarce RGBA training data.
Unified Dense Prediction of Video Diffusion: This paper proposes UDPDiff, which, for the first time, achieves joint generation of RGB videos, entity segmentation, and depth estimation within video diffusion models. It enhances video quality and consistency through the Pixelplanes unified representation and learnable task embeddings.
VEU-Bench: Towards Comprehensive Understanding of Video Editing: Proposes VEU-Bench, the first benchmark to comprehensively evaluate Video-LLMs' understanding of video editing elements, spanning 10 editing dimensions and 3 evaluation levels (Recognition/Reasoning/Judgment) across 19 fine-grained tasks. Additionally, trains an expert model, Oscars, which outperforms the open-source SOTA by 28.3%.
Video-Bench: Human-Aligned Video Generation Benchmark: This paper proposes Video-Bench, a comprehensive benchmark for video generation evaluation, which systematically leverages Multimodal Large Language Models (MLLMs) to automatically evaluate generated videos through two techniques, Chain-of-Query and Few-Shot Scoring, achieving the highest alignment with human preferences across all evaluation dimensions.
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval: This paper proposes Video-ColBERT, which introduces ColBERT's late interaction in text retrieval into text-to-video retrieval (T2VR). By performing MeanMaxSim interaction at both the frame and video levels and employing a dual Sigmoid loss to train independent yet compatible multi-granularity representations, Video-ColBERT outperforms existing dual-encoder methods on multiple T2VR benchmarks.
DiTFlow: Video Motion Transfer with Diffusion Transformers: DiTFlow proposes the first motion transfer method designed specifically for Diffusion Transformers (DiTs). By analyzing cross-frame attention maps, it extracts Attention Motion Flow (AMF) as patch-wise motion signals and guides the generation of new videos to replicate the motion patterns of reference videos in a training-free optimization manner.
VideoDirector: Precise Video Editing via Text-to-Video Models: VideoDirector proposes Spatiotemporal Decoupled Guidance (STDG), multi-frame Null-Text optimization, and self-attention control strategies. It successfully applies the classic "inversion-editing" paradigm to text-to-video (T2V) models (AnimateDiff) for the first time, achieving high-fidelity, temporally consistent, and motion-natural precise video editing.
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation: VideoDPO is the first to adapt DPO (Direct Preference Optimization) to video diffusion models. It proposes the OmniScore comprehensive scoring system to simultaneously measure visual quality and semantic alignment, combined with an automatic preference data generation pipeline and a score-difference-based data reweighting strategy. This approach achieves significant improvements in preference alignment across VideoCrafter2, T2V-Turbo, and CogVideoX.
VideoGigaGAN: Towards Detail-rich Video Super-Resolution: Ours proposes VideoGigaGAN, the first large-scale GAN-based video super-resolution model. By incorporating flow-guided feature propagation, an anti-aliasing module, and a high-frequency shuttle mechanism, it generates rich high-frequency details while maintaining temporal consistency, supporting \(8\times\) super-resolution.
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide: VideoGuide proposes a training-free framework to enhance video diffusion models. By leveraging any pre-trained video diffusion model (or itself) as a teacher during the early stages of reverse diffusion sampling, it interpolates and fuses the denoised samples from the teacher model with the student sampling model, significantly improving temporal consistency without compromising image quality.
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step: VideoScene proposes a 3D-aware Leap Flow Distillation strategy that distills a video diffusion model into a one-step generator. It generates 3D-consistent videos from two sparse-view images. Coordinated with a Dynamic Denoising Policy Network (DDPNet) that adaptively selects the optimal starting noise level, it compresses the generation time from 2 minutes to 3 seconds while maintaining high quality.
VidTwin: Video VAE with Decoupled Structure and Dynamics: Proposed VidTwin, which decouples videos into two independent latent spaces: Structure Latent (global content and coarse motion) and Dynamics Latent (fine-grained details and fast motion), achieving high-quality reconstruction with a 28.14 PSNR at an extremely high compression ratio of 0.20%.
VIRES: Video Instance Repainting via Sketch and Text Guided Generation: Based on abstract: We introduce VIRES, a video instance repainting method with sketch and text guidance, enabling video instance repainting, replacement, generation, and removal. Existing approaches struggle with temporal consistency and accurate alignment with the provided sketch sequence. VIRES leverages the generat
Visual Prompting for One-Shot Controllable Video Editing Without Inversion: This work tackles the One-Shot Controllable Video Editing (OCVE) problem from a novel perspective of Visual Prompting. By leveraging an image inpainting diffusion model to perform editing propagation, and introducing Content Consistency Sampling (CCS) and Temporal-Content Consistency Sampling (TCS), the method achieves high-quality controllable video editing without DDIM inversion.
Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation: This work proposes Wav2Sem, a plug-and-play audio semantic decoupling module. By extracting global semantic features from complete audio sequences and fusing them with existing self-supervised audio models (HuBERT/Wav2Vec 2.0), it addresses the coupling issue of near-homophonic syllables in the feature space. This significantly mitigates the "averaging effect" in lip-shape generation, achieving consistent performance improvements across six facial animation models with different architectures.
When to Lock Attention: Training-Free KV Control in Video Diffusion: Proposes KV-Lock, a training-free video editing framework based on diffusion hallucination detection. It dynamically schedules the KV cache fusion ratio and the CFG guidance scale to preserve background consistency while enhancing foreground generation quality.
World-Consistent Video Diffusion with Explicit 3D Modeling: This paper proposes WVD (World-consistent Video Diffusion), which jointly models RGB and XYZ images (encoding global 3D coordinates) by training a diffusion model. This design achieves multi-view consistent video generation under explicit 3D constraints, and unifies various downstream tasks, such as single-image 3D reconstruction, multi-view stereo, and camera-controlled generation, through a flexible inpainting strategy.
World2Act: Latent Action Post-Training via Skill-Compositional World Models: World2Act proposes a VLA post-training method based on latent space alignment: it aligns the latent video dynamic representations of a World Model with the action representations of a VLA via contrastive learning (instead of supervision in pixel space). It also introduces an LLM-driven skill decomposition pipeline to enable arbitrary-length video generation, achieving SOTA on RoboCasa and LIBERO with only 50 synthetic trajectories and a 6.7% improvement in real-world environments.
Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion: Zero-1-to-A is proposed to generate high-fidelity animatable 4D head avatars from a single image using a pretrained video diffusion model via Symbiotic Generation (SymGEN) and a progressive learning strategy, effectively addressing the spatio-temporal inconsistency issue in video diffusion.