🎬 Video Generation¶

📹 ICCV2025 · 51 paper notes

Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis: This paper proposes the Adversarial Distribution Matching (ADM) framework, which aligns the latent predictions of real and fake score estimators adversarially via a diffusion-based discriminator, replacing the predefined KL divergence in DMD. Combined with Adversarial Distillation Pretraining (ADP), the proposed DMDX pipeline achieves one-step generation on SDXL surpassing DMD2, and sets new multi-step distillation benchmarks on SD3 and CogVideoX.
Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis: This paper proposes an Adversarial Distribution Matching (ADM) framework that replaces the predefined KL divergence in DMD with an implicit, data-driven measure of distributional discrepancy. A diffusion-model-based discriminator aligns the latent predictions of real and fake score estimators along the PF-ODE. Combined with Adversarial Distillation Pre-training (ADP), the resulting DMDX pipeline surpasses DMD2 on one-step SDXL generation and extends naturally to SD3 and CogVideoX video synthesis.
Aligning Moments in Time using Video Queries: This paper proposes MATR (Moment Alignment TRansformer), which conditions target video representations on query video features via dual-stage sequence alignment (soft-DTW), enabling video-to-video moment retrieval (Vid2VidMR). A self-supervised pretraining strategy is designed accordingly, achieving +13.1% R@1 and +8.1% mIoU on ActivityNet-VRL.
BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation: BadVideo is the first backdoor attack framework targeting text-to-video (T2V) generation models. It exploits inherent static and dynamic redundancy in video (e.g., unspecified background elements, motion trajectories) through two strategies—spatio-temporal composition and dynamic element transition—to covertly embed malicious content. The framework achieves up to 93.5% human-evaluated attack success rate on LaVie and Open-Sora while effectively evading existing content moderation systems.
Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis: This paper proposes Causal-VidSyn, a diffusion model that achieves causal entity localization via an Accident-Reason Answering (ArA) module and a gaze-conditioned visual token selection mechanism. The authors also construct the Drive-Gaze dataset comprising 1.54 million frames of gaze data. The method outperforms state-of-the-art approaches across three tasks: accident video editing, normal-to-accident video diffusion, and text-to-video generation.
D3: Training-Free AI-Generated Video Detection Using Second-Order Features: Drawing from second-order control systems in Newtonian mechanics, this paper identifies a fundamental distinction between real and AI-generated videos in their second-order temporal features ("acceleration"): real videos exhibit high fluctuation while generated videos remain flat. Based on this insight, the authors propose D3, a fully training-free AI-generated video detection method that classifies videos solely by computing the standard deviation of second-order differences of inter-frame features, achieving state-of-the-art performance across 40 test subsets.
DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images: This paper proposes DACoN, which fuses semantic features from the DINOv2 foundation model with high-resolution spatial features from a U-Net to enable automatic anime line art colorization with an arbitrary number of reference images, surpassing existing methods on both key-frame and sequential-frame colorization tasks.
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer: To address the difficulty of decoupling motion from appearance in DiT models with 3D full-attention, this paper proposes Shared Temporal Kernels and a Dense Point Tracking Loss, along with a comprehensive motion transfer benchmark MTBench and a hybrid motion fidelity metric.
DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation: This paper introduces DH-FaceVid-1K, a large-scale high-quality face video dataset comprising 1,200+ hours, 270,043 video clips, and 20,000+ unique identities. It specifically addresses the severe underrepresentation of Asian faces in existing datasets and empirically validates scaling laws with respect to data volume and model parameter count through systematic experiments.
Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning: This paper proposes DisWM, a framework that pre-trains disentangled representations from "distracting videos" offline, then transfers semantic knowledge to downstream world models via offline-to-online latent space distillation, improving sample efficiency and robustness of visual reinforcement learning under environmental variations.
DIVE: Taming DINO for Subject-Driven Video Editing: This paper proposes DIVE, a framework that leverages semantic features from the pretrained DINOv2 model as implicit correspondences to guide subject-driven video editing. DINO features are used for temporal motion modeling and target subject identity registration, enabling high-quality subject replacement while preserving motion consistency.
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization: This paper proposes DOLLAR, which combines Variational Score Distillation (VSD) and Consistency Distillation (CD) to achieve few-step video generation, and introduces a latent reward model fine-tuning strategy to further improve quality. The 4-step student model achieves a VBench score of 82.57, surpassing the teacher model and baselines such as Gen-3 and Kling, while the single-step distillation achieves a 278.6× speedup.
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization: DOLLAR combines variational score distillation (VSD) and consistency distillation to achieve few-step video generation, and introduces a latent-space reward model fine-tuning method to further optimize specific quality dimensions. The 4-step student model achieves a VBench score of 82.57, surpassing the teacher model and commercial baselines such as Gen-3 and Kling, while 1-step distillation yields a 278.6× sampling speedup.
DreamRelation: Relation-Centric Video Customization: DreamRelation is proposed as the first relation-centric video customization method. Through a Relation LoRA Triplet combined with Hybrid Mask Training, it achieves disentanglement between relation and appearance, and enhances relational dynamics learning via a spatiotemporal relational contrastive loss, enabling animals to imitate human interactions.
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation: This paper analyzes the optimization conflict between high- and low-noise levels in consistency model distillation, and proposes a parameter-efficient Dual-Expert Consistency Model (DCM). A semantic expert handles layout and motion while a detail expert handles fine-grained details, complemented by a temporal coherence loss and GAN with feature matching loss. On HunyuanVideo (13B), DCM achieves 4-step sampling quality approaching the 50-step baseline.
DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization: DualReal is the first framework to propose adaptive joint training for identity and motion, achieving lossless fusion along both dimensions via Dual-aware Adaptation and a StageBlender Controller, with average gains of 21.7% and 31.8% on CLIP-I and DINO-I metrics.
EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models: This paper proposes EfficientMT, an efficient end-to-end video motion transfer framework that reuses a pretrained T2V model backbone to extract temporal motion features, combines a scaler module with a temporal integration mechanism, and achieves zero-shot motion transfer using only a small amount of synthetic paired data. The inference speed is more than 10× faster than optimization-based methods.
ETVA: Evaluation of Text-to-Video Alignment via Fine-Grained Question Generation and Answering: This paper proposes ETVA, a text-to-video alignment evaluation method based on fine-grained question generation and answering. It employs a multi-agent scene graph traversal to generate atomic questions and a knowledge-augmented multi-stage reasoning pipeline to answer them. ETVA substantially outperforms existing metrics in correlation with human judgments (Spearman's ρ 58.47 vs. 31.0) and introduces an evaluation benchmark containing 2k prompts and 12k questions.
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation: This paper proposes SynFMC, a synthetic dataset (the first video dataset with complete 6D pose annotations for both camera and objects) and the FMC method, enabling independent or simultaneous 6D pose control of camera and objects in text-to-video generation. The approach produces high-fidelity videos across diverse scenarios and is compatible with multiple personalized T2I models.
FuXi-RTM: A Physics-Guided Prediction Framework with Radiative Transfer Modeling: This paper proposes FuXi-RTM, the first hybrid physics-guided weather forecasting framework that integrates a deep learning radiative transfer model (DLRTM) as a differentiable physical regularizer, outperforming the unconstrained baseline on 88.51% of variable–lead-time combinations.
FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation: This paper proposes FVGen, a framework that distills a multi-step video diffusion model (VDM) into a student model requiring only 4 sampling steps. Through GAN-based student initialization and softened reverse KL divergence optimization, FVGen reduces sampling time by over 90% while maintaining or even surpassing the visual quality of the teacher model.
Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks: This paper proposes Video Interface Networks (VINs), an abstraction module analogous to "fast thinking," which encodes long videos into fixed-size global tokens at each diffusion step to guide a DiT in generating multiple video chunks in parallel, enabling efficient and temporally consistent long video generation.
Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction: This paper adapts a pretrained video diffusion model (DynamiCrafter) into a monocular 4D dynamic scene reconstructor that simultaneously predicts three complementary geometric modalities — point maps, disparity maps, and ray maps. Through a multi-modal alignment and fusion algorithm combined with sliding-window inference, the model generalizes zero-shot to real videos despite being trained exclusively on synthetic data, substantially outperforming current state-of-the-art video depth estimation methods.
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models: LeanVAE is proposed as an ultra-efficient video VAE built upon non-overlapping patch operations, a Neighborhood-Aware Feedforward (NAF) module, wavelet transforms, and compressed sensing. With only 40M parameters, it achieves a 50× reduction in FLOPs and a 44× speedup in inference while maintaining competitive reconstruction quality.
Long Context Tuning for Video Generation: This paper proposes Long Context Tuning (LCT), which extends the context window of pretrained single-shot video diffusion models to the scene level. By introducing interleaved 3D positional embeddings and an asynchronous noise strategy, LCT achieves cross-shot visual and temporal consistency without additional parameters, supporting both joint and autoregressive multi-shot generation, and exhibits emergent capabilities such as compositional generation.
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control: MagicDrive-V2 proposes a multi-view driving video generation framework based on DiT + 3D VAE. Through a spatial-temporal condition encoding module and a progressive training strategy, it achieves high-resolution long video generation at 848×1600×6 views and 241 frames, significantly surpassing existing methods in both resolution and frame count.
MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers: MagicMirror is the first framework to achieve zero-shot identity-preserving video generation on a Video Diffusion Transformer (CogVideoX). It employs dual-branch facial feature extraction, Conditioned Adaptive Normalization (CAN), and a two-stage training strategy (image pre-training followed by video fine-tuning) to generate high-quality dynamic videos while maintaining consistent facial identity.
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent: This paper proposes MotionAgent, which leverages a Motion Field Agent to parse motion descriptions from text into object trajectories and camera extrinsics, then unifies them into optical flow maps via an analytical flow synthesis module, enabling fine-grained and precise control over both object motion and camera motion in I2V generation using only text input.
MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation: This paper proposes MotionShot, a training-free motion transfer framework that achieves high-fidelity motion transfer between arbitrary reference–target object pairs with significant appearance and structural differences, via a two-level motion alignment strategy combining high-level semantic alignment and low-level morphological alignment.
Multi-identity Human Image Animation with Structural Video Diffusion: This paper proposes the Structural Video Diffusion framework, which maintains multi-person appearance consistency via mask-guided identity-specific embeddings, jointly learns RGB/depth/normal tri-modal geometric structure to model human–object interactions, and introduces the Multi-HumanVid dataset of 25K multi-person interaction videos to enable multi-identity human video generation.
NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors: NormalCrafter proposes a video normal estimation method built upon Stable Video Diffusion (SVD). By incorporating Semantic Feature Regularization (SFR) and a two-stage training strategy, the method generates normal sequences with fine-grained details and temporal consistency, substantially outperforming existing per-frame methods on video benchmarks.
OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics: This paper proposes OCK (Object-Centric Kinematics), which augments object-centric video prediction by introducing explicit kinematic attributes (position, velocity, acceleration) as complements to slot representations. Two Transformer variants — Joint-OCK and Cross-OCK — are designed to fuse appearance and motion information, achieving significant improvements in dynamic video prediction quality across complex synthetic and real-world scenarios.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models: This paper proposes OmniHuman, a multi-condition human animation generation framework based on Diffusion Transformer. Through an omni-conditions training strategy that mixes motion-related conditions including text, audio, and pose, the framework enables effective data scaling. It is the first single model to support audio-driven human video generation with arbitrary body proportions and aspect ratios, achieving state-of-the-art performance on both portrait and half-body animation tasks.
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM: Prompt-A-Video is proposed, which automatically constructs training data via a reward-guided prompt evolution pipeline and optimizes an LLM through two-stage SFT and DPO training to generate enhanced prompts aligned with the preferences of specific video diffusion models.
Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization: This paper proposes UMIVR, a framework that explicitly quantifies three types of uncertainty in text-to-video retrieval—textual ambiguity (semantic entropy), mapping uncertainty (JS divergence), and frame uncertainty (temporal quality frame sampling)—and adaptively generates clarification questions based on the quantified uncertainty to iteratively refine queries, achieving 69.2% R@1 on MSR-VTT-1k after 10 interaction rounds.
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control: RealCam-I2V integrates monocular metric depth estimation to construct 3D scenes for metric-scale-aligned training, provides an interactive 3D scene trajectory drawing interface, and introduces a scene-constrained noise shaping mechanism, addressing the scale inconsistency and real-world usability issues inherent in existing trajectory-guided I2V methods.
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation: Reangle-A-Video reformulates multi-view video generation as a video-to-video translation problem. It learns view-invariant motion via self-supervised fine-tuning of a video diffusion model, and combines DUSt3R-guided multi-view consistent inpainting to generate synchronized multi-view videos from a monocular input video.
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video: This paper proposes ReCamMaster, which achieves camera-trajectory-controlled video re-generation from a single input video via a frame-dimension conditioning mechanism and a multi-camera synchronized dataset synthesized in UE5, significantly outperforming existing methods.
SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering: SteerX proposes a zero-shot inference-time guidance method that integrates scene reconstruction into the video generation process. By designing geometric reward functions using camera-free feed-forward reconstruction models, SteerX steers the generation distribution toward geometrically consistent samples, enabling high-quality camera-free 3D/4D scene generation.
STiV: Scalable Text and Image Conditioned Video Generation: This paper proposes STIV, a unified text-image conditioned video generation framework based on Diffusion Transformer. It integrates image conditioning via a frame replacement strategy and introduces joint image-text classifier-free guidance, enabling both T2V and TI2V generation within a single model. The 8.7B-parameter model achieves state-of-the-art scores of 83.1 and 90.1 on VBench T2V and I2V, respectively.
SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization: This paper proposes SweetTok, a video tokenizer that decouples spatial and temporal information compression via a Decoupled Query AutoEncoder (DQAE), and assigns codewords by part-of-speech through a Motion-enhanced Language Codebook (MLC). Using only 25% of the token count, SweetTok achieves 42.8% improvement in rFVD and 15.1% improvement in gFVD, attaining an optimal balance between compression ratio and reconstruction fidelity.
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation: This paper constructs TIP-I2V, the first million-scale real-user text and image prompt dataset for image-to-video (I2V) generation (1,701,935 unique prompt pairs), accompanied by generated videos from five state-of-the-art I2V models. Built upon this dataset, the paper introduces TIP-Eval, a large-scale evaluation benchmark, alongside studies on user preference analysis and AI-generated video detection.
VACE: All-in-One Video Creation and Editing: This paper proposes VACE, an all-in-one video creation and editing framework built on Diffusion Transformer. Through a unified Video Condition Unit (VCU) interface and a pluggable Context Adapter architecture, a single model covers 12+ video tasks including reference-based generation, video editing, and mask-based editing, achieving performance on par with task-specific models.
VACE: All-in-One Video Creation and Editing: VACE is proposed as a unified framework for video creation and editing. It introduces a Video Condition Unit (VCU) that consolidates text, image, video, and mask inputs into a unified conditional representation. Combined with a Context Adapter that injects task concepts into a DiT model, VACE is the first single video DiT to simultaneously support reference-guided generation, video editing, mask-based editing, and their arbitrary combinations.
Versatile Transition Generation with Image-to-Video Diffusion: This paper proposes VTG, a unified video transition generation framework built upon an image-to-video diffusion model. VTG achieves smooth, high-fidelity transitions across four task categories — object morphing, motion prediction, concept blending, and scene transition — via interpolation-based initialization (noise SLERP + LoRA interpolation + text SLERP), bidirectional motion fine-tuning, and DINOv2 representation alignment regularization.
V.I.P.: Iterative Online Preference Distillation for Efficient Video Diffusion Models: This paper proposes the ReDPO loss function and the V.I.P. iterative online preference distillation framework, which combines preference learning (DPO) with SFT regularization for distilling pruned video diffusion models. The approach matches or surpasses the performance of the full model while reducing parameters by 36.2%–67.5%.
VMBench: A Benchmark for Perception-Aligned Video Motion Generation: This paper proposes VMBench — the first comprehensive benchmark for video motion quality evaluation, featuring five-dimensional perception-aligned motion metrics (PMM) and a meta-information-guided motion prompt generation framework (MMPG). VMBench covers 969 motion categories and achieves an average improvement of 35.3% in Spearman correlation over existing methods.
VPO: Aligning Text-to-Video Generation Models with Prompt Optimization: This paper proposes the VPO framework, which systematically optimizes text prompts for video generation based on three core principles (Harmless, Accurate, Helpful). Through principle-guided SFT and multi-feedback preference optimization, VPO significantly improves the safety, alignment, and quality of generated videos.
VSRM: A Robust Mamba-Based Framework for Video Super-Resolution: This work is the first to introduce Mamba into video super-resolution (VSR), proposing the VSRM framework. It achieves efficient spatiotemporal modeling via the Dual Aggregation Mamba Block, combined with Deformable Cross-Mamba Alignment and a frequency-domain loss, achieving state-of-the-art performance on multiple benchmarks.
WorldScore: A Unified Evaluation Benchmark for World Generation: This paper proposes WorldScore — the first unified evaluation benchmark for world generation. It decomposes world generation into a series of next-scene generation tasks, enabling unified evaluation of 3D, 4D, I2V, and T2V models across 3,000 test samples and 10 evaluation metrics.
X-Dancer: Expressive Music to Human Dance Video Generation: X-Dancer proposes a unified Transformer–diffusion framework that takes a single static image and a music sequence as input, autoregressively generates 2D whole-body dance pose token sequences synchronized with musical beats via a Transformer, and then synthesizes high-fidelity dance videos from these tokens using a diffusion model, surpassing existing methods in diversity, expressiveness, and video quality.