Skip to content

🎬 Video Generation

🎞️ ECCV2024 · 14 paper notes

📌 Same area in other venues: 📷 CVPR2026 (182) · 🔬 ICLR2026 (98) · 💬 ACL2026 (4) · 🧪 ICML2026 (32) · 🤖 AAAI2026 (11) · 🧠 NeurIPS2025 (23)

🔥 Top topics: Diffusion Models ×7 · Video Generation ×3 · Super-Resolution ×2

BlazeBVD: Make Scale-Time Equalization Great Again for Blind Video Deflickering

BlazeBVD is proposed, which leverages classical Scale-Time Equalization (STE) in the illumination histogram space to extract deflickering priors (filtered illumination maps, exposure maps, and flickering frame indices). This simplifies complex video space-time learning into frame-by-frame processing using a 2D spatial network coupled with a lightweight 3D temporal consistency network. It achieves SOTA quality on blind video deflickering while speeding up inference by more than 10 times compared to baselines.

DragAnything: Motion Control for Anything using Entity Representation

This paper proposes DragAnything, which utilizes the latent space features of diffusion models as Entity Representations to achieve entity-level motion control. It addresses the issue of existing trajectory-driven methods only dragging pixels without being able to precisely control the motion of target objects. DragAnything achieves state-of-the-art (SOTA) FVD/FID metrics on VIPSeg, outperforming DragNUWA by 26% in motion control votes in a user study.

DreamMotion: Space-Time Self-Similar Score Distillation for Zero-Shot Video Editing

DreamMotion, a zero-shot video editing framework based on Score Distillation, is proposed. By employing space-time self-similarity regularization, it injects target appearances while preserving the structural and motion integrity of the original video, applicable to both cascaded and non-cascaded video diffusion models.

Evaluating Text-to-Visual Generation with Image-to-Text Generation

The authors propose VQAScore, which uses Visual Question Answering (VQA) models instead of CLIP to evaluate text-to-visual generation quality. It significantly outperforms CLIPScore on complex compositional prompts and releases the GenAI-Bench benchmark.

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

This paper is the first to explore the visual features of pre-trained text-to-video (T2V) diffusion models for video understanding tasks. It proposes the VD-IT framework, which extracts visual features with superior temporal-semantic consistency from a frozen T2V diffusion model using two key designs: text-guided image projection and video-specific noise prediction. VD-IT outperforms state-of-the-art methods using discriminatively pre-trained video backbones (such as Video Swin Transformer) across four major R-VOS benchmarks.

FreeInit: Bridging Initialization Gap in Video Diffusion Models

This work identifies a training-inference initialization discrepancy in video diffusion models (where low-frequency information leakage during training leads to temporally correlated initial noise, whereas uncorrelated Gaussian noise is used during inference). It proposes FreeInit, which bridges this gap by iteratively refining the spatiotemporal low-frequency components of the initial noise, thereby significantly improving the temporal consistency of generated videos.

Kalman-Inspired Feature Propagation for Video Face Super-Resolution

This paper proposes the KEEP framework, which leverages Kalman filtering principles to recursively fuse prior information from previous frames with observations of the current frame in the latent space. This achieves high-fidelity reconstruction of facial details and ensures temporal consistency in video face super-resolution, outperforming the previous state-of-the-art method by 0.8 dB in PSNR on the VFHQ dataset.

MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing

This paper proposes MagDiff, the first multi-alignment diffusion model that unifies video generation and editing. Through three mechanisms—subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment—MagDiff achieves high-quality video generation and editing simultaneously within a single, tuning-free framework.

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

This paper proposes MOFA-Video, which equips a frozen image-to-video diffusion model (SVD) with controllable motion capabilities by designing multiple domain-specific motion field adapters (MOFA-Adapters). It supports various control signals and their combinations, such as hand-drawn trajectories and facial landmarks, to achieve open-domain controllable image animation.

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

Estimates spatially-varying Young's modulus material fields for static 3D Gaussian objects by leveraging physical dynamics priors implicit in video generation models, enabling physically plausible interactive 3D dynamics synthesis.

RealViformer: Investigating Attention for Real-World Video Super-Resolution

This paper systematically investigates the behavioral differences between spatial and channel attention in real-world video super-resolution (RWVSR). It is found that channel attention is more robust to degradation artifacts but leads to feature redundancy. Based on this, RealViformer is proposed with Improved Channel Attention (ICA) and Channel Attention Fusion (CAF) modules, achieving SOTA performance with fewer parameters and faster speed.

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

SV3D is proposed to adapt image-to-video diffusion models for multi-view synthesis and 3D generation, leveraging the generalization capability and multi-view consistency of video models while introducing explicit camera control.

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

This work proposes utilizing a pre-trained video diffusion model (EMU Video) as a multi-view data engine. By fine-tuning it to generate 3-D consistent multi-view videos, the authors construct approximately 3 million synthetic data points to train a feedforward 3D generative model, VFusion3D. This enables generating 3D assets from a single image in seconds, achieving a user preference rate of over 90%.

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion

This paper proposes Videoshop—a training-free method for localized semantic video editing. Users can modify the first frame of a video using any image editing tool, and the system automatically propagates the edits to all subsequent frames based on noise-extrapolated diffusion inversion and latent space normalization techniques. While maintaining semantic, spatial, and temporal consistency, it outperforms six baseline methods across ten evaluation metrics.