Skip to content

🎬 Video Generation

🔬 ICLR2026 · 19 paper notes

Arbitrary Generative Video Interpolation

ArbInterp proposes a generative video frame interpolation framework supporting arbitrary timestamps and arbitrary sequence lengths. It achieves precise temporal control via Timestamp-aware Rotary Position Embedding (TaRoPE) and enables seamless long-sequence stitching through an appearance-motion decoupled conditioning strategy.

BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

BindWeave replaces conventional shallow fusion mechanisms with a Multimodal Large Language Model (MLLM) to parse complex multi-subject textual instructions, generating subject-aware hidden states as conditioning signals for a DiT. Combined with CLIP semantic features and VAE fine-grained appearance features, it achieves high-fidelity, subject-consistent video generation.

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

This paper proposes Frame Guidance, a training-free frame-level guidance method that enables controllable video generation tasks — including keyframe guidance, stylization, and looping video — without modifying the model, via two core components: latent slicing (reducing memory by 60×) and Video Latent Optimization (VLO).

Geometry-aware 4D Video Generation for Robot Manipulation

This paper proposes a geometry-aware 4D video generation framework that trains a video diffusion model via cross-view pointmap alignment supervision, jointly predicting RGB and pointmap sequences to achieve spatiotemporally consistent multi-view RGB-D videos. Without requiring camera pose inputs at inference, the framework generates consistent videos from novel viewpoints and recovers robot end-effector trajectories using an off-the-shelf 6DoF pose tracker.

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

This paper proposes JavisDiT, a joint audio-video generation model built on the DiT architecture. It achieves fine-grained spatio-temporal audio-video alignment via a Hierarchical Spatio-Temporal Synchronization Prior Estimator (HiST-Sypo). The work also introduces a new benchmark, JavisBench (10K complex-scene samples), and a new evaluation metric, JavisScore.

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

This paper proposes JavisDiT++, a clean and unified framework for joint audio-video generation (JAVG). It improves generation quality via modality-specific MoE, achieves frame-level synchronization through temporally aligned RoPE, and aligns outputs with human preferences via audio-video DPO. Built on Wan2.1-1.3B with only ~1M public data, it achieves state-of-the-art performance.

Language-guided Open-world Video Anomaly Detection under Weak Supervision

This paper proposes LaGoVAD, a language-guided open-world video anomaly detection paradigm that models anomaly definitions as random variables provided in natural language. Combined with dynamic video synthesis and contrastive learning regularization, it achieves zero-shot state-of-the-art performance across seven datasets.

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

This paper proposes RoboMaster, a framework that decomposes the robot–object interaction process into three temporal stages—pre-interaction, in-interaction, and post-interaction—via a collaborative trajectory representation, combined with appearance- and shape-aware object embeddings, to achieve high-quality video generation for robotic manipulation.

LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

This paper proposes LoRA-Edit, which leverages spatiotemporal masks to guide LoRA fine-tuning of a pretrained I2V model, enabling controllable first-frame-guided video editing. The mask simultaneously serves as an instruction for the editing region and a guidance signal for LoRA learning, supporting motion inheritance and appearance control.

Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective

This paper proposes Lumos-1, a unified video generation model built on a standard LLM architecture. It addresses visual spatiotemporal encoding via MM-RoPE (distributed multi-modal RoPE) and inter-frame loss imbalance via AR-DF (autoregressive discrete diffusion forcing). Trained with only 48 GPUs, Lumos-1 achieves competitive performance on GenEval, VBench-I2V, and VBench-T2V.

MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling

MoSA decomposes human video generation into a structure generation stage (a 3D Transformer generates physically plausible motion skeletons) and an appearance generation stage (a DiT synthesizes video conditioned on the skeletons). A Human-Aware Dynamic Control (HADC) module propagates sparse skeleton signals across the entire motion region. Combined with a dense tracking loss and contact constraints, MoSA comprehensively outperforms SOTA models such as HunyuanVideo and Wan 2.1 on FVD, CLIPSIM, and other metrics.

MotionStream: Real-Time Video Generation with Interactive Motion Controls

MotionStream is proposed as the first real-time streaming video generation system with motion control. It first trains a bidirectional motion-control teacher with a lightweight track head on Wan DiT, then distills it into a causal student via Self Forcing + DMD. Attention sink and rolling KV cache are introduced to achieve full train-inference distribution matching, enabling infinite-length generation at constant speed — reaching 17 FPS / 29 FPS (+ Tiny VAE) at 480P on a single H100 GPU.

PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation

This paper proposes PreciseCache — a plug-and-play acceleration framework that precisely detects and skips genuinely redundant computations in video generation. It consists of LFCache (step-level, based on a Low-Frequency Difference (LFD) metric) and BlockCache (block-level, based on an input-output difference metric), achieving an average 2.6× speedup with negligible quality degradation on mainstream models such as Wan2.1-14B.

QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification

This paper proposes QuantSparse, the first framework to jointly integrate model quantization and attention sparsification for video diffusion Transformer compression. By introducing Multi-Scale Salient Attention Distillation (MSAD) and Second-Order Sparse Attention Reparameterization (SSAR), QuantSparse addresses the "amplified attention shift" problem caused by naive combination of the two techniques. On HunyuanVideo-13B with W4A8 and 15% attention density, it achieves 3.68× storage compression and 1.88× inference speedup with nearly lossless generation quality.

SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

SIGMark proposes the first blind watermarking framework for modern video diffusion models, achieving scalable blind extraction with constant retrieval cost via Global Frame-level Pseudorandom Coding (GF-PRC), and addresses temporal perturbations under causal 3D VAE through a Segmented Group Ordering (SGO) module, attaining high bit accuracy and strong robustness on HunyuanVideo and Wan-2.2.

Streaming Autoregressive Video Generation via Diagonal Distillation

This paper proposes Diagonal Distillation (DiagDistill), which achieves 277.3× acceleration and 31 FPS real-time streaming autoregressive video generation via a diagonal denoising strategy (more steps for early chunks, fewer for later chunks) and a flow distribution matching loss.

Target-Aware Video Diffusion Models

This paper proposes a target-aware video diffusion model that generates videos of an actor interacting with a specified target object, given only a single input image and a segmentation mask of the target. The core innovations are the introduction of a special [TGT] token and a selective cross-attention loss that guides the model to attend to the spatial location of the target, achieving comprehensive improvements over baselines in both target alignment and video quality.

Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

This paper proposes VIST3A, a framework that seamlessly connects the latent space of a pretrained video generator to a feed-forward 3D reconstruction model (e.g., AnySplat/MVDUSt3R/VGGT) via model stitching, and employs direct reward finetuning to align the generative model with the stitched 3D decoder. The approach enables high-quality end-to-end text-to-3DGS and text-to-pointmap generation, achieving state-of-the-art results on T3Bench, SceneBench, and DPG-Bench.

TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

This paper proposes TTOM, a framework that aligns attention maps of video generation models with LLM-generated spatiotemporal layouts by optimizing newly introduced parameters at inference time, while a parameter memorization mechanism stores historical optimization contexts for reuse. TTOM achieves relative improvements of 34% (CogVideoX) and 14% (Wan2.1) on T2V-CompBench.