🎬 Video Generation¶

🔬 ICLR2026 · 19 paper notes

Arbitrary Generative Video Interpolation: ArbInterp proposes a generative video frame interpolation framework supporting arbitrary timestamps and arbitrary sequence lengths. It achieves precise temporal control via Timestamp-aware Rotary Position Embedding (TaRoPE) and enables seamless long-sequence stitching through an appearance-motion decoupled conditioning strategy.
BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration: BindWeave replaces conventional shallow fusion mechanisms with a Multimodal Large Language Model (MLLM) to parse complex multi-subject textual instructions, generating subject-aware hidden states as conditioning signals for a DiT. Combined with CLIP semantic features and VAE fine-grained appearance features, it achieves high-fidelity, subject-consistent video generation.
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models: This paper proposes Frame Guidance, a training-free frame-level guidance method that enables controllable video generation tasks — including keyframe guidance, stylization, and looping video — without modifying the model, via two core components: latent slicing (reducing memory by 60×) and Video Latent Optimization (VLO).
Geometry-aware 4D Video Generation for Robot Manipulation: This paper proposes a geometry-aware 4D video generation framework that trains a video diffusion model via cross-view pointmap alignment supervision, jointly predicting RGB and pointmap sequences to achieve spatiotemporally consistent multi-view RGB-D videos. Without requiring camera pose inputs at inference, the framework generates consistent videos from novel viewpoints and recovers robot end-effector trajectories using an off-the-shelf 6DoF pose tracker.
JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization: This paper proposes JavisDiT, a joint audio-video generation model built on the DiT architecture. It achieves fine-grained spatio-temporal audio-video alignment via a Hierarchical Spatio-Temporal Synchronization Prior Estimator (HiST-Sypo). The work also introduces a new benchmark, JavisBench (10K complex-scene samples), and a new evaluation metric, JavisScore.
JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation: This paper proposes JavisDiT++, a clean and unified framework for joint audio-video generation (JAVG). It improves generation quality via modality-specific MoE, achieves frame-level synchronization through temporally aligned RoPE, and aligns outputs with human preferences via audio-video DPO. Built on Wan2.1-1.3B with only ~1M public data, it achieves state-of-the-art performance.
Language-guided Open-world Video Anomaly Detection under Weak Supervision: This paper proposes LaGoVAD, a language-guided open-world video anomaly detection paradigm that models anomaly definitions as random variables provided in natural language. Combined with dynamic video synthesis and contrastive learning regularization, it achieves zero-shot state-of-the-art performance across seven datasets.
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control: This paper proposes RoboMaster, a framework that decomposes the robot–object interaction process into three temporal stages—pre-interaction, in-interaction, and post-interaction—via a collaborative trajectory representation, combined with appearance- and shape-aware object embeddings, to achieve high-quality video generation for robotic manipulation.
LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning: This paper proposes LoRA-Edit, which leverages spatiotemporal masks to guide LoRA fine-tuning of a pretrained I2V model, enabling controllable first-frame-guided video editing. The mask simultaneously serves as an instruction for the editing region and a guidance signal for LoRA learning, supporting motion inheritance and appearance control.
Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective: This paper proposes Lumos-1, a unified video generation model built on a standard LLM architecture. It addresses visual spatiotemporal encoding via MM-RoPE (distributed multi-modal RoPE) and inter-frame loss imbalance via AR-DF (autoregressive discrete diffusion forcing). Trained with only 48 GPUs, Lumos-1 achieves competitive performance on GenEval, VBench-I2V, and VBench-T2V.
MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling: MoSA decomposes human video generation into a structure generation stage (a 3D Transformer generates physically plausible motion skeletons) and an appearance generation stage (a DiT synthesizes video conditioned on the skeletons). A Human-Aware Dynamic Control (HADC) module propagates sparse skeleton signals across the entire motion region. Combined with a dense tracking loss and contact constraints, MoSA comprehensively outperforms SOTA models such as HunyuanVideo and Wan 2.1 on FVD, CLIPSIM, and other metrics.
MotionStream: Real-Time Video Generation with Interactive Motion Controls: MotionStream is proposed as the first real-time streaming video generation system with motion control. It first trains a bidirectional motion-control teacher with a lightweight track head on Wan DiT, then distills it into a causal student via Self Forcing + DMD. Attention sink and rolling KV cache are introduced to achieve full train-inference distribution matching, enabling infinite-length generation at constant speed — reaching 17 FPS / 29 FPS (+ Tiny VAE) at 480P on a single H100 GPU.
PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation: This paper proposes PreciseCache — a plug-and-play acceleration framework that precisely detects and skips genuinely redundant computations in video generation. It consists of LFCache (step-level, based on a Low-Frequency Difference (LFD) metric) and BlockCache (block-level, based on an input-output difference metric), achieving an average 2.6× speedup with negligible quality degradation on mainstream models such as Wan2.1-14B.
QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification: This paper proposes QuantSparse, the first framework to jointly integrate model quantization and attention sparsification for video diffusion Transformer compression. By introducing Multi-Scale Salient Attention Distillation (MSAD) and Second-Order Sparse Attention Reparameterization (SSAR), QuantSparse addresses the "amplified attention shift" problem caused by naive combination of the two techniques. On HunyuanVideo-13B with W4A8 and 15% attention density, it achieves 3.68× storage compression and 1.88× inference speedup with nearly lossless generation quality.
SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion: SIGMark proposes the first blind watermarking framework for modern video diffusion models, achieving scalable blind extraction with constant retrieval cost via Global Frame-level Pseudorandom Coding (GF-PRC), and addresses temporal perturbations under causal 3D VAE through a Segmented Group Ordering (SGO) module, attaining high bit accuracy and strong robustness on HunyuanVideo and Wan-2.2.
Streaming Autoregressive Video Generation via Diagonal Distillation: This paper proposes Diagonal Distillation (DiagDistill), which achieves 277.3× acceleration and 31 FPS real-time streaming autoregressive video generation via a diagonal denoising strategy (more steps for early chunks, fewer for later chunks) and a flow distribution matching loss.
Target-Aware Video Diffusion Models: This paper proposes a target-aware video diffusion model that generates videos of an actor interacting with a specified target object, given only a single input image and a segmentation mask of the target. The core innovations are the introduction of a special [TGT] token and a selective cross-attention loss that guides the model to attend to the spatial location of the target, achieving comprehensive improvements over baselines in both target alignment and video quality.
Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator: This paper proposes VIST3A, a framework that seamlessly connects the latent space of a pretrained video generator to a feed-forward 3D reconstruction model (e.g., AnySplat/MVDUSt3R/VGGT) via model stitching, and employs direct reward finetuning to align the generative model with the stitched 3D decoder. The approach enables high-quality end-to-end text-to-3DGS and text-to-pointmap generation, achieving state-of-the-art results on T3Bench, SceneBench, and DPG-Bench.
TTOM: Test-Time Optimization and Memorization for Compositional Video Generation: This paper proposes TTOM, a framework that aligns attention maps of video generation models with LLM-generated spatiotemporal layouts by optimizing newly introduced parameters at inference time, while a parameter memorization mechanism stores historical optimization contexts for reuse. TTOM achieves relative improvements of 34% (CogVideoX) and 14% (Wan2.1) on T2V-CompBench.