🎬 Video Generation¶

🤖 AAAI2026 · 11 paper notes

3D4D: An Interactive Editable 4D World Model via 3D Video Generation: This paper proposes 3D4D, an interactive 4D visualization framework integrating WebGL and Supersplat rendering. A four-module backend pipeline converts static images and text prompts into editable 4D scenes, while a VLM-guided foveated rendering strategy enables 60fps real-time interaction, achieving state-of-the-art performance on both CLIP Consistency and CLIP Score.
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation: This paper proposes DreamRunner, a framework that achieves fine-grained controllable multi-character multi-event story video generation via LLM-based dual-level planning, retrieval-augmented motion prior learning, and a spatial-temporal region-based 3D attention injection module (SR3AI).
FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion: FilmWeaver is proposed as a framework that guides autoregressive diffusion models via a dual-level cache (Shot Cache + Temporal Cache), enabling multi-shot video generation of arbitrary length with cross-shot consistency.
GenVidBench: A 6-Million Benchmark for AI-Generated Video Detection: This paper introduces GenVidBench—the first large-scale AI-generated video detection dataset with 6.78 million videos, featuring cross-source and cross-generator properties, covering 11 state-of-the-art video generators, and providing rich semantic annotations.
Mask2IV: Interaction-Centric Video Generation via Mask Trajectories: This paper proposes Mask2IV, a two-stage decoupled framework that first predicts mask motion trajectories of the interactor and object, then generates video conditioned on these trajectories. The approach enables controllable, interaction-centric video generation without dense mask annotations, supporting both human-object interaction and robot manipulation scenarios.
MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation: This paper proposes MoFu, which addresses two fundamental challenges in multi-subject video generation—scale inconsistency and permutation sensitivity—through two core modules: Scale-Aware Modulation (SMO, an LLM-guided scale-aware modulation mechanism) and Fourier Fusion (an FFT-based permutation-invariant feature fusion strategy). The work additionally introduces the MoFu-1M training dataset and the MoFu-Bench evaluation benchmark.
MotionCharacter: Fine-Grained Motion Controllable Human Video Generation: This paper proposes the MotionCharacter framework, which decouples motion into two independently controllable dimensions—action type and motion intensity—to achieve fine-grained motion control and identity consistency in high-fidelity human video generation.
OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding: This paper proposes OmniVDiff, a unified controllable video diffusion framework that jointly models multiple visual modalities (RGB, depth, segmentation, Canny) in color space and introduces an Adaptive Modality Control Strategy (AMCS). Within a single diffusion model, OmniVDiff simultaneously supports three task types—text-conditioned generation, X-conditioned generation, and video understanding—achieving state-of-the-art performance on VBench.
Phased One-Step Adversarial Equilibrium for Video Diffusion Models: This paper proposes V-PAE (Video Phased Adversarial Equilibrium), a two-phase distillation framework consisting of stability priming followed by unified adversarial equilibrium, which compresses large-scale video diffusion models (e.g., Wan2.1-I2V-14B) to single-step generation, achieving a 100× speedup and surpassing existing acceleration methods by 5.8% in average quality on VBench-I2V.
Seeing the Unseen: Zooming in the Dark with Event Cameras: This paper proposes RetinexEVSR, the first event-driven low-light video super-resolution (LVSR) framework. Through a Retinex-inspired bidirectional fusion strategy (RBF)—which first uses illumination maps to guide event feature denoising (IEE), then leverages enhanced event features to recover reflectance details (ERE)—the method achieves a 2.95 dB gain on the SDSD benchmark while reducing runtime by 65%.
SphereDiff: Tuning-free Omnidirectional Panoramic Image and Video Generation via Spherical Latent Representation: This paper proposes SphereDiff, which defines a spherical latent representation (uniformly distributed via Fibonacci Lattice) to replace conventional equirectangular projection (ERP), combined with a dynamic sampling algorithm and distortion-aware weighted averaging. Without any fine-tuning, SphereDiff leverages pretrained diffusion models such as SANA and LTX Video to generate seamless, low-distortion 360° panoramic images and videos.