OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding¶

Conference: AAAI 2026 arXiv: 2504.10825 Code: https://tele-ai.github.io/OmniVDiff/ (project page available) Area: Video Generation Keywords: Video diffusion models, multimodal generation, controllable video generation, video understanding, unified framework

TL;DR¶

This paper proposes OmniVDiff, a unified controllable video diffusion framework that jointly models multiple visual modalities (RGB, depth, segmentation, Canny) in color space and introduces an Adaptive Modality Control Strategy (AMCS). Within a single diffusion model, OmniVDiff simultaneously supports three task types—text-conditioned generation, X-conditioned generation, and video understanding—achieving state-of-the-art performance on VBench.

Background & Motivation¶

Video diffusion models have achieved remarkable progress in text-to-video generation, yet controllable video generation faces two core bottlenecks:

Task-specific fine-tuning: Introducing each new control signal (depth, segmentation, Canny, etc.) requires dedicated fine-tuning of large-scale diffusion architectures, incurring prohibitive computational costs and poor scalability.

Reliance on external expert models: Most methods depend on standalone expert models (depth estimators, segmentation models, etc.) to extract conditioning signals before passing them to a separate diffusion model, forming a multi-step, non-end-to-end pipeline.

Limitations of prior work: - VideoJAM: Jointly models only RGB and optical flow; does not support conditional generation or understanding. - UDPDiff: Supports joint generation of RGB + depth or RGB + segmentation, but cannot synthesize all modalities simultaneously. - Aether: Unifies RGB + depth + camera pose, but primarily targets geometric world modeling.

Core Idea: Treat all visual modalities as parallel signals in color space, concatenate them along the channel dimension, and feed them into a unified diffusion Transformer. By dynamically assigning each modality's role (generation vs. conditioning), the framework flexibly supports diverse downstream tasks.

Method¶

Overall Architecture¶

OmniVDiff is built upon the pretrained CogVideoX text-to-video model and comprises three core components: 1. Multimodal video encoding: A shared 3D-VAE encoder encodes each of the four modalities—RGB, depth, segmentation, and Canny—into latent space independently. 2. OmniVDiff diffusion network: Noisy latent representations of all modalities are concatenated along the channel dimension and jointly denoised by the diffusion Transformer. 3. Multimodal video decoding: Modality-Specific Projection Heads (MSPH) disentangle the denoised output into per-modality representations, which are then reconstructed by the 3D-VAE decoder.

Key Designs¶

Multimodal Video Diffusion Architecture:
- Function: Extends CogVideoX's input space to accommodate multiple modalities and equips the output with independent projection heads per modality.
- Mechanism: Each modality is encoded into a latent representation $x_m$ via 3D-VAE, mixed with noise, and concatenated to form the unified input $x_i = \text{Concat}(x_r^t, x_d^t, x_s^t, x_c^t)$.
- Design Motivation: Projection heads are replicated rather than shared, as different modalities (depth vs. segmentation vs. Canny) exhibit fundamentally distinct distributional characteristics.
Adaptive Modality Control Strategy (AMCS):
- Function: Dynamically determines whether each modality acts as a "generation modality" or a "conditioning modality."
- Mechanism: Generation modalities are mixed with noise before input; conditioning modalities are concatenated directly as clean signals. Learnable modality embeddings $e_g / e_c$ further differentiate the roles.
- Design Motivation: Eliminates the need for separate fine-tuning per conditioning task; a single unified architecture can flexibly switch between tasks.
Two-Stage Training Strategy:
- Function: Stage one learns joint multimodal video generation; stage two incorporates conditional generation and video understanding tasks.
- Mechanism: Each stage runs for 20K steps with independent denoising losses; conditioning modalities are excluded from loss computation.
- Design Motivation: Progressive training avoids instability that would arise from introducing all tasks simultaneously.

Loss & Training¶

Multimodal diffusion loss: $$\mathcal{L} = \sum_{m, m \notin Cond} \mathbb{E}_{x_m, t, \epsilon, m} [\|\epsilon - \epsilon_\theta(x_m^{t,'}, t, e_m)\|^2]$$

Denoising loss is computed independently for each generation modality.
Conditioning modalities are excluded from loss computation and serve only as guidance.
Training data: 400K videos sampled from Koala-36M; pseudo-labels generated using Video Depth Anything and Semantic-SAM + SAM2.

Key Experimental Results¶

Main Results¶

Text-conditioned video generation (VBench):

Method	subject cons.	b.g. cons.	motion smooth.	dynamic deg.	weighted avg.
CogVideoX	95.68	96.00	98.21	53.98	72.25
OmniVDiff	97.78	96.26	99.21	49.69	72.78

Depth-conditioned video generation (VBench):

Method	subject cons.	dynamic deg.	weighted avg.
Make-your-video	90.04	51.95	70.17
VideoX-Fun	96.25	50.43	72.85
OmniVDiff	97.96	53.32	73.45

Zero-shot video depth estimation (ScanNet):

Method	AbsRel ↓	δ1 ↑
DepthCrafter	0.169	0.730
VDA-S (teacher)	0.110	0.876
OmniVDiff	0.125	0.852
OmniVDiff-Syn	0.100	0.894

Ablation Study¶

Configuration	subject cons.	dynamic deg.	weighted avg.	Note
w/o modality embedding	97.11	41.80	71.54	No role differentiation across modalities
w/o AMCS	97.31	33.28	71.21	No adaptive control; dynamic degree drops sharply
w/o MSPH	96.76	41.41	71.35	Shared projection head; modality features entangled
Full OmniVDiff	97.78	49.69	72.78	All components working in concert

Key Findings¶

Trained solely on pseudo-labels, OmniVDiff achieves video depth estimation performance approaching or even surpassing the expert teacher model VDA-S.
With a small amount (10K) of high-quality synthetic data, OmniVDiff-Syn outperforms the teacher model on AbsRel (0.100 vs. 0.110).
AMCS has the largest impact on dynamic degree (removal causes a drop from 49.69 to 33.28), demonstrating that adaptive control is critical for motion dynamics modeling.
In terms of inference efficiency, OmniVDiff adds only 11.8M parameters and 3 seconds of latency while simultaneously outputting RGB + depth + segmentation + Canny.

Highlights & Insights¶

Elegant unified framework design: Three concise designs—channel concatenation, modality embeddings, and adaptive control—unify generation and understanding within a single model.
Effectiveness of pseudo-label training: Demonstrates that pseudo-labels generated by expert models are sufficient to train a unified model that approaches or even surpasses the experts themselves.
Flexible task adaptability: New modalities or tasks (e.g., super-resolution) can be adapted with only 2K fine-tuning steps, showcasing strong extensibility.
Elimination of external expert dependencies: The end-to-end pipeline reduces the complexity and inconsistency of multi-model deployment.

Limitations & Future Work¶

Dynamic degree in text-conditioned generation is slightly lower than CogVideoX (49.69 vs. 53.98), suggesting that joint multimodal training may mildly compromise motion dynamics.
Currently limited to four modalities; extension to additional modalities (optical flow, surface normals, semantic labels, etc.) remains to be validated.
Segmentation quality depends on Semantic-SAM and SAM2 pseudo-labels; annotation noise may constrain the performance ceiling.
Validation on higher resolutions or longer videos has not been conducted.

The choice of CogVideoX as the base model is critical; its spatiotemporal compression capability in the 3D-VAE makes multimodal concatenation feasible.
Analogous approaches in the image domain (OneDiff, UniReal) unify tasks via a "multi-view" formulation; the proposed channel concatenation approach elegantly avoids the token explosion problem in video settings.
The design of AMCS is generalizable to other multimodal generation tasks (e.g., audio-visual generation).

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐