OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding¶
Conference: AAAI 2026 arXiv: 2504.10825 Code: https://tele-ai.github.io/OmniVDiff/ (project page available) Area: Video Generation Keywords: Video diffusion models, multimodal generation, controllable video generation, video understanding, unified framework
TL;DR¶
This paper proposes OmniVDiff, a unified controllable video diffusion framework that jointly models multiple visual modalities (RGB, depth, segmentation, Canny) in color space and introduces an Adaptive Modality Control Strategy (AMCS). Within a single diffusion model, OmniVDiff simultaneously supports three task types—text-conditioned generation, X-conditioned generation, and video understanding—achieving state-of-the-art performance on VBench.
Background & Motivation¶
Video diffusion models have achieved remarkable progress in text-to-video generation, yet controllable video generation faces two core bottlenecks:
Task-specific fine-tuning: Introducing each new control signal (depth, segmentation, Canny, etc.) requires dedicated fine-tuning of large-scale diffusion architectures, incurring prohibitive computational costs and poor scalability.
Reliance on external expert models: Most methods depend on standalone expert models (depth estimators, segmentation models, etc.) to extract conditioning signals before passing them to a separate diffusion model, forming a multi-step, non-end-to-end pipeline.
Limitations of prior work: - VideoJAM: Jointly models only RGB and optical flow; does not support conditional generation or understanding. - UDPDiff: Supports joint generation of RGB + depth or RGB + segmentation, but cannot synthesize all modalities simultaneously. - Aether: Unifies RGB + depth + camera pose, but primarily targets geometric world modeling.
Core Idea: Treat all visual modalities as parallel signals in color space, concatenate them along the channel dimension, and feed them into a unified diffusion Transformer. By dynamically assigning each modality's role (generation vs. conditioning), the framework flexibly supports diverse downstream tasks.
Method¶
Overall Architecture¶
OmniVDiff is built upon the pretrained CogVideoX text-to-video model and comprises three core components: 1. Multimodal video encoding: A shared 3D-VAE encoder encodes each of the four modalities—RGB, depth, segmentation, and Canny—into latent space independently. 2. OmniVDiff diffusion network: Noisy latent representations of all modalities are concatenated along the channel dimension and jointly denoised by the diffusion Transformer. 3. Multimodal video decoding: Modality-Specific Projection Heads (MSPH) disentangle the denoised output into per-modality representations, which are then reconstructed by the 3D-VAE decoder.
Key Designs¶
-
Multimodal Video Diffusion Architecture:
- Function: Extends CogVideoX's input space to accommodate multiple modalities and equips the output with independent projection heads per modality.
- Mechanism: Each modality is encoded into a latent representation \(x_m\) via 3D-VAE, mixed with noise, and concatenated to form the unified input \(x_i = \text{Concat}(x_r^t, x_d^t, x_s^t, x_c^t)\).
- Design Motivation: Projection heads are replicated rather than shared, as different modalities (depth vs. segmentation vs. Canny) exhibit fundamentally distinct distributional characteristics.
-
Adaptive Modality Control Strategy (AMCS):
- Function: Dynamically determines whether each modality acts as a "generation modality" or a "conditioning modality."
- Mechanism: Generation modalities are mixed with noise before input; conditioning modalities are concatenated directly as clean signals. Learnable modality embeddings \(e_g / e_c\) further differentiate the roles.
- Design Motivation: Eliminates the need for separate fine-tuning per conditioning task; a single unified architecture can flexibly switch between tasks.
-
Two-Stage Training Strategy:
- Function: Stage one learns joint multimodal video generation; stage two incorporates conditional generation and video understanding tasks.
- Mechanism: Each stage runs for 20K steps with independent denoising losses; conditioning modalities are excluded from loss computation.
- Design Motivation: Progressive training avoids instability that would arise from introducing all tasks simultaneously.
Loss & Training¶
Multimodal diffusion loss: $\(\mathcal{L} = \sum_{m, m \notin Cond} \mathbb{E}_{x_m, t, \epsilon, m} [\|\epsilon - \epsilon_\theta(x_m^{t,'}, t, e_m)\|^2]\)$
- Denoising loss is computed independently for each generation modality.
- Conditioning modalities are excluded from loss computation and serve only as guidance.
- Training data: 400K videos sampled from Koala-36M; pseudo-labels generated using Video Depth Anything and Semantic-SAM + SAM2.
Key Experimental Results¶
Main Results¶
Text-conditioned video generation (VBench):
| Method | subject cons. | b.g. cons. | motion smooth. | dynamic deg. | weighted avg. |
|---|---|---|---|---|---|
| CogVideoX | 95.68 | 96.00 | 98.21 | 53.98 | 72.25 |
| OmniVDiff | 97.78 | 96.26 | 99.21 | 49.69 | 72.78 |
Depth-conditioned video generation (VBench):
| Method | subject cons. | dynamic deg. | weighted avg. |
|---|---|---|---|
| Make-your-video | 90.04 | 51.95 | 70.17 |
| VideoX-Fun | 96.25 | 50.43 | 72.85 |
| OmniVDiff | 97.96 | 53.32 | 73.45 |
Zero-shot video depth estimation (ScanNet):
| Method | AbsRel ↓ | δ1 ↑ |
|---|---|---|
| DepthCrafter | 0.169 | 0.730 |
| VDA-S (teacher) | 0.110 | 0.876 |
| OmniVDiff | 0.125 | 0.852 |
| OmniVDiff-Syn | 0.100 | 0.894 |
Ablation Study¶
| Configuration | subject cons. | dynamic deg. | weighted avg. | Note |
|---|---|---|---|---|
| w/o modality embedding | 97.11 | 41.80 | 71.54 | No role differentiation across modalities |
| w/o AMCS | 97.31 | 33.28 | 71.21 | No adaptive control; dynamic degree drops sharply |
| w/o MSPH | 96.76 | 41.41 | 71.35 | Shared projection head; modality features entangled |
| Full OmniVDiff | 97.78 | 49.69 | 72.78 | All components working in concert |
Key Findings¶
- Trained solely on pseudo-labels, OmniVDiff achieves video depth estimation performance approaching or even surpassing the expert teacher model VDA-S.
- With a small amount (10K) of high-quality synthetic data, OmniVDiff-Syn outperforms the teacher model on AbsRel (0.100 vs. 0.110).
- AMCS has the largest impact on dynamic degree (removal causes a drop from 49.69 to 33.28), demonstrating that adaptive control is critical for motion dynamics modeling.
- In terms of inference efficiency, OmniVDiff adds only 11.8M parameters and 3 seconds of latency while simultaneously outputting RGB + depth + segmentation + Canny.
Highlights & Insights¶
- Elegant unified framework design: Three concise designs—channel concatenation, modality embeddings, and adaptive control—unify generation and understanding within a single model.
- Effectiveness of pseudo-label training: Demonstrates that pseudo-labels generated by expert models are sufficient to train a unified model that approaches or even surpasses the experts themselves.
- Flexible task adaptability: New modalities or tasks (e.g., super-resolution) can be adapted with only 2K fine-tuning steps, showcasing strong extensibility.
- Elimination of external expert dependencies: The end-to-end pipeline reduces the complexity and inconsistency of multi-model deployment.
Limitations & Future Work¶
- Dynamic degree in text-conditioned generation is slightly lower than CogVideoX (49.69 vs. 53.98), suggesting that joint multimodal training may mildly compromise motion dynamics.
- Currently limited to four modalities; extension to additional modalities (optical flow, surface normals, semantic labels, etc.) remains to be validated.
- Segmentation quality depends on Semantic-SAM and SAM2 pseudo-labels; annotation noise may constrain the performance ceiling.
- Validation on higher resolutions or longer videos has not been conducted.
Related Work & Insights¶
- The choice of CogVideoX as the base model is critical; its spatiotemporal compression capability in the 3D-VAE makes multimodal concatenation feasible.
- Analogous approaches in the image domain (OneDiff, UniReal) unify tasks via a "multi-view" formulation; the proposed channel concatenation approach elegantly avoids the token explosion problem in video settings.
- The design of AMCS is generalizable to other multimodal generation tasks (e.g., audio-visual generation).
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐