MoVieDrive: Urban Scene Synthesis with Multi-Modal Multi-View Video Diffusion Transformer¶
Conference: CVPR 2026 arXiv: 2508.14327 Code: Unavailable Area: Autonomous Driving / Video Generation Keywords: Multi-modal multi-view video generation, Diffusion Transformer, urban scene synthesis, conditional control, CogVideoX
TL;DR¶
The first method to simultaneously generate RGB + depth + semantic tri-modal multi-view driving scene videos within a unified DiT framework. Through a decomposed design of modal-shared layers (temporal + multi-view spatiotemporal attention) and modal-specific layers (cross-modal interaction + projection heads), a unified layout encoder, and diverse conditioning, the method achieves FVD 46.8 on nuScenes (22% improvement over CogVideoX+SyntheOcc), depth AbsRel 0.110, and semantic mIoU 37.5, outperforming pipelines based on separate model generation and estimation.
Background & Motivation¶
Background: Autonomous driving scene video generation has advanced rapidly. Methods such as MagicDrive, DriveDreamer, and MaskGWM leverage diffusion models to achieve promising multi-view RGB video generation. However, these methods focus exclusively on the RGB modality.
Limitations of Prior Work: Autonomous driving requires multi-modal data (RGB + depth + semantics) for comprehensive scene understanding. Although multiple independent models can generate different modalities separately (e.g., generating RGB first and then estimating depth with Depth-Anything-V2), this increases deployment complexity, fails to exploit complementary inter-modal information, and results in poor cross-modal consistency.
Key Challenge: How can multi-modal multi-view driving videos be generated simultaneously within a unified framework? The key challenges are: (1) different modalities exhibit large content variation yet share underlying scene structure, requiring a distinction between shared and modality-specific knowledge; (2) both multi-view spatiotemporal consistency and cross-modal consistency must be ensured simultaneously; (3) complex driving scenes require fine-grained conditional control.
Goal: To build a unified multi-modal multi-view video DiT model that simultaneously generates 6-view, 49-frame videos across three modalities while guaranteeing spatiotemporal and cross-modal consistency.
Key Insight: Based on the observation that CogVideoX's shared 3D VAE can process videos of different modalities, the authors hypothesize that different modalities share a common latent space and require only a small number of modality-specific parameters to differentiate them. This motivates the modal-shared + modal-specific decomposition design.
Core Idea: Modal-shared layers in the unified DiT learn common spatiotemporal structure; modal-specific layers capture modality differences; diverse condition encodings control scene generation.
Method¶
Overall Architecture¶
Built upon CogVideoX (v1.1-2B). Three types of conditions (text, contextual reference frames, and layout) are processed through unified encoders to extract embeddings, which are concatenated with noisy latents and fed into a DiT composed of modal-shared and modal-specific layers. A shared 3D VAE encodes and decodes all modalities. DDPM noise scheduling is used during training, and DDIM with classifier-free guidance is applied at inference. The default configuration is 6 cameras × 49 frames × 512×256 resolution.
Key Designs¶
-
Diverse Condition Encoding
-
Function: Encodes text, layout constraints, and reference frames into unified conditional embeddings to control scene generation.
- Mechanism: (a) Text conditioning — camera intrinsics and extrinsics are encoded via Fourier encoding + MLP encoder \(E^\text{cam}\); video descriptions are encoded by a frozen T5 encoder \(E^\text{text}\); the concatenated embeddings are injected via cross-attention in the DiT. (b) Layout conditioning — 3D bounding box projection maps \(c^b\), road structure maps \(c^r\), and 3D occupancy sparse semantic maps \(c^o\) are fused through a unified layout encoder (independent causal ResNets per condition + a shared causal ResNet): \(f^\text{layout} = E_s^l(E_b^l(c^b) \otimes E_r^l(c^r) \otimes E_o^l(c^o))\). (c) Contextual reference — the first frame is encoded by the 3D VAE for future prediction.
-
Design Motivation: The unified layout encoder achieves implicit alignment of condition embedding spaces, proving more effective than multiple independent encoders.
-
Modal-Shared Components (Temporal + Multi-View Spatiotemporal Blocks)
-
Function: Learn temporal consistency and multi-view spatial structure shared across all modalities.
- Mechanism: (a) Temporal attention layer \(D^\text{tem}\) — CogVideoX's 3D full attention learns inter-frame consistency; text is injected via cross-attention; operating on dimension \(\mathcal{R}^{V \times (NKW) \times C}\). (b) Multi-view spatiotemporal block \(D^\text{st}\) — inserted every \(\alpha_1\) layers; contains 3D spatial attention (\(\mathcal{R}^{K \times (VHW) \times C}\) for cross-view structure), Hash grid 3D spatial embeddings, and full spatiotemporal attention (\(\mathcal{R}^{(VKHW) \times C}\) for global context).
-
Design Motivation: Temporal attention alone cannot guarantee multi-view consistency (FVD degrades from 46.8 to 153.7 without spatiotemporal blocks); the multi-view spatiotemporal block explicitly models cross-view spatial relationships.
-
Modal-Specific Components (Cross-Modal Interaction + Projection Heads)
-
Function: Learn modality-specific content on top of shared representations while maintaining cross-modal alignment.
- Mechanism: Cross-modal interaction layers are inserted every \(\alpha_2\) layers, comprising self-attention, cross-modal cross-attention (query = current modality latent; key/value = concatenated latents of other modalities), and FFN. Modality-specific projection heads (linear layer + adaptive normalization) independently predict noise for each modality: \(h'_m = D_m^\text{cm}(h, h_m^\text{modal}, t)\).
- Design Motivation: Cross-modal cross-attention enables different modalities to exchange complementary information; unified generation yields higher quality than independent generation with external models.
Loss & Training¶
- \(\mathcal{L} = \sum_m \lambda_m \mathbb{E}_{x_{0,m}, t_m, \epsilon_m, C} \|\epsilon_m - \epsilon_{\theta,m}(x_{t,m}, t_m, C)\|^2\), with per-modality weighting.
- AdamW, lr=2e-4; 3D VAE and T5 are frozen; conditioning dropout is applied for improved generalization.
- Depth ground truth is generated by Depth-Anything-V2; semantic ground truth is generated by Mask2Former (not real annotations).
Key Experimental Results¶
Main Results — nuScenes¶
| Method | Conference | FVD↓ | mAP↑ | mIoU↑ | AbsRel↓ | Sem mIoU↑ |
|---|---|---|---|---|---|---|
| MagicDrive | ICLR24 | 236.2 | 9.7 | 15.6 | 0.255 | 23.5 |
| MagicDrive-V2 | ICCV25 | 112.7 | 11.5 | 17.4 | 0.280 | 22.4 |
| DriveDreamer-2 | AAAI25 | 55.7 | — | — | — | — |
| CogVideoX+SyntheOcc | — | 60.4 | 15.9 | 28.2 | 0.124 | 32.4 |
| MoVieDrive | — | 46.8 | 22.7 | 35.8 | 0.110 | 37.5 |
On Waymo: MoVieDrive FVD 61.6 vs. CogVideoX+SyntheOcc 82.3 (25% improvement).
Ablation Study — Multi-Modal Generation¶
| Configuration | FVD↓ | AbsRel↓ | Sem mIoU↑ | Note |
|---|---|---|---|---|
| RGB only + external model estimation | 42.0 | 0.121 | 36.4 | Best RGB quality but poor multi-modal quality |
| RGB+Depth unified + external semantics | 43.4 | 0.111 | 36.0 | Depth quality improves |
| RGB+Depth+Semantics fully unified | 46.8 | 0.110 | 37.5 | Best overall multi-modal quality |
Ablation Study — DiT Components¶
| Configuration | FVD↓ | Note |
|---|---|---|
| L1 (temporal layers only) | 153.7 | No multi-view consistency |
| L1 + L3 (temporal + modal-specific) | 78.8 | No cross-view spatial learning |
| L1 + L2 + L3 (full model) | 46.8 | All components present |
| CogVideoX + cross-view attention | 118.4 | Simple modification is insufficient |
Key Findings¶
- The multi-view spatiotemporal block is critical: removing it causes FVD to degrade from 46.8 to 153.7 (3.3× worse).
- Unified multi-modal generation achieves better depth (AbsRel 0.110) and semantics (mIoU 37.5) than RGB + external model estimation (0.121 / 36.4), at the cost of a marginal increase in RGB FVD (42.0 → 46.8), indicating slight inter-modal interference.
- The unified layout encoder outperforms independent encoders, attributed to implicit alignment of condition embedding spaces.
- Simply adding cross-view attention to CogVideoX still yields FVD 118.4, far inferior to MoVieDrive's 46.8.
Highlights & Insights¶
- First unified multi-modal multi-view generation framework — filling a gap in autonomous driving scene generation. The modal-shared + modal-specific decomposition exploits the shared latent space hypothesis of the 3D VAE and is parameter-efficient.
- The cross-modal interaction layers enable different modalities to exchange complementary information — unified generation not only reduces the number of models required but also genuinely improves depth and semantic quality compared to independent generation.
- The unified layout encoder design outperforms multiple independent encoders through implicit embedding space alignment when fusing diverse layout conditions, and is generalizable to other multi-condition controlled generation tasks.
- Good scalability: supports long video generation (without reference frames) and scene editing across different weather/time conditions via text.
Limitations & Future Work¶
- Long video generation in distant regions still exhibits noise; temporal consistency degrades in longer sequences.
- Depth and semantic ground truth are derived from pretrained model estimates (Depth-Anything-V2 / Mask2Former) rather than real annotations, imposing a ceiling on training signal quality.
- Multi-modal generation slightly increases RGB FVD (42.0 → 46.8), calling for improved inter-modal disentanglement strategies.
- The framework has not been extended to 3D modalities such as LiDAR point clouds.
- Integration with closed-loop simulators has not been explored, and downstream task benefits remain unquantified.
- Training cost is high (6 views × 49 frames × multiple modalities), demanding substantial computational resources.
Related Work & Insights¶
- vs. MagicDrive / MagicDrive-V2: These methods generate RGB only and require additional models for depth and semantics. MoVieDrive substantially outperforms them on FVD (46.8 vs. 112.7 / 236.2) and controllability (mAP 22.7 vs. 11.5 / 9.7) across the board.
- vs. UniScene (CVPR25): Uses separate models for RGB and LiDAR generation, remaining a non-unified approach. MoVieDrive achieves true single-model multi-modal generation.
- vs. CogVideoX+SyntheOcc: The most direct competitor. MoVieDrive consistently leads on all metrics (FVD 46.8 vs. 60.4), demonstrating the necessity of dedicated multi-modal multi-view architectural design.
Rating¶
⭐⭐⭐⭐
- Novelty ⭐⭐⭐⭐: First unified multi-modal multi-view driving video generation; the decomposition design is principled.
- Experimental Thoroughness ⭐⭐⭐⭐: Evaluated on nuScenes and Waymo with comprehensive ablations and detailed supplementary material.
- Writing Quality ⭐⭐⭐⭐: Clear structure, detailed methodology, and rich figures and tables.
- Value ⭐⭐⭐⭐: Provides a more complete solution for autonomous driving scene generation.