X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability¶
Conference: NeurIPS 2025 arXiv: 2506.13558 Code: https://x-scene.github.io/ Area: Autonomous Driving / Scene Generation Keywords: large-scale scene generation, multi-granularity control, occupancy generation, 3DGS reconstruction, autonomous driving simulation
TL;DR¶
This paper presents X-Scene, a unified large-scale driving scene generation framework that supports multi-granularity control ranging from high-level text prompts to low-level BEV layouts. By jointly generating 3D semantic occupancy, multi-view images, and videos, and leveraging consistency-aware extrapolation for large-scale scene expansion, X-Scene comprehensively outperforms existing methods in generation quality (FID 11.29) and downstream tasks.
Background & Motivation¶
Diffusion models have demonstrated remarkable success in autonomous driving data synthesis and simulation. However, existing works primarily focus on temporally consistent video generation (e.g., MagicDrive, DriveDreamer), while spatially consistent large-scale 3D scene generation remains an underexplored direction.
Core limitations of prior work:
SemCity: Can generate city-level 3D occupancy grids but lacks appearance details, making it unsuitable for realistic simulation.
UniScene / InfiniCube: Jointly generate occupancy and images but require manually designed large-scale layouts as input, resulting in complex pipelines with limited flexibility.
General large-scale urban generation methods (InfiniCity, CityDreamer): Not tailored for driving scenarios, lacking precise road layouts and dynamic objects.
Large-scale driving scene generation faces three core challenges: flexible controllability, high-fidelity geometry and appearance, and large-scale consistency.
Core Idea: Construct a unified cascaded generation pipeline from text/layout to occupancy–image–video, enable large-scale scene expansion via consistency-aware extrapolation, and reconstruct scenes as 3DGS to support downstream applications.
Method¶
Overall Architecture¶
X-Scene consists of three core modules: (1) multi-granularity controllability—combining high-level text and low-level layout conditions; (2) joint occupancy–image–video generation—ensuring cross-modal alignment and temporal consistency; (3) large-scale scene extrapolation and 3DGS reconstruction.
Key Designs¶
-
Multi-Granularity Controllability:
- Function: Supports multi-level scene control ranging from coarse text prompts to fine-grained geometric layouts.
- Mechanism: The high-level path employs a RAG-augmented LLM to generate detailed scene descriptions and construct scene graphs, followed by graph convolution and conditional diffusion for layout generation. The low-level path directly utilizes BEV layouts and 3D bounding boxes.
- Design Motivation: High-level control is suitable for rapid prototyping, while low-level control is suited for precise simulation; the two paths are complementary.
-
Joint Occupancy–Image–Video Generation:
- Function: Generates aligned occupancy fields, multi-view images, and sequential videos following a 3D-to-2D hierarchical order.
- Occupancy Generation: Triplane representation with a proposed triplane deformable attention mechanism to mitigate information loss caused by downsampling.
- Image Generation: Occupancy voxels are converted to 3D Gaussians to render semantic/depth maps, which are fused into geometric embeddings to condition the image diffusion model.
- Video Generation: Preceding images serve as reference frames; only temporal attention layers are fine-tuned, enabling autoregressive streaming generation.
- Design Motivation: The 3D-to-2D hierarchical generation ensures geometry–appearance consistency.
-
Large-Scale Scene Extrapolation and 3DGS Reconstruction:
- Function: Extends local generation to large-scale coherent environments.
- Occupancy Extrapolation: The triplane is decomposed into three 2D planes for extrapolation, with overlapping masks for synchronized denoising.
- Image Extrapolation: The diffusion model is fine-tuned to condition on reference images and camera embeddings for novel view synthesis.
- Design Motivation: Consistency-aware extrapolation ensures structural coherence in overlapping regions.
Loss & Training¶
The three diffusion models are trained independently, all using a noise prediction objective. For video diffusion, only the temporal attention layers are fine-tuned.
Key Experimental Results¶
Main Results¶
Occupancy generation:
| Method | FID3D | F3D | P3D | R3D |
|---|---|---|---|---|
| UniScene | 529.6 | 0.396 | 0.382 | 0.412 |
| X-Scene | 258.8 | 0.778 | 0.769 | 0.787 |
Multi-view image generation:
| Method | FID | Road mIoU | Veh. mIoU | mAP | NDS |
|---|---|---|---|---|---|
| MagicDrive | 16.20 | 61.05 | 27.01 | 12.30 | 23.32 |
| DreamForge | 14.61 | 65.27 | 28.36 | 13.01 | 22.16 |
| X-Scene (224×400) | 11.29 | 66.48 | 29.76 | 16.28 | 26.26 |
| X-Scene (448×800) | 12.77 | 69.06 | 33.27 | 27.65 | 34.48 |
Data augmentation effect:
| Data | 3D mAP | BEV Road mIoU | BEV Veh. mIoU |
|---|---|---|---|
| Real only | 34.5 | 74.30 | 36.00 |
| +UniScene | 36.5 | 81.69 | 41.62 |
| +X-Scene | 39.9 | 83.37 | 43.05 |
Ablation Study¶
| Configuration | IoU | mIoU | FID3D | F3D | Notes |
|---|---|---|---|---|---|
| Full | 85.6 | 92.4 | 258.8 | 0.778 | — |
| w/o Deform Attn (50×50) | 64.7 | 74.2 | 462.4 | 0.510 | Severe downsampling loss |
| w/ Deform Attn (50×50) | 66.6 | 76.6 | 436.1 | 0.522 | +2.4% with deformable attention |
| w/o Layout Cond | 85.6 | 92.4 | 1584 | 0.237 | FID increases by 6× |
Key Findings¶
- Triplane-VAE achieves reconstruction mIoU of 92.4%, substantially surpassing UniScene's 73.7%.
- FID3D is reduced by 51.2% (258.8 vs. 529.6).
- Data augmentation improves 3D detection mAP by 5.4 points (34.5 → 39.9).
- Training with 7 frames outperforms the 16-frame baseline (FVD 179.7 vs. 217.9), validating the efficiency of autoregressive temporal modeling.
Highlights & Insights¶
- First end-to-end driving scene generation framework: text → scene graph → layout → occupancy → image → video → 3DGS.
- The dual-path multi-granularity control design is highly practical.
- Triplane deformable attention substantially improves reconstruction accuracy while maintaining encoding efficiency.
- Consistency-aware extrapolation elegantly extends local generation to large-scale scenes.
Limitations & Future Work¶
- Progressive extrapolation may accumulate errors in extremely large-scale scenes.
- The scene-graph-to-layout diffusion generation depends on the diversity of training data.
- Spatiotemporally consistent generation of dynamic objects remains unexplored.
Related Work & Insights¶
- Complementary to MagicDrive and UniScene: X-Scene emphasizes spatial scalability and controllability.
- The RAG + scene graph + diffusion pipeline for text-to-layout generation is transferable to domains such as indoor scene generation.
- The occupancy-first generation paradigm offers a new framework for 3D perception data augmentation.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐