Decoupled Diffusion Sparks Adaptive Scene Generation¶
Conference: ICCV2025 arXiv: 2504.10485 Code: opendrivelab.com/Nexus Area: Autonomous Driving Keywords: Scene Generation, Decoupled Diffusion, Autonomous Driving Simulation, Safety-Critical Scenes, Traffic Layout
TL;DR¶
This paper proposes Nexus, an adaptive driving scene generation framework based on decoupled diffusion. By assigning independent noise states to individual tokens, Nexus unifies goal-oriented and reactive generation, reduces displacement error by 40%, and introduces Nexus-Data, a dataset comprising 540 hours of safety-critical driving scenarios.
Background & Motivation¶
- Diversity is essential in autonomous driving datasets, yet safety-critical long-tail scenarios remain extremely scarce.
- Existing scene generation methods face two fundamental tensions:
- Full-sequence diffusion (e.g., SceneDiffuser): supports goal-oriented generation via inpainting, but cannot respond promptly to environmental changes.
- Autoregressive prediction (e.g., GUMP): enables real-time environmental feedback, but lacks awareness of goal states for precise control.
- No existing method simultaneously provides Reactivity and Goal Orientation.
- Publicly available datasets predominantly capture safe driving behaviors, with insufficient coverage of high-risk scenarios.
Method¶
Core Idea: Noise as Soft Masks¶
- Key insight: different noise levels are interpreted as masks of varying intensity.
- Low-noise tokens → confirmed goal/historical states (analogous to hard masks).
- High-noise tokens → future states to be generated.
- This formulation unifies diffusion models (noise-axis masking) and autoregressive prediction (time-axis masking) into a tri-axis masked modeling framework.
- Each token is assigned an independent noise level, forming a noise matrix \(\mathbf{k} \in (0,1]^{A \times \mathcal{T}}\).
Scene Encoding and Representation¶
- Agent Tensor: \(\mathbf{x} \in \mathbb{R}^{A \times \mathcal{T} \times D}\), encoding coordinates, heading, velocity, and size.
- Map Tensor: \(\mathbf{c} \in \mathbb{R}^{L \times N \times D'}\), encoding road topology.
- Perceiver IO is used to compress map information into fixed-length tokens.
- Agent tokens are perturbed with independently sampled noise and encoded with 2D rotary positional embeddings (physical time + denoising step).
Noise-Masked Training (Goal Orientation)¶
- During training, noise levels are sampled independently per token rather than uniformly across the full sequence.
- The model learns to reconstruct complete sequences from partially soft-masked inputs, conditioned on low-noise guidance tokens.
- Optimization objective: \(\forall \mathbf{k} \in (0,1]^{A \times \mathcal{T}}, \min_\theta \mathbb{E} \|(\epsilon - \epsilon_\theta(g(\mathbf{x}^0, \mathbf{k}); \mathbf{c}, \mathbf{k}))\|_2^2\)
- At inference: history and goal tokens are assigned low noise; remaining tokens receive high noise, enabling conditional generation.
Diffusion Transformer Architecture¶
- Built upon DiT with multiple attention interaction types:
- Map cross-attention: agents query map tokens to model agent–map interactions.
- Temporal attention: captures trajectory continuity.
- Spatial attention: models spatial interactions (e.g., car-following, yielding).
- Validity masks exclude invalid or skipped tokens.
- AdaLN is used to condition transformer blocks.
Noise-Aware Scheduling (Reactivity)¶
- A scheduling matrix \(\mathcal{K} \in [\mathbf{k}]^M\) encodes the noise level progression of each agent at each timestep.
- Chunk mechanism:
- Each chunk contains historical frames, frames to be denoised, and optional goal tokens.
- After each denoising step, the lowest-noise token is popped (generation complete) and a high-noise frame is pushed in.
- Environmental changes can directly overwrite agent states with reduced noise levels.
- Scheduling strategies:
- Pyramid scheduling: tokens enter and exit from one end of the chunk, enabling sequential generation.
- Trapezoid scheduling: tokens enter and exit from both ends, supporting bidirectional goal conditioning.
- Both strategies achieve a response latency of only 0.16 seconds (vs. 4.96 seconds for full-sequence methods).
Behavior-Aligned Classifier Guidance¶
- A correction function is applied at each denoising step:
- Separates overlapping agents along the centerline in the reverse direction (collision avoidance).
- Smooths trajectories.
- Attracts agents toward the nearest lane (road-keeping).
Nexus-Data: Safety-Critical Scene Dataset¶
- Generated using the MetaDrive simulator with ScenarioNet-formatted scenes.
- High-risk interactions (cut-ins, hard braking, collisions) are produced via CAT adversarial learning.
- Automated filtering retains only 36.9% of scenarios with valid collisions, further excluding off-road cases and invalid trajectories.
- The final dataset comprises 540 hours of high-quality safety-critical driving scenarios.
Key Experimental Results¶
Main Generation Performance (nuPlan Dataset, 8-second Prediction)¶
| Method | ADE↓ | Off-road Rate↓ | Collision Rate↓ | Instability↓ | Time (s) |
|---|---|---|---|---|---|
| IDM | 10.52 | 9.85 | 10.17 | 6.30 | 2.16 |
| Diffusion Policy | 7.80 | 13.9 | 14.92 | 12.71 | 6.59 |
| SceneDiffuser | 5.99 | 8.53 | 11.78 | 9.64 | 5.34 |
| GUMP | 1.93 | 7.73 | 7.85 | 16.18 | 5.59 |
| Nexus | 1.28 | 6.89 | 1.62 | 4.63 | 2.79 |
| Nexus-Full | 1.12 | 6.25 | 1.56 | 3.17 | 2.93 |
Scheduling Strategy Comparison¶
| Scheduling Strategy | ADE↓ | Response Time (s) | Total Time (s) |
|---|---|---|---|
| Autoregressive | 1.48 | 4.96 | 79.36 |
| Full-sequence | 1.28 | 4.96 | 4.96 |
| Pyramid | 1.53 | 0.16 | 7.68 |
| Trapezoid | 1.39 | 0.16 | 6.20 |
| Trapezoid + Feedback | 1.17 | 0.16 | 6.20 |
Ablation Study¶
| Component | ADE↓ (Conditional Generation) |
|---|---|
| Baseline (Diffusion Policy) | 7.53 |
| + Noise-masked training | 3.42 |
| + Positional encoding | 1.44 |
| + Nexus-Data | 1.32 |
| + Classifier guidance | 1.25 |
Closed-Loop Driving World Generator Evaluation¶
| Method | Reactive Score↑ | Collision Score↑ | Progress Score↑ |
|---|---|---|---|
| Oracle (GT) | 82.8 | 89.5 | 97.0 |
| Diffusion Policy | 61.6 | 81.9 | 90.2 |
| SceneDiffuser | 57.2 | 74.7 | 91.6 |
| Nexus | 73.0 | 84.9 | 95.0 |
Data Augmentation Effect¶
Augmenting planning model training with synthetic data generated by Nexus improves the closed-loop score from 48.11 to 57.86 (+20%).
Highlights & Insights¶
Highlights: - First framework to unify goal-oriented generation and real-time reactivity in scene generation. - Noise-masked training elegantly unifies diffusion and autoregressive prediction under a single formulation. - Response latency of only 0.16 seconds, suitable for online closed-loop simulation. - Collision rate of only 1.56%, substantially lower than all baselines. - Nexus-Data provides large-scale safety-critical scenario data for model training.
Limitations & Future Work: - Generates only structured traffic layouts; does not directly synthesize video. - Closed-loop training of end-to-end driving models still requires video synthesis support. - Insufficient synthetic data volume (3×) degrades performance; adequate scale (30×+) is necessary.
Personal Reflections¶
- Treating noise states as soft masks is an elegant unifying framework that fundamentally addresses the question of whether diffusion and autoregressive models can be reconciled.
- The chunk sliding window design endows diffusion models with genuine online generation capability for the first time.
- The construction methodology of Nexus-Data (adversarial learning + automated filtering) offers a paradigm for large-scale acquisition of safety-critical scenarios.
- The data augmentation experiment (60× synthetic data → +20% closed-loop score) compellingly demonstrates the practical value of scene generation.
- Integration with NeRF illustrates a complete pipeline from layout generation to visual rendering, representing an important component of world models.
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD