OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving¶
Conference: ECCV 2024
arXiv: 2404.15014
Code: https://occgen-ad.github.io/
Area: Autonomous Driving
Keywords: 3D Occupancy Prediction, diffusion model, Multi-modal Fusion, Generative Perception, Coarse-to-Fine Refinement
TL;DR¶
OccGen reformulates 3D semantic occupancy prediction into a generative "noise-to-occupancy" paradigm. It extracts multi-modal features via a conditional encoder and performs diffusion denoising using a progressive refinement decoder to step-by-step generate occupancy maps in a coarse-to-fine manner. It relatively improves mIoU by 9.5%, 6.3%, and 13.3% under multi-modal, LiDAR-only, and camera-only settings on nuScenes-Occupancy, respectively.
Background & Motivation¶
Background: 3D semantic occupancy prediction is a core perception task in autonomous driving, aiming to assign a semantic label to each voxel within the perception range, which retains vertical dimension details better than BEV representations.
Limitations of Prior Work: Existing methods (LiDAR-based, vision-based, multi-modal) formulate occupancy prediction as a one-shot voxel segmentation problem, completing prediction with a single forward pass. However, these discriminative methods suffer from two limitations: (1) they only learn the mapping from input to output, neglecting the distribution modeling of occupancy maps; (2) a single forward pass does not suffice to complete fine-grained structures.
Key Challenge: Discriminative methods lack step-by-step refinement and scene imagination, making them unable to refine their perception of the entire scene through persistent observation like humans do.
Goal: How to introduce a coarse-to-fine step-by-step refinement paradigm into occupancy prediction while supporting uncertainty estimation.
Key Insight: The denoising process of diffusion models naturally models the coarse-to-fine refinement of occupancy maps, progressively generating accurate occupancy predictions from random Gaussian noise.
Core Idea: OccGen is proposed, adopting a generative "noise-to-occupancy" paradigm where the conditional encoder runs only once, and the decoder progressively refines the prediction over multiple iterations, achieving inference latency comparable to one-shot methods.
Method¶
Overall Architecture¶
OccGen is a generative perception model based on an encoder-decoder structure. The conditional encoder extracts conditional features by processing multi-modal inputs (running only once), while the progressive refinement decoder uses these features as conditions to step-by-step refine and generate occupancy predictions from a 3D Gaussian noise map through diffusion denoising. During training, Gaussian noise is added to the ground-truth (GT) occupancy map to learn denoising; during inference, the model starts from pure Gaussian noise and undergoes progressive denoising to generate the final occupancy map.
Key Designs¶
-
Noise-to-Occupancy Generative Paradigm:
- Function: Formulates occupancy prediction as a progressive generation process from noise to occupancy.
- Mechanism: Refines step-by-step through a \(T\)-step diffusion process \(Y_T \xrightarrow{f_\theta} Y_{T-1} \xrightarrow{f_\theta} \ldots \rightarrow Y_0\), where each step predicts and removes noise using model \(f_\theta\). The forward diffusion process is defined as \(z_t = \sqrt{\alpha_t} z_0 + \sqrt{1-\alpha_t} Z\), where \(Z \sim \mathcal{N}(0, I)\). The refinement formula for each step is \(\Delta Y_t = f_\theta(x, t, Y_{t+1})\) and \(Y_t = Y_{t+1} \oplus \Delta Y_t\).
- Design Motivation: The diffusion denoising process naturally models the coarse-to-fine refinement of 3D occupancy maps instead of one-shot generation. It also naturally supports flexible tradeoffs between accuracy and computation, and enables uncertainty estimation.
-
Conditional Encoder:
- Function: Fuses multi-modal features from LiDAR and camera inputs to generate condition information for the decoder.
- Mechanism: Employs a dual-stream architecture. The LiDAR stream extracts voxel features \(F_p\) using VoxelNet + 3D sparse convolutions. The camera stream extracts multi-view image features using a 2D backbone + FPN, followed by a Hard 2D-to-3D view transformer to obtain camera voxel features \(F_c\). The fusion module adopts adaptive weighting: \(W = \mathcal{G}_C([\mathcal{G}_C(F_p), \mathcal{G}_C(F_c)])\) and \(F_m = \sigma(W) \odot F_p + (1-\sigma(W)) \odot F_c\).
- Design Motivation: (1) Hard 2D-to-3D view transformation replaces traditional softmax depth estimation with Gumbel-Softmax to guarantee more precise depth predictions (differentiable one-hot encoding); (2) Geometry Mask utilizes LiDAR voxel features to generate a mask that constrains the spatial distribution of camera features, compensating for depth ambiguity in camera features.
-
Progressive Refinement Decoder:
- Function: Performs multiple iterations of denoising refinement on the noisy occupancy map guided by conditional features.
- Mechanism: The decoder consists of multiple Refinement Layers and an occupancy head. Each layer contains three core operations: (1) 3D Deformable Cross-Attention (DCA) to learn features from conditional inputs: \(\text{DCA}_{3D}(Y_t^i, F_m) = \sum_{n \in F_m} \text{DA}_{3D}(q, \text{proj}(q,n), F_m)\); (2) 3D Deformable Self-Attention (DSA) to enhance self-completion capability: \(\text{DSA}_{3D}(Y_t^i, Y_t^i)\); (3) Temporal Diffusion Module which performs a scale-and-shift operation on the noisy map utilizing the step index \(t\) embedding: \(Y_t^{i} := \text{Diff}(Y_t^i, \text{ToEmbed}(t))\). For efficiency, the noisy map is first downsampled to multi-scale dimensions \(Y_t^i \in \mathbb{R}^{D/2^i \times H/2^i \times W/2^i \times C_i}\).
- Design Motivation: Running the encoder only once while iterating through the decoder for multiple steps prevents significant computational overhead, keeping the inference latency comparable to one-shot forward methods.
Loss & Training¶
During training, a cosine noise schedule is used to add Gaussian noise to the GT occupancy map (empirically found to outperform the linear schedule). The total loss function is:
where \(\mathcal{L}_{ce}\) is the cross-entropy loss, \(\mathcal{L}_{ls}\) is the lovász-softmax loss, \(\mathcal{L}_{scal}^{geo}\) and \(\mathcal{L}_{scal}^{sem}\) are scene-wise and class-wise affinity losses, and \(\mathcal{L}_d\) is the depth loss. During inference, the DDIM strategy and asymmetric time steps (\(td=1\)) are adopted.
Key Experimental Results¶
Main Results¶
Results on nuScenes-Occupancy validation set:
| Setting | Method | IoU | mIoU | Gain |
|---|---|---|---|---|
| Multi-modal | CONet | 29.5 | 20.1 | - |
| Multi-modal | OccGen | 30.3 | 22.0 | +9.5% |
| LiDAR-only | L-CONet | 30.9 | 15.8 | - |
| LiDAR-only | L-OccGen | 31.6 | 16.8 | +6.3% |
| Camera-only | C-CONet | 20.1 | 12.8 | - |
| Camera-only | C-OccGen | 23.4 | 14.5 | +13.3% |
Results on SemanticKITTI validation set:
| Method | IoU | mIoU |
|---|---|---|
| OccFormer | 36.50 | 13.46 |
| Symphonize | 41.44 | 13.44 |
| OccGen | 36.87 | 13.74 |
Ablation Study¶
| Configuration | IoU | mIoU | Description |
|---|---|---|---|
| Baseline | 28.1 | 20.4 | No encoder improvement + No decoder improvement |
| + Proposed Encoder | 28.6 | 20.7 | Hard LSS + Geo Mask |
| + Proposed Decoder | 30.1 | 21.6 | Progressive Refinement Decoder |
| Complete OccGen | 30.3 | 22.0 | Encoder + Decoder |
| w/o DSA | 30.1 | 21.4 | Remove self-attention |
| w/o Diffusion | 29.3 | 21.7 | Remove diffusion denoising |
Key Findings¶
- The decoder (progressive refinement) contributes more than the encoder improvements, indicating that the generative paradigm itself is the key driver of performance gain.
- Cross-attention has a larger impact than self-attention, as learning representations from conditional inputs is more critical.
- Removing the diffusion module drops the mIoU from 22.0% to 21.7%, proving the necessity of the temporal diffusion process.
- Multi-step inference consistently boosts performance: 1 step (21.7%) → 3 steps (22.0%), allowing a trade-off adjustment between computation and accuracy without retraining.
- Multi-modal OccGen outperforms camera-only by 7.5% mIoU, and LiDAR-only by 5.8% mIoU.
Highlights & Insights¶
- First to Introduce Diffusion Models to 3D Occupancy Prediction: Redefines the traditional discriminative task as a generative task, empowering the model with step-by-step refinement and uncertainty estimation capabilities.
- Efficient Inference Design: The encoder only runs once while the decoder iterates through multiple steps, achieving an inference latency comparable to one-shot methods (approx. 294ms vs. 286ms for CONet).
- Uncertainty Estimation: The stochastic sampling process naturally allows the computation of voxel-level prediction uncertainty, which is unattainable for discriminative methods.
- Hard 2D-to-3D View Transformation: Implements Gumbel-Softmax for differentiable one-hot depth encoding, which is more accurate than traditional softmax.
Limitations & Future Work¶
- The accuracy boost brought by multi-step inference is limited (3-step only improves by 0.3% mIoU), and performance saturate with more steps.
- The current generation process operates directly in voxel space; exploring latent space diffusion could be more efficient.
- The comparison with the latest vision-only methods (e.g., FlashOcc, SparseOcc) is insufficient.
- The multi-modal fusion strategy is relatively simple (adaptive weighting); more complex interaction mechanisms could be explored.
Related Work & Insights¶
- vs. CONet: OccGen replaces the discriminative approach with a generative one, comprehensively outperforming it across all three modalities and demonstrating the advantage of the diffusion paradigm for dense prediction.
- vs. DiffusionDet / DDP: Adopts the diffusion "noise-to-X" paradigm but extends it to the sparse voxel space of 3D occupancy, addressing the high-resolution computation challenge in 3D scenes.
- vs. MonoScene / VoxFormer: Unlike one-shot forward methods, OccGen can flexibly trade off computation and accuracy while providing uncertainty information.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first study to introduce diffusion models to 3D occupancy prediction. While the generative paradigm is novel, applying diffusion to perception tasks is not entirely unprecedented.
- Experimental Thoroughness: ⭐⭐⭐⭐ Evaluations on two benchmarks with detailed ablation studies; however, it lacks comparisons with a broader range of the latest methods.
- Writing Quality: ⭐⭐⭐⭐ Extremely clear, with detailed method descriptions and complete mathematical derivations.
- Value: ⭐⭐⭐⭐ Provides a fresh generative perspective for occupancy prediction; both uncertainty estimation and flexible inference hold practical value.