OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving¶

Conference: ECCV 2024
arXiv: 2404.15014
Code: https://occgen-ad.github.io/
Area: Autonomous Driving
Keywords: 3D Occupancy Prediction, diffusion model, Multi-modal Fusion, Generative Perception, Coarse-to-Fine Refinement

TL;DR¶

OccGen reformulates 3D semantic occupancy prediction into a generative "noise-to-occupancy" paradigm. It extracts multi-modal features via a conditional encoder and performs diffusion denoising using a progressive refinement decoder to step-by-step generate occupancy maps in a coarse-to-fine manner. It relatively improves mIoU by 9.5%, 6.3%, and 13.3% under multi-modal, LiDAR-only, and camera-only settings on nuScenes-Occupancy, respectively.

Background & Motivation¶

Background: 3D semantic occupancy prediction is a core perception task in autonomous driving, aiming to assign a semantic label to each voxel within the perception range, which retains vertical dimension details better than BEV representations.

Limitations of Prior Work: Existing methods (LiDAR-based, vision-based, multi-modal) formulate occupancy prediction as a one-shot voxel segmentation problem, completing prediction with a single forward pass. However, these discriminative methods suffer from two limitations: (1) they only learn the mapping from input to output, neglecting the distribution modeling of occupancy maps; (2) a single forward pass does not suffice to complete fine-grained structures.

Key Challenge: Discriminative methods lack step-by-step refinement and scene imagination, making them unable to refine their perception of the entire scene through persistent observation like humans do.

Goal: How to introduce a coarse-to-fine step-by-step refinement paradigm into occupancy prediction while supporting uncertainty estimation.

Key Insight: The denoising process of diffusion models naturally models the coarse-to-fine refinement of occupancy maps, progressively generating accurate occupancy predictions from random Gaussian noise.

Core Idea: OccGen is proposed, adopting a generative "noise-to-occupancy" paradigm where the conditional encoder runs only once, and the decoder progressively refines the prediction over multiple iterations, achieving inference latency comparable to one-shot methods.

Method¶

Overall Architecture¶

OccGen is a generative perception model based on an encoder-decoder structure. The conditional encoder extracts conditional features by processing multi-modal inputs (running only once), while the progressive refinement decoder uses these features as conditions to step-by-step refine and generate occupancy predictions from a 3D Gaussian noise map through diffusion denoising. During training, Gaussian noise is added to the ground-truth (GT) occupancy map to learn denoising; during inference, the model starts from pure Gaussian noise and undergoes progressive denoising to generate the final occupancy map.

Key Designs¶

Noise-to-Occupancy Generative Paradigm:
- Function: Formulates occupancy prediction as a progressive generation process from noise to occupancy.
- Mechanism: Refines step-by-step through a \(T\)-step diffusion process \(Y_T \xrightarrow{f_\theta} Y_{T-1} \xrightarrow{f_\theta} \ldots \rightarrow Y_0\), where each step predicts and removes noise using model \(f_\theta\). The forward diffusion process is defined as \(z_t = \sqrt{\alpha_t} z_0 + \sqrt{1-\alpha_t} Z\), where \(Z \sim \mathcal{N}(0, I)\). The refinement formula for each step is \(\Delta Y_t = f_\theta(x, t, Y_{t+1})\) and \(Y_t = Y_{t+1} \oplus \Delta Y_t\).
- Design Motivation: The diffusion denoising process naturally models the coarse-to-fine refinement of 3D occupancy maps instead of one-shot generation. It also naturally supports flexible tradeoffs between accuracy and computation, and enables uncertainty estimation.
Conditional Encoder:
- Function: Fuses multi-modal features from LiDAR and camera inputs to generate condition information for the decoder.
- Mechanism: Employs a dual-stream architecture. The LiDAR stream extracts voxel features \(F_p\) using VoxelNet + 3D sparse convolutions. The camera stream extracts multi-view image features using a 2D backbone + FPN, followed by a Hard 2D-to-3D view transformer to obtain camera voxel features \(F_c\). The fusion module adopts adaptive weighting: \(W = \mathcal{G}_C([\mathcal{G}_C(F_p), \mathcal{G}_C(F_c)])\) and \(F_m = \sigma(W) \odot F_p + (1-\sigma(W)) \odot F_c\).
- Design Motivation: (1) Hard 2D-to-3D view transformation replaces traditional softmax depth estimation with Gumbel-Softmax to guarantee more precise depth predictions (differentiable one-hot encoding); (2) Geometry Mask utilizes LiDAR voxel features to generate a mask that constrains the spatial distribution of camera features, compensating for depth ambiguity in camera features.
Progressive Refinement Decoder:
- Function: Performs multiple iterations of denoising refinement on the noisy occupancy map guided by conditional features.
- Mechanism: The decoder consists of multiple Refinement Layers and an occupancy head. Each layer contains three core operations: (1) 3D Deformable Cross-Attention (DCA) to learn features from conditional inputs: \(\text{DCA}_{3D}(Y_t^i, F_m) = \sum_{n \in F_m} \text{DA}_{3D}(q, \text{proj}(q,n), F_m)\); (2) 3D Deformable Self-Attention (DSA) to enhance self-completion capability: \(\text{DSA}_{3D}(Y_t^i, Y_t^i)\); (3) Temporal Diffusion Module which performs a scale-and-shift operation on the noisy map utilizing the step index \(t\) embedding: \(Y_t^{i} := \text{Diff}(Y_t^i, \text{ToEmbed}(t))\). For efficiency, the noisy map is first downsampled to multi-scale dimensions \(Y_t^i \in \mathbb{R}^{D/2^i \times H/2^i \times W/2^i \times C_i}\).
- Design Motivation: Running the encoder only once while iterating through the decoder for multiple steps prevents significant computational overhead, keeping the inference latency comparable to one-shot forward methods.

Loss & Training¶

During training, a cosine noise schedule is used to add Gaussian noise to the GT occupancy map (empirically found to outperform the linear schedule). The total loss function is:

\[\mathcal{L}_{total} = \mathcal{L}_{ce} + \mathcal{L}_{ls} + \mathcal{L}_{scal}^{geo} + \mathcal{L}_{scal}^{sem} + \mathcal{L}_d\]

where \(\mathcal{L}_{ce}\) is the cross-entropy loss, \(\mathcal{L}_{ls}\) is the lovász-softmax loss, \(\mathcal{L}_{scal}^{geo}\) and \(\mathcal{L}_{scal}^{sem}\) are scene-wise and class-wise affinity losses, and \(\mathcal{L}_d\) is the depth loss. During inference, the DDIM strategy and asymmetric time steps (\(td=1\)) are adopted.

Key Experimental Results¶

Main Results¶

Results on nuScenes-Occupancy validation set:

Setting	Method	IoU	mIoU	Gain
Multi-modal	CONet	29.5	20.1	-
Multi-modal	OccGen	30.3	22.0	+9.5%
LiDAR-only	L-CONet	30.9	15.8	-
LiDAR-only	L-OccGen	31.6	16.8	+6.3%
Camera-only	C-CONet	20.1	12.8	-
Camera-only	C-OccGen	23.4	14.5	+13.3%

Results on SemanticKITTI validation set:

Method	IoU	mIoU
OccFormer	36.50	13.46
Symphonize	41.44	13.44
OccGen	36.87	13.74

Ablation Study¶

Configuration	IoU	mIoU	Description
Baseline	28.1	20.4	No encoder improvement + No decoder improvement
+ Proposed Encoder	28.6	20.7	Hard LSS + Geo Mask
+ Proposed Decoder	30.1	21.6	Progressive Refinement Decoder
Complete OccGen	30.3	22.0	Encoder + Decoder
w/o DSA	30.1	21.4	Remove self-attention
w/o Diffusion	29.3	21.7	Remove diffusion denoising

Key Findings¶

The decoder (progressive refinement) contributes more than the encoder improvements, indicating that the generative paradigm itself is the key driver of performance gain.
Cross-attention has a larger impact than self-attention, as learning representations from conditional inputs is more critical.
Removing the diffusion module drops the mIoU from 22.0% to 21.7%, proving the necessity of the temporal diffusion process.
Multi-step inference consistently boosts performance: 1 step (21.7%) → 3 steps (22.0%), allowing a trade-off adjustment between computation and accuracy without retraining.
Multi-modal OccGen outperforms camera-only by 7.5% mIoU, and LiDAR-only by 5.8% mIoU.

Highlights & Insights¶

First to Introduce Diffusion Models to 3D Occupancy Prediction: Redefines the traditional discriminative task as a generative task, empowering the model with step-by-step refinement and uncertainty estimation capabilities.
Efficient Inference Design: The encoder only runs once while the decoder iterates through multiple steps, achieving an inference latency comparable to one-shot methods (approx. 294ms vs. 286ms for CONet).
Uncertainty Estimation: The stochastic sampling process naturally allows the computation of voxel-level prediction uncertainty, which is unattainable for discriminative methods.
Hard 2D-to-3D View Transformation: Implements Gumbel-Softmax for differentiable one-hot depth encoding, which is more accurate than traditional softmax.

Limitations & Future Work¶

The accuracy boost brought by multi-step inference is limited (3-step only improves by 0.3% mIoU), and performance saturate with more steps.
The current generation process operates directly in voxel space; exploring latent space diffusion could be more efficient.
The comparison with the latest vision-only methods (e.g., FlashOcc, SparseOcc) is insufficient.
The multi-modal fusion strategy is relatively simple (adaptive weighting); more complex interaction mechanisms could be explored.

vs. CONet: OccGen replaces the discriminative approach with a generative one, comprehensively outperforming it across all three modalities and demonstrating the advantage of the diffusion paradigm for dense prediction.
vs. DiffusionDet / DDP: Adopts the diffusion "noise-to-X" paradigm but extends it to the sparse voxel space of 3D occupancy, addressing the high-resolution computation challenge in 3D scenes.
vs. MonoScene / VoxFormer: Unlike one-shot forward methods, OccGen can flexibly trade off computation and accuracy while providing uncertainty information.

Rating¶

Novelty: ⭐⭐⭐⭐ The first study to introduce diffusion models to 3D occupancy prediction. While the generative paradigm is novel, applying diffusion to perception tasks is not entirely unprecedented.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluations on two benchmarks with detailed ablation studies; however, it lacks comparisons with a broader range of the latest methods.
Writing Quality: ⭐⭐⭐⭐ Extremely clear, with detailed method descriptions and complete mathematical derivations.
Value: ⭐⭐⭐⭐ Provides a fresh generative perspective for occupancy prediction; both uncertainty estimation and flexible inference hold practical value.