SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding¶

Conference: NeurIPS 2025 arXiv: 2505.22643 Code: GitHub Area: Autonomous Driving / LiDAR Generation Keywords: LiDAR Generation, Diffusion Models, Semantic Segmentation, Range-View, Closed-Loop Inference

TL;DR¶

SPIRAL proposes a semantic-aware range-view LiDAR diffusion model that jointly generates depth maps, reflectance images, and semantic segmentation maps. By introducing progressive semantic prediction and a closed-loop inference mechanism to enhance cross-modal consistency, the model achieves state-of-the-art performance with a minimal parameter count of 61M.

Background & Motivation¶

Large-scale acquisition and annotation of LiDAR data is prohibitively expensive. Leveraging diffusion models to synthesize LiDAR scenes has emerged as a promising direction to alleviate this data bottleneck. Existing generation approaches fall into two categories: voxel-based methods (e.g., XCube, DynamicCity), which can simultaneously generate geometric structures and semantic labels but incur high memory and computational costs, and range-view methods (e.g., LiDARGen, R2DM), which are computationally efficient but produce only unannotated depth and reflectance images.

Limitations of Prior Work: Existing range-view methods that require semantic labels must resort to a two-stage pipeline—first generating an unannotated scene, then predicting semantic maps using a pretrained segmentation model (e.g., RangeNet++). This approach has two critical drawbacks: 1. The generative and segmentation models are trained independently without shared representations, resulting in low training efficiency. 2. Semantic maps are predicted post-hoc and cannot guide the generation of depth and reflectance during the diffusion process, leading to poor cross-modal consistency.

Key Insight: The powerful feature learning capacity of diffusion models can be exploited to simultaneously predict semantic labels during denoising, and a closed-loop mechanism can allow semantic predictions to inversely guide geometric generation.

Method¶

Overall Architecture¶

SPIRAL employs a 4-level Efficient U-Net as the backbone, built upon a continuous-time DDPM framework. The input consists of noisy depth and reflectance images \(x_t\) along with a semantic map \(y\) (encoded as an RGB image). Two independent branches output the diffusion residual \(\hat{\epsilon}_t\) and semantic labels \(\hat{y}_t\), respectively. The model alternates between unconditional steps and conditional steps, controlled by two mutually exclusive switches \(\mathcal{A}\) and \(\mathcal{B}\).

Key Designs¶

Complete Semantic Awareness:
- Unconditional step: the model jointly predicts the semantic map \(\hat{y}_t\) and noise \(\hat{\epsilon}_t\); the loss is MSE + cross-entropy.
- Conditional step: conditioned on a given semantic map \(y\), only the denoising residual \(\hat{\epsilon}_t\) is predicted; the loss is MSE.
- During training, the two step types are randomly switched with 50% probability, with a unified loss: \(\mathcal{L} = \mathcal{L}_c \cdot \mathbb{I}(\psi \leq 0.5) + \mathcal{L}_u \cdot \mathbb{I}(\psi > 0.5)\)
Progressive Semantic Predictions:
- At inference, each unconditional denoising step produces an intermediate semantic map \(\hat{y}_t\).
- Exponential moving average (EMA) smoothing is applied: \(\bar{y}_t = \alpha \cdot \hat{y}_t + (1-\alpha) \cdot \bar{y}_{t+1}\)
- This suppresses stochastic fluctuations during diffusion and yields stable per-pixel confidence scores.
- The final \(\bar{y}_0\) serves as the semantic output.
Closed-Loop Inference:
- Inference begins in open-loop mode, executing unconditional steps.
- When the proportion of pixels in \(\bar{y}_t\) with confidence exceeding threshold \(\delta\) surpasses \(\delta\) (default 0.8), the model switches to closed-loop mode.
- In closed-loop mode, unconditional and conditional steps alternate: unconditional steps predict semantics and noise, while conditional steps use the current semantic map to guide depth/reflectance generation.
- This achieves joint optimization of semantics and geometry, enhancing cross-modal consistency.

Semantic-Aware Evaluation Metrics¶

The paper also introduces a new semantic-aware evaluation framework: - Learned features: Features extracted by a RangeNet++ encoder and a LiDM semantic encoder are concatenated to compute S-FRD, S-FPD, and S-MMD. - Rule-based features: Per-class BEV 2D histograms are computed and aggregated into \(h^s \in \mathbb{R}^{C \times B \times B}\) to compute S-JSD and S-MMD.

Key Experimental Results¶

Main Results (SemanticKITTI)¶

Method	Params	S-FRD↓	S-FPD↓	S-JSD↓
LiDARGen + RangeNet++	80M	1216.61	710.79	28.65
LiDM + RangeNet++	325M	—	458.33	16.69
R2DM + RangeNet++	81M	559.26	363.16	18.13
R2DM + SPVCNN++	128M	555.09	351.73	18.67
SPIRAL (Ours)	61M	382.87	153.61	9.16

Ablation Study¶

Configuration	S-FRD↓	S-FPD↓	Notes
w/o closed-loop inference	Higher	Higher	Closed-loop mechanism significantly improves cross-modal consistency
w/o EMA smoothing	Higher	Higher	EMA suppresses stochasticity during denoising
threshold δ=0.8	Optimal	Optimal	Too low introduces noise contamination; too high delays closed-loop activation

Key Findings¶

SPIRAL surpasses all two-stage methods (80–372M parameters) with only 61M parameters, achieving 31% improvement in S-FRD, 56% in S-FPD, and 50% in S-JSD.
Larger segmentation models (SPVCNN++) perform worse than RangeNet++ on generated data, suggesting larger models are more sensitive to distributional noise.
SPIRAL-generated data effectively augments downstream segmentation training, reducing annotation costs.
The model generalizes well, achieving state-of-the-art results on the nuScenes dataset as well.

Highlights & Insights¶

The closed-loop inference mechanism is particularly elegant: feeding intermediate diffusion predictions back as conditional inputs enables mutual reinforcement between semantics and geometry—a design principle transferable to other multi-modal generation tasks.
EMA-based progressive semantic prediction naturally exploits the iterative nature of the denoising process for gradual confidence accumulation.
Unified training rather than two-stage pipelines eliminates the representational gap between generative and segmentation models, substantially reducing the total parameter count.
The proposed semantic-aware evaluation metrics fill a gap in quality assessment for labeled LiDAR scene generation.

Limitations & Future Work¶

Range-view representations may lose fine-grained details of distant objects at high resolutions.
Closed-loop inference increases the number of inference steps and overall runtime due to alternating between two step types.
Semantic prediction relies on the diffusion model's feature learning capacity and may be insufficiently precise for rare object categories.
Text-conditioned generation and 4D dynamic scene generation remain unexplored.

vs. R2DM: R2DM generates only depth and reflectance and requires an external segmentation model; SPIRAL unifies the generation of all three modalities.
vs. LiDM: LiDM supports semantic-conditioned generation but requires a semantic map as prior input; SPIRAL autonomously predicts semantics during generation.
vs. voxel-based methods (XCube, DynamicCity): Voxel-based methods carry large parameter counts and high computational overhead; SPIRAL's range-view representation is substantially more efficient.

Rating¶

Novelty: ⭐⭐⭐⭐ Closed-loop inference and progressive semantic prediction are meaningful contributions, though the core diffusion framework is standard.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Two standard benchmarks + new evaluation metrics + extensive ablations + downstream application validation.
Writing Quality: ⭐⭐⭐⭐ Clear structure, professional figures, and coherent motivation.
Value: ⭐⭐⭐⭐ Practically valuable for autonomous driving data generation, though impact is primarily confined to the LiDAR domain.