NeurIPS 2025 Autonomous Driving LiDAR world model flow matching 4D semantic occupancy prediction domain transfer VAE compression foundation model

Towards Foundational LiDAR World Models with Efficient Latent Flow Matching¶

Conference: NeurIPS 2025 arXiv: 2506.23434 Code: To be confirmed Area: Autonomous Driving Keywords: LiDAR world model, flow matching, 4D semantic occupancy prediction, domain transfer, VAE compression, foundation model

TL;DR¶

This paper proposes the first transferable LiDAR world model, achieving a 192× compression ratio via a Swin Transformer VAE (state-of-the-art reconstruction accuracy), replacing diffusion models with Conditional Flow Matching (CFM) for state-of-the-art semantic occupancy prediction (using only 4.38% of prior work's FLOPs), and surpassing OccWorld trained on full annotations across three domain transfer tasks using only 5% labeled data.

Background & Motivation¶

Geometric limitations of RGB world models: RGB-based world models (GAIA-1/2, Cosmos) exhibit strong generative capabilities but lack explicit depth and semantic structural information, making them unsuitable for planning and control in autonomous driving.
Domain constraints of LiDAR world models: Existing LiDAR world models (Copilot4D, BEVWorld) are trained and evaluated solely under specific dataset/sensor configurations, lacking cross-domain transfer capability — they fail when the sensor or environment changes.
High cost of semantic annotation: 4D semantic occupancy prediction tasks (OccWorld, DOME, etc.) rely heavily on expensive manual semantic annotations, limiting model scalability.
Insufficient compression efficiency: Prior methods directly repurpose the SD3 encoder-decoder architecture for LiDAR data, yielding low compression ratios (16×–64×), which leads to redundant parameters in the dynamics model and slow training.
Low training efficiency of diffusion models: The DDPM training + DDIM sampling paradigm requires thousands of epochs to converge for LiDAR prediction, incurring prohibitive computational costs that hinder transferability research.
Transferability of dynamic knowledge remains unexplored: The causal laws governing object motion (dynamics priors) should be shared across environments, yet no prior work has systematically studied a pretrain-finetune paradigm for LiDAR world models.

Method¶

Overall Architecture: Pretrain-Finetune LiDAR World Model¶

Function: Pretrain a general-purpose world model on large-scale unannotated LiDAR data, then fine-tune it for diverse downstream tasks (varying beam counts, indoor scenes, semantic occupancy prediction).
Design Motivation: Dynamic knowledge (how objects move) is shared across domains; pretraining can learn generalizable 3D dynamics priors, reducing the annotation requirements for downstream tasks.
Mechanism:
Train a VAE + CFM model on unannotated nuScenes LiDAR data.
Fine-tune separately for three downstream tasks: (i) sparse-to-dense beam adaptation (KITTI360); (ii) outdoor-to-indoor transfer (self-collected Jackal data); (iii) non-semantic to semantic transfer (nuScenes Occ3D).
Apply representation alignment (VAE fine-tuning + cosine similarity loss) to prevent latent space drift during fine-tuning.

Key Design 1: Swin Transformer VAE for Efficient Compression¶

Function: Design a LiDAR-oriented VAE architecture achieving a far superior compression ratio (192×) over prior work while maintaining state-of-the-art reconstruction accuracy.
Design Motivation: Prior methods directly reuse image VAEs (SD3 architecture) without exploiting the sparse characteristics of LiDAR BEV representations; low compression ratios (16×–64×) result in excessively high latent dimensionality, significantly increasing the parameter count of the dynamics model.
Mechanism:
Encoding: Height embedding + category embedding → 2D BEV feature map → Swin Transformer encoder (convolutional layers replace Patch Merging downsampling) → lightweight neck compressed to 16 channels → reparameterization sampling.
Decoding: Symmetric 2D structure (3D blocks removed; experiments show they are detrimental) → differentiable ray rendering for point cloud recovery or occupancy map recovery via class embedding similarity.
Continuous encoding: Discrete codebooks are abandoned (to avoid codebook collapse and low compression efficiency); continuous Gaussian latent variables are adopted instead.

Key Design 2: Conditional Flow Matching (CFM) Prediction Model¶

Function: Replace DDPM/DDIM with Rectified Flow as the generative prediction framework.
Design Motivation: Diffusion models require a large number of sampling steps (1000 for DDPM / 50 for DDIM), whereas flow matching regresses the velocity field along straight-line trajectories, generating high-quality samples in far fewer steps, with FLOPs of only 4.38%–28.91% compared to prior work.
Mechanism:
Linear interpolation noise: \(\mathbf{x}_t = (1-t)\epsilon + t\sigma \mathbf{z}_{s}^{t_1:t_2}\), defining a straight-line path from a standard Gaussian to the target distribution.
Spatiotemporal DiT architecture improvements: 3D convolutional layers are inserted after spatial DiT blocks to enlarge the temporal receptive field (addressing weak temporal dependencies in per-frame latent representations); a UNet-style multi-scale structure replaces the single-stride DiT backbone.
Training objective: Rectified Flow loss \(\mathcal{L}(\theta) = \mathbb{E}\|\mu_t^\theta(\mathbf{z}) - (\mathbf{z}_s - \mathbf{x}_0)\|^2\), with timesteps sampled from \(\text{sigmoid}(\mathcal{N}(0,1))\).
Conditional input: Historical latent frames and noisy future frames are concatenated along the temporal dimension; optional conditioning on future trajectories is supported.

Key Design 3: Representation Alignment Fine-Tuning Strategy¶

Function: Maintain alignment between the new-domain latent space and the pretrained domain during fine-tuning.
Design Motivation: Directly using the pretrained VAE or training a new VAE from scratch both result in feature space mismatch, preventing effective utilization of the pretrained CFM.
Mechanism:
For beam adaptation and indoor tasks: fine-tune all VAE parameters directly.
For semantic occupancy prediction (where embedding layer dimensions differ): a cosine similarity alignment term \(\kappa \mathcal{L}_{\cos}(\mathbf{z}_s, \mathbf{d}_s)\) is added to the semantic VAE training loss, guiding the semantic latent space towards the dense occupancy latent space.

Key Experimental Results¶

Table 1: SOTA Comparison on nuScenes Semantic Occupancy Prediction¶

Method	1s mIoU	2s mIoU	3s mIoU	Avg mIoU	GFLOPs/frame	FPS
OccWorld	25.75	15.14	10.51	17.13	1347.09	16.97
DynamicCity	26.18	16.94	—	—	774.44	19.30
Ours	33.17	21.09	15.64	23.33	389.46	22.22
DOME†	29.39	20.98	16.17	22.18	8891.98	5.48
Ours†	36.42	27.39	21.66	28.49	389.46	21.43

Table 2: VAE Reconstruction Accuracy vs. Compression Ratio¶

Method	Compression Ratio	mIoU	IoU
UniScenes	32×	92.1	87.0
DOME	64×	83.1	77.3
Ours	32×	99.2	97.9
Ours	192×	92.8	85.8
Ours	768×	80.0	69.3

Key Findings:

Without future trajectory conditioning, the 1s mIoU reaches 33.17%, surpassing RenderWorld (28.69%) by 4.48 absolute percentage points.
With future trajectory conditioning, the proposed method outperforms DOME by at least 5.5% mIoU while using only 4.38% of DOME's FLOPs and running 3.9× faster.
In domain transfer experiments, using only 5% of labeled data (i.e., 5% of the annotations used by OccWorld) surpasses OccWorld trained on full annotations, with relative mIoU improvements of 82.6%/80.7%/69.7% at 1s/2s/3s.
In 30 out of 36 comparison points, the pretrained model outperforms training from scratch, with a maximum absolute improvement of 11.17%.
The FVD temporal consistency score is 7.68, significantly better than OccWorld (18.68) and DOME (9.79).

Highlights & Insights¶

First systematic transferability study of a LiDAR foundation world model, validating three domain transfer scenarios (beam adaptation, indoor/outdoor, non-semantic to semantic).
The Swin Transformer VAE achieving 192× compression with SOTA reconstruction represents a major engineering contribution to the field.
Surpassing full-annotation training with only 5% labeled data substantially reduces annotation costs for semantic occupancy prediction, offering significant practical value.
In-depth analysis of representation alignment (via CKA/CKNNA metrics) reveals that the key to successful fine-tuning lies in preserving latent space structure rather than reconstruction accuracy.

Limitations & Future Work¶

Limited pretraining data: Only 27K frames from nuScenes are used for pretraining, far smaller in scale than RGB foundation models; incorporating additional datasets (Waymo, ONCE, etc.) may further improve transfer capability.
Limited indoor transfer performance: When sufficient data is available (>25%), training from scratch surpasses the pretrained model, indicating a large domain gap between outdoor and indoor environments that requires including indoor data during pretraining.
Restricted to ground vehicles: Transferability to non-ground platforms such as drones and underwater robots has not been validated.
Limited semantic categories: Constrained by the 16-class annotations of Occ3D; open-vocabulary semantic prediction with finer granularity has not been explored.
Incomplete evaluation metrics for non-deterministic outputs: mIoU/IoU cannot adequately assess the diversity of stochastic models; although NLL/FID/KID/FVD are reported, systematic evaluation of diversity and coverage is lacking.

vs. Copilot4D (Zhang et al., 2023)¶

Copilot4D achieves state-of-the-art LiDAR prediction using a MaskGiT latent diffusion model, but is trained entirely within a single nuScenes domain with no transferability study. The proposed method surpasses Copilot4D on non-semantic occupancy prediction (IoU Avg 30.91 vs. 21.09) and, through a pretrain-finetune paradigm, demonstrates effective cross-domain transfer for the first time.

vs. OccWorld (Zheng et al., 2024) / DOME (Gu et al., 2024)¶

OccWorld employs an autoregressive Transformer for semantic occupancy prediction (mIoU 17.13), while DOME introduces conditional trajectory inputs to achieve 22.18 mIoU but requires 444M parameters and 8892 GFLOPs/frame. The proposed method achieves 23.33/28.49 mIoU (without/with trajectory) using 30M parameters and 389 GFLOPs/frame (1/23 of DOME), substantially outperforming prior work in both accuracy and efficiency. Notably, the proposed method requires only 5% of labeled data to surpass OccWorld.

vs. Cosmos (Agarwal et al., 2025)¶

Cosmos is an RGB foundation world model targeting both indoor and outdoor scenes but cannot provide explicit geometric information. The proposed work represents the first attempt at a foundation model-like approach in the LiDAR modality, offering directly usable 3D semantic prediction for autonomous driving.

Rating¶

Novelty: ⭐⭐⭐⭐ (First LiDAR world model transferability study + CFM replacing diffusion + VAE design; the combination of contributions is highly significant)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Three transfer tasks, multiple data ratio settings, ablation studies, comprehensive evaluation with FVD/FID/KID/NLL)
Writing Quality: ⭐⭐⭐⭐ (Clear structure, rigorous experiments; some equations involve dense notation requiring repeated cross-referencing)
Value: ⭐⭐⭐⭐⭐ (Substantially reduces dependence on semantic annotations, improves efficiency by an order of magnitude, with direct implications for real-world autonomous driving deployment)