Skip to content

LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

Conference: AAAI 2026 arXiv: 2508.03692 Code: https://github.com/worldbench/lidarcrafter Area: Autonomous Driving Keywords: LiDAR generation, 4D world model, diffusion model, scene graph, autonomous driving

TL;DR

This paper proposes LiDARCrafter, the first 4D generative world model targeting LiDAR, which achieves controllable 4D LiDAR sequence generation and editing through a pipeline of text → scene graph → three-branch diffusion layout → range-image diffusion generation → autoregressive temporal extension, comprehensively surpassing existing methods on nuScenes.

Background & Motivation

Generative world models have emerged as critical data engines for autonomous driving, yet three fundamental challenges remain unresolved:

  1. LiDAR is overlooked: Existing works primarily focus on video (GAIA-1, DreamForge) or occupancy grids (OccWorld, OccSora), while LiDAR is neglected due to its sparse, unordered, and irregular nature.
  2. Insufficient controllability: Text prompts lack spatial precision, whereas precise inputs such as 3D bounding boxes and HD maps require expensive annotation.
  3. Lack of temporal consistency: Single-frame generation fails to reveal occlusion patterns and object kinematics, and conventional cross-frame attention neglects the geometric continuity of point clouds.
  4. Absence of standardized evaluation: While video world models benefit from mature benchmarks, no unified evaluation protocol exists for LiDAR.

Core Idea: Exploit an explicit object-centric 4D layout (geometry + motion) as an intermediate representation to bridge the usability of natural language and the geometric precision of LiDAR.

Method

Overall Architecture

LiDARCrafter adopts a three-stage pipeline: 1. Text2Layout: An LLM parses text instructions into an ego-centric scene graph → a three-branch diffusion network generates object bounding boxes, trajectories, and shape priors. 2. Layout2Scene: A range-image diffusion model translates layout conditions into high-fidelity single-frame LiDAR scans. 3. Scene2Seq: An autoregressive module warps historical point clouds using motion priors to generate temporally consistent 4D sequences.

Key Designs

Design 1: Three-Branch 4D Layout Diffusion Generation (Text2Layout)

Text prompts are converted into structured 4D layout tuples \(\mathcal{O}_i=(\mathbf{b}_i, \boldsymbol{\delta}_i, \mathbf{p}_i)\): - \(\mathbf{b}_i=(x_i,y_i,z_i,w_i,l_i,h_i,\psi_i)\): 3D bounding box (center, dimensions, heading angle) - \(\boldsymbol{\delta}_i=\{(\Delta x_i^t, \Delta y_i^t)\}_{t=1}^T\): future trajectory displacements over \(T\) frames - \(\mathbf{p}_i \in \mathbb{R}^{N \times 3}\): \(N\) canonicalized foreground points (coarse shape prior)

Scene Graph Construction: The LLM extracts an ego-centric graph \(\mathcal{G}=(\mathcal{V},\mathcal{E})\), where node \(v_i\) is annotated with semantic category \(c_i\) and motion state \(s_i\), and directed edge \(e_{i \to j}\) encodes spatial relationships.

Graph-Fusion Encoder: An \(L\)-layer TripletGCN processes the scene graph, with nodes and edges initialized using a frozen CLIP encoder:

\[\mathbf{h}_{v_i}^{(0)}=\text{concat}(\text{CLIP}(c_i), \text{CLIP}(s_i), \boldsymbol{\omega}_i)\]

At each layer, node features are updated via edge reasoning \(\Phi_{\text{edge}}\) and neighborhood aggregation \(\Phi_{\text{agg}}\).

Three-Branch Diffusion Decoder: Each branch minimizes:

\[\mathcal{L}^o=\mathbb{E}_{\tau,\mathbf{d}^o,\varepsilon}\|\varepsilon-\varepsilon_\theta^o(\mathbf{d}_\tau^o, \tau, c^o)\|_2^2\]

Bounding boxes and trajectories are denoised with lightweight 1D U-Nets, while object shapes are generated using a point cloud U-Net.

Design 2: Sparse Object-Conditioned Range-Image Diffusion (Layout2Scene)

To address the challenge that distant or small objects occupy only tens of pixels in range images, sparse object conditioning is proposed:

\[\hat{\mathbf{h}}_{v_i}=\Phi_{\text{pos}}(\pi(\mathbf{b}_i))+\Phi_{\text{cls}}(c_i)+\Phi_{\text{box}}(\mathbf{b}_i)\]

Global conditioning vector: \(\mathbf{h}_{\text{cond}}=\mathbf{h}_{\text{ego}}+\Phi_{\text{time}}(\tau)+\text{CLIP}(s_0)\)

Layout-driven scene editing is realized via mask blending:

\[\mathbf{d}_{\tau-1}=(1-\mathbf{m})\odot\tilde{\mathbf{d}}_{\tau-1}+\mathbf{m}\odot\hat{\mathbf{d}}_{\tau-1}\]

Design 3: Motion-Prior-Driven Autoregressive 4D Generation (Scene2Seq)

The core insight is that, aside from the ego vehicle and annotated objects, the majority of the scene in a LiDAR sequence is static. Warping is therefore used to provide a strong prior:

  • Static scene warp: Background points are transformed using the ego-pose matrix \(\Delta\mathbf{G}_0^t\) as \(\mathbf{B}^t=\Delta\mathbf{G}_0^t \mathbf{B}^{t-1}\).
  • Dynamic object warp: Each object's position is updated according to its own trajectory offset and then transformed into the current ego coordinate frame.

At each timestep, a conditional range map is constructed:

\[I_{\text{cond}}^t=\Pi(\mathbf{B}^{0 \to t} \cup \mathbf{B}^{t-1 \to t} \cup \{\mathbf{F}_i^{t-1 \to t}\}_{i=1}^M)\]

The first-frame background warp \(\mathbf{B}^{0 \to t}\) is included to eliminate accumulated drift.

Loss & Training

  • Three-branch layout diffuser: 1M steps, batch size 64
  • Range-image diffusion model: 500K steps, batch size 32, resolution \(32 \times 1024\)
  • Training with 1024 denoising steps; inference with 256 steps
  • Trained on 6 NVIDIA A40 GPUs

Key Experimental Results

Main Results

Scene-level fidelity (nuScenes, lower is better):

Method Conference FRD↓ FPD↓ BEV-JSD↓ BEV-MMD↓
LiDARGen ECCV'22 759.65 159.35 5.74 2.39
LiDM CVPR'24 495.54 210.20 5.86 0.73
R2DM ICRA'24 243.35 33.97 3.51 0.71
UniScene CVPR'25 - 976.47 31.55 13.61
OpenDWM-DiT CVPR'25 - 381.91 19.90 5.73
LiDARCrafter Ours 194.37 8.64 3.11 0.42

Foreground object detection confidence (FDC↑):

Method Car Ped Truck Bus #Box
OpenDWM-DiT 0.78 0.32 0.56 0.51 0.64
LiDARCrafter 0.83 0.34 0.55 0.54 1.84

Ablation Study

Ablation on foreground conditioning mechanisms:

ID Variant FRD↓ FPD↓ Object FPD↓ CFCA↑ CFSC↑
1 Baseline (no foreground) 243.35 33.97 1.40 - -
2 + 2D mask 237.17 33.21 1.35 61.22 0.24
3 + Obj mask 217.83 24.02 1.20 64.54 0.27
4 + Sparse position embedding 205.27 15.97 1.08 72.46 0.40
6 + All (full model) 194.37 8.64 1.03 73.45 0.42

Ablation on 4D generation paradigms:

ID Paradigm TTCE(3f)↓ CTC(3f)↓ FRD↓ FPD↓
1 End-to-end 3.21 5.68 477.21 182.36
2 Autoregressive (no prior) 3.31 4.31 311.27 90.10
5 Autoregressive + depth prior 2.65 3.02 194.37 8.64

Temporal consistency (TTCE↓ / CTC↓):

Method TTCE(3f) TTCE(4f) CTC(1f) CTC(3f)
UniScene 2.74 3.69 0.90 3.64
OpenDWM-DiT 2.71 3.66 0.89 3.06
LiDARCrafter 2.65 3.56 1.12 3.02

Key Findings

  • FRD is reduced by 20% over R2DM (194.37 vs. 243.35) and FPD by 75% (8.64 vs. 33.97).
  • Foreground detection AP (CDA) is comprehensively superior: BEV R11 AP 23.21 vs. OpenDWM-DiT's 16.37; 3D R40 AP 8.26 vs. 1.89.
  • Depth prior is more critical than intensity prior for temporal consistency: removing the depth prior increases FRD by 109.88.
  • Autoregressive generation is more suitable than end-to-end generation for LiDAR sequences, consistent with the predominantly static nature of LiDAR data.

Highlights & Insights

  • The first 4D world model dedicated to LiDAR, filling an important methodological gap.
  • The scene graph serves as an intermediate representation bridging text and layout, elegantly balancing controllability and usability.
  • The warp-and-inpaint autoregressive strategy leverages the static nature of LiDAR sequences through motion priors.
  • The comprehensive EvalSuite spanning scene-level, object-level, and temporal-level metrics establishes an evaluation standard for subsequent work.
  • Supports fine-grained scene editing operations including insertion, deletion, and dragging, enabling generation of safety-critical corner cases.

Limitations & Future Work

  • Validation is currently limited to nuScenes (32-beam LiDAR); generalizability to higher-resolution LiDAR (e.g., 128-beam) remains unexplored.
  • Scene graphs are generated by an LLM, which may introduce parsing errors in complex scenes.
  • Autoregressive generation incurs slight cumulative error; CTC at short intervals (1 frame) is inferior to OpenDWM-DiT.
  • The effects of adverse weather conditions (rain, snow, fog) on LiDAR point clouds are not considered.
  • vs. LiDARGen/R2DM: These methods perform only single-frame unconditional generation, whereas LiDARCrafter supports conditioned 4D sequence generation.
  • vs. UniScene/OpenDWM: Voxel/BEV-based methods exhibit poor LiDAR fidelity and low foreground quality (UniScene FPD 976 vs. LiDARCrafter 8.64).
  • vs. Video World Models (GAIA-1, etc.): Video pixel textures vary substantially across frames, while LiDAR sequences are predominantly static — LiDARCrafter's warp strategy explicitly exploits this distinction.

Rating

  • Novelty: ⭐⭐⭐⭐ First LiDAR 4D world model with a complete Text2Layout→Layout2Scene→Scene2Seq pipeline design.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multi-dimensional evaluation (scene/object/temporal), detailed ablations, and corner-case generation demonstrations.
  • Writing Quality: ⭐⭐⭐⭐ Highly systematic, with clear method descriptions and well-coordinated equations and figures.
  • Value: ⭐⭐⭐⭐ Directly applicable to autonomous driving data augmentation and simulation; the EvalSuite is reusable by the community.