StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models¶

Conference: CVPR 2025
arXiv: 2412.13188
Code: GitHub
Area: Video Generation
Keywords: Street View Synthesis, Video Diffusion Models, LiDAR Conditioning, Novel View Synthesis, 3DGS Distillation

TL;DR¶

This paper proposes StreetCrafter, which utilizes LiDAR point cloud rendering as a pixel-level condition to control a video diffusion model, achieving precise camera-controlled novel view synthesis for street views. The learned generative priors are then distilled into dynamic 3DGS representations to enable real-time rendering.

Background & Motivation¶

Autonomous driving simulators require high-quality novel view synthesis capabilities. While 3DGS-based methods generate high-quality images near the training trajectory, they suffer from severe artifacts when the camera viewpoint significantly deviates. This stems from insufficient observations of out-of-distribution regions in the training data and the limited extrapolation capabilities of reconstruction methods.

Video diffusion models possess strong generative priors that can synthesize realistic viewpoints from sparse input images. However, existing approaches rely on text prompts as control signals, which are high-level instructions that lack fine-grained control and are thus unsuitable for autonomous driving scenarios.

The key observation is that LiDAR point cloud rendering provides precise geometric information. Although incomplete and noisy, it can serve as an exact camera pose representation. By projecting the LiDAR point cloud as a pixel-aligned condition for the video diffusion model, generative priors can be leveraged while maintaining fine-grained camera control. Experiments demonstrate that even when trained solely on single-lane driving data, the model can generalize to generate high-quality multi-lane views during inference.

Method¶

Overall Architecture¶

Built upon Vista (a driving world model fine-tuned from SVD), StreetCrafter comprises three core components: (1) LiDAR condition construction, which aggregates colored LiDAR points from neighboring frames into a global point cloud and renders it to the target viewpoint; (2) Controllable video diffusion model, which injects the rendered LiDAR image as a pixel-level condition into the UNet; and (3) Dynamic 3DGS distillation, which uses the novel-view images generated by the diffusion model as extra supervision.

Key Design 1: LiDAR Pixel-Level Condition Construction¶

Function: Provides precise pixel-level pose control signals for the video diffusion model.

Mechanism: Projected LiDAR points are mapped onto calibrated image planes to obtain colors, and object tracking is utilized to separate dynamic objects from the static background. Given a target camera pose \(\mathbf{C}_i\), LiDAR points within a time window \(l\) (\(\pm 1\)s) are aggregated to form a unified point cloud \(\mathbf{P}\). The point clouds of dynamic objects are transformed into the world coordinate system via the tracking pose \(\mathbf{T}_o^{t_i}\). Point rasterization with a fixed radius is applied to each point to generate the condition image \(\mathbf{I}^c_i\).

Design Motivation: Compared to abstract control signals like camera parameter embeddings, the rendered LiDAR image establishes a pixel-aligned connection between the novel trajectory and the input images. The network only needs to denoise the noisy conditions rather than learning a complex mapping from abstract camera parameters to video frames.

Key Design 2: Zero-Convolution Condition Injection and Training¶

Function: Efficiently injects LiDAR conditions into the video diffusion model.

Mechanism: The input images and LiDAR conditions are encoded into latent space representations \(\{\mathbf{z}_i\}\) and \(\{\mathbf{z}^c_i\}\) using a pre-trained VAE encoder. The LiDAR latent codes are processed through a trainable zero-convolution layer \(\Theta_z\) and added element-wise to the noisy latents:

\[\hat{\mathbf{z}}_{i,t} = \mathbf{z}_{i,t} + \mathcal{Z}(\mathbf{z}^c_i; \Theta_z)\]

The training loss follows the standard diffusion denoising objective: \(\mathcal{L} = \mathbb{E}[\|\mathbf{z}_i - \mathcal{F}_\theta(\hat{\mathbf{z}}_{i,t}, t, \mathbf{c}_{\text{ref}}, \mathbf{c}_p)\|_2^2]\).

Design Motivation: Zero-initialization ensures that the initial output is identical to the original diffusion model, and the training introduces minimal modifications to guide the model effectively without extra computational overhead. During training, reference images and LiDAR conditions are independently dropped with a 15% probability to support classifier-free guidance.

Key Design 3: Progressive 3DGS Distillation¶

Function: Distills diffusion model priors into 3DGS representations to achieve real-time rendering.

Mechanism: Starting from the artifact-containing images rendered by 3DGS, the images are encoded, added with noise, and denoised by the diffusion model to produce high-quality novel-view images as extra supervision. A progressive optimization strategy is adopted by gradually reducing the noise scale \(s\): early training relies on diffusion priors to eliminate artifacts, while later training refines fine-grained details. An LPIPS loss is applied to the novel views to emphasize semantic-level consistency:

\[\mathcal{L}_{\text{novel}} = \lambda_{\text{novel}} \mathcal{L}_{\text{lpips}}\]

Design Motivation: Starting from noisy rendered results (instead of pure noise) helps maintain the overall scene structure and reduces the number of denoising steps. Combining the consistency of 3D representations with the generative capability of diffusion models yields optimal performance.

Loss & Training¶

For input views: \(\mathcal{L}_{\text{input}} = \lambda_1 \mathcal{L}_1 + \lambda_{\text{ssim}} \mathcal{L}_{\text{ssim}} + \lambda_{\text{lpips}} \mathcal{L}_{\text{lpips}} + \mathcal{L}_g\), where \(\mathcal{L}_g\) includes LiDAR depth loss, sky mask loss, and dynamic object regularization. For novel views: \(\mathcal{L}_{\text{novel}} = 0.1 \cdot \mathcal{L}_{\text{lpips}}\).

Key Experimental Results¶

Main Results: Novel View Synthesis on Waymo Open Dataset¶

Method	PSNR↑	SSIM↑	LPIPS↓	FPS
3DGS + LiDAR	26.87	0.851	0.182	Real-time
Street Gaussians	27.52	0.862	0.171	Real-time
EmerNeRF	26.21	0.843	0.195	-
NeuRAD	27.14	0.858	0.178	-
StreetCrafter-V	28.34	0.874	0.148	0.2fps
StreetCrafter-G	28.91	0.883	0.139	Real-time

Novel View Extrapolations (lateral shift 3m)¶

Method	PSNR↑	LPIPS↓
Street Gaussians	22.71	0.312
NeuRAD	23.45	0.287
StreetCrafter-G	25.83	0.198

Key Findings¶

In viewpoint extrapolation scenarios (3m lateral shift), StreetCrafter-G improves by +3.1 PSNR over Street Gaussians, highlighting the massive advantage of diffusion priors for out-of-distribution views.
Training only on single-lane data generalizes successfully to multi-lane testing views.
LiDAR conditioning can also be applied to scene editing (object removal, replacement, and translation) without requiring per-scene optimization.
The distilled 3DGS version (Ours-G) outperforms the pure diffusion method (Ours-V) in visual quality while maintaining real-time rendering speeds.

Highlights & Insights¶

LiDAR as a Pixel-Level Pose Condition: Rendered LiDAR images offer a much stronger pixel-aligned guidance signal compared to camera pose embeddings, significantly easing the learning task for the network.
Complementarity of Diffusion and Reconstruction: This work successfully merges the generalization ability of diffusion models with the consistency/real-time speed of 3DGS. The progressive distillation strategy allows both paradigms to complement each other.
Zero-Cost Scene Editing: The controllability of the LiDAR condition enables scene editing tasks (e.g., object removal/replacement) without requiring per-scene optimization.

Limitations & Future Work¶

The inference latency of the diffusion model is high (only 0.2fps at 576×1024 resolution), making practical applications highly dependent on distillation.
High-quality LiDAR data and accurate object tracking are required as inputs.
Generalization capability under long-range camera shifts or extreme weather conditions remains to be fully validated.
Future work could explore integration with faster diffusion models (e.g., consistency models).

Vista: A foundational driving world model. StreetCrafter builds on top of Vista by introducing LiDAR conditioning.
Street Gaussians: A baseline framework for dynamic 3DGS modeling, which StreetCrafter employs during the distillation stage.
ReconFusion / CAT3D: These share similar ideas of leveraging diffusion models to assist 3D reconstruction, but are restricted to static scenes and lack precise camera control.

Rating¶

⭐⭐⭐⭐ — Distinct and clear method design. The usage of LiDAR as a pixel-aligned condition is both novel and highly practical. The viewpoint extrapolation results on Waymo are impressive. The joint diffusion-3DGS distillation framework holds good engineering value and provides solid practical advancement for autonomous driving simulation.