Bolt3D: Generating 3D Scenes in Seconds¶

Conference: ICCV 2025 arXiv: 2503.14445 Code: szymanowiczs.github.io/bolt3d (project page; code not released) Area: 3D Vision / 3D Scene Generation / Novel View Synthesis Keywords: Latent Diffusion Model, 3D Gaussian Representation, Feed-Forward Generation, Geometry VAE, Splatter Image Authors: Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Hołyński, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler (Google Research / Oxford)

TL;DR¶

A feed-forward 3D scene generation method based on latent diffusion models that represents 3D scenes as multiple sets of Splatter Images and employs a dedicated geometry VAE, generating a complete 3D scene on a single GPU in 7 seconds—reducing inference cost by 300× compared to optimization-based methods (CAT3D).

Background & Motivation¶

2D generative models cannot directly output 3D scenes: Existing image/video generative models produce 2D content, which is unsuitable for interactive visualization and editing.
Severe scarcity of 3D data: Compared to the abundance of 2D image data, ground-truth 3D scene data is extremely limited, making direct training of 3D generative models challenging.
Inefficiency of existing 3D generation methods:
- Multi-view diffusion + per-scene optimization methods (e.g., CAT3D): generate 800 images → optimize 3DGS/NeRF, requiring minutes to hours.
- Feed-forward regression methods (Flash3D, DepthSplat): fast but unable to handle ambiguity (blurry unseen regions).
- Feed-forward generative methods (LatentSplat, Wonderland): either limited to single categories or reliant on slow video models (~5 min/scene).
Key Challenge: How to achieve second-level inference while maintaining high-quality 3D generation?

Core Problem¶

How to design a diffusion model that directly outputs a renderable 3D scene representation while: (1) leveraging mature 2D diffusion architectures for scalability and generalization; (2) accurately modeling 3D geometry rather than merely generating 2D images; and (3) handling ambiguity in unobserved regions?

Method¶

Overall Architecture¶

Bolt3D adopts a two-stage feed-forward pipeline: 1. Multi-view latent diffusion model: Takes 1–4 posed input images + target camera poses → jointly denoises to produce appearance latent codes + geometry latent codes for 16 viewpoints. 2. VAE decoding: Appearance latent codes are decoded into RGB images using a pretrained image VAE; geometry latent codes are decoded into pointmaps (per-pixel 3D coordinates) using a dedicated geometry VAE. 3. Gaussian Head: Receives decoded images, pointmaps, and camera poses → feed-forward prediction of per-pixel 3D Gaussian opacity, covariance, and refined color → forms Splatter Images. 4. All Splatter Images are merged into a complete 3D Gaussian scene.

Key Designs¶

Factorized Sampling:
- Decomposes 3D Gaussian parameters into two parts: color + 3D position are generated by the diffusion model (directly supervised via images + SfM), while opacity + covariance are predicted deterministically by the Gaussian Head (supervised via rendering loss).
- Design Motivation: Color and 3D position can be assigned pseudo ground truth from dense SfM, whereas direct supervision for opacity and covariance is difficult to obtain; however, given color and position, the ambiguity of the latter is substantially reduced.
Geometry Variational Autoencoder (Geometry VAE):
- Encoder: Convolutional architecture; inputs are a pointmap (3D coordinate map) + camera raymap (6D ray map), encoded into a \(64\times64\times8\) latent code.
- Decoder: Employs a Transformer architecture (ViT-B, 12 layers, 768-dim, patch size 2, sequence length 1024) rather than a conventional convolutional decoder.
- Key Finding: Pretrained image VAEs are entirely unsuitable for geometry data—they fail catastrophically on unbounded real scenes (AbsRel increases from 0.67% to 15–18%).
- Training losses: reconstruction loss (with distance weighting) + KL divergence + gradient loss (for sharper boundaries).
- Resolution: trained at \(256\times256\) for 3M iterations → fine-tuned at \(512\times512\) for 250k iterations.
Multi-view Geometry Latent Diffusion Model:
- Initialized from a pretrained multi-view image diffusion model (CAT3D); input channels are extended to accept geometry latent codes (8D image latent + 8D geometry latent + 6D camera raymap + 1D conditioning mask = 23D input).
- Uses v-parameterization and v-prediction loss.
- U-Net architecture with full 3D attention on feature maps of resolution \(32\times32\) and below.
- First trained on 8 viewpoints for 700k iterations, then fine-tuned on 16 viewpoints for 70k iterations.
Gaussian Head:
- Multi-view design: Uses a U-ViT architecture; 8 viewpoints serve as input, with cross-attention enabling cross-view information exchange (determining visibility → modulating opacity).
- \(4\times\) patchification → Transformer blocks (3 layers, 128-dim, 8 heads) → unpatchify.
- Outputs: 3-channel color + 3-channel scale + 4-channel rotation + 1-channel opacity.
- Training: L2 photometric loss + LPIPS perceptual loss (weight 0.05).

Loss & Training¶

Three-stage training:

Stage	Content	Training Budget
Stage 1: Geometry VAE	\(256\times256\) training → \(512\times512\) fine-tuning	3M + 250k iter
Stage 2: Gaussian Head	Given GT color and auto-encoded geometry → rendering loss	100k iter
Stage 3: Latent Diffusion Model	Initialized from CAT3D → 8 views → 16 views	700k + 70k iter

Data: - MASt3R is run on CO3D, MVImg, RealEstate10K, and DL3DV-7K (~300k scenes in total) to obtain dense 3D pseudo ground truth. - Synthetic data (Objaverse + internal object dataset) is additionally used; synthetic:real ratio = 1:2. - Geometry VAE loss weighting: points farther from the scene center receive lower weights (\(w = \max(1, d^2)\)).

Key Experimental Results¶

Comparison with Feed-Forward Regression Methods¶

Dataset	Setting	Method	PSNR↑	SSIM↑	LPIPS↓	FID↓
RE10K	1-view	Flash3D	17.40	0.699	0.419	96.9
RE10K	1-view	Bolt3D	21.03	0.805	0.257	55.5
CO3D	1-view	Flash3D	14.43	0.552	0.608	174.8
CO3D	1-view	Bolt3D	16.78	0.562	0.505	97.5
DL3DV	2-view	DepthSplat	16.25	0.515	0.465	95.9
DL3DV	2-view	Bolt3D	17.75	0.551	0.392	64.5
DL3DV	4-view	DepthSplat	19.48	0.638	0.327	58.8
DL3DV	4-view	Bolt3D	20.64	0.653	0.310	48.2

→ Largest gains are observed in the 1-view setting (PSNR +3.63 dB), validating the advantage of generative methods in modeling ambiguity.

Comparison with Feed-Forward 3D Generative Methods¶

Dataset	Setting	Method	PSNR↑	SSIM↑	LPIPS↓
RE10K	1-view	Wonderland	17.15	0.550	0.292
RE10K	1-view	Bolt3D	21.54	0.747	0.234
RE10K	2-view	LatentSplat	22.62	0.777	0.196
RE10K	2-view	Bolt3D	23.13	0.806	0.166

→ Wonderland requires ~5 min/scene via a video model; Bolt3D requires only 6 seconds.

Comparison with Optimization-Based Methods (Speed–Quality Trade-off)¶

Dataset	Method	PSNR↑	SSIM↑	LPIPS↓	FID↓	gpu-min↓
RE10K	CAT3D	29.56	0.937	0.134	13.75	77.28
RE10K	Bolt3D	27.00	0.905	0.154	27.40	0.25
LLFF	CAT3D	22.06	0.745	0.194	37.54	80.00
LLFF	Bolt3D	18.75	0.562	0.341	96.61	0.25
DTU	CAT3D	19.97	0.809	0.202	41.76	72.00
DTU	Bolt3D	18.59	0.738	0.312	67.49	0.25

→ Quality is slightly below CAT3D (PSNR gap of 1–3 dB), but inference cost is reduced by ~300× (0.25 vs. 72–80 gpu-min).

Ablation Study¶

Geometry VAE ablation:
- Removing encoder training → AbsRel increases from 0.67% to 1.63%.
- Removing distance weighting → \(\delta_{1.01}\) drops from 81.5% to 56.7%.
- Removing gradient loss → reprojection error increases to 2.96.
Gaussian Head ablation:
- Removing cross-attention → PSNR drops from 24.88 to 23.80.
- Removing the Gaussian Head entirely → PSNR drops to 21.94.
- Learning XYZ from rendering loss → PSNR drops to 21.88 (inferior to explicit geometry supervision).
- Removing ray clipping → PSNR drops to 20.78.
Image VAE vs. Geometry VAE:
- Pretrained image VAE + mean depth scaling → AbsRel = 17.9% on real data.
- Dedicated geometry VAE (proposed) → AbsRel = 0.67% on real data (26× improvement).
- Key finding: Image VAE is marginally acceptable for bounded synthetic data but fails completely on unbounded real scenes.

Highlights & Insights¶

Extreme speed–quality trade-off: Generates a complete 3D scene in 6.25 seconds on an H100 (15 seconds on an A100), 300× faster than CAT3D with acceptable quality loss.
Pioneering analysis of Geometry VAE: The first systematic study of VAE design for 3D geometry data, finding that Transformer decoders outperform convolutional decoders for geometry (convolutions produce curved artifacts) and that image-pretrained VAEs do not transfer to unbounded geometry.
Elegant factorized sampling design: Treating "directly supervisable quantities" (color + position) and "rendering-loss-supervised quantities" (opacity + covariance) separately leverages large-scale SfM data while circumventing the annotation difficulty for opacity and covariance.
Reuse of 2D diffusion architectures: The 3D generation problem is reformulated as joint generation of multiple 2D Splatter Images, reusing mature 2D diffusion architectures and pretrained weights.
Large-scale pseudo ground-truth construction: Running MASt3R on 300k scenes to build a large-scale geometry dataset addresses the bottleneck of 3D data scarcity.

Limitations & Future Work¶

Difficulty with thin structures: Structures narrower than 8 pixels (due to the 8× downsampling in the geometry VAE) are difficult to reconstruct.
Failure on transparent/specular surfaces: SfM geometry reconstruction is unreliable for non-Lambertian surfaces, degrading training data quality.
Sensitivity to camera path: The method is sensitive to the up-direction of target cameras and scene scale, requiring better data augmentation.
Limited number of viewpoints: Only 16 Splatter Images are generated (vs. 800 in CAT3D), resulting in incomplete scene coverage and a quality bottleneck.
Static scenes only: Future work could incorporate multi-view video diffusion models to generate dynamic 3D content.
Depth vs. pointmap: This work compresses pointmaps, but concurrent work suggests depth maps may be superior—worth exploring.
Weaker FID: FID on RE10K is 27.40 (vs. 13.75 for CAT3D), indicating a remaining gap in distribution-level visual quality.

Method	Type	3D Repr.	Inference Time	Input	Quality
Flash3D	Feed-forward regression	3DGS	~seconds	1-view	Blurry unseen regions
DepthSplat	Feed-forward regression	3DGS	~seconds	2–4 view	Requires feature matching
LatentSplat	Feed-forward generative (VAE-GAN)	3DGS	~seconds	2-view	Limited to single category / low resolution
Wonderland	Video model + 3DGS	3DGS	~5 min	1-view	No explicit geometry model
CAT3D	Multi-view diffusion + optimization	3DGS/NeRF	~5 min (16 GPU)	1–3 view	High quality but very slow
Bolt3D	Latent diffusion (feed-forward)	3DGS	~7 sec (1 GPU)	1–4 view	Approaching CAT3D

Core distinction: Bolt3D is the first to realize an end-to-end feed-forward paradigm in which a diffusion model directly outputs 3DGS without any subsequent optimization.

Broader implications: - General value of the Geometry VAE: This work demonstrates the necessity of a dedicated geometry VAE (particularly with a Transformer decoder) for unbounded scenes. This component can transfer to other tasks requiring latent-space geometry encoding (e.g., 3D completion, 4D generation, scene reconstruction in robotic manipulation). - Generalizability of factorized sampling: The idea of separately modeling "directly supervisable quantities" and "rendering-supervised quantities" applies broadly to any 3D generation task that must learn from imperfect supervision. - Data perspective: Constructing a geometry dataset of 300k scenes by running SfM on existing multi-view datasets exemplifies a "compute-for-data" strategy with significant implications for the 3D community.

Rating¶

Novelty: ⭐⭐⭐⭐ First to combine latent diffusion with a dedicated geometry VAE for second-level 3D scene generation; factorized sampling is elegant. However, core components (Splatter Image, multi-view diffusion) build on prior work.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparison across three method categories (regression / feed-forward generative / optimization-based), five datasets, detailed VAE and Gaussian Head ablations, and fair low-resolution comparisons.
Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, well-motivated contributions, rich figures (interactive viewer), and thorough supplementary material.
Value: ⭐⭐⭐⭐⭐ Advances 3D scene generation from minute-level to second-level inference; the 300× speedup carries significant practical value, and the geometry VAE analysis provides important guidance for future work.