Skip to content

Bolt3D: Generating 3D Scenes in Seconds

Conference: ICCV 2025 arXiv: 2503.14445 Code: szymanowiczs.github.io/bolt3d (project page; code not released) Area: 3D Vision / 3D Scene Generation / Novel View Synthesis Keywords: Latent Diffusion Model, 3D Gaussian Representation, Feed-Forward Generation, Geometry VAE, Splatter Image Authors: Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Hołyński, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler (Google Research / Oxford)

TL;DR

A feed-forward 3D scene generation method based on latent diffusion models that represents 3D scenes as multiple sets of Splatter Images and employs a dedicated geometry VAE, generating a complete 3D scene on a single GPU in 7 seconds—reducing inference cost by 300× compared to optimization-based methods (CAT3D).

Background & Motivation

  1. 2D generative models cannot directly output 3D scenes: Existing image/video generative models produce 2D content, which is unsuitable for interactive visualization and editing.
  2. Severe scarcity of 3D data: Compared to the abundance of 2D image data, ground-truth 3D scene data is extremely limited, making direct training of 3D generative models challenging.
  3. Inefficiency of existing 3D generation methods:
    • Multi-view diffusion + per-scene optimization methods (e.g., CAT3D): generate 800 images → optimize 3DGS/NeRF, requiring minutes to hours.
    • Feed-forward regression methods (Flash3D, DepthSplat): fast but unable to handle ambiguity (blurry unseen regions).
    • Feed-forward generative methods (LatentSplat, Wonderland): either limited to single categories or reliant on slow video models (~5 min/scene).
  4. Key Challenge: How to achieve second-level inference while maintaining high-quality 3D generation?

Core Problem

How to design a diffusion model that directly outputs a renderable 3D scene representation while: (1) leveraging mature 2D diffusion architectures for scalability and generalization; (2) accurately modeling 3D geometry rather than merely generating 2D images; and (3) handling ambiguity in unobserved regions?

Method

Overall Architecture

Bolt3D adopts a two-stage feed-forward pipeline: 1. Multi-view latent diffusion model: Takes 1–4 posed input images + target camera poses → jointly denoises to produce appearance latent codes + geometry latent codes for 16 viewpoints. 2. VAE decoding: Appearance latent codes are decoded into RGB images using a pretrained image VAE; geometry latent codes are decoded into pointmaps (per-pixel 3D coordinates) using a dedicated geometry VAE. 3. Gaussian Head: Receives decoded images, pointmaps, and camera poses → feed-forward prediction of per-pixel 3D Gaussian opacity, covariance, and refined color → forms Splatter Images. 4. All Splatter Images are merged into a complete 3D Gaussian scene.

Key Designs

  1. Factorized Sampling:

    • Decomposes 3D Gaussian parameters into two parts: color + 3D position are generated by the diffusion model (directly supervised via images + SfM), while opacity + covariance are predicted deterministically by the Gaussian Head (supervised via rendering loss).
    • Design Motivation: Color and 3D position can be assigned pseudo ground truth from dense SfM, whereas direct supervision for opacity and covariance is difficult to obtain; however, given color and position, the ambiguity of the latter is substantially reduced.
  2. Geometry Variational Autoencoder (Geometry VAE):

    • Encoder: Convolutional architecture; inputs are a pointmap (3D coordinate map) + camera raymap (6D ray map), encoded into a \(64\times64\times8\) latent code.
    • Decoder: Employs a Transformer architecture (ViT-B, 12 layers, 768-dim, patch size 2, sequence length 1024) rather than a conventional convolutional decoder.
    • Key Finding: Pretrained image VAEs are entirely unsuitable for geometry data—they fail catastrophically on unbounded real scenes (AbsRel increases from 0.67% to 15–18%).
    • Training losses: reconstruction loss (with distance weighting) + KL divergence + gradient loss (for sharper boundaries).
    • Resolution: trained at \(256\times256\) for 3M iterations → fine-tuned at \(512\times512\) for 250k iterations.
  3. Multi-view Geometry Latent Diffusion Model:

    • Initialized from a pretrained multi-view image diffusion model (CAT3D); input channels are extended to accept geometry latent codes (8D image latent + 8D geometry latent + 6D camera raymap + 1D conditioning mask = 23D input).
    • Uses v-parameterization and v-prediction loss.
    • U-Net architecture with full 3D attention on feature maps of resolution \(32\times32\) and below.
    • First trained on 8 viewpoints for 700k iterations, then fine-tuned on 16 viewpoints for 70k iterations.
  4. Gaussian Head:

    • Multi-view design: Uses a U-ViT architecture; 8 viewpoints serve as input, with cross-attention enabling cross-view information exchange (determining visibility → modulating opacity).
    • \(4\times\) patchification → Transformer blocks (3 layers, 128-dim, 8 heads) → unpatchify.
    • Outputs: 3-channel color + 3-channel scale + 4-channel rotation + 1-channel opacity.
    • Training: L2 photometric loss + LPIPS perceptual loss (weight 0.05).

Loss & Training

Three-stage training:

Stage Content Training Budget
Stage 1: Geometry VAE \(256\times256\) training → \(512\times512\) fine-tuning 3M + 250k iter
Stage 2: Gaussian Head Given GT color and auto-encoded geometry → rendering loss 100k iter
Stage 3: Latent Diffusion Model Initialized from CAT3D → 8 views → 16 views 700k + 70k iter

Data: - MASt3R is run on CO3D, MVImg, RealEstate10K, and DL3DV-7K (~300k scenes in total) to obtain dense 3D pseudo ground truth. - Synthetic data (Objaverse + internal object dataset) is additionally used; synthetic:real ratio = 1:2. - Geometry VAE loss weighting: points farther from the scene center receive lower weights (\(w = \max(1, d^2)\)).

Key Experimental Results

Comparison with Feed-Forward Regression Methods

Dataset Setting Method PSNR↑ SSIM↑ LPIPS↓ FID↓
RE10K 1-view Flash3D 17.40 0.699 0.419 96.9
RE10K 1-view Bolt3D 21.03 0.805 0.257 55.5
CO3D 1-view Flash3D 14.43 0.552 0.608 174.8
CO3D 1-view Bolt3D 16.78 0.562 0.505 97.5
DL3DV 2-view DepthSplat 16.25 0.515 0.465 95.9
DL3DV 2-view Bolt3D 17.75 0.551 0.392 64.5
DL3DV 4-view DepthSplat 19.48 0.638 0.327 58.8
DL3DV 4-view Bolt3D 20.64 0.653 0.310 48.2

→ Largest gains are observed in the 1-view setting (PSNR +3.63 dB), validating the advantage of generative methods in modeling ambiguity.

Comparison with Feed-Forward 3D Generative Methods

Dataset Setting Method PSNR↑ SSIM↑ LPIPS↓
RE10K 1-view Wonderland 17.15 0.550 0.292
RE10K 1-view Bolt3D 21.54 0.747 0.234
RE10K 2-view LatentSplat 22.62 0.777 0.196
RE10K 2-view Bolt3D 23.13 0.806 0.166

→ Wonderland requires ~5 min/scene via a video model; Bolt3D requires only 6 seconds.

Comparison with Optimization-Based Methods (Speed–Quality Trade-off)

Dataset Method PSNR↑ SSIM↑ LPIPS↓ FID↓ gpu-min↓
RE10K CAT3D 29.56 0.937 0.134 13.75 77.28
RE10K Bolt3D 27.00 0.905 0.154 27.40 0.25
LLFF CAT3D 22.06 0.745 0.194 37.54 80.00
LLFF Bolt3D 18.75 0.562 0.341 96.61 0.25
DTU CAT3D 19.97 0.809 0.202 41.76 72.00
DTU Bolt3D 18.59 0.738 0.312 67.49 0.25

→ Quality is slightly below CAT3D (PSNR gap of 1–3 dB), but inference cost is reduced by ~300× (0.25 vs. 72–80 gpu-min).

Ablation Study

  1. Geometry VAE ablation:

    • Removing encoder training → AbsRel increases from 0.67% to 1.63%.
    • Removing distance weighting → \(\delta_{1.01}\) drops from 81.5% to 56.7%.
    • Removing gradient loss → reprojection error increases to 2.96.
  2. Gaussian Head ablation:

    • Removing cross-attention → PSNR drops from 24.88 to 23.80.
    • Removing the Gaussian Head entirely → PSNR drops to 21.94.
    • Learning XYZ from rendering loss → PSNR drops to 21.88 (inferior to explicit geometry supervision).
    • Removing ray clipping → PSNR drops to 20.78.
  3. Image VAE vs. Geometry VAE:

    • Pretrained image VAE + mean depth scaling → AbsRel = 17.9% on real data.
    • Dedicated geometry VAE (proposed) → AbsRel = 0.67% on real data (26× improvement).
    • Key finding: Image VAE is marginally acceptable for bounded synthetic data but fails completely on unbounded real scenes.

Highlights & Insights

  1. Extreme speed–quality trade-off: Generates a complete 3D scene in 6.25 seconds on an H100 (15 seconds on an A100), 300× faster than CAT3D with acceptable quality loss.
  2. Pioneering analysis of Geometry VAE: The first systematic study of VAE design for 3D geometry data, finding that Transformer decoders outperform convolutional decoders for geometry (convolutions produce curved artifacts) and that image-pretrained VAEs do not transfer to unbounded geometry.
  3. Elegant factorized sampling design: Treating "directly supervisable quantities" (color + position) and "rendering-loss-supervised quantities" (opacity + covariance) separately leverages large-scale SfM data while circumventing the annotation difficulty for opacity and covariance.
  4. Reuse of 2D diffusion architectures: The 3D generation problem is reformulated as joint generation of multiple 2D Splatter Images, reusing mature 2D diffusion architectures and pretrained weights.
  5. Large-scale pseudo ground-truth construction: Running MASt3R on 300k scenes to build a large-scale geometry dataset addresses the bottleneck of 3D data scarcity.

Limitations & Future Work

  1. Difficulty with thin structures: Structures narrower than 8 pixels (due to the 8× downsampling in the geometry VAE) are difficult to reconstruct.
  2. Failure on transparent/specular surfaces: SfM geometry reconstruction is unreliable for non-Lambertian surfaces, degrading training data quality.
  3. Sensitivity to camera path: The method is sensitive to the up-direction of target cameras and scene scale, requiring better data augmentation.
  4. Limited number of viewpoints: Only 16 Splatter Images are generated (vs. 800 in CAT3D), resulting in incomplete scene coverage and a quality bottleneck.
  5. Static scenes only: Future work could incorporate multi-view video diffusion models to generate dynamic 3D content.
  6. Depth vs. pointmap: This work compresses pointmaps, but concurrent work suggests depth maps may be superior—worth exploring.
  7. Weaker FID: FID on RE10K is 27.40 (vs. 13.75 for CAT3D), indicating a remaining gap in distribution-level visual quality.
Method Type 3D Repr. Inference Time Input Quality
Flash3D Feed-forward regression 3DGS ~seconds 1-view Blurry unseen regions
DepthSplat Feed-forward regression 3DGS ~seconds 2–4 view Requires feature matching
LatentSplat Feed-forward generative (VAE-GAN) 3DGS ~seconds 2-view Limited to single category / low resolution
Wonderland Video model + 3DGS 3DGS ~5 min 1-view No explicit geometry model
CAT3D Multi-view diffusion + optimization 3DGS/NeRF ~5 min (16 GPU) 1–3 view High quality but very slow
Bolt3D Latent diffusion (feed-forward) 3DGS ~7 sec (1 GPU) 1–4 view Approaching CAT3D

Core distinction: Bolt3D is the first to realize an end-to-end feed-forward paradigm in which a diffusion model directly outputs 3DGS without any subsequent optimization.

Broader implications: - General value of the Geometry VAE: This work demonstrates the necessity of a dedicated geometry VAE (particularly with a Transformer decoder) for unbounded scenes. This component can transfer to other tasks requiring latent-space geometry encoding (e.g., 3D completion, 4D generation, scene reconstruction in robotic manipulation). - Generalizability of factorized sampling: The idea of separately modeling "directly supervisable quantities" and "rendering-supervised quantities" applies broadly to any 3D generation task that must learn from imperfect supervision. - Data perspective: Constructing a geometry dataset of 300k scenes by running SfM on existing multi-view datasets exemplifies a "compute-for-data" strategy with significant implications for the 3D community.

Rating

  • Novelty: ⭐⭐⭐⭐ First to combine latent diffusion with a dedicated geometry VAE for second-level 3D scene generation; factorized sampling is elegant. However, core components (Splatter Image, multi-view diffusion) build on prior work.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparison across three method categories (regression / feed-forward generative / optimization-based), five datasets, detailed VAE and Gaussian Head ablations, and fair low-resolution comparisons.
  • Writing Quality: ⭐⭐⭐⭐⭐ Clear structure, well-motivated contributions, rich figures (interactive viewer), and thorough supplementary material.
  • Value: ⭐⭐⭐⭐⭐ Advances 3D scene generation from minute-level to second-level inference; the 300× speedup carries significant practical value, and the geometry VAE analysis provides important guidance for future work.