CVPR 2025 3D Vision 3D scene generation latent diffusion latent tree TUDF coarse-to-fine patch-based unconditional generation

LT3SD: Latent Trees for 3D Scene Diffusion¶

Conference: CVPR 2025
arXiv: 2409.08215
Code: https://quan-meng.github.io/projects/lt3sd
Area: 3D Vision
Keywords: 3D scene generation, latent diffusion, latent tree, TUDF, coarse-to-fine, patch-based, unconditional generation

TL;DR¶

LT3SD is proposed to progressively decompose 3D scenes into latent trees (each layer containing a geometry volume + a high-frequency latent feature volume). A patch-based diffusion model is trained on this representation to achieve coarse-to-fine, patch-wise high-quality infinite 3D scene generation, improving FID by 70% compared to the SOTA.

Background & Motivation¶

Background: Diffusion models have achieved breakthroughs in 2D image generation, while 3D diffusion models mainly focus on object-level generation. Due to high geometric complexity, scarce data, and indefinite spatial scales, generating 3D scenes is far more challenging than generating objects.

Limitations of Prior Work: (1) Object-level 3D diffusion (PVD, NFD) assumes shapes are in a normalized space and uses compact representations like global latent codes or tri-planes, which cannot scale to unstructured scenes; (2) The three planes in tri-plane representations (BlockFusion) are highly coupled, requiring complex synchronization for scene extrapolation; (3) Existing scene generation methods (SSG, SemCity) are limited to low-resolution or semantic scenes, lacking geometric details.

Key Challenge: The 3D scene signal is highly non-uniform (with large empty areas and details concentrated near surfaces), requiring a representation that can efficiently encode multi-scale information from global structures to local details.

Key Insight: Designing a hierarchical latent tree representation to decompose the scene into complementary geometric (low-frequency) and latent feature (high-frequency) encodings at multiple resolution levels, which naturally supports coarse-to-fine patch-based generation.

Method¶

Overall Architecture¶

Two stages: 1. Latent Tree Encoding (Stage 1): An encoder/decoder is learned to progressively decompose the 3D scene TUDF grid into multi-layer latent trees. 2. Patch-based Latent Diffusion (Stage 2): Conditional diffusion models are trained on each resolution layer of the latent tree.

Key Designs¶

1. Latent Tree Representation - Function: Progressively decomposes a high-resolution 3D scene TUDF grid into an $N$-layer tree, where each layer $i$ contains a geometric volume $L_i^s$ (TUDF) and a latent feature volume $H_i^s$. - Mechanism: A 3D CNN encoder decomposes a high-resolution patch $L_{i+1}$ into a low-resolution TUDF $L_i$ (via average pooling downsampling) and a latent feature $H_i$ (predicted by the CNN): $$\mathcal{E}_{i+1}(L_{i+1}) \Rightarrow [L_i, H_i]$$ A 3D CNN decoder reconstructs the high-resolution TUDF from $L_i$ and $H_i$: $$\mathcal{D}_{i+1}([L_i, H_i]) \Rightarrow L_{i+1}$$ - Comparison with Alternatives: Compared to the cascaded latent model (Cascaded Model), the latent tree excels with less storage (×0.80), faster training (×0.87), and lower reconstruction error (3.20 vs 4.91 ×10⁻⁴) because the cascaded model models redundant information independently at each layer.

2. Patch-based Conditional Diffusion - Function: Trains a 3D UNet diffusion model at each layer of the latent tree to predict latent features $H_i$ from geometric volumes $L_i$. - Mechanism: Conditional generation ($z=H_i$, $c=L_i$) is performed when layer $i>1$, while unconditional generation is performed for the coarsest layer $i=1$ (generating both $L_1$ and $H_1$). Training is conducted on randomly cropped patches: $$\mathcal{L}_\text{diff} = \mathbb{E}_{z,c,\epsilon,t}[\|\epsilon - \mathcal{G}_i(z_t, t, c)\|_2^2]$$ - Design Motivation: Patch-level training shifts the focus from complex, unaligned full scenes to local regions with more shared structures, while also serving as data augmentation.

3. Patch-Stitched Generation of Large-Scale Scenes - Function: Generates scenes of arbitrary sizes during inference using patch-stitching + coarse-to-fine hierarchical reconstruction. - Mechanism: - The coarsest layer uses a Stable Inpainting scheme to autoregressively generate patches (fixing known regions + diffusing unknown regions). - High-resolution layers use a MultiDiffusion scheme to denoise all patches in parallel, averaging and blending overlapping regions. - Design Motivation: Coarse layers have fewer patches and are suitable for autoregression to ensure global consistency, while fine layers have many patches and leverage parallel acceleration (taking only 2 hours to generate 170 rooms compared to BlockFusion's 3 hours for 7 rooms).

Loss & Training¶

Latent Tree Training: $\mathcal{L}_\text{latent} = (L_{i+1} - \mathcal{D}_{i+1}(\mathcal{E}_{i+1}(L_{i+1})))^2$
Diffusion Training: Standard denoising $\epsilon$-prediction loss

Key Experimental Results¶

Main Results — 3D-FRONT Unconditional Scene Generation¶

Method	COV↑ (CD)	MMD↓ (CD)	1-NNA↓ (CD)	FID↓
PVD	43.82	3.69	70.83	237.85
NFD	44.65	3.65	62.86	266.27
BlockFusion	24.32	5.10	89.01	45.55
XCube	48.60	3.35	56.45	55.35
LT3SD (3 layers)	53.10	3.51	53.22	13.39

The FID of 13.39 far outperforms the second-best at 45.55 (an improvement of 70%+), while global structural metrics such as COV/1-NNA are also optimal.

Ablation Study — Number of Latent Tree Layers¶

Configuration	FID↓	COV↑ (CD)
Single layer (17.6-2.2)	59.23	28.61
Single layer (8.8-2.2)	50.07	40.87
Three layers (17.6-8.8-2.2)	13.39	53.10

Multi-layer hierarchical modeling is key to high-quality generation.

Ablation Study on Latent Representation¶

Representation	Training Time	Storage	Reconstruction Error ($\ell_2$)
Cascaded Model	×1.00	×1.00	4.91×10⁻⁴
Latent Tree	×0.87	×0.80	3.20×10⁻⁴

Key Findings¶

Hierarchical > Single-layer: The FID (13.39) of the three-layer latent tree is far superior to any single-layer configuration (50+).
Complementary Decomposition > Independent Cascade: Decomposing each layer into geometry + latent features is more efficient and accurate than modeling each layer independently.
Ability to Generate Novel Scenes: Generated patches exhibit significant structural differences compared to their nearest-neighbor training patches, proving that the model is not merely memorizing training data.

Highlights & Insights¶

Exquisite Design of Latent Tree Representation: Geometry (TUDF, which can be directly downsampled) encodes low-frequency information, while latent features encode high-frequency residuals — complementary decomposition is more efficient than redundant cascades.
Unified Patch-level Training and Inference: Patches are randomly cropped during training (serving as data augmentation and preventing overfitting), and stitched during inference (supporting arbitrary sizes).
Significant Speed Advantage: A large-scale scene of 45m × 90m (~170 rooms) can be completed in 2 hours on a single GPU.
Probabilistic Completion: Starting from partial observations, multiple plausible complete scenes can be sampled, demonstrating the diversity of the generative model.

Limitations & Future Work¶

Only validated on the 3D-FRONT indoor dataset; applicability to outdoor scenes (larger scale and sparser) remains unknown.
Unconditional generation — lacks the ability to control the generated content via text, layouts, etc.
The TUDF representation assumes simple topology, which may have limitations for transparent or thin-shell objects.
Hyperparameters of the three-layer latent tree (patch size, overlap ratio, number of feature channels) may need adjustment for different scene domains.
The autoregressive patch generation order (breadth-first traversal) may introduce directional bias.

BlockFusion: Based on tri-planes + layout conditioning → good local surface quality but poor global structure; LT3SD's hierarchical coarse-to-fine strategy resolves the global consistency issue.
XCube: Based on sparse voxel latents → single-step generation limits detail fidelity; LT3SD's hierarchical conditional generation progressively adds details.
MultiDiffusion (Bar-Tal et al.): Patch-parallel denoising for 2D images → LT3SD generalizes this to 3D, employing it at high-resolution layers to accelerate inference.
Insight: "Complementary decomposed geometry + features" can be generalized to other 3D representations (e.g., trying a similar decomposition learning on NeRF's multi-resolution hash grids).

Rating¶

⭐⭐⭐⭐ — Cleverly designed representation, significantly outperforms SOTA quantitatively, highly practical support for infinite scene generation; but limited to indoor datasets and lacks conditional control capabilities.