LT3SD: Latent Trees for 3D Scene Diffusion¶
Conference: CVPR 2025
arXiv: 2409.08215
Code: https://quan-meng.github.io/projects/lt3sd
Area: 3D Vision
Keywords: 3D scene generation, latent diffusion, latent tree, TUDF, coarse-to-fine, patch-based, unconditional generation
TL;DR¶
LT3SD is proposed to progressively decompose 3D scenes into latent trees (each layer containing a geometry volume + a high-frequency latent feature volume). A patch-based diffusion model is trained on this representation to achieve coarse-to-fine, patch-wise high-quality infinite 3D scene generation, improving FID by 70% compared to the SOTA.
Background & Motivation¶
Background: Diffusion models have achieved breakthroughs in 2D image generation, while 3D diffusion models mainly focus on object-level generation. Due to high geometric complexity, scarce data, and indefinite spatial scales, generating 3D scenes is far more challenging than generating objects.
Limitations of Prior Work: (1) Object-level 3D diffusion (PVD, NFD) assumes shapes are in a normalized space and uses compact representations like global latent codes or tri-planes, which cannot scale to unstructured scenes; (2) The three planes in tri-plane representations (BlockFusion) are highly coupled, requiring complex synchronization for scene extrapolation; (3) Existing scene generation methods (SSG, SemCity) are limited to low-resolution or semantic scenes, lacking geometric details.
Key Challenge: The 3D scene signal is highly non-uniform (with large empty areas and details concentrated near surfaces), requiring a representation that can efficiently encode multi-scale information from global structures to local details.
Key Insight: Designing a hierarchical latent tree representation to decompose the scene into complementary geometric (low-frequency) and latent feature (high-frequency) encodings at multiple resolution levels, which naturally supports coarse-to-fine patch-based generation.
Method¶
Overall Architecture¶
Two stages: 1. Latent Tree Encoding (Stage 1): An encoder/decoder is learned to progressively decompose the 3D scene TUDF grid into multi-layer latent trees. 2. Patch-based Latent Diffusion (Stage 2): Conditional diffusion models are trained on each resolution layer of the latent tree.
Key Designs¶
1. Latent Tree Representation - Function: Progressively decomposes a high-resolution 3D scene TUDF grid into an \(N\)-layer tree, where each layer \(i\) contains a geometric volume \(L_i^s\) (TUDF) and a latent feature volume \(H_i^s\). - Mechanism: A 3D CNN encoder decomposes a high-resolution patch \(L_{i+1}\) into a low-resolution TUDF \(L_i\) (via average pooling downsampling) and a latent feature \(H_i\) (predicted by the CNN): $\(\mathcal{E}_{i+1}(L_{i+1}) \Rightarrow [L_i, H_i]\)$ A 3D CNN decoder reconstructs the high-resolution TUDF from \(L_i\) and \(H_i\): $\(\mathcal{D}_{i+1}([L_i, H_i]) \Rightarrow L_{i+1}\)$ - Comparison with Alternatives: Compared to the cascaded latent model (Cascaded Model), the latent tree excels with less storage (×0.80), faster training (×0.87), and lower reconstruction error (3.20 vs 4.91 ×10⁻⁴) because the cascaded model models redundant information independently at each layer.
2. Patch-based Conditional Diffusion - Function: Trains a 3D UNet diffusion model at each layer of the latent tree to predict latent features \(H_i\) from geometric volumes \(L_i\). - Mechanism: Conditional generation (\(z=H_i\), \(c=L_i\)) is performed when layer \(i>1\), while unconditional generation is performed for the coarsest layer \(i=1\) (generating both \(L_1\) and \(H_1\)). Training is conducted on randomly cropped patches: $\(\mathcal{L}_\text{diff} = \mathbb{E}_{z,c,\epsilon,t}[\|\epsilon - \mathcal{G}_i(z_t, t, c)\|_2^2]\)$ - Design Motivation: Patch-level training shifts the focus from complex, unaligned full scenes to local regions with more shared structures, while also serving as data augmentation.
3. Patch-Stitched Generation of Large-Scale Scenes - Function: Generates scenes of arbitrary sizes during inference using patch-stitching + coarse-to-fine hierarchical reconstruction. - Mechanism: - The coarsest layer uses a Stable Inpainting scheme to autoregressively generate patches (fixing known regions + diffusing unknown regions). - High-resolution layers use a MultiDiffusion scheme to denoise all patches in parallel, averaging and blending overlapping regions. - Design Motivation: Coarse layers have fewer patches and are suitable for autoregression to ensure global consistency, while fine layers have many patches and leverage parallel acceleration (taking only 2 hours to generate 170 rooms compared to BlockFusion's 3 hours for 7 rooms).
Loss & Training¶
- Latent Tree Training: \(\mathcal{L}_\text{latent} = (L_{i+1} - \mathcal{D}_{i+1}(\mathcal{E}_{i+1}(L_{i+1})))^2\)
- Diffusion Training: Standard denoising \(\epsilon\)-prediction loss
Key Experimental Results¶
Main Results — 3D-FRONT Unconditional Scene Generation¶
| Method | COV↑ (CD) | MMD↓ (CD) | 1-NNA↓ (CD) | FID↓ |
|---|---|---|---|---|
| PVD | 43.82 | 3.69 | 70.83 | 237.85 |
| NFD | 44.65 | 3.65 | 62.86 | 266.27 |
| BlockFusion | 24.32 | 5.10 | 89.01 | 45.55 |
| XCube | 48.60 | 3.35 | 56.45 | 55.35 |
| LT3SD (3 layers) | 53.10 | 3.51 | 53.22 | 13.39 |
The FID of 13.39 far outperforms the second-best at 45.55 (an improvement of 70%+), while global structural metrics such as COV/1-NNA are also optimal.
Ablation Study — Number of Latent Tree Layers¶
| Configuration | FID↓ | COV↑ (CD) |
|---|---|---|
| Single layer (17.6-2.2) | 59.23 | 28.61 |
| Single layer (8.8-2.2) | 50.07 | 40.87 |
| Three layers (17.6-8.8-2.2) | 13.39 | 53.10 |
Multi-layer hierarchical modeling is key to high-quality generation.
Ablation Study on Latent Representation¶
| Representation | Training Time | Storage | Reconstruction Error (\(\ell_2\)) |
|---|---|---|---|
| Cascaded Model | ×1.00 | ×1.00 | 4.91×10⁻⁴ |
| Latent Tree | ×0.87 | ×0.80 | 3.20×10⁻⁴ |
Key Findings¶
- Hierarchical > Single-layer: The FID (13.39) of the three-layer latent tree is far superior to any single-layer configuration (50+).
- Complementary Decomposition > Independent Cascade: Decomposing each layer into geometry + latent features is more efficient and accurate than modeling each layer independently.
- Ability to Generate Novel Scenes: Generated patches exhibit significant structural differences compared to their nearest-neighbor training patches, proving that the model is not merely memorizing training data.
Highlights & Insights¶
- Exquisite Design of Latent Tree Representation: Geometry (TUDF, which can be directly downsampled) encodes low-frequency information, while latent features encode high-frequency residuals — complementary decomposition is more efficient than redundant cascades.
- Unified Patch-level Training and Inference: Patches are randomly cropped during training (serving as data augmentation and preventing overfitting), and stitched during inference (supporting arbitrary sizes).
- Significant Speed Advantage: A large-scale scene of 45m × 90m (~170 rooms) can be completed in 2 hours on a single GPU.
- Probabilistic Completion: Starting from partial observations, multiple plausible complete scenes can be sampled, demonstrating the diversity of the generative model.
Limitations & Future Work¶
- Only validated on the 3D-FRONT indoor dataset; applicability to outdoor scenes (larger scale and sparser) remains unknown.
- Unconditional generation — lacks the ability to control the generated content via text, layouts, etc.
- The TUDF representation assumes simple topology, which may have limitations for transparent or thin-shell objects.
- Hyperparameters of the three-layer latent tree (patch size, overlap ratio, number of feature channels) may need adjustment for different scene domains.
- The autoregressive patch generation order (breadth-first traversal) may introduce directional bias.
Related Work & Insights¶
- BlockFusion: Based on tri-planes + layout conditioning → good local surface quality but poor global structure; LT3SD's hierarchical coarse-to-fine strategy resolves the global consistency issue.
- XCube: Based on sparse voxel latents → single-step generation limits detail fidelity; LT3SD's hierarchical conditional generation progressively adds details.
- MultiDiffusion (Bar-Tal et al.): Patch-parallel denoising for 2D images → LT3SD generalizes this to 3D, employing it at high-resolution layers to accelerate inference.
- Insight: "Complementary decomposed geometry + features" can be generalized to other 3D representations (e.g., trying a similar decomposition learning on NeRF's multi-resolution hash grids).
Rating¶
⭐⭐⭐⭐ — Cleverly designed representation, significantly outperforms SOTA quantitatively, highly practical support for infinite scene generation; but limited to indoor datasets and lacks conditional control capabilities.