DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting¶
Conference: ECCV 2024
arXiv: 2404.06903
Code: https://github.com/dreamscene360/dreamscene360
Area: 3D Vision
Keywords: Text-to-3D Generation, Panorama Images, 3D Gaussian Splatting, Scene Generation, Diffusion Models
TL;DR¶
Proposes DreamScene360, which utilizes panoramic images as an intermediate representation, combined with a GPT-4V self-refinement mechanism and panoramic 3D Gaussian Splatting, to achieve rapid generation of immersive 360° 3D scenes from text.
Background & Motivation¶
Background: Text-to-3D scene generation primarily follows two technical pathways: (a) Score Distillation Sampling (SDS)-based methods (e.g., DreamFusion), which optimize NeRF/3DGS representations by distilling priors from 2D diffusion models; and (b) progressive methods based on explicit representations (e.g., LucidDreamer, Text2Room), which gradually expand 3D representations to cover a wider field of view.
Limitations of Prior Work: - SDS-based methods struggle with low rendering quality, are limited by multi-view inconsistency of 2D models, and cannot easily scale to scene-level 3D structures. - Progressive methods perform poorly when filling in large missing areas, leading to severe distortions and incoherent structures in 360° scenes. - The challenge of prompt engineering in text-to-image is even more pronounced in 3D generation, requiring extensive trial and error.
Key Challenge: Existing methods lack a globally consistent 2D scene representation, making it impossible to maintain semantic and geometric consistency across a full 360° range.
Goal: Generate globally consistent, immersive 360° 3D scenes from arbitrary text prompts.
Key Insight: Adopting panoramic images as an intermediate representation ensures global consistency while enabling automatic prompt optimization using GPT-4V.
Core Idea: Panoramic images provide a globally consistent 2D representation of complete 360° scenes. Combined with monocular depth initialization and semantic/geometric regularization, they can be efficiently elevated to 3D Gaussian Splatting representations.
Method¶
Overall Architecture¶
DreamScene360 consists of three stages: (1) generating 360° panoramas using a diffusion model, and iteratively optimizing them through GPT-4V self-refinement; (2) performing 2D-to-3D initialization of the panorama using monocular depth estimation and a learnable geometric field; and (3) optimizing panoramic 3D Gaussians via semantic and geometric regularization to fill in invisible regions from the single-view input.
Key Designs¶
-
Text-to-360° Panorama Generation + Self-Refinement Mechanism:
- Function: Generate high-quality, globally consistent 360° panoramic images from text.
- Mechanism: Based on the MultiDiffusion sliding window process, StitchDiffusion is employed to ensure left-right boundary continuity. It generates a panorama of resolution \(H \times 2H\), where each patch's update during the denoising process is merged through a weighted average: \(\Phi(I_{t-1}) = \sum_{i=1}^{n} \frac{P_i^{-1}(W_i)}{\sum_{j=1}^{n} P_j^{-1}(W_j)} \otimes P_i^{-1}(\Phi(P_i(I_t)))\)
- In each denoising timestep, diffusion is performed not only on the original resolution but also on the concatenated leftmost and rightmost regions to ensure boundary consistency.
- Self-Refinement: Integrates GPT-4V for multi-round self-improvement. Starting from a simple user prompt, GPT-4V evaluates the generated image on aspects such as object counts, attributes, relationships, and appearance with a score (0-10). It then provides improvement suggestions to refine the prompt, ultimately selecting the highest-scoring panorama.
- Design Motivation: The panorama provides a globally consistent 2D representation for subsequent 3D generation, while GPT-4V eliminates the need for manual prompt engineering.
-
Panoramic Geometric Field Initialization:
- Function: Elevate the 2D panorama into a consistent 3D point cloud as Gaussian initialization.
- Mechanism: Project the panorama into N=20 overlapping perspective tangent images and obtain the depth map \(D_i^{\text{Mono}}\) for each view using a DPT monocular depth estimator. To address the affine ambiguity inherent in monocular depth, a learnable global geometric field (MLP) and per-view scale/shift parameters are introduced for global alignment: \(\min_{\alpha,\beta,\Theta} \left\{ \|\alpha \cdot D^{\text{Mono}} + \beta - \text{MLPs}(v;\Theta)\|_2^2 + \lambda_{\text{TV}} \mathcal{L}_{\text{TV}}(\beta) + \lambda_\alpha \|\gamma(\alpha) - 1\|^2 \right\}\)
- Where \(\alpha_i\) is the scale parameter for each view, \(\beta_i\) is the pixel-wise shift parameter, \(\Theta\) denotes the MLP parameters, and \(\gamma(\cdot)\) is the softplus function.
- The TV loss ensures the spatial smoothness of the shift parameters.
- Design Motivation: Outdoor scenes lack structured layout priors, necessitating deformable alignment to achieve scale consistency across different views.
-
Virtual Camera Synthesis Parallax + Semantic/Geometric Regularization:
- Function: Solve the lack of parallax information in single-view panoramas and fill in invisible areas.
- Mechanism: Synthesize virtual cameras by introducing progressive perturbations to the panoramic viewpoint coordinates: \((x', y', z') = (x, y, z) + \delta(d_x, d_y, d_z)\) The perturbation range is \([-0.05, +0.05] \times \gamma\), where \(\gamma \in \{1, 2, 4\}\) denotes the three-stage progressive perturbation.
- Semantic Regularization: Uses the [CLS] feature of DINOv2 to constrain semantic consistency between the training view and the virtual view: \(\mathcal{L}_{\text{sem}} = 1 - \text{Cos}([\text{CLS}](I_i), [\text{CLS}](I_i'))\)
- Geometric Regularization: Uses DPT to estimate the depth of the rendered image, regularizing the relative relationships of the rendered depth via Pearson correlation: \(\mathcal{L}_{\text{geo}}(I_i, D_i) = 1 - \frac{\text{Cov}(D_i, \text{DPT}(I_i))}{\sqrt{\text{Var}(D_i) \cdot \text{Var}(\text{DPT}(I_i))}}\)
- Design Motivation: Single-view panoramas lack parallax information (cannot perceive depth via binocular disparity), which needs to be compensated for using virtual views paired with 2D model priors.
Loss & Training¶
Overall loss function: $\(\mathcal{L} = \mathcal{L}_{\text{RGB}} + \lambda_1 \cdot \mathcal{L}_{\text{sem}} + \lambda_2 \cdot \mathcal{L}_{\text{geo}}\)$
- \(\mathcal{L}_{\text{RGB}}\): Photometric loss, including L1 and D-SSIM terms.
- \(\lambda_1 = \lambda_2 = 0.05\)
- Input panorama resolution: \(1024 \times 2048\)
- Disable 3DGS adaptive density control (densification) since high-quality point cloud initialization is already available.
- Set FoV to 80° for geometric field optimization.
Key Experimental Results¶
Main Results¶
| Metric | DreamScene360 | LucidDreamer | Description |
|---|---|---|---|
| CLIP Distance ↓ | 0.8732 | 0.8900 | Text-Image Alignment |
| Q-Align ↑ | 3.1094 | 3.0566 | SOTA Perceptual Quality Assessment |
| NIQE ↓ | 4.9165 | 6.2305 | No-Reference Image Quality |
| BRISQUE ↓ | 38.3911 | 51.9764 | No-Reference Image Quality |
| Runtime | 7min 20sec | 6min 15sec | Slightly slower but acceptable |
Ablation Study¶
| Configuration | Effect | Description |
|---|---|---|
| Photometric Loss Only | Artifacts in virtual views | Lack of constraints in occluded regions |
| + Geometric Regularization | Reduced artifacts | Improved depth consistency |
| + Semantic Regularization | Reduced artifacts | High-level semantic complement |
| Full Model | Optimal visual quality | Mutual complement of both regularizations |
| Random Initialization | Blurry results | Lack of geometric priors |
| Monocular Depth + Alignment | Clear and consistent | Key to a good initialization |
Key Findings¶
- LucidDreamer's progressive inpainting tends to generate repetitive content in complex scenes (e.g., repeating a bedroom multiple times).
- GPT-4V self-refinement significantly improves the visual quality and detail richness of the panorama.
- Disabling 3DGS densification actually helps improve quality and accelerate convergence.
Highlights & Insights¶
- Using a panorama as an intermediate representation is an excellent design choice—it naturally addresses the global consistency issue of 360° scenes while enabling GPT-4V's quality evaluation (prior methods without a global 2D representation could not perform such evaluations).
- The entire pipeline enables "one-click" 3D scene generation in about 7 minutes, offering a user experience significantly superior to SDS-based methods that require laborious parameter tuning.
- Formulating the affine ambiguity of monocular depth as a learnable geometric field optimization is a highly practical and elegant solution.
Limitations & Future Work¶
- The generation resolution is limited by the default resolution of the pre-trained panoramic diffusion model (\(512 \times 1024\)).
- Future work can explore higher-resolution generation and extension to 4D dynamic scenes.
- The geometric accuracy of the scene remains bounded by the quality of monocular depth estimation.
Related Work & Insights¶
- vs. LucidDreamer: LucidDreamer extends views via progressive inpainting but cannot guarantee 360° global consistency. DreamScene360 solves this fundamental issue through the panoramic intermediate representation.
- vs. DreamFusion: DreamFusion utilizes SDS distillation, which makes generation slow and low quality. DreamScene360 avoids this time-consuming score distillation process.
- vs. Text2Room: Text2Room uses a mesh representation to progressively construct indoor scenes, whereas DreamScene360 supports unconstrained both indoor and outdoor scenes.
Rating¶
- Novelty: ⭐⭐⭐⭐ The design of using a panorama as an intermediate representation + GPT-4V self-refinement is simple yet effective.
- Experimental Thoroughness: ⭐⭐⭐ Only compared with LucidDreamer; lacks more baselines and quantitative ablation.
- Writing Quality: ⭐⭐⭐⭐ Clear structure, well-explained motivation, high-quality illustrations.
- Value: ⭐⭐⭐⭐ Provides a practical end-to-end solution for 360° scene generation, indicating clear industrial application value.