DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting¶

Conference: ECCV 2024
arXiv: 2404.06903
Code: https://github.com/dreamscene360/dreamscene360
Area: 3D Vision
Keywords: Text-to-3D Generation, Panorama Images, 3D Gaussian Splatting, Scene Generation, Diffusion Models

TL;DR¶

Proposes DreamScene360, which utilizes panoramic images as an intermediate representation, combined with a GPT-4V self-refinement mechanism and panoramic 3D Gaussian Splatting, to achieve rapid generation of immersive 360° 3D scenes from text.

Background & Motivation¶

Background: Text-to-3D scene generation primarily follows two technical pathways: (a) Score Distillation Sampling (SDS)-based methods (e.g., DreamFusion), which optimize NeRF/3DGS representations by distilling priors from 2D diffusion models; and (b) progressive methods based on explicit representations (e.g., LucidDreamer, Text2Room), which gradually expand 3D representations to cover a wider field of view.

Limitations of Prior Work: - SDS-based methods struggle with low rendering quality, are limited by multi-view inconsistency of 2D models, and cannot easily scale to scene-level 3D structures. - Progressive methods perform poorly when filling in large missing areas, leading to severe distortions and incoherent structures in 360° scenes. - The challenge of prompt engineering in text-to-image is even more pronounced in 3D generation, requiring extensive trial and error.

Key Challenge: Existing methods lack a globally consistent 2D scene representation, making it impossible to maintain semantic and geometric consistency across a full 360° range.

Goal: Generate globally consistent, immersive 360° 3D scenes from arbitrary text prompts.

Key Insight: Adopting panoramic images as an intermediate representation ensures global consistency while enabling automatic prompt optimization using GPT-4V.

Core Idea: Panoramic images provide a globally consistent 2D representation of complete 360° scenes. Combined with monocular depth initialization and semantic/geometric regularization, they can be efficiently elevated to 3D Gaussian Splatting representations.

Method¶

Overall Architecture¶

DreamScene360 consists of three stages: (1) generating 360° panoramas using a diffusion model, and iteratively optimizing them through GPT-4V self-refinement; (2) performing 2D-to-3D initialization of the panorama using monocular depth estimation and a learnable geometric field; and (3) optimizing panoramic 3D Gaussians via semantic and geometric regularization to fill in invisible regions from the single-view input.

Key Designs¶

Text-to-360° Panorama Generation + Self-Refinement Mechanism:
- Function: Generate high-quality, globally consistent 360° panoramic images from text.
- Mechanism: Based on the MultiDiffusion sliding window process, StitchDiffusion is employed to ensure left-right boundary continuity. It generates a panorama of resolution $H \times 2H$, where each patch's update during the denoising process is merged through a weighted average: $\Phi(I_{t-1}) = \sum_{i=1}^{n} \frac{P_i^{-1}(W_i)}{\sum_{j=1}^{n} P_j^{-1}(W_j)} \otimes P_i^{-1}(\Phi(P_i(I_t)))$
- In each denoising timestep, diffusion is performed not only on the original resolution but also on the concatenated leftmost and rightmost regions to ensure boundary consistency.
- Self-Refinement: Integrates GPT-4V for multi-round self-improvement. Starting from a simple user prompt, GPT-4V evaluates the generated image on aspects such as object counts, attributes, relationships, and appearance with a score (0-10). It then provides improvement suggestions to refine the prompt, ultimately selecting the highest-scoring panorama.
- Design Motivation: The panorama provides a globally consistent 2D representation for subsequent 3D generation, while GPT-4V eliminates the need for manual prompt engineering.
Panoramic Geometric Field Initialization:
- Function: Elevate the 2D panorama into a consistent 3D point cloud as Gaussian initialization.
- Mechanism: Project the panorama into N=20 overlapping perspective tangent images and obtain the depth map $D_i^{\text{Mono}}$ for each view using a DPT monocular depth estimator. To address the affine ambiguity inherent in monocular depth, a learnable global geometric field (MLP) and per-view scale/shift parameters are introduced for global alignment: $\min_{\alpha,\beta,\Theta} \left\{ \|\alpha \cdot D^{\text{Mono}} + \beta - \text{MLPs}(v;\Theta)\|_2^2 + \lambda_{\text{TV}} \mathcal{L}_{\text{TV}}(\beta) + \lambda_\alpha \|\gamma(\alpha) - 1\|^2 \right\}$
- Where $\alpha_i$ is the scale parameter for each view, $\beta_i$ is the pixel-wise shift parameter, $\Theta$ denotes the MLP parameters, and $\gamma(\cdot)$ is the softplus function.
- The TV loss ensures the spatial smoothness of the shift parameters.
- Design Motivation: Outdoor scenes lack structured layout priors, necessitating deformable alignment to achieve scale consistency across different views.
Virtual Camera Synthesis Parallax + Semantic/Geometric Regularization:
- Function: Solve the lack of parallax information in single-view panoramas and fill in invisible areas.
- Mechanism: Synthesize virtual cameras by introducing progressive perturbations to the panoramic viewpoint coordinates: $(x', y', z') = (x, y, z) + \delta(d_x, d_y, d_z)$ The perturbation range is $[-0.05, +0.05] \times \gamma$, where $\gamma \in \{1, 2, 4\}$ denotes the three-stage progressive perturbation.
- Semantic Regularization: Uses the [CLS] feature of DINOv2 to constrain semantic consistency between the training view and the virtual view: $\mathcal{L}_{\text{sem}} = 1 - \text{Cos}([\text{CLS}](I_i), [\text{CLS}](I_i'))$
- Geometric Regularization: Uses DPT to estimate the depth of the rendered image, regularizing the relative relationships of the rendered depth via Pearson correlation: $\mathcal{L}_{\text{geo}}(I_i, D_i) = 1 - \frac{\text{Cov}(D_i, \text{DPT}(I_i))}{\sqrt{\text{Var}(D_i) \cdot \text{Var}(\text{DPT}(I_i))}}$
- Design Motivation: Single-view panoramas lack parallax information (cannot perceive depth via binocular disparity), which needs to be compensated for using virtual views paired with 2D model priors.

Loss & Training¶

Overall loss function: $$\mathcal{L} = \mathcal{L}_{\text{RGB}} + \lambda_1 \cdot \mathcal{L}_{\text{sem}} + \lambda_2 \cdot \mathcal{L}_{\text{geo}}$$

$\mathcal{L}_{\text{RGB}}$: Photometric loss, including L1 and D-SSIM terms.
$\lambda_1 = \lambda_2 = 0.05$
Input panorama resolution: $1024 \times 2048$
Disable 3DGS adaptive density control (densification) since high-quality point cloud initialization is already available.
Set FoV to 80° for geometric field optimization.

Key Experimental Results¶

Main Results¶

Metric	DreamScene360	LucidDreamer	Description
CLIP Distance ↓	0.8732	0.8900	Text-Image Alignment
Q-Align ↑	3.1094	3.0566	SOTA Perceptual Quality Assessment
NIQE ↓	4.9165	6.2305	No-Reference Image Quality
BRISQUE ↓	38.3911	51.9764	No-Reference Image Quality
Runtime	7min 20sec	6min 15sec	Slightly slower but acceptable

Ablation Study¶

Configuration	Effect	Description
Photometric Loss Only	Artifacts in virtual views	Lack of constraints in occluded regions
+ Geometric Regularization	Reduced artifacts	Improved depth consistency
+ Semantic Regularization	Reduced artifacts	High-level semantic complement
Full Model	Optimal visual quality	Mutual complement of both regularizations
Random Initialization	Blurry results	Lack of geometric priors
Monocular Depth + Alignment	Clear and consistent	Key to a good initialization

Key Findings¶

LucidDreamer's progressive inpainting tends to generate repetitive content in complex scenes (e.g., repeating a bedroom multiple times).
GPT-4V self-refinement significantly improves the visual quality and detail richness of the panorama.
Disabling 3DGS densification actually helps improve quality and accelerate convergence.

Highlights & Insights¶

Using a panorama as an intermediate representation is an excellent design choice—it naturally addresses the global consistency issue of 360° scenes while enabling GPT-4V's quality evaluation (prior methods without a global 2D representation could not perform such evaluations).
The entire pipeline enables "one-click" 3D scene generation in about 7 minutes, offering a user experience significantly superior to SDS-based methods that require laborious parameter tuning.
Formulating the affine ambiguity of monocular depth as a learnable geometric field optimization is a highly practical and elegant solution.

Limitations & Future Work¶

The generation resolution is limited by the default resolution of the pre-trained panoramic diffusion model ($512 \times 1024$).
Future work can explore higher-resolution generation and extension to 4D dynamic scenes.
The geometric accuracy of the scene remains bounded by the quality of monocular depth estimation.

vs. LucidDreamer: LucidDreamer extends views via progressive inpainting but cannot guarantee 360° global consistency. DreamScene360 solves this fundamental issue through the panoramic intermediate representation.
vs. DreamFusion: DreamFusion utilizes SDS distillation, which makes generation slow and low quality. DreamScene360 avoids this time-consuming score distillation process.
vs. Text2Room: Text2Room uses a mesh representation to progressively construct indoor scenes, whereas DreamScene360 supports unconstrained both indoor and outdoor scenes.

Rating¶

Novelty: ⭐⭐⭐⭐ The design of using a panorama as an intermediate representation + GPT-4V self-refinement is simple yet effective.
Experimental Thoroughness: ⭐⭐⭐ Only compared with LucidDreamer; lacks more baselines and quantitative ablation.
Writing Quality: ⭐⭐⭐⭐ Clear structure, well-explained motivation, high-quality illustrations.
Value: ⭐⭐⭐⭐ Provides a practical end-to-end solution for 360° scene generation, indicating clear industrial application value.