GVGEN: Text-to-3D Generation with Volumetric Representation¶
Conference: ECCV 2024
arXiv: 2403.12957
Code: https://gvgen.github.io/
Area: 3D Vision
Keywords: Text-to-3D, 3D Gaussian, Volumetric Representation, Feed-forward Generation, Diffusion Model
TL;DR¶
Proposes GVGEN, the first framework to directly generate 3D Gaussians from text in a feed-forward manner. By organizing unordered Gaussians into a structured volumetric representation (GaussianVolume) and designing a coarse-to-fine generation pipeline (generating geometric volumes first and then predicting Gaussian attributes), text-to-3D generation is completed in approximately 7 seconds.
Background & Motivation¶
Background: Text-to-3D generation is a popular direction in computer graphics and is categorized into two main groups: - Optimization-based methods (e.g., DreamFusion/SDS): Good quality but takes hours, exhibiting the Janus problem. - Feed-forward methods (e.g., Shap-E): Fast but limited in quality, often adopting an indirect text → 2D → 3D route.
Limitations of Prior Work: - Optimization methods have excessively high time costs (hour-level) and suffer from multi-face (Janus) / over-saturation issues. - In feed-forward methods, directly generating 3D Gaussians is almost unexplored because Gaussian points are unordered and high-dimensional. - Point-cloud diffusion/generation methods are difficult to scale to high-dimensional Gaussian attributes. - Methods using multi-view intermediate steps suffer from resolution loss and lack sufficient semantic understanding for complex prompts.
Key Challenge: 3D Gaussian points are inherently unstructured, making them difficult to handle directly with existing generative networks; meanwhile, Gaussian attributes are high-dimensional, making direct distribution learning difficult.
Goal: Design a method that can organize unordered 3D Gaussians into structured representations, and achieve direct feed-forward 3D Gaussian generation from text.
Key Insight: Organize Gaussians into a fixed number of volumetric grids (GaussianVolume), simplifying generation via a two-stage (geometry → attribute) coarse-to-fine pipeline.
Core Idea: Structuring unordered Gaussians into volumetric grids + Candidate Pool Strategy (CPS) to maintain a fixed count + diffusing geometry first, then predicting attributes with 3D U-Net.
Method¶
Overall Architecture¶
GVGEN consists of two main stages:
Stage 1 — GaussianVolume Fitting (Data Preparation): Represents the unordered Gaussian points of each 3D object as a structured volume with a fixed resolution \(N^3\) (default \(N=32\), i.e., 32,768 Gaussian points), while extracting a Gaussian Distance Field (GDF) as a coarse geometric representation.
Stage 2 — Text-to-3D Generation (Generation): 1. A diffusion model generates the GDF (coarse geometric volume) according to the text. 2. A 3D U-Net predicts the complete GaussianVolume attributes based on the GDF and text.
Key Designs¶
-
GaussianVolume (Structured Gaussian Volume):
- Function: Organizes unordered 3D Gaussian points into a structured volume \(V \in \mathbb{R}^{C \times N \times N \times N}\)
- Mechanism: Place a Gaussian point at each grid node of the volumetric grid, and use the position offset \(\Delta\mu\) to represent the tiny displacement from the grid node to the actual Gaussian center: \(\mu = p + \Delta\mu\) where \(p\) is the grid node position, and \(\Delta\mu\) is a learnable offset. During training, gradients are only backpropagated back to the offsets, allowing fine-grained position adjustment while maintaining a structured form. Each Gaussian includes: position offset \(\Delta\mu \in \mathbb{R}^3\), scale \(s \in \mathbb{R}^3\), rotation quaternion \(q \in \mathbb{R}^4\), color \(c \in \mathbb{R}^3\), opacity \(\alpha \in \mathbb{R}\)
- Design Motivation: Unstructured Gaussian points are unfriendly to 3D neural networks; prior attempts to use point cloud diffusion to generate high-dimensional Gaussians worked poorly; the volumetric format can seamlessly interface with existing volumetric generative networks.
-
Candidate Pool Strategy (CPS):
- Function: Achieves effective pruning and densification under the constraint of a fixed number of Gaussians.
- Mechanism:
- Initialize the candidate pool \(P = \emptyset\).
- Pruning: Determine the points to be pruned \(G_p\) according to a gradient threshold \(\tau_p\), "deactivate" them and put them into the candidate pool (they do not participate in forward and backward passes).
- Densification: Determine the points to be densified \(G_d\) according to a threshold \(\tau_d\), find the nearest deactivated point \(G_{new}\) from the candidate pool, and activate it near \(G_d\).
- Optimization End: Release all points in the pool to participate in optimization again.
- Design Motivation: The original 3DGS free pruning/densification changes the total number of Gaussians, which is inapplicable to a fixed volumetric resolution. Without CPS, the movement range of Gaussian centers is limited, and the geometric quality drops (PSNR drops by 0.45 dB).
-
Gaussian Distance Field (GDF):
- Function: Serves as a coarse geometric representation, storing the distance from each grid node to the nearest Gaussian center.
- Mechanism: \(F \in \mathbb{R}_0^{+1 \times N \times N \times N}\), similar to the Unsigned Distance Field (UDF), extracted from the fitted GaussianVolume via sorting algorithms. Train a diffusion model to predict the noise of noisy GDF: \(\mathcal{L}_{3D} = \text{MSE}(\text{GDF}_{gt}, \text{GDF}_{pred})\)
- Design Motivation: Directly generating the integrated GaussianVolume is too high-dimensional (14 channels × 32³), making retrieve difficult for the diffusion model to converge. Decomposing the problem into first generating a simple 1-channel geometric volume and then conditionally predicting other attributes decreases complexity.
-
3D U-Net Attribute Predictor:
- Function: Predicts the complete GaussianVolume attributes in a single step based on the generated GDF and text conditions.
- Mechanism: Based on a modified 3D U-Net from SDFusion, taking GDF + text conditions as inputs and outputting all Gaussian attributes. It uses multimodal loss: \(\mathcal{L} = \lambda_{3D}\mathcal{L}_{3D} + \lambda_{2D}\mathcal{L}_{2D}\) \(\mathcal{L}_{2D} = \lambda\mathcal{L}_1 + (1-\lambda)\mathcal{L}_{SSIM}\)
- Design Motivation: Experiments found that the single-step reconstruction model and the diffusion model performed comparably in attribute prediction, but the former was faster. The multimodal loss (3D semantics + 2D rendering) balances global consistency and local details.
Loss & Training¶
- GaussianVolume Fitting Loss: \(\mathcal{L}_{fitting} = \lambda_1\mathcal{L}_1 + \lambda_2\mathcal{L}_{SSIM} + \lambda_3\mathcal{L}_{offsets}\)
- Offset regularization: \(\mathcal{L}_{offsets} = \text{Mean}(\text{ReLU}(|\Delta\mu - \epsilon_{offsets}|))\), restricting Gaussian centers from deviating too far from the grid nodes.
- GDF Diffusion Loss: MSE noise prediction loss.
- Attribute Prediction Loss: Combination of 3D MSE + 2D rendering loss.
- Training Data: Objaverse-LVIS (~46,000 3D models across 1,156 categories), with text descriptions from Cap3D (BLIP-2 + GPT-4).
- Volumetric resolution \(N=32\), i.e., 32,768 Gaussian points.
Key Experimental Results¶
Main Results — Text-to-3D (CLIP Score + Speed)¶
| Method | CLIP Score↑ | Generation Time↓ | Method Type |
|---|---|---|---|
| DreamGaussian | 23.60 | ~3 min | Optimization |
| VolumeDiffusion | 25.09 | 7 sec | Feed-forward |
| Shap-E | 28.48 | 11 sec | Feed-forward |
| GVGEN (Ours) | 28.53 | 7 sec | Feed-forward |
Ablation Study — GaussianVolume Fitting Strategy¶
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ | Description |
|---|---|---|---|---|
| Full (CPS + offsets) | 30.122 | 0.963 | 0.038 | Best |
| w/o CPS | 29.677 | 0.958 | 0.049 | No candidate pool strategy, limited physical movement range |
| w/o offsets | 27.140 | 0.936 | 0.084 | Fixed Gaussian positions, sharp degradation in quality |
Ablation Study — Attribute Prediction Loss¶
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ | Description |
|---|---|---|---|---|
| Full (\(\mathcal{L}_{3D}\) + \(\mathcal{L}_{2D}\)) | 35.03 | 0.9872 | 0.0236 | Complementary optimal of both losses |
| w/o \(\mathcal{L}_{3D}\) | 35.21 | 0.9846 | 0.0268 | PSNR slightly higher but details (LPIPS) degraded |
| w/o \(\mathcal{L}_{2D}\) | 29.55 | 0.9654 | 0.0444 | 3D loss only, sharp degradation in quality |
Key Findings¶
- GVGEN is on par with Shap-E in terms of CLIP Score (28.53 vs 28.48) but runs faster (7s vs 11s).
- Removing position offsets (w/o offsets) results in a PSNR drop of about 3 dB, indicating that fine position adjustments are crucial for detail recovery.
- The CPS strategy contributes to a PSNR gain of ~0.45 dB and a significant improvement in LPIPS.
- The 2D rendering loss is far more important than the 3D MSE loss (removing 2D loss drops PSNR by 5.5 dB).
- GVGEN can generate diverse results for the same text prompt, distinguishing itself from deterministic reconstruction methods.
- The image-conditioned version of GVGEN, compared to reconstruction methods like OpenLRM/TGS, generates more reasonable shapes and textures in unobserved regions.
Highlights & Insights¶
- First work to directly generate 3D Gaussians in a feed-forward manner: Opens a new direction of text-to-3D Gaussian generation without relying on intermediate 2D multi-view steps.
- Structured idea of GaussianVolume: Formulates the unstructured point cloud problem into a regular volume problem, elegantly bridging Gaussians and neural networks.
- Candidate Pool Strategy: The densification/pruning strategy under the fixed-number constraint is clever and practical, resolving the conflict between structuring and adaptability.
- Coarse-to-fine decomposition strategy: Decomposes the high-dimensional generation problem into simple geometry (1-channel GDF) + conditional attribute prediction, effectively reducing learning difficulty for the diffusion model.
- Generation diversity: Capable of generating 3D objects with different appearances for the same text prompt, which is unattainable by deterministic reconstruction methods.
Limitations & Future Work¶
- Training data domain limitation: Performance degrades when input text deviates from the training domain.
- The volumetric resolution \(N=32\) is low, exhibiting limited detail rendering for complex textures (higher resolution requires more computational resources).
- Fitting GaussianVolume for million-scale training data is time-consuming.
- The CLIP Score is only slightly higher than Shap-E, indicating remaining challenges in complex semantic understanding.
- Compared to reconstruction methods like GRM, there is still a gap in absolute generation quality metrics.
- High-order spherical harmonics are not used (SH order=0), limiting view-dependent effects.
Related Work & Insights¶
- Comparison with 3D VADER/VolumeDiffusion: These methods also adopt volumetric representation + diffusion but use NeRF/implicit volumes; GVGEN uses 3D Gaussian volumes, providing explicit manipulability and faster rendering speeds.
- Comparison with GRM/LGM: GRM/LGM are reconstruction methods requiring multi-view inputs; GVGEN is a generative method generating directly from text, offering diversity.
- Comparison with DreamGaussian: DreamGaussian optimizes Gaussians using SDS, which takes minutes and suffers from over-saturation; GVGEN achieves feed-forward generation in just 7 seconds.
- Insights: Organizing unstructured representations (such as point clouds, Gaussians) into structured volumes serves as an effective bridge connecting classical representations with generative networks; the coarse-to-fine decomposition is a general strategy to mitigate high-dimensional generative complexity.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First feed-forward text-to-3D Gaussian framework, with highly novel GaussianVolume and CPS designs.
- Experimental Thoroughness: ⭐⭐⭐⭐ Qualitative and quantitative comparisons are sufficient, and ablation studies are clean, but quantitative metrics are relatively sparse (only CLIP Score).
- Writing Quality: ⭐⭐⭐⭐ Structure is clear, with standard pseudocode for the CPS algorithm, although some details require referring to the supplementary material.
- Value: ⭐⭐⭐⭐ Significant in opening a new direction, although there is still a quality gap compared to optimization-based methods; practical applications require higher resolution.