GVGEN: Text-to-3D Generation with Volumetric Representation¶

Conference: ECCV 2024
arXiv: 2403.12957
Code: https://gvgen.github.io/
Area: 3D Vision
Keywords: Text-to-3D, 3D Gaussian, Volumetric Representation, Feed-forward Generation, Diffusion Model

TL;DR¶

Proposes GVGEN, the first framework to directly generate 3D Gaussians from text in a feed-forward manner. By organizing unordered Gaussians into a structured volumetric representation (GaussianVolume) and designing a coarse-to-fine generation pipeline (generating geometric volumes first and then predicting Gaussian attributes), text-to-3D generation is completed in approximately 7 seconds.

Background & Motivation¶

Background: Text-to-3D generation is a popular direction in computer graphics and is categorized into two main groups: - Optimization-based methods (e.g., DreamFusion/SDS): Good quality but takes hours, exhibiting the Janus problem. - Feed-forward methods (e.g., Shap-E): Fast but limited in quality, often adopting an indirect text → 2D → 3D route.

Limitations of Prior Work: - Optimization methods have excessively high time costs (hour-level) and suffer from multi-face (Janus) / over-saturation issues. - In feed-forward methods, directly generating 3D Gaussians is almost unexplored because Gaussian points are unordered and high-dimensional. - Point-cloud diffusion/generation methods are difficult to scale to high-dimensional Gaussian attributes. - Methods using multi-view intermediate steps suffer from resolution loss and lack sufficient semantic understanding for complex prompts.

Key Challenge: 3D Gaussian points are inherently unstructured, making them difficult to handle directly with existing generative networks; meanwhile, Gaussian attributes are high-dimensional, making direct distribution learning difficult.

Goal: Design a method that can organize unordered 3D Gaussians into structured representations, and achieve direct feed-forward 3D Gaussian generation from text.

Key Insight: Organize Gaussians into a fixed number of volumetric grids (GaussianVolume), simplifying generation via a two-stage (geometry → attribute) coarse-to-fine pipeline.

Core Idea: Structuring unordered Gaussians into volumetric grids + Candidate Pool Strategy (CPS) to maintain a fixed count + diffusing geometry first, then predicting attributes with 3D U-Net.

Method¶

Overall Architecture¶

GVGEN consists of two main stages:

Stage 1 — GaussianVolume Fitting (Data Preparation): Represents the unordered Gaussian points of each 3D object as a structured volume with a fixed resolution \(N^3\) (default \(N=32\), i.e., 32,768 Gaussian points), while extracting a Gaussian Distance Field (GDF) as a coarse geometric representation.

Stage 2 — Text-to-3D Generation (Generation): 1. A diffusion model generates the GDF (coarse geometric volume) according to the text. 2. A 3D U-Net predicts the complete GaussianVolume attributes based on the GDF and text.

Key Designs¶

GaussianVolume (Structured Gaussian Volume):
- Function: Organizes unordered 3D Gaussian points into a structured volume \(V \in \mathbb{R}^{C \times N \times N \times N}\)
- Mechanism: Place a Gaussian point at each grid node of the volumetric grid, and use the position offset \(\Delta\mu\) to represent the tiny displacement from the grid node to the actual Gaussian center: \(\mu = p + \Delta\mu\) where \(p\) is the grid node position, and \(\Delta\mu\) is a learnable offset. During training, gradients are only backpropagated back to the offsets, allowing fine-grained position adjustment while maintaining a structured form. Each Gaussian includes: position offset \(\Delta\mu \in \mathbb{R}^3\), scale \(s \in \mathbb{R}^3\), rotation quaternion \(q \in \mathbb{R}^4\), color \(c \in \mathbb{R}^3\), opacity \(\alpha \in \mathbb{R}\)
- Design Motivation: Unstructured Gaussian points are unfriendly to 3D neural networks; prior attempts to use point cloud diffusion to generate high-dimensional Gaussians worked poorly; the volumetric format can seamlessly interface with existing volumetric generative networks.
Candidate Pool Strategy (CPS):
- Function: Achieves effective pruning and densification under the constraint of a fixed number of Gaussians.
- Mechanism:
  - Initialize the candidate pool \(P = \emptyset\).
  - Pruning: Determine the points to be pruned \(G_p\) according to a gradient threshold \(\tau_p\), "deactivate" them and put them into the candidate pool (they do not participate in forward and backward passes).
  - Densification: Determine the points to be densified \(G_d\) according to a threshold \(\tau_d\), find the nearest deactivated point \(G_{new}\) from the candidate pool, and activate it near \(G_d\).
  - Optimization End: Release all points in the pool to participate in optimization again.
- Design Motivation: The original 3DGS free pruning/densification changes the total number of Gaussians, which is inapplicable to a fixed volumetric resolution. Without CPS, the movement range of Gaussian centers is limited, and the geometric quality drops (PSNR drops by 0.45 dB).
Gaussian Distance Field (GDF):
- Function: Serves as a coarse geometric representation, storing the distance from each grid node to the nearest Gaussian center.
- Mechanism: \(F \in \mathbb{R}_0^{+1 \times N \times N \times N}\), similar to the Unsigned Distance Field (UDF), extracted from the fitted GaussianVolume via sorting algorithms. Train a diffusion model to predict the noise of noisy GDF: \(\mathcal{L}_{3D} = \text{MSE}(\text{GDF}_{gt}, \text{GDF}_{pred})\)
- Design Motivation: Directly generating the integrated GaussianVolume is too high-dimensional (14 channels × 32³), making retrieve difficult for the diffusion model to converge. Decomposing the problem into first generating a simple 1-channel geometric volume and then conditionally predicting other attributes decreases complexity.
3D U-Net Attribute Predictor:
- Function: Predicts the complete GaussianVolume attributes in a single step based on the generated GDF and text conditions.
- Mechanism: Based on a modified 3D U-Net from SDFusion, taking GDF + text conditions as inputs and outputting all Gaussian attributes. It uses multimodal loss: \(\mathcal{L} = \lambda_{3D}\mathcal{L}_{3D} + \lambda_{2D}\mathcal{L}_{2D}\) \(\mathcal{L}_{2D} = \lambda\mathcal{L}_1 + (1-\lambda)\mathcal{L}_{SSIM}\)
- Design Motivation: Experiments found that the single-step reconstruction model and the diffusion model performed comparably in attribute prediction, but the former was faster. The multimodal loss (3D semantics + 2D rendering) balances global consistency and local details.

Loss & Training¶

GaussianVolume Fitting Loss: \(\mathcal{L}_{fitting} = \lambda_1\mathcal{L}_1 + \lambda_2\mathcal{L}_{SSIM} + \lambda_3\mathcal{L}_{offsets}\)
- Offset regularization: \(\mathcal{L}_{offsets} = \text{Mean}(\text{ReLU}(|\Delta\mu - \epsilon_{offsets}|))\), restricting Gaussian centers from deviating too far from the grid nodes.
GDF Diffusion Loss: MSE noise prediction loss.
Attribute Prediction Loss: Combination of 3D MSE + 2D rendering loss.
Training Data: Objaverse-LVIS (~46,000 3D models across 1,156 categories), with text descriptions from Cap3D (BLIP-2 + GPT-4).
Volumetric resolution \(N=32\), i.e., 32,768 Gaussian points.

Key Experimental Results¶

Main Results — Text-to-3D (CLIP Score + Speed)¶

Method	CLIP Score↑	Generation Time↓	Method Type
DreamGaussian	23.60	~3 min	Optimization
VolumeDiffusion	25.09	7 sec	Feed-forward
Shap-E	28.48	11 sec	Feed-forward
GVGEN (Ours)	28.53	7 sec	Feed-forward

Ablation Study — GaussianVolume Fitting Strategy¶

Configuration	PSNR↑	SSIM↑	LPIPS↓	Description
Full (CPS + offsets)	30.122	0.963	0.038	Best
w/o CPS	29.677	0.958	0.049	No candidate pool strategy, limited physical movement range
w/o offsets	27.140	0.936	0.084	Fixed Gaussian positions, sharp degradation in quality

Ablation Study — Attribute Prediction Loss¶

Configuration	PSNR↑	SSIM↑	LPIPS↓	Description
Full (\(\mathcal{L}_{3D}\) + \(\mathcal{L}_{2D}\))	35.03	0.9872	0.0236	Complementary optimal of both losses
w/o \(\mathcal{L}_{3D}\)	35.21	0.9846	0.0268	PSNR slightly higher but details (LPIPS) degraded
w/o \(\mathcal{L}_{2D}\)	29.55	0.9654	0.0444	3D loss only, sharp degradation in quality

Key Findings¶

GVGEN is on par with Shap-E in terms of CLIP Score (28.53 vs 28.48) but runs faster (7s vs 11s).
Removing position offsets (w/o offsets) results in a PSNR drop of about 3 dB, indicating that fine position adjustments are crucial for detail recovery.
The CPS strategy contributes to a PSNR gain of ~0.45 dB and a significant improvement in LPIPS.
The 2D rendering loss is far more important than the 3D MSE loss (removing 2D loss drops PSNR by 5.5 dB).
GVGEN can generate diverse results for the same text prompt, distinguishing itself from deterministic reconstruction methods.
The image-conditioned version of GVGEN, compared to reconstruction methods like OpenLRM/TGS, generates more reasonable shapes and textures in unobserved regions.

Highlights & Insights¶

First work to directly generate 3D Gaussians in a feed-forward manner: Opens a new direction of text-to-3D Gaussian generation without relying on intermediate 2D multi-view steps.
Structured idea of GaussianVolume: Formulates the unstructured point cloud problem into a regular volume problem, elegantly bridging Gaussians and neural networks.
Candidate Pool Strategy: The densification/pruning strategy under the fixed-number constraint is clever and practical, resolving the conflict between structuring and adaptability.
Coarse-to-fine decomposition strategy: Decomposes the high-dimensional generation problem into simple geometry (1-channel GDF) + conditional attribute prediction, effectively reducing learning difficulty for the diffusion model.
Generation diversity: Capable of generating 3D objects with different appearances for the same text prompt, which is unattainable by deterministic reconstruction methods.

Limitations & Future Work¶

Training data domain limitation: Performance degrades when input text deviates from the training domain.
The volumetric resolution \(N=32\) is low, exhibiting limited detail rendering for complex textures (higher resolution requires more computational resources).
Fitting GaussianVolume for million-scale training data is time-consuming.
The CLIP Score is only slightly higher than Shap-E, indicating remaining challenges in complex semantic understanding.
Compared to reconstruction methods like GRM, there is still a gap in absolute generation quality metrics.
High-order spherical harmonics are not used (SH order=0), limiting view-dependent effects.

Comparison with 3D VADER/VolumeDiffusion: These methods also adopt volumetric representation + diffusion but use NeRF/implicit volumes; GVGEN uses 3D Gaussian volumes, providing explicit manipulability and faster rendering speeds.
Comparison with GRM/LGM: GRM/LGM are reconstruction methods requiring multi-view inputs; GVGEN is a generative method generating directly from text, offering diversity.
Comparison with DreamGaussian: DreamGaussian optimizes Gaussians using SDS, which takes minutes and suffers from over-saturation; GVGEN achieves feed-forward generation in just 7 seconds.
Insights: Organizing unstructured representations (such as point clouds, Gaussians) into structured volumes serves as an effective bridge connecting classical representations with generative networks; the coarse-to-fine decomposition is a general strategy to mitigate high-dimensional generative complexity.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First feed-forward text-to-3D Gaussian framework, with highly novel GaussianVolume and CPS designs.
Experimental Thoroughness: ⭐⭐⭐⭐ Qualitative and quantitative comparisons are sufficient, and ablation studies are clean, but quantitative metrics are relatively sparse (only CLIP Score).
Writing Quality: ⭐⭐⭐⭐ Structure is clear, with standard pseudocode for the CPS algorithm, although some details require referring to the supplementary material.
Value: ⭐⭐⭐⭐ Significant in opening a new direction, although there is still a quality gap compared to optimization-based methods; practical applications require higher resolution.