Skip to content

GVGEN: Text-to-3D Generation with Volumetric Representation

Conference: ECCV 2024
arXiv: 2403.12957
Code: https://gvgen.github.io/
Area: 3D Vision
Keywords: Text-to-3D, 3D Gaussian, Volumetric Representation, Feed-forward Generation, Diffusion Model

TL;DR

Proposes GVGEN, the first framework to directly generate 3D Gaussians from text in a feed-forward manner. By organizing unordered Gaussians into a structured volumetric representation (GaussianVolume) and designing a coarse-to-fine generation pipeline (generating geometric volumes first and then predicting Gaussian attributes), text-to-3D generation is completed in approximately 7 seconds.

Background & Motivation

Background: Text-to-3D generation is a popular direction in computer graphics and is categorized into two main groups: - Optimization-based methods (e.g., DreamFusion/SDS): Good quality but takes hours, exhibiting the Janus problem. - Feed-forward methods (e.g., Shap-E): Fast but limited in quality, often adopting an indirect text → 2D → 3D route.

Limitations of Prior Work: - Optimization methods have excessively high time costs (hour-level) and suffer from multi-face (Janus) / over-saturation issues. - In feed-forward methods, directly generating 3D Gaussians is almost unexplored because Gaussian points are unordered and high-dimensional. - Point-cloud diffusion/generation methods are difficult to scale to high-dimensional Gaussian attributes. - Methods using multi-view intermediate steps suffer from resolution loss and lack sufficient semantic understanding for complex prompts.

Key Challenge: 3D Gaussian points are inherently unstructured, making them difficult to handle directly with existing generative networks; meanwhile, Gaussian attributes are high-dimensional, making direct distribution learning difficult.

Goal: Design a method that can organize unordered 3D Gaussians into structured representations, and achieve direct feed-forward 3D Gaussian generation from text.

Key Insight: Organize Gaussians into a fixed number of volumetric grids (GaussianVolume), simplifying generation via a two-stage (geometry → attribute) coarse-to-fine pipeline.

Core Idea: Structuring unordered Gaussians into volumetric grids + Candidate Pool Strategy (CPS) to maintain a fixed count + diffusing geometry first, then predicting attributes with 3D U-Net.

Method

Overall Architecture

GVGEN consists of two main stages:

Stage 1 — GaussianVolume Fitting (Data Preparation): Represents the unordered Gaussian points of each 3D object as a structured volume with a fixed resolution \(N^3\) (default \(N=32\), i.e., 32,768 Gaussian points), while extracting a Gaussian Distance Field (GDF) as a coarse geometric representation.

Stage 2 — Text-to-3D Generation (Generation): 1. A diffusion model generates the GDF (coarse geometric volume) according to the text. 2. A 3D U-Net predicts the complete GaussianVolume attributes based on the GDF and text.

Key Designs

  1. GaussianVolume (Structured Gaussian Volume):

    • Function: Organizes unordered 3D Gaussian points into a structured volume \(V \in \mathbb{R}^{C \times N \times N \times N}\)
    • Mechanism: Place a Gaussian point at each grid node of the volumetric grid, and use the position offset \(\Delta\mu\) to represent the tiny displacement from the grid node to the actual Gaussian center: \(\mu = p + \Delta\mu\) where \(p\) is the grid node position, and \(\Delta\mu\) is a learnable offset. During training, gradients are only backpropagated back to the offsets, allowing fine-grained position adjustment while maintaining a structured form. Each Gaussian includes: position offset \(\Delta\mu \in \mathbb{R}^3\), scale \(s \in \mathbb{R}^3\), rotation quaternion \(q \in \mathbb{R}^4\), color \(c \in \mathbb{R}^3\), opacity \(\alpha \in \mathbb{R}\)
    • Design Motivation: Unstructured Gaussian points are unfriendly to 3D neural networks; prior attempts to use point cloud diffusion to generate high-dimensional Gaussians worked poorly; the volumetric format can seamlessly interface with existing volumetric generative networks.
  2. Candidate Pool Strategy (CPS):

    • Function: Achieves effective pruning and densification under the constraint of a fixed number of Gaussians.
    • Mechanism:
      • Initialize the candidate pool \(P = \emptyset\).
      • Pruning: Determine the points to be pruned \(G_p\) according to a gradient threshold \(\tau_p\), "deactivate" them and put them into the candidate pool (they do not participate in forward and backward passes).
      • Densification: Determine the points to be densified \(G_d\) according to a threshold \(\tau_d\), find the nearest deactivated point \(G_{new}\) from the candidate pool, and activate it near \(G_d\).
      • Optimization End: Release all points in the pool to participate in optimization again.
    • Design Motivation: The original 3DGS free pruning/densification changes the total number of Gaussians, which is inapplicable to a fixed volumetric resolution. Without CPS, the movement range of Gaussian centers is limited, and the geometric quality drops (PSNR drops by 0.45 dB).
  3. Gaussian Distance Field (GDF):

    • Function: Serves as a coarse geometric representation, storing the distance from each grid node to the nearest Gaussian center.
    • Mechanism: \(F \in \mathbb{R}_0^{+1 \times N \times N \times N}\), similar to the Unsigned Distance Field (UDF), extracted from the fitted GaussianVolume via sorting algorithms. Train a diffusion model to predict the noise of noisy GDF: \(\mathcal{L}_{3D} = \text{MSE}(\text{GDF}_{gt}, \text{GDF}_{pred})\)
    • Design Motivation: Directly generating the integrated GaussianVolume is too high-dimensional (14 channels × 32³), making retrieve difficult for the diffusion model to converge. Decomposing the problem into first generating a simple 1-channel geometric volume and then conditionally predicting other attributes decreases complexity.
  4. 3D U-Net Attribute Predictor:

    • Function: Predicts the complete GaussianVolume attributes in a single step based on the generated GDF and text conditions.
    • Mechanism: Based on a modified 3D U-Net from SDFusion, taking GDF + text conditions as inputs and outputting all Gaussian attributes. It uses multimodal loss: \(\mathcal{L} = \lambda_{3D}\mathcal{L}_{3D} + \lambda_{2D}\mathcal{L}_{2D}\) \(\mathcal{L}_{2D} = \lambda\mathcal{L}_1 + (1-\lambda)\mathcal{L}_{SSIM}\)
    • Design Motivation: Experiments found that the single-step reconstruction model and the diffusion model performed comparably in attribute prediction, but the former was faster. The multimodal loss (3D semantics + 2D rendering) balances global consistency and local details.

Loss & Training

  • GaussianVolume Fitting Loss: \(\mathcal{L}_{fitting} = \lambda_1\mathcal{L}_1 + \lambda_2\mathcal{L}_{SSIM} + \lambda_3\mathcal{L}_{offsets}\)
    • Offset regularization: \(\mathcal{L}_{offsets} = \text{Mean}(\text{ReLU}(|\Delta\mu - \epsilon_{offsets}|))\), restricting Gaussian centers from deviating too far from the grid nodes.
  • GDF Diffusion Loss: MSE noise prediction loss.
  • Attribute Prediction Loss: Combination of 3D MSE + 2D rendering loss.
  • Training Data: Objaverse-LVIS (~46,000 3D models across 1,156 categories), with text descriptions from Cap3D (BLIP-2 + GPT-4).
  • Volumetric resolution \(N=32\), i.e., 32,768 Gaussian points.

Key Experimental Results

Main Results — Text-to-3D (CLIP Score + Speed)

Method CLIP Score↑ Generation Time↓ Method Type
DreamGaussian 23.60 ~3 min Optimization
VolumeDiffusion 25.09 7 sec Feed-forward
Shap-E 28.48 11 sec Feed-forward
GVGEN (Ours) 28.53 7 sec Feed-forward

Ablation Study — GaussianVolume Fitting Strategy

Configuration PSNR↑ SSIM↑ LPIPS↓ Description
Full (CPS + offsets) 30.122 0.963 0.038 Best
w/o CPS 29.677 0.958 0.049 No candidate pool strategy, limited physical movement range
w/o offsets 27.140 0.936 0.084 Fixed Gaussian positions, sharp degradation in quality

Ablation Study — Attribute Prediction Loss

Configuration PSNR↑ SSIM↑ LPIPS↓ Description
Full (\(\mathcal{L}_{3D}\) + \(\mathcal{L}_{2D}\)) 35.03 0.9872 0.0236 Complementary optimal of both losses
w/o \(\mathcal{L}_{3D}\) 35.21 0.9846 0.0268 PSNR slightly higher but details (LPIPS) degraded
w/o \(\mathcal{L}_{2D}\) 29.55 0.9654 0.0444 3D loss only, sharp degradation in quality

Key Findings

  • GVGEN is on par with Shap-E in terms of CLIP Score (28.53 vs 28.48) but runs faster (7s vs 11s).
  • Removing position offsets (w/o offsets) results in a PSNR drop of about 3 dB, indicating that fine position adjustments are crucial for detail recovery.
  • The CPS strategy contributes to a PSNR gain of ~0.45 dB and a significant improvement in LPIPS.
  • The 2D rendering loss is far more important than the 3D MSE loss (removing 2D loss drops PSNR by 5.5 dB).
  • GVGEN can generate diverse results for the same text prompt, distinguishing itself from deterministic reconstruction methods.
  • The image-conditioned version of GVGEN, compared to reconstruction methods like OpenLRM/TGS, generates more reasonable shapes and textures in unobserved regions.

Highlights & Insights

  • First work to directly generate 3D Gaussians in a feed-forward manner: Opens a new direction of text-to-3D Gaussian generation without relying on intermediate 2D multi-view steps.
  • Structured idea of GaussianVolume: Formulates the unstructured point cloud problem into a regular volume problem, elegantly bridging Gaussians and neural networks.
  • Candidate Pool Strategy: The densification/pruning strategy under the fixed-number constraint is clever and practical, resolving the conflict between structuring and adaptability.
  • Coarse-to-fine decomposition strategy: Decomposes the high-dimensional generation problem into simple geometry (1-channel GDF) + conditional attribute prediction, effectively reducing learning difficulty for the diffusion model.
  • Generation diversity: Capable of generating 3D objects with different appearances for the same text prompt, which is unattainable by deterministic reconstruction methods.

Limitations & Future Work

  • Training data domain limitation: Performance degrades when input text deviates from the training domain.
  • The volumetric resolution \(N=32\) is low, exhibiting limited detail rendering for complex textures (higher resolution requires more computational resources).
  • Fitting GaussianVolume for million-scale training data is time-consuming.
  • The CLIP Score is only slightly higher than Shap-E, indicating remaining challenges in complex semantic understanding.
  • Compared to reconstruction methods like GRM, there is still a gap in absolute generation quality metrics.
  • High-order spherical harmonics are not used (SH order=0), limiting view-dependent effects.
  • Comparison with 3D VADER/VolumeDiffusion: These methods also adopt volumetric representation + diffusion but use NeRF/implicit volumes; GVGEN uses 3D Gaussian volumes, providing explicit manipulability and faster rendering speeds.
  • Comparison with GRM/LGM: GRM/LGM are reconstruction methods requiring multi-view inputs; GVGEN is a generative method generating directly from text, offering diversity.
  • Comparison with DreamGaussian: DreamGaussian optimizes Gaussians using SDS, which takes minutes and suffers from over-saturation; GVGEN achieves feed-forward generation in just 7 seconds.
  • Insights: Organizing unstructured representations (such as point clouds, Gaussians) into structured volumes serves as an effective bridge connecting classical representations with generative networks; the coarse-to-fine decomposition is a general strategy to mitigate high-dimensional generative complexity.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First feed-forward text-to-3D Gaussian framework, with highly novel GaussianVolume and CPS designs.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Qualitative and quantitative comparisons are sufficient, and ablation studies are clean, but quantitative metrics are relatively sparse (only CLIP Score).
  • Writing Quality: ⭐⭐⭐⭐ Structure is clear, with standard pseudocode for the CPS algorithm, although some details require referring to the supplementary material.
  • Value: ⭐⭐⭐⭐ Significant in opening a new direction, although there is still a quality gap compared to optimization-based methods; practical applications require higher resolution.