Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation¶
Metadata¶
- Conference: ICCV 2025
- arXiv: 2503.15877
- Code: Project Page
- Area: 3D Generation · Diffusion Models
- Keywords: 3D Gaussian, 2D diffusion model transfer, Gaussian Atlas, large-scale dataset, text-to-3D
TL;DR¶
This paper proposes the Gaussian Atlas representation, which maps unordered 3D Gaussians onto a sphere via optimal transport and then flattens them into a structured 2D grid, enabling direct fine-tuning of pretrained 2D Latent Diffusion models for high-quality text-to-3D generation.
Background & Motivation¶
The development of 3D diffusion models is constrained by the scarcity of high-quality 3D data, leaving their performance far behind that of 2D counterparts. Existing methods attempt to leverage 2D diffusion model priors for 3D generation, but face the following limitations:
Data bottleneck: Creating and annotating high-quality 3D models is extremely costly, and the data scale is far smaller than that of 2D images (billions of samples).
Indirect utilization: Methods such as Score Distillation Sampling (SDS) use frozen 2D model weights only indirectly, resulting in low efficiency and limited quality.
Representation gap: The unordered point-set structure of 3D Gaussians cannot be directly fed into 2D networks; a structured 2D representation is required to enable transfer learning.
Mechanism: The paper designs a method (Gaussian Atlas) that maps 3D Gaussians into a structured 2D grid, allowing pretrained 2D diffusion models to be directly fine-tuned for 3D generation.
Method¶
Overall Architecture¶
The pipeline consists of two stages: 1. 3DGS pre-fitting stage: Pre-fitting high-quality 3D Gaussians for a large collection of 3D objects to construct the GaussianVerse dataset. 2. Diffusion model training stage: Converting 3D Gaussians into the Gaussian Atlas 2D representation and fine-tuning the UNet of a pretrained Latent Diffusion model.
GaussianVerse Dataset¶
A large-scale dataset containing 205,737 high-quality 3DGS fits is constructed based on Scaffold-GS. Key improvements include:
- Visibility-ranked pruning strategy: Rather than enforcing a fixed count constraint, an upper bound of \(\tau=36,864\) (i.e., \(192 \times 192\)) is set; Gaussians are ranked by their opacity under random camera viewpoints, and those with the lowest visibility are pruned.
- Perceptual loss augmentation: LPIPS perceptual loss is incorporated into the fitting objective to achieve higher rendering fidelity with fewer Gaussians.
The fitting loss is:
Each object requires approximately 10 minutes of fitting time on an A100 GPU, totaling over 3.8 A100 GPU-years.
Gaussian Atlas: Mapping from 3D to 2D¶
The conversion of unordered 3D Gaussians into a structured 2D grid proceeds in three steps:
Step 1: Sphere Offsetting
A unit sphere \(\mathcal{S}\) is assumed, with \(N\) points \(\{s_i \in \mathbb{R}^3\}\) uniformly distributed on its surface. Optimal Transport (OT) is used to map the positions of 3D Gaussians onto the sphere surface. Unlike GaussianCube, this method maps to a spherical surface rather than the interior of a volumetric grid.
Step 2: Equirectangular Projection
The 3D Gaussians on the sphere are flattened onto a 2D plane via the equirectangular projection \(\mathcal{M}\), yielding 2D coordinates \(\{p_i \in \mathbb{R}^2\}\). Since \(\mathcal{M}\) is a deterministic function, the projection is consistent across all objects.
Step 3: Plane Offsetting
OT is applied again to map the flattened 2D coordinates to the vertices of a \(\sqrt{N} \times \sqrt{N}\) regular grid \(\{q_i \in \mathbb{R}^2\}\). Due to the deterministic nature of the projection, this OT step needs to be computed only once and its index can be reused.
The resulting Gaussian Atlas has shape \(\sqrt{N} \times \sqrt{N} \times C\), where \(C = ||\mathbf{x}-\mathbf{s}|| + ||\mathbf{c}|| + ||\mathbf{o}|| + ||\mathbf{s}|| + ||\mathbf{r}||\), encoding all Gaussian attributes.
Fine-tuning the Latent Diffusion Model¶
- VAE bypassed: The distribution of Gaussian attributes differs too greatly from that of natural images to be directly processed by a VAE.
- Normalization alignment: Per-pixel mean and standard deviation computed over the full GaussianVerse dataset are used to normalize the Atlas, aligning its distribution with VAE-encoded image latents.
- Channel adaptation: Each 3-channel attribute is concatenated with a 1-channel opacity to form a 4-channel input; the UNet input layer is repeated four times to accommodate the \(128 \times 128 \times 16\) input.
The training loss is:
where \(\mathcal{L}_{diff}\) is the v-parameterized diffusion loss, \(\mathcal{L}_{rgb}\) and \(\mathcal{L}_{mask}\) are rendering L1 losses, and \(\mathcal{L}_{lpips}\) is the perceptual loss.
Inference¶
Starting from random 2D noise, reverse diffusion is performed using DPMsolver++ with a guidance scale of 3.5. Generating and rendering a single 3DGS sample takes less than 5 seconds.
Key Experimental Results¶
Main Results: Text-to-3D Generation Comparison¶
| Method | CLIP score ↑ | VQA score ↑ | # Gaussians ↓ |
|---|---|---|---|
| DreamGaussian | 20.52 | 0.37 | 40K |
| LGM | 20.28 | 0.35 | 66K |
| TriplaneGaussian | 21.10 | 0.46 | 16K |
| GaussianCube | 22.31 | 0.52 | 33K |
| GaussianAtlas (Ours) | 23.20 | 0.61 | 16K |
The proposed method outperforms GaussianCube by 0.9 in CLIP score and by 17% in VQA score, while using only half the number of Gaussians and training steps.
Ablation Study: Pretrained 2D Model vs. Training from Scratch¶
| Method | Training Steps | CLIP score ↑ | VQA score ↑ |
|---|---|---|---|
| From scratch | 500K | 19.33 | 0.23 |
| Transfer 2D LD | 500K | 21.61 | 0.49 |
| From scratch | 1M | 20.85 | 0.40 |
| Transfer 2D LD | 1M | 23.20 | 0.61 |
The pretrained 2D model significantly outperforms training from scratch under equivalent training steps, validating the transferability of 2D knowledge.
User Study¶
Based on 2,500+ valid responses: - vs. GaussianCube: 65% of users prefer the proposed method. - vs. TriplaneGaussian: 88% of users prefer the proposed method.
Key Findings¶
- Deterministic mapping is critical: Optimization-based flattening approaches (learning per-object UV mappings independently) produce inconsistent visual patterns, resulting in noise-only outputs after fine-tuning.
- Minimal weight deviation: The deviation between fine-tuned and pretrained UNet weights is extremely small (even the most-changed layer deviates 8× less than random initialization), indicating that pretrained weights serve as an excellent initialization for 3D generation.
- VAE not required: Training the UNet directly in the normalized Atlas space avoids the mismatch introduced by applying a VAE designed for natural images.
Highlights & Insights¶
- Unifying 2D and 3D generation: This work is the first to demonstrate that pretrained text-to-image diffusion models can be directly fine-tuned for 3D Gaussian generation without complex intermediate steps.
- Elegant 2D representation design: The three-step pipeline—spherical OT → equirectangular projection → planar OT—preserves 3D topological continuity while ensuring consistent cross-object mappings.
- Large-scale dataset contribution: GaussianVerse (205K high-quality fits) provides important infrastructure for the research community.
- Fast inference: Generation and rendering in under 5 seconds, far superior to SDS-based optimization methods (which require minutes).
Limitations & Future Work¶
- Generation resolution is constrained by the Atlas grid size (\(128 \times 128 = 16\text{K}\) Gaussians), leading to insufficient detail for complex objects.
- Dataset construction cost is extremely high (3.8+ A100 GPU-years).
- Equirectangular projection introduces inherent area distortion near the poles.
- Only single-object generation is supported; extension to scene-level generation has not been explored.
- Direct comparison with recent methods such as TRELLIS is absent.
Related Work & Insights¶
- 2D diffusion model transfer: Marigold (depth prediction) and GeoWizard (geometry prediction) inspired the transfer to 3D tasks.
- 3D Gaussian generation: GaussianCube (3D grid OT), GVGen (voxel offsetting), DiffGS (triplane mapping).
- 2D representations of 3D: Triplane-based methods (NFD, CRM, InstantMesh), Omages (UV mapping), DiffSplat (multi-view latents).
- Datasets: ShapeSplat, Objaverse series.
Insight: A deterministic and consistent 3D→2D mapping is the key to successful transfer—learned mappings offer flexibility but destroy cross-sample pattern consistency.
Rating¶
- Novelty: ★★★★★ — Introduces the Gaussian Atlas representation and pioneers the direct fine-tuning of 2D models for 3D generation.
- Technical Depth: ★★★★☆ — The OT mapping combined with projection is clean and elegant, though the diffusion model training itself is relatively standard.
- Practicality: ★★★★☆ — Inference is extremely fast, but dataset construction requires substantial computational resources.