UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation¶

Conference: ECCV 2024
arXiv: 2312.08754
Code: Project Page
Area: 3D Vision

TL;DR¶

Proposes UniDream, which achieves relightable text-to-3D generation with clean albedo textures and PBR materials by training an albedo-normal aligned multi-view diffusion model (AN-MVM), integrated with a Transformer reconstruction model and stage-wise SDS optimization.

Background & Motivation¶

Background¶

Background: Existing text-to-3D methods (e.g., DreamFusion, Magic3D) utilize RGB diffusion models, which "bake" illumination and shadows into the textures of the generated 3D objects.

Limitations of Prior Work¶

Limitations of Prior Work: Such baked-in lighting limits the realism and reusability of 3D objects under diverse illumination conditions.

Key Challenge¶

Key Challenge: While Fantasia3D attempts to disentangle lighting and texture, it often mixes albedo and specular reflections.

Core Problem: How to generate 3D objects with clean PBR materials (albedo, normal, roughness, metallic) from text that can be re-rendered under arbitrary lighting.

Method¶

Overall Architecture¶

Three-stage pipeline: 1. AN-MVM generates multi-view albedo and normal maps. 2. The Transformer Reconstruction Model (TRM) reconstructs a coarse 3D model from the albedo maps, which is then refined using SDS via AN-MVM. 3. With albedo and normals fixed, Stable Diffusion is utilized to optimize roughness and metallic properties to generate PBR materials.

Key Designs¶

AN-MVM (Albedo-Normal Aligned Multi-View Diffusion): - Jointly trains both albedo and normal domains on top of Stable Diffusion. - Multi-view self-attention: Concatenates multi-view data before the self-attention layer of the UNet to enforce cross-view constraints. - Multi-domain self-attention: Applies self-attention between corresponding views of albedo and normal domains to ensure cross-domain consistency. - Uses class label \(L\) to distinguish the normal domain; jointly trained on 70% 3D data + 30% LAION-Aesthetics 2D data to preserve semantic generalization.

TRM (Transformer-Based Reconstruction): - Extracts four-view albedo image features using DINO-v2, and encodes camera parameters via learnable camera modulation MLPs. - A Transformer decoder performs cross-attention between learnable tokens and image features to output triplane representations. - Trains the reconstruction model using albedo instead of RGB to avoid the negative impact of lighting and shadows on triplane-NeRF reconstruction.

PBR Material Generation: - After fixing albedo and normals, an additional hash grid and MLP are introduced to predict roughness and metallic properties. - Uses SDS supervision from Stable Diffusion, allowing simultaneous optimization of environmental illumination (restricted to a single channel to avoid color shifting).

Loss & Training¶

The SDS loss uses the noise prediction difference on both albedo and normal domains by AN-MVM, with weights of 0.8 and 0.2, respectively. TRM is trained jointly using LPIPS + L2 + normal supervision.

Key Experimental Results¶

Main Results¶

Method	User Study(%)↑	CLIP Score↑	R@1(%)↑	R@5(%)↑	R@10(%)↑
DreamFusion	7.1	71.0	54.2	82.2	91.5
Magic3D	10.5	75.1	75.9	93.5	96.6
MVDream	32.1	75.7	76.8	94.3	96.9
UniDream	50.3	77.9	80.3	97.4	98.5

Ablation Study¶

Comparison of multi-view diffusion models: The 2D images output by UniDream's AN-MVM successfully achieve lighting-texture disentanglement, and the generated normal maps show better cross-view consistency than MVDream's RGB outputs.

The progressive process of TRM reconstruction \(\rightarrow\) SDS refinement \(\rightarrow\) PBR material demonstrates step-by-step quality improvement: coarse reconstruction maintains clear texture and geometric boundaries, SDS refinement achieves high-quality 3D models, and the PBR stage adds realistic material properties.

PBR comparison: While Fantasia3D mixes lighting and shadow information into the albedo, UniDream effectively achieves disentanglement, allowing relighting under different environmental illuminations.

Key Findings¶

The preference rate of 50.3% in the user study significantly outperforms MVDream (32.1%), validating the practical value of the relighting capability.
Normal supervision significantly accelerates geometric convergence, yielding cleaner and smoother surfaces.
The 3D prior provided by TRM effectively avoids the "Janus problem" in SDS optimization.
The single-channel environmental illumination constraint effectively prevents color shifting introduced by Stable Diffusion.

Highlights & Insights¶

The stage-wise strategy of "albedo first, then PBR" breaks down the complex problem progressively, which is more stable than direct joint optimization.
Consistency in the normal domain is naturally easier to guarantee (as world-space normals of the same 3D point remain consistent across different views), which helps drive the convergence of multi-view consistency in the albedo domain.
The diffusion model trained jointly on albedo and normal domains is a key technical innovation for understanding and disentangling the appearance of 3D objects.
3D prior from the reconstruction model + details from the diffusion model = complementary strengths.

Training and Generation Details¶

AN-MVM training: 32 A800 GPUs, 256×256 resolution, with per-GPU batch_size=128 (16 objects × 2 domains × 4 views), taking about 19 hours for 50K iterations. TRM training: 32 A800 GPUs, batch_size=96, taking about 3 days for 70K steps. SDS refinement: 5000 steps for the NeRF stage, and 2000 steps for the DMTet stage. PBR material stage: 512×512 rendering, 2000 iterations.

The 3D dataset uses approximately 300K Objaverse objects after strict filtering (excluding untextured, non-single objects, low-quality, and caption-less models) to render multi-view albedo and normal data for training. AN-MVM is jointly trained using 70% 3D data and 30% LAION-Aesthetics 2D data, with ", 3D asset" appended to 3D captions to distinguish them.

Limitations & Future Work¶

The training data is limited to approximately 300K Objaverse objects, which limits semantic and material generalization.
Special material types, such as transparent materials, are not supported.
The rendering pipeline is not integrated with path tracing, resulting in simplified relighting effects.
The multi-stage pipeline is relatively time-consuming.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The first relightable text-to-3D framework
Effectiveness: ⭐⭐⭐⭐ — Leads significantly in both quantitative and user studies
Practicality: ⭐⭐⭐⭐ — PBR outputs are directly usable in games and AR/VR
Recommendation: ⭐⭐⭐⭐⭐