Skip to content

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

Conference: ECCV 2024
arXiv: 2312.08754
Code: Project Page
Area: 3D Vision

TL;DR

Proposes UniDream, which achieves relightable text-to-3D generation with clean albedo textures and PBR materials by training an albedo-normal aligned multi-view diffusion model (AN-MVM), integrated with a Transformer reconstruction model and stage-wise SDS optimization.

Background & Motivation

Background

Background: Existing text-to-3D methods (e.g., DreamFusion, Magic3D) utilize RGB diffusion models, which "bake" illumination and shadows into the textures of the generated 3D objects.

Limitations of Prior Work

Limitations of Prior Work: Such baked-in lighting limits the realism and reusability of 3D objects under diverse illumination conditions.

Key Challenge

Key Challenge: While Fantasia3D attempts to disentangle lighting and texture, it often mixes albedo and specular reflections.

Core Problem: How to generate 3D objects with clean PBR materials (albedo, normal, roughness, metallic) from text that can be re-rendered under arbitrary lighting.

Method

Overall Architecture

Three-stage pipeline: 1. AN-MVM generates multi-view albedo and normal maps. 2. The Transformer Reconstruction Model (TRM) reconstructs a coarse 3D model from the albedo maps, which is then refined using SDS via AN-MVM. 3. With albedo and normals fixed, Stable Diffusion is utilized to optimize roughness and metallic properties to generate PBR materials.

Key Designs

AN-MVM (Albedo-Normal Aligned Multi-View Diffusion): - Jointly trains both albedo and normal domains on top of Stable Diffusion. - Multi-view self-attention: Concatenates multi-view data before the self-attention layer of the UNet to enforce cross-view constraints. - Multi-domain self-attention: Applies self-attention between corresponding views of albedo and normal domains to ensure cross-domain consistency. - Uses class label \(L\) to distinguish the normal domain; jointly trained on 70% 3D data + 30% LAION-Aesthetics 2D data to preserve semantic generalization.

TRM (Transformer-Based Reconstruction): - Extracts four-view albedo image features using DINO-v2, and encodes camera parameters via learnable camera modulation MLPs. - A Transformer decoder performs cross-attention between learnable tokens and image features to output triplane representations. - Trains the reconstruction model using albedo instead of RGB to avoid the negative impact of lighting and shadows on triplane-NeRF reconstruction.

PBR Material Generation: - After fixing albedo and normals, an additional hash grid and MLP are introduced to predict roughness and metallic properties. - Uses SDS supervision from Stable Diffusion, allowing simultaneous optimization of environmental illumination (restricted to a single channel to avoid color shifting).

Loss & Training

The SDS loss uses the noise prediction difference on both albedo and normal domains by AN-MVM, with weights of 0.8 and 0.2, respectively. TRM is trained jointly using LPIPS + L2 + normal supervision.

Key Experimental Results

Main Results

Method User Study(%)↑ CLIP Score↑ R@1(%)↑ R@5(%)↑ R@10(%)↑
DreamFusion 7.1 71.0 54.2 82.2 91.5
Magic3D 10.5 75.1 75.9 93.5 96.6
MVDream 32.1 75.7 76.8 94.3 96.9
UniDream 50.3 77.9 80.3 97.4 98.5

Ablation Study

Comparison of multi-view diffusion models: The 2D images output by UniDream's AN-MVM successfully achieve lighting-texture disentanglement, and the generated normal maps show better cross-view consistency than MVDream's RGB outputs.

The progressive process of TRM reconstruction \(\rightarrow\) SDS refinement \(\rightarrow\) PBR material demonstrates step-by-step quality improvement: coarse reconstruction maintains clear texture and geometric boundaries, SDS refinement achieves high-quality 3D models, and the PBR stage adds realistic material properties.

PBR comparison: While Fantasia3D mixes lighting and shadow information into the albedo, UniDream effectively achieves disentanglement, allowing relighting under different environmental illuminations.

Key Findings

  • The preference rate of 50.3% in the user study significantly outperforms MVDream (32.1%), validating the practical value of the relighting capability.
  • Normal supervision significantly accelerates geometric convergence, yielding cleaner and smoother surfaces.
  • The 3D prior provided by TRM effectively avoids the "Janus problem" in SDS optimization.
  • The single-channel environmental illumination constraint effectively prevents color shifting introduced by Stable Diffusion.

Highlights & Insights

  • The stage-wise strategy of "albedo first, then PBR" breaks down the complex problem progressively, which is more stable than direct joint optimization.
  • Consistency in the normal domain is naturally easier to guarantee (as world-space normals of the same 3D point remain consistent across different views), which helps drive the convergence of multi-view consistency in the albedo domain.
  • The diffusion model trained jointly on albedo and normal domains is a key technical innovation for understanding and disentangling the appearance of 3D objects.
  • 3D prior from the reconstruction model + details from the diffusion model = complementary strengths.

Training and Generation Details

AN-MVM training: 32 A800 GPUs, 256×256 resolution, with per-GPU batch_size=128 (16 objects × 2 domains × 4 views), taking about 19 hours for 50K iterations. TRM training: 32 A800 GPUs, batch_size=96, taking about 3 days for 70K steps. SDS refinement: 5000 steps for the NeRF stage, and 2000 steps for the DMTet stage. PBR material stage: 512×512 rendering, 2000 iterations.

The 3D dataset uses approximately 300K Objaverse objects after strict filtering (excluding untextured, non-single objects, low-quality, and caption-less models) to render multi-view albedo and normal data for training. AN-MVM is jointly trained using 70% 3D data and 30% LAION-Aesthetics 2D data, with ", 3D asset" appended to 3D captions to distinguish them.

Limitations & Future Work

  • The training data is limited to approximately 300K Objaverse objects, which limits semantic and material generalization.
  • Special material types, such as transparent materials, are not supported.
  • The rendering pipeline is not integrated with path tracing, resulting in simplified relighting effects.
  • The multi-stage pipeline is relatively time-consuming.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The first relightable text-to-3D framework
  • Effectiveness: ⭐⭐⭐⭐ — Leads significantly in both quantitative and user studies
  • Practicality: ⭐⭐⭐⭐ — PBR outputs are directly usable in games and AR/VR
  • Recommendation: ⭐⭐⭐⭐⭐