Skip to content

Collaborative Control for Geometry-Conditioned PBR Image Generation

Conference: ECCV 2024
arXiv: 2402.05919
Code: https://unity-research.github.io/holo-gen (Project Page)
Area: Diffusion Models / Image Generation
Keywords: PBR Material Generation, Multimodal Diffusion, Cross-network Control, Geometry-Conditioned Generation, Physically-Based Rendering

TL;DR

Proposes the Collaborative Control paradigm, which freezes a pre-trained RGB diffusion model and trains a parallel PBR model. By utilizing bi-directional cross-network communication layers to jointly model the RGB and PBR image distributions, it achieves high-quality geometry-conditioned PBR material image generation under limited data conditions.

Background & Motivation

Background: Diffusion models have achieved immense success in the field of RGB image generation, and Text-to-3D as well as Text-to-Texture methods have successfully extended them to 3D content generation. However, downstream 3D workflows (e.g., game engines) require Physically-Based Rendering (PBR) materials, rather than simple RGB images.

Limitations of Prior Work:

Inherent Flaws of Inverse Rendering: Current methods first generate an RGB image and then extract PBR attributes via inverse rendering. However, the RGB images generated by diffusion models often exhibit non-physical lighting (as models prefer idealized and artistic appearances), leading to severe ambiguities in inverse rendering.

Data Scarcity: The largest PBR dataset, Objaverse, contains only about 800,000 objects, which is several orders of magnitude smaller than LAION-5B (5 billion). Directly training generative models from scratch on it results in insufficient generalization capability.

High Dimensionality Dilemma: PBR images consist of Albedo (3 channels), Metallic (1 channel), Roughness (1 channel), and Bump Map (3 channels), totaling 8 channels, which cannot be compressed effectively into the low-dimensional latent space of existing RGB VAEs.

Catastrophic Forgetting from Fine-tuning: Fine-tuning pre-trained RGB models on limited PBR data results in a loss of generalizability.

Key Challenge: How to directly model the joint distribution of PBR images by leveraging the rich prior knowledge of pre-trained RGB models under extreme data scarcity?

Key Insight: Keep the pre-trained RGB model completely frozen and train a parallel PBR model, tightly coupling the two models via a bi-directional cross-network communication mechanism. This allows the PBR model to extract semantic information from the RGB model while simultaneously guiding the RGB model to generate rendered images aligned with the PBR outputs.

Core Idea: Decompose the joint reverse process into two coupled processes: the RGB model generates the rendered image and provides rich internal representations, while the PBR model utilizes these representations to generate corresponding PBR materials.

Method

Overall Architecture

The system comprises two parallel diffusion models running side-by-side: - Left Branch: A frozen pre-trained RGB diffusion model \(\mathcal{D}_{rgb}\) that generates the rendered RGB image. - Right Branch: A newly-trained PBR diffusion model \(\mathcal{D}_{pbr}\) that generates PBR material maps.

The two models are connected through cross-network communication layers after each self-attention layer, enabling bi-directional information exchange. The input to the PBR model is also concatenated with the screen-space geometric normals as a condition.

Key Designs

  1. Collaborative Control Bi-directional Communication Mechanism:

    • Function: Inserts connection layers after each self-attention module of both models to realize a bi-directional information flow.
    • Mechanism: Concatenates the latent states of both models, processes them using a simple pixel-wise linear layer, and then distributes the outputs residually back to both models: $\(h_{rgb}' = h_{rgb} + \text{Linear}([h_{rgb}; h_{pbr}])\)$ $\(h_{pbr}' = h_{pbr} + \text{Linear}([h_{rgb}; h_{pbr}])\)$
    • Design Motivation: The PBR branch needs to extract relevant information from the RGB model while guiding the RGB output toward the rendered image domain \(\text{Im}(f)\). One-way communication (e.g., ControlNet) cannot align the RGB model to the conditional distribution, while clockwise communication (e.g., AnimateAnyone) prevents the PBR model from obtaining \(z'_{rgb,t-1}\) in the encoder stage. Experiments prove that bi-directional communication is indispensable.
  2. PBR-Specific VAE:

    • Function: Trains a VAE specifically for PBR image compression, with the latent space dimension set to 14 channels.
    • Mechanism: Adopts the VAE architecture of StableDiffusion v1.5, but expands the number of latent space channels from 4 to 14 to balance the quality and compression ratio of PBR images (8 channels).
    • Design Motivation: The distribution of PBR images differs significantly from RGB. Directly using an RGB VAE to encode trios of PBR channels causes severe distribution mismatch, with experiments showing the CMMD metric deteriorating drastically from 6.30 to 84.66.
  3. Geometry Tangent Space Bump Map Representation:

    • Function: Defines the bump map in a geometric-only tangent space rather than the traditional UV tangent space.
    • Mechanism: For point \(\bm{p}\) and geometric normal \(\bm{n}\), constructs a local tangent vector \(\bm{t} = \bm{n} \times ([-p_y, p_x, 0]^T \times \bm{n})\).
    • Design Motivation: The UV tangent space depends on arbitrary UV unwrapping, which causes similar surface bumps in world space to appear wildly different in UV space. Decoupling textures from UV mapping facilitates model learning.
  4. Disabling Cross-Attention of Text on the PBR Branch:

    • Function: Turns off cross-attention to prompts in the PBR model; all text guidance flows solely through the frozen RGB model.
    • Design Motivation: On limited data, the text attention layers of the PBR model are prone to overfitting—the scarcer the data, the worse the performance. Forcing text attention through the frozen RGB model prevents overfitting.

Loss & Training

  • Jointly optimizes the training loss for both RGB and PBR denoising, updating only the weights of the PBR model and the cross-network communication layers.
  • The RGB model renders with a fixed environment map and a fixed camera setup to simplify the alignment problem.
  • Training Data: Approximately 300,000 objects filtered from Objaverse, each rendered from 16 viewpoints.
  • Training Configuration: 512×512 resolution, 200K steps, batch size of 12, lr = 3e-5, takes about 2 days on a single A100.

Key Experimental Results

Main Results

Communication Paradigm CMMD(PBR)↓ CMMD(Relit)↓ FID(PBR)↓ CLIPScore(Albedo)↑ CLIPScore(Relit)↑
One-way 16.44 13.38 20.90 23.08 23.40
Clockwise 6.78 2.76 12.21 26.45 24.53
Bi-directional 6.30 1.79 11.65 26.76 25.41

Ablation Study

Configuration CMMD(PBR)↓ FID(PBR)↓ CLIPScore↑ Description
PBR VAE 6.30 11.65 26.76 Dedicated VAE (Baseline)
RGB VAE Triplets 84.66 25.81 25.27 Severe distribution mismatch
Fine-tuning (with RGB) 13.40 14.42 25.04 Degradation in OOD performance
Fine-tuning (without RGB) 5.25 11.41 25.66 Severe OOD overfitting
Pixel-wise MLP 5.43 11.43 27.15 Slightly better but more complex
Global Attention 7.60 13.61 24.50 Lacks pixel correspondence

Data Efficiency Experiment (without PBR prompt attention):

Training Data Ratio CMMD(PBR)↓ FID(PBR)↓ Description
1% (~60k images) 6.25 11.87 Still works with only a few thousand objects
5% 5.77 11.49 Near-full data performance
98% 6.30 11.65 Full dataset

Key Findings

  • Bi-directional communication is crucial for PBR generation; the one-way communication scheme cannot even align object positions.
  • A simple pixel-wise linear layer is sufficient as the communication layer; MLP and attention mechanisms yield no significant advantages.
  • Disabling prompt attention in the PBR branch is essential for OOD (Out-of-Distribution) generalization, especially on small datasets.
  • The method is highly data-efficient, capable of generating reasonable PBR materials with only 1% of the training data.
  • It is compatible with IP-Adapter because the RGB model is completely frozen.

Highlights & Insights

  • The frozen + parallel design paradigm is remarkably elegant: it leverages pre-trained model priors without corrupting their weights, while maintaining compatibility with third-party control technologies.
  • Theoretical motivation for decomposing the joint reverse process using Bayesian formulation is clear.
  • The geometry-tangent-space bump map design accounts for the arbitrariness of UV mapping in practical applications.

Limitations & Future Work

  • The most common failure modes involve roughness, metallic, or bump maps being generated as flat constant maps (lacking details).
  • The training data only comes from Objaverse, limiting generalization to real-world scenes.
  • The simplifications of using a fixed environment map and camera setups might restrict certain application scenarios.
  • Validation was conducted only on StableDiffusion 1.5/2.1, leaving extensions to larger models unexplored.
  • ControlNet/ControlNet-XS: One-way and semi-bi-directional variants of the control paradigm; this paper demonstrates that fully bi-directional communication is indispensable for the PBR task.
  • AnimateAnyone: Only one-way communication (control model \(\rightarrow\) generation), rendering it unsuitable for scenarios requiring bi-directional information flow.
  • Wonder3D/UniDream: Cross-domain self-attention schemes, which scale poorly as the number of modalities increases.
  • Insight: For tasks requiring generative training for new modalities on limited data, a frozen foundational model + a parallel branch + bi-directional communication constitutes a highly effective, generic paradigm.

Rating

  • Novelty: ⭐⭐⭐⭐ The Collaborative Control paradigm is highly novel, addressing practical pain points in PBR generation.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly comprehensive ablation study, covering communication paradigms, layer types, VAE configurations, data scale, resolution, training budget, etc.
  • Writing Quality: ⭐⭐⭐⭐ Clear theoretical derivations with well-organized motivation and experiments.
  • Value: ⭐⭐⭐⭐ Offers direct practical value for 3D content generation pipelines, with a paradigm adaptable to other multimodal generation tasks.