Skip to content

Material Anything: Generating Materials for Any 3D Object via Diffusion

Conference: CVPR 2025
arXiv: 2411.15138
Code: Not open-sourced
Area: 3D Vision / Material Generation
Keywords: PBR material generation, diffusion models, confidence mask, triple-head U-Net, UV space refinement

TL;DR

This paper proposes Material Anything, a fully automated, unified diffusion framework that adapts pre-trained image diffusion models to generate PBR materials (albedo/roughness/metallic/bump) via a triple-head U-Net architecture, a confidence mask, and a rendering loss. Together with a confidence-guided progressive multi-view generation strategy and a UV space refinement model, it generates high-quality material maps for 3D objects under various lighting conditions (untextured / pure albedo / scanned / generated) in a unified manner.

Background & Motivation

Background: PBR (physically-based rendering) materials are key to maintaining a consistent appearance of 3D objects under different lighting. Handcrafting materials is time-consuming and labor-intensive, and existing 3D texture generation methods (TEXTure, Paint3D, SyncMVD) generate textures containing baked lighting information (highlights/shadows), lacking true material modeling.

Limitations of Prior Work: (1) Optimization-based methods (NvDiffRec, DreamMat) require independent optimization for each object, which is time-consuming and difficult to automate; (2) Retrieval-based methods (Make-it-Real) rely on SAM+GPT segmentation and material matching, which makes the system fragile and unable to handle fine areas; (3) All methods are sensitive to lighting conditions—real-world scans have complex lighting, generated textures have unrealistic lighting, and albedo has no lighting—lacking a unified solution.

Key Challenge: How to automatically handle material generation under various lighting conditions using a unified model while ensuring multi-view consistency?

Key Insight: Rethink 3D material generation as an image-based material estimation task. Leveraging the strong priors of pre-trained image diffusion models, a confidence mask is used to dynamically indicate the reliability of lighting—estimating materials using lighting cues in high-confidence regions and generating materials based on semantics in low-confidence regions.

Method

Overall Architecture

Two-stage pipeline: (1) Image-space material diffusion—for each rendered view of the 3D object, utilizing the rendered image, normal map, and confidence mask as conditions, a triple-head U-Net is used to generate three sets of material maps: albedo, roughness-metallic, and bump. (2) UV-space material refinement—after projecting multi-view materials onto the UV space, a refinement diffusion model is applied to fill occluded regions and eliminate seams. For untextured objects, a coarse texture is first generated using an image diffusion model before processing.

Key Designs

  1. Triple-Head Diffusion U-Net (Triple-Head Architecture):

    • Splits the initial convolutional layer + first DownBlock and the last UpBlock + output convolutions of the U-Net into three independent branches.
    • The three branches respectively output albedo (RGB), roughness-metallic (R=1, G=roughness, B=metallic), and bump (RGB).
    • Intermediate layers are shared to maintain consistency among materials, while the branch heads prevent coupling interference between materials.
    • Design Motivation: A vanilla U-Net directly outputting 12-channel latents leads to coupling effects (e.g., bump colored by albedo), whereas training separate material VAEs is unfeasible on limited PBR data.
  2. Confidence Mask:

    • Sets different confidence values for different lighting conditions to guide the model behavior:
      • Real scans (reliable lighting) \(\rightarrow\) confidence=1, the model estimates materials using lighting cues.
      • No lighting / pure albedo \(\rightarrow\) confidence=0, the model generates materials based on semantics.
      • Generated textures (unreliable lighting) \(\rightarrow\) confidence=1 for known regions and confidence=0 for other regions, allowing adaptive switching.
    • During training, various lighting conditions are simulated via data augmentation (random light types + image stitching + degradation), with the confidence mask marking degraded regions.
    • Design Motivation: Unifying the two tasks of "material estimation" and "material generation" into a single model, where the confidence mask acts as the switch.
  3. Confidence-Guided Progressive Multi-View Generation:

    • Generates materials view by view, initializing the noise latent of each new view with the known materials from previous views: \(\hat{z}_t = \hat{z}_t \cdot (1-\hat{m}) + z_t \cdot \hat{m}\)
    • For views with generated lighting, the confidence mask is additionally used to mark known regions (confidence=1) and regions to be generated (confidence=0).
    • Finally, all multi-view materials are baked into the UV space, and a UV refinement diffusion model (inputting coarse materials + CCM coordinate maps) is used to fill holes and remove seams.
    • Design Motivation: Progressive generation avoids the high-resolution/multi-channel bottleneck of multi-view diffusion, and the confidence mask simultaneously serves lighting adaptation and multi-view consistency.

Loss & Training

  • v-prediction loss: \(\mathcal{L}_v = \|\hat{V}_\theta(z_t; c, y) - v_t\|_2^2\) (predicted respectively by the three heads)
  • Rendering loss: Decodes material maps \(\rightarrow\) differentiable rendering under random lighting \(\rightarrow\) perceptual loss \(\mathcal{L}_p = \sum_l \|\phi_l(\hat{r}) - \phi_l(r)\|_2^2\) (VGG multi-layer feature matching)
  • L2 material reconstruction loss: L2 loss for each material channel
  • The rendering loss is key to stable training—bridging the domain gap between natural images and material maps.

Key Experimental Results

Main Results

Quantitative comparison (FID/CLIP):

Method Type FID↓ CLIP Score↑
Text2Tex Learning/Untextured 116.41 30.33
SyncMVD Learning/Untextured 118.46 30.66
NvDiffRec Optimization/Untextured 103.81 30.14
DreamMat Optimization/Untextured 113.34 30.64
Ours Learning/Untextured 100.63 31.06
Make-it-Real Retrieval/Textured 104.38 88.62
Ours Learning/Textured 101.19 89.70
  • Achieves the best FID/CLIP under both untextured and textured settings.
  • Comparable results to Rodin Gen-1 and Tripo3D (which use massive-scale data).

Ablation Study

Triple-head U-Net + Rendering Loss Ablation (Material RMSE ↓):

Configuration Albedo Roughness Metallic Bump
W/O Triple-head 0.0800 0.1196 0.1584 0.0824
W/O Rendering Loss 0.1442 0.1943 0.2594 0.0716
Full 0.0604 0.0877 0.1193 0.0313

Confidence Mask Ablation (Mean RMSE ↓):

Configuration Unlit Real Lit Unrealistic Lit Mean
W/O Confidence 0.1521 0.1074 0.1111 0.1235
Full 0.1102 0.0747 0.0847 0.0899

Key Findings

  • Rendering loss is the most critical component—removing it significantly degrades all material RMSEs, especially metallic which worsens by \(2.2\times\).
  • The triple-head architecture effectively decouples materials—with a vanilla U-Net, the bump map gets colored by the albedo.
  • The confidence mask improves material quality under all three lighting conditions (reducing average RMSE by 27%).
  • Progressive generation + confidence mask eliminates obvious inconsistencies in multi-view materials under generated lighting scenarios.

Highlights & Insights

  • Elegant design of the unified framework: A single model simultaneously handles four different lighting conditions (untextured / pure albedo / scanned / generated), eliminating the past complex pipelines operating in isolation.
  • Confidence mask is an elegant design: Serving as a "switch" for the model to smoothly transition between estimation and generation modes, while being reused as a tool for multi-view consistency.
  • Key role of rendering loss: Closing the loop from materials back to the image space via differentiable rendering for comparison acts as a stabilizer for cross-domain training (natural images \(\rightarrow\) material maps).
  • Material3D Dataset: 80K high-quality PBR objects rendered under various lighting conditions, providing the community with a training foundation for material generation.

Limitations & Future Work

  • Material resolution is limited by the resolution of latent diffusion (fine details are inferior to high-resolution handcrafted materials).
  • Limited ability to model extreme reflective/transparent materials.
  • Texture seam issues still exist in progressive multi-view generation (although mitigated by UV refinement).
  • The physical accuracy of generated materials has not been validated (comparison with real-world measured BRDF).
  • Relies on the category coverage of the pre-trained diffusion model.
  • From texture generation to material generation: The output of texture generation methods (Paint3D, TEXTure) contains baked lighting. Material Anything proves that directly generating decoupled PBR materials is feasible and more practical.
  • Generality of the confidence mechanism: The design concept of the confidence mask can be extended to other generative tasks that require unifying supervised/unsupervised signals.
  • Collaboration with 3D generation pipelines: Can serve as a downstream module for 3D generation pipelines (such as LRM / InstantMesh) to add realistic materials to generated 3D models.

Rating

⭐⭐⭐⭐ — The unified framework is elegantly designed (the confidence mask solves both lighting adaptation and multi-view consistency), performing better than specialized methods across all four scenarios, which holds high engineering value. The designs of the triple-head architecture and rendering loss, though not entirely novel, are effectively combined. There is room for improvement in material resolution and physical accuracy.