Material Anything: Generating Materials for Any 3D Object via Diffusion¶
Conference: CVPR 2025
arXiv: 2411.15138
Code: Not open-sourced
Area: 3D Vision / Material Generation
Keywords: PBR material generation, diffusion models, confidence mask, triple-head U-Net, UV space refinement
TL;DR¶
This paper proposes Material Anything, a fully automated, unified diffusion framework that adapts pre-trained image diffusion models to generate PBR materials (albedo/roughness/metallic/bump) via a triple-head U-Net architecture, a confidence mask, and a rendering loss. Together with a confidence-guided progressive multi-view generation strategy and a UV space refinement model, it generates high-quality material maps for 3D objects under various lighting conditions (untextured / pure albedo / scanned / generated) in a unified manner.
Background & Motivation¶
Background: PBR (physically-based rendering) materials are key to maintaining a consistent appearance of 3D objects under different lighting. Handcrafting materials is time-consuming and labor-intensive, and existing 3D texture generation methods (TEXTure, Paint3D, SyncMVD) generate textures containing baked lighting information (highlights/shadows), lacking true material modeling.
Limitations of Prior Work: (1) Optimization-based methods (NvDiffRec, DreamMat) require independent optimization for each object, which is time-consuming and difficult to automate; (2) Retrieval-based methods (Make-it-Real) rely on SAM+GPT segmentation and material matching, which makes the system fragile and unable to handle fine areas; (3) All methods are sensitive to lighting conditions—real-world scans have complex lighting, generated textures have unrealistic lighting, and albedo has no lighting—lacking a unified solution.
Key Challenge: How to automatically handle material generation under various lighting conditions using a unified model while ensuring multi-view consistency?
Key Insight: Rethink 3D material generation as an image-based material estimation task. Leveraging the strong priors of pre-trained image diffusion models, a confidence mask is used to dynamically indicate the reliability of lighting—estimating materials using lighting cues in high-confidence regions and generating materials based on semantics in low-confidence regions.
Method¶
Overall Architecture¶
Two-stage pipeline: (1) Image-space material diffusion—for each rendered view of the 3D object, utilizing the rendered image, normal map, and confidence mask as conditions, a triple-head U-Net is used to generate three sets of material maps: albedo, roughness-metallic, and bump. (2) UV-space material refinement—after projecting multi-view materials onto the UV space, a refinement diffusion model is applied to fill occluded regions and eliminate seams. For untextured objects, a coarse texture is first generated using an image diffusion model before processing.
Key Designs¶
-
Triple-Head Diffusion U-Net (Triple-Head Architecture):
- Splits the initial convolutional layer + first DownBlock and the last UpBlock + output convolutions of the U-Net into three independent branches.
- The three branches respectively output albedo (RGB), roughness-metallic (R=1, G=roughness, B=metallic), and bump (RGB).
- Intermediate layers are shared to maintain consistency among materials, while the branch heads prevent coupling interference between materials.
- Design Motivation: A vanilla U-Net directly outputting 12-channel latents leads to coupling effects (e.g., bump colored by albedo), whereas training separate material VAEs is unfeasible on limited PBR data.
-
Confidence Mask:
- Sets different confidence values for different lighting conditions to guide the model behavior:
- Real scans (reliable lighting) \(\rightarrow\) confidence=1, the model estimates materials using lighting cues.
- No lighting / pure albedo \(\rightarrow\) confidence=0, the model generates materials based on semantics.
- Generated textures (unreliable lighting) \(\rightarrow\) confidence=1 for known regions and confidence=0 for other regions, allowing adaptive switching.
- During training, various lighting conditions are simulated via data augmentation (random light types + image stitching + degradation), with the confidence mask marking degraded regions.
- Design Motivation: Unifying the two tasks of "material estimation" and "material generation" into a single model, where the confidence mask acts as the switch.
- Sets different confidence values for different lighting conditions to guide the model behavior:
-
Confidence-Guided Progressive Multi-View Generation:
- Generates materials view by view, initializing the noise latent of each new view with the known materials from previous views: \(\hat{z}_t = \hat{z}_t \cdot (1-\hat{m}) + z_t \cdot \hat{m}\)
- For views with generated lighting, the confidence mask is additionally used to mark known regions (confidence=1) and regions to be generated (confidence=0).
- Finally, all multi-view materials are baked into the UV space, and a UV refinement diffusion model (inputting coarse materials + CCM coordinate maps) is used to fill holes and remove seams.
- Design Motivation: Progressive generation avoids the high-resolution/multi-channel bottleneck of multi-view diffusion, and the confidence mask simultaneously serves lighting adaptation and multi-view consistency.
Loss & Training¶
- v-prediction loss: \(\mathcal{L}_v = \|\hat{V}_\theta(z_t; c, y) - v_t\|_2^2\) (predicted respectively by the three heads)
- Rendering loss: Decodes material maps \(\rightarrow\) differentiable rendering under random lighting \(\rightarrow\) perceptual loss \(\mathcal{L}_p = \sum_l \|\phi_l(\hat{r}) - \phi_l(r)\|_2^2\) (VGG multi-layer feature matching)
- L2 material reconstruction loss: L2 loss for each material channel
- The rendering loss is key to stable training—bridging the domain gap between natural images and material maps.
Key Experimental Results¶
Main Results¶
Quantitative comparison (FID/CLIP):
| Method | Type | FID↓ | CLIP Score↑ |
|---|---|---|---|
| Text2Tex | Learning/Untextured | 116.41 | 30.33 |
| SyncMVD | Learning/Untextured | 118.46 | 30.66 |
| NvDiffRec | Optimization/Untextured | 103.81 | 30.14 |
| DreamMat | Optimization/Untextured | 113.34 | 30.64 |
| Ours | Learning/Untextured | 100.63 | 31.06 |
| Make-it-Real | Retrieval/Textured | 104.38 | 88.62 |
| Ours | Learning/Textured | 101.19 | 89.70 |
- Achieves the best FID/CLIP under both untextured and textured settings.
- Comparable results to Rodin Gen-1 and Tripo3D (which use massive-scale data).
Ablation Study¶
Triple-head U-Net + Rendering Loss Ablation (Material RMSE ↓):
| Configuration | Albedo | Roughness | Metallic | Bump |
|---|---|---|---|---|
| W/O Triple-head | 0.0800 | 0.1196 | 0.1584 | 0.0824 |
| W/O Rendering Loss | 0.1442 | 0.1943 | 0.2594 | 0.0716 |
| Full | 0.0604 | 0.0877 | 0.1193 | 0.0313 |
Confidence Mask Ablation (Mean RMSE ↓):
| Configuration | Unlit | Real Lit | Unrealistic Lit | Mean |
|---|---|---|---|---|
| W/O Confidence | 0.1521 | 0.1074 | 0.1111 | 0.1235 |
| Full | 0.1102 | 0.0747 | 0.0847 | 0.0899 |
Key Findings¶
- Rendering loss is the most critical component—removing it significantly degrades all material RMSEs, especially metallic which worsens by \(2.2\times\).
- The triple-head architecture effectively decouples materials—with a vanilla U-Net, the bump map gets colored by the albedo.
- The confidence mask improves material quality under all three lighting conditions (reducing average RMSE by 27%).
- Progressive generation + confidence mask eliminates obvious inconsistencies in multi-view materials under generated lighting scenarios.
Highlights & Insights¶
- Elegant design of the unified framework: A single model simultaneously handles four different lighting conditions (untextured / pure albedo / scanned / generated), eliminating the past complex pipelines operating in isolation.
- Confidence mask is an elegant design: Serving as a "switch" for the model to smoothly transition between estimation and generation modes, while being reused as a tool for multi-view consistency.
- Key role of rendering loss: Closing the loop from materials back to the image space via differentiable rendering for comparison acts as a stabilizer for cross-domain training (natural images \(\rightarrow\) material maps).
- Material3D Dataset: 80K high-quality PBR objects rendered under various lighting conditions, providing the community with a training foundation for material generation.
Limitations & Future Work¶
- Material resolution is limited by the resolution of latent diffusion (fine details are inferior to high-resolution handcrafted materials).
- Limited ability to model extreme reflective/transparent materials.
- Texture seam issues still exist in progressive multi-view generation (although mitigated by UV refinement).
- The physical accuracy of generated materials has not been validated (comparison with real-world measured BRDF).
- Relies on the category coverage of the pre-trained diffusion model.
Related Work & Insights¶
- From texture generation to material generation: The output of texture generation methods (Paint3D, TEXTure) contains baked lighting. Material Anything proves that directly generating decoupled PBR materials is feasible and more practical.
- Generality of the confidence mechanism: The design concept of the confidence mask can be extended to other generative tasks that require unifying supervised/unsupervised signals.
- Collaboration with 3D generation pipelines: Can serve as a downstream module for 3D generation pipelines (such as LRM / InstantMesh) to add realistic materials to generated 3D models.
Rating¶
⭐⭐⭐⭐ — The unified framework is elegantly designed (the confidence mask solves both lighting adaptation and multi-view consistency), performing better than specialized methods across all four scenarios, which holds high engineering value. The designs of the triple-head architecture and rendering loss, though not entirely novel, are effectively combined. There is room for improvement in material resolution and physical accuracy.