MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion¶

Conference: ICCV 2025 arXiv: 2503.10289 Code: GitHub Area: 3D Vision / PBR Material Generation Keywords: PBR texture, multi-view diffusion, illumination invariance, material generation, dual-channel

TL;DR¶

MaterialMVP is an end-to-end multi-view PBR texture generation model that decouples illumination via consistency-regularized training and employs a dual-channel material generation framework (MCAA + Learnable Material Embeddings) to align albedo and metallic-roughness maps, enabling single-pass generation of high-quality, illumination-invariant, multi-view-consistent PBR materials from a 3D mesh and an image prompt.

Background & Motivation¶

Background: PBR texture generation is a core task in 3D asset creation. Existing approaches fall into two categories: (1) SDS-based optimization methods (Text2Tex, Paint-it, etc.), which yield high quality but require minutes of inference; and (2) generative methods (SuperMat, RGB↔X), which are fast but limited to single-view inputs or lack precise cross-channel alignment.

Limitations of Prior Work:

Optimization-based methods such as TextureDreamer and HyperDreamer are computationally expensive and unsuitable for large-scale production.
Single-view generation methods cannot guarantee multi-view consistency and are prone to seams and the Janus effect.
CLAY uses IP-Adapter to incorporate reference images, but the alignment between generated textures and the input remains insufficiently precise.
Diffusion model outputs tend to "bake" the illumination information from reference images, producing non-physical artifacts.

Core Idea: Construct an end-to-end multi-view PBR diffusion framework that simultaneously addresses illumination disentanglement, multi-channel alignment, and reference fidelity through three key designs: consistency-regularized training, dual-channel material generation, and reference attention.

Method¶

Overall Architecture¶

Input: 3D mesh (normal maps + position maps encoded into latent space and concatenated with noise) + reference image → Reference Attention extracts reference information → U-Net dual-channel parallel denoising → consistency-regularized training enforces illumination invariance → Output: 6-view PBR materials (albedo / metallic / roughness). Initialized from the SD 2.1 ZSNR checkpoint; optimized with AdamW, lr=\(5 \times 10^{-5}\), 2000-step warmup, ~180 GPU-days.

Key Designs¶

Consistency-Regularized Training
- Design Motivation: Addresses view sensitivity and illumination entanglement — small camera pose changes lead to drastically different material outputs, and reference image illumination is "baked" into the output.
- Core Idea: Each training step uses a pair of reference images \((I_1, I_2)\) that differ slightly in viewpoint/illumination but are required to produce identical network outputs.
- Reference pair selection: Image pairs at adjacent azimuths (\(\pm 15°\)) are sampled from 312 renderings (4 elevations × 24 azimuths × multiple lighting conditions).
- Training loss: \(\mathcal{L} = (1-\lambda)\mathcal{L}_{pbr} + \lambda\mathcal{L}_{cons}\), where \(\mathcal{L}_{cons} = \mathbb{E}_t[\|\epsilon_t^1 - \epsilon_t^2\|_2^2]\), \(\lambda=0.1\).
- Effect: Forces the model to learn illumination-invariant representations, eliminating the influence of input illumination on the output materials.
Dual-Channel Material Generation + MCAA
- Multi-Channel Aligned Attention (MCAA): The albedo channel retains standard cross-attention \(\text{Attn}_{albedo} = \text{Softmax}(Q_{albedo}K_{ref}^T/\sqrt{d}) \cdot V_{ref}\); the MR channel is not conditioned directly on the reference image (due to large distribution gap) but instead inherits spatial information from the albedo channel via a residual connection: \(z_{MR}^{new} = z_{MR} + \text{Attn}_{albedo}\).
- Learnable Material Embeddings: Separate \(16 \times 1024\) learnable embeddings are introduced for albedo and MR respectively, injected into each channel via cross-attention to capture the distinct distributions of the two texture types.
- Design advantage: No additional trainable parameters are introduced (only cross-attention is reused), avoiding the difficulty of semantic alignment between reference images and MR maps.

Loss & Training¶

PBR loss: \(\mathcal{L}_{pbr} = \mathbb{E}_{\epsilon, t}[\|\epsilon - \epsilon_t^1\|_2^2]\)
Consistency loss: \(\mathcal{L}_{cons} = \mathbb{E}_t[\|\epsilon_t^1 - \epsilon_t^2\|_2^2]\), \(\lambda = 0.1\)
Training data: 70,000 Objaverse/Objaverse-XL 3D assets, each rendered at 4 elevations × 24 azimuths, 512×512 resolution.
Each step samples 6 PBR image groups at the same elevation + 2 reference images.

Key Experimental Results¶

Main Results¶

Quantitative comparison on 176 Objaverse evaluation objects:

Method	Condition	CLIP-FID↓	FID↓	CMMD↓	CLIP-I↑	LPIPS↓
Text2Tex	Text	31.83	187.7	2.738	-	0.1448
SyncMVD	Text	29.93	189.2	2.584	-	0.1411
Paint-it	Text	33.54	179.1	2.629	-	0.1538
Paint3D (text)	Text	30.17	185.7	2.755	-	0.1388
Paint3D (image)	Image	26.86	176.9	2.400	0.8871	0.1261
TexGen	Text+Image	28.23	178.6	2.447	0.8818	0.1331
MaterialMVP	Image	24.78	168.5	2.191	0.9207	0.1211

Ablation Study¶

Qualitative ablation validating the contribution of each component:

Ablation Setting	Effect
Two-stage method (RGB first, then material estimation)	Glass/metal surfaces appear plastic-like; error accumulation is severe
Without consistency loss (\(\lambda=0\))	Metallic channel frequently over-predicts; multiple material types are erroneously assigned metallic appearance
Without MCAA (standard weight sharing)	Spatial misalignment between albedo and MR; texture details are blurred in fine-grained regions

Key Findings¶

MaterialMVP outperforms all baselines across all five metrics; FID is reduced by 8.4 and CLIP-I improved by 3.4% relative to Paint3D (image).
The consistency loss is critical for eliminating illumination artifacts — removing it causes severe metallic over-prediction.
End-to-end generation substantially outperforms two-stage approaches due to error accumulation in the latter.
MCAA avoids semantic alignment difficulties in the MR channel by using residual connections rather than direct reference conditioning.

Highlights & Insights¶

The dual-reference consistency regularization is an elegant design: reference pairs with subtle differences compel the model to "ignore" illumination variation, effectively implementing invariance learning driven by data augmentation.
MCAA avoids forcing cross-attention between albedo and MR channels where the distribution gap is large, instead using residual connections for implicit alignment — a practical design that introduces no additional parameters.
Single-pass end-to-end generation of a complete PBR material set (including metallic/roughness) offers substantially greater practical utility than SDS-based methods requiring minutes of optimization.

Limitations & Future Work¶

Quantitative evaluation covers only 176 objects, limiting the scale of assessment; ablation experiments are purely qualitative.
Training cost is high (~180 GPU-days); inference time is not reported.
Generalization to out-of-distribution 3D assets (e.g., scanned data) is not evaluated.
The MR channel relies entirely on spatial information from the albedo channel, which may propagate errors when albedo estimation is inaccurate.

vs. CLAY: CLAY employs IP-Adapter, resulting in insufficient alignment precision; MaterialMVP achieves more accurate pixel-level alignment via Reference Attention + MCAA.
vs. Paint3D: Paint3D is a single-view method; multi-view generation naturally eliminates seams and inconsistencies.
vs. SuperMat: SuperMat's two-stage pipeline introduces error accumulation that degrades material estimation accuracy.
Insight: Learnable Material Embeddings are inspired by IC-Light and merit broader application in other multi-channel generation tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The consistency-regularized training and MCAA dual-channel design are genuinely novel.
Experimental Thoroughness: ⭐⭐⭐ Quantitative results are comprehensively superior, but ablation studies are qualitative only.
Writing Quality: ⭐⭐⭐⭐ Structure is clear and visualizations are compelling.
Value: ⭐⭐⭐⭐ A practical end-to-end PBR generation solution with direct applicability to 3D asset production pipelines.