SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates¶

Metadata¶

Conference: ICCV 2025
arXiv: 2411.17515
Code: Project Page
Area: 3D Vision
Keywords: PBR material decomposition, single-step inference, end-to-end training, re-render loss, 3D material estimation

TL;DR¶

This paper proposes SuperMat, a single-step inference framework for PBR material decomposition. Through structured expert branches and scheduler correction, it enables end-to-end training and introduces a re-render loss to enforce physical consistency, accelerating inference from seconds to milliseconds.

Background & Motivation¶

Decomposing PBR materials (albedo, metallic, roughness) from images is a core challenge in 3D asset creation. Existing methods suffer from three major bottlenecks:

Model redundancy: Each material property requires a separate diffusion model, doubling training and inference costs.

Slow inference: DDIM requires 30–50 denoising steps, making it unsuitable for interactive applications.

Insufficient decomposition: Noise-prediction-based training objectives cannot directly supervise final material outputs, hindering the application of advanced techniques such as perceptual loss and re-render loss.

Method¶

Overall Architecture¶

SuperMat is built upon fine-tuned Stable Diffusion and incorporates three core designs:

1. Structured Expert Branches¶

The last UpBlock of the UNet is duplicated into two expert branches: - Albedo Branch: dedicated to predicting the diffuse albedo map. - RM Branch: dedicated to predicting the joint roughness-metallic map.

Shared modules extract general features, while expert branches capture material-specific representations. The added parameters account for only 2.23% of total UNet parameters (19.3M), enabling multi-material output from a single model.

2. Single-Step Inference and End-to-End Training¶

Scheduler correction: The default leading timestep setting in DDIM is found to be flawed — during single-step prediction, the timestep presented to the model (\(t=1\), implying nearly noise-free input) is inconsistent with the actual input (pure noise). Correcting this to a trailing setting (\(t=T\)) enables genuine single-step inference.

End-to-end training: Single-step inference makes backpropagation tractable, allowing losses to be computed directly on the final predicted materials:

\[\mathcal{L}_m = \mathcal{L}_p(\hat{k}_d, k_d) + \mathcal{L}_p(\hat{k}_{rm}, k_{rm})\]

where \(\mathcal{L}_p\) denotes a VGG-16-based perceptual loss.

3. Re-render Loss¶

The predicted materials are used to render images under novel lighting conditions, which are then compared against ground-truth renderings:

\[\mathcal{L}_{re} = \mathcal{L}_p(\mathcal{R}_{\hat{k}_n, \hat{k}_p, \hat{c}}(\hat{k}_d, \hat{k}_m, \hat{k}_r), \mathcal{R}_{\hat{k}_n, \hat{k}_p, \hat{c}}(k_d, k_m, k_r))\]

This enforces physical consistency across different material attributes — even when individual material maps appear close to the ground truth, the composite rendering result may still be incorrect without such a constraint.

Multi-view Extension (SuperMatMV)¶

Built upon the MVDream architecture, 3D self-attention and camera extrinsic conditioning are incorporated to enable simultaneous decomposition across 6 views, ensuring cross-view consistency.

Multi-view decomposition results from SuperMatMV are back-projected into UV space, where a UV refinement network completes uncovered regions and improves overall quality. The complete 3D pipeline requires approximately 3 seconds.

Experiments¶

Main Results: Image-Space Material Decomposition¶

Method	Albedo PSNR↑	Metallic PSNR↑	Roughness PSNR↑	Relighting PSNR↑	Time (s)↓
IIR	21.94	17.95	19.73	20.98	0.04
RGB→X	22.30	15.36	20.40	21.51	3.32
StableMaterial	23.44	20.29	21.01	22.56	0.53
SuperMat w/o e2e	24.26	20.79	20.81	23.90	3.09
SuperMat w/o re-render	26.70	24.54	23.52	26.41	0.07
SuperMat	27.68	25.48	24.25	27.66	0.07
SuperMatMV	27.56	26.11	24.84	27.64	0.09

Ablation Study¶

Configuration	Albedo PSNR	Relighting PSNR	Time
w/o scheduler correction (multi-step inference)	24.26	23.90	3.09s
w/ scheduler correction, w/o re-render	26.70	26.41	0.07s
Full SuperMat	27.68	27.66	0.07s

Key Findings¶

Scheduler correction is the central contribution: It accelerates inference by approximately 40× (3.09s → 0.07s) while improving PSNR by 2–4 dB.
Re-render loss contributes substantially: Relighting PSNR improves by 1.25 dB, validating the importance of cross-attribute interaction supervision.
Single model vs. multiple models: The structured expert branches achieve dual-model equivalent functionality with only a 2.23% increase in parameters.

Highlights & Insights¶

Identification and correction of a DDIM scheduler flaw: This simple fix unlocks the substantial potential of single-step diffusion models and has broad implications.
Re-render loss enables cross-attribute physical constraints: This is the first introduction of rendering consistency constraints in diffusion-based material decomposition.
Millisecond-level inference: SuperMat advances material decomposition from academic research toward practical deployment.

Limitations & Future Work¶

Performance depends on the diversity of lighting conditions and material coverage in the training data.
The generalization of the UV refinement network to unseen complex geometries remains to be validated.
Single-step inference may sacrifice fine details under extreme lighting conditions.

Diffusion-based material decomposition: RGB→X, IntrinsicAnything, StableMaterial
Traditional methods: Derender3D, IIR
Single-step diffusion: DMD, InstaFlow

Rating¶

Novelty: ⭐⭐⭐⭐ (combined innovation of scheduler correction and re-render loss)
Technical Depth: ⭐⭐⭐⭐ (systematically addresses three major bottlenecks)
Experimental Quality: ⭐⭐⭐⭐⭐ (comprehensive ablation with substantial SOTA improvements)
Practical Value: ⭐⭐⭐⭐⭐ (millisecond inference, highly applicable)