ARM: Appearance Reconstruction Model for Relightable 3D Generation¶

Conference: CVPR 2025
arXiv: 2411.10825
Code: https://arm-aigc.github.io
Area: 3D Vision / 3D Generation
Keywords: 3D Reconstruction, Appearance Decomposition, PBR Materials, UV Texture Space, Relighting

TL;DR¶

This paper proposes the ARM framework, which decouples geometry and appearance generation. It reconstructs high-quality textures in the UV texture space using back-projection and global-receptive-field networks, while introducing material priors to resolve the ambiguity between material and illumination under sparse views. Trained on only 8 H100 GPUs, it outperforms existing methods on GSO and OmniObject3D.

Background & Motivation¶

Generating high-quality 3D models with realistic appearance from 2D images is a core task in computer vision and graphics. While existing methods have achieved significant progress in geometric reconstruction, their appearance quality remains insufficient. LRM-based methods use triplane representations, which are limited by resolution and the blurriness of MLP decoding, leaving reconstructed textures lacking in fine details. Additionally, most methods only output baked-in vertex colors without physical properties, failing to support relighting under dynamic illumination.

The Key Challenge is that spatial variations in the triplane do not directly correspond to texture variations on the object surface, and separating materials from illumination under sparse views is inherently an ill-posed inverse problem.

The Key Insight of this paper is to shift appearance processing to the UV texture space—directly learning textures on the object surface to bypass the triplane resolution bottleneck—while introducing material priors to assist in decomposing illumination and materials.

Method¶

ARM divides the 3D reconstruction task into a geometry stage and an appearance stage, where geometry is generated by GeoRM, and appearance is handled separately by InstantAlbedo (diffuse reflection) and GlossyRM (roughness/metalness).

Overall Architecture¶

The input consists of sparse multi-view images (6 views) generated by a diffusion model. The triplane synthesizer based on a transformer in GeoRM predicts the density field, and a differentiable Marching Cubes algorithm is used to extract the mesh. After the mesh is unwrapped into the UV space, InstantAlbedo back-projects the multi-view images into the UV texture space, extracts view-wise features using a U-Net, fuses them via max-pooling, fills unobserved regions using FFCNet, and finally outputs the baked colors and decomposed diffuse albedo. GlossyRM queries the triplane at the mesh vertices to predict roughness and metalness.

Key Designs¶

Geometry-Appearance Decoupling (GeoRM + GlossyRM):
- Function: Separates the generation of geometry and appearance, processing each with a dedicated network.
- Mechanism: GeoRM focuses on density prediction (supervised by mask, depth, and normal) and freezes its weights after training. GlossyRM is conditioned on the mesh from GeoRM and queries its own triplane to predict per-vertex roughness \(\rho\) and metalness \(m\). Both share the LRM architecture but are trained independently.
- Design Motivation: Predicting all targets (density, color, and material) within a single LRM causes significant quality degradation, particularly because material parameters are harder to infer. Decoupling provides more capacity for each model and allows scaling the triplane resolution up to \(256 \times 256\).
UV Texture Space Appearance Decomposition (InstantAlbedo):
- Function: Reconstructs high-quality diffuse albedo textures in the UV space.
- Mechanism: Back-projects images from 6 views, auxiliary data (mask, position, texture coordinates, view direction, normal), and material encodings into the UV texture space to obtain 6 groups of UV-space input maps. A U-Net extracts view-wise features, which are fused via max-pooling. Then, FFCNet (with a global receptive field) is used to fill unobserved regions and refine the output, yielding the baked color and decomposed albedo.
- Design Motivation: Color variations in the triplane space do not directly correspond to surface texture variations, leading to blurry MLP decoding. Directly representing surface color variations in the UV space bypasses resolution and interpolation mismatch issues. The global receptive field of FFCNet is crucial for completing unobserved regions from only 6 views.
Material Prior:
- Function: Resolves the inherent ambiguity between materials and illumination under sparse views.
- Mechanism: Utilizes a DINO ViT-8x8 image encoder, pre-trained on a semantic material dataset, and integrates it into the back-projection pipeline of InstantAlbedo. It converts input images into material-aware feature maps, which are back-projected to the UV space alongside other auxiliary information to help the network distinguish between lighting effects and material properties.
- Design Motivation: Performing inverse rendering solely based on rendering loss is bound to fail under sparse views, as lighting effects will be baked into the albedo. The material prior provides semantic-level information on "what looks like a certain material," enabling proper decomposition even under strong illumination.

Loss & Training¶

GeoRM: \(\mathcal{L}_{geo} = \lambda_z |z^{gt} - \hat{z}| + \lambda_M \mathcal{L}_{mse}(M^{gt}, \hat{M}) + \lambda_n \mathcal{L}_{lpips}(\mathbf{n}^{gt}, \hat{\mathbf{n}})\)
GlossyRM: \(\mathcal{L}_{glossy} = \mathcal{L}_0(\rho^{gt}, \hat{\rho}) + \mathcal{L}_0(m^{gt}, \hat{m})\), where \(\mathcal{L}_0 = \lambda_1 \mathcal{L}_{mse} + \lambda_2 \mathcal{L}_{lpips} + \lambda_3 \mathcal{L}_{ssim}\)
InstantAlbedo: \(\mathcal{L}_{albedo} = \mathcal{L}_0(\mathbf{c}^{gt}, \hat{\mathbf{c}}) + \mathcal{L}_0(\mathbf{c_d}^{gt}, \hat{\mathbf{c_d}})\), which directly fits ground-truth materials instead of using rendering loss.
Training takes about 5 days on 8 H100 GPUs: 2 days for GeoRM, 2 days for GlossyRM, and 1 day for InstantAlbedo (which can be trained in parallel with GlossyRM).

Key Experimental Results¶

Main Results¶

Dataset	Metric	ARM (Ours)	MeshFormer	InstantMesh	SF3D	Gain
GSO (1030 shapes)	F-Score↑	0.968	0.966	0.938	0.888	SOTA
GSO	PSNR↑	21.692	20.500	19.744	18.540	+1.19dB
GSO	LPIPS↓	0.137	0.141	0.146	0.175	Best
OmniObject3D (1038)	F-Score↑	0.936	0.927	0.877	0.857	SOTA
OmniObject3D	PSNR↑	20.874	19.402	19.193	18.529	+1.47dB
Relighting Dataset	PSNR-A↑	21.750	-	-	18.592	+3.16dB

Ablation Study¶

Configuration	PSNR-A↑	LPIPS-A↓	PSNR-D↑	LPIPS-D↓	Description
Full Method	25.074	0.096	24.116	0.098	Baseline
W/o back-projection measurement	24.780	0.104	23.398	0.114	Direct image information is important
W/o material prior	24.471	0.108	22.687	0.121	Albedo decomposition quality significantly degrades
W/o FFCNet	24.612	0.110	23.360	0.123	Inpainting capability for unobserved regions degrades

Key Findings¶

ARM comprehensively outperforms existing methods in geometry and texture quality, improving texture PSNR by 1-3dB.
Removing the material prior causes the largest drop in albedo quality (PSNR-D -1.43), proving that rendering loss alone cannot achieve proper decomposition.
FFCNet is critical for completing unobserved regions; replacing it with a local-receptive-field U-Net introduces artifacts.
SF3D generates constant roughness/metalness, whereas ARM reconstructs spatially varying material properties.

Highlights & Insights¶

The UV space is the right choice for appearance modeling—operating directly on the surface avoids the indirect mapping issues of triplanes.
The design of the material prior is clever: instead of directly predicting materials, it provides semantic information on "what looks like metal/wood" to aid decomposition.
The strategy of decoupling geometry and appearance is simple yet effective, allowing each sub-model to focus on a smaller task.
Requiring only 8 GPUs for training, it is much friendlier in terms of resource demands compared to many 3D generation methods.

Limitations & Future Work¶

Inconsistent viewpoints generated by the upstream multi-view diffusion model can lead to texture artifacts.
The UV unwrapping process is time-consuming and cannot be performed during online training (requires preprocessing the dataset).
Open-world lighting conditions are not considered (all training data uses specific environmental lighting).
Exploring weighting input views based on user preferences could be a way to resolve conflicting viewpoints.

The comparison with SF3D is the most direct: both perform PBR decomposition, but ARM's operation in the UV space combined with the material prior yields a decomposition quality far exceeding SF3D.
Relationship with MeshFormer: Geometric quality is similar, but ARM has a significant advantage in texture.
Insights: The approach of UV-space operations combined with staged training can be extended to other tasks requiring high-quality appearance.

Supplementary Analysis¶

Training Data and Generalization¶

GeoRM and GlossyRM are trained on a 150K subset of Objaverse, while InstantAlbedo uses 55K generated shapes from it.
Evaluation is performed on GSO, OmniObject3D, and a custom relighting dataset, where all objects are unseen during training.
For each evaluated object, 144 images are generated (24 views \(\times\) 6 environmental lights), ensuring a rigorous evaluation setup for relighting.

Limitations and Advantages of UV-Space Operations¶

Advantages: Pixels directly correspond to surface colors, avoiding the indirect mapping and resolution bottlenecks of triplanes.
Advantages: U-Net and FFCNet can operate directly on the 2D texture map, leveraging mature 2D network architectures.
Limitations: UV unwrapping itself is a non-trivial operation, and the unwrapping quality varies for meshes with different topologies.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of UV-space appearance decomposition and material priors is innovative, though individual component ideas are relatively intuitive.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Highly comprehensive across three datasets, multiple metrics, ablation studies, and qualitative comparisons.
Writing Quality: ⭐⭐⭐⭐ Clear pipeline, beautiful illustrations, and well-articulated motivation.
Value: ⭐⭐⭐⭐⭐ Relightable 3D generation is a highly demanded feature in practical applications (e.g., games, metaverse), and the quality of this method is significantly leading.