WildCap: Facial Albedo Capture in the Wild via Hybrid Inverse Rendering¶

Conference: CVPR 2026 arXiv: 2512.11237 Code: Released (as declared in the paper) Area: Human Understanding Keywords: facial albedo capture, inverse rendering, diffusion prior, texel grid lighting, in-the-wild

TL;DR¶

This paper proposes WildCap, a hybrid inverse rendering framework that reconstructs high-quality 4K facial diffuse albedo maps from casual in-the-wild smartphone videos. The approach combines data-driven relighting (SwitchLight), model-based texel grid lighting optimization, and diffusion prior sampling, substantially closing the quality gap between in-the-wild capture and controlled-illumination methods.

Background & Motivation¶

Facial albedo capture is central to digital human creation: Cloning a real person into the digital world requires high-quality facial reflectance maps, a problem studied for over two decades.
High-quality methods rely on controlled illumination: From Light Stage equipment to smartphone flash lighting, existing approaches assume known scene illumination, increasing capture cost and limiting accessibility.
Model-based inverse rendering is unstable under complex illumination: Jointly optimizing illumination and reflectance to match observed images becomes highly ill-posed and unstable in the presence of complex light transport effects such as shadows.
Data-driven methods are robust but suffer from baking artifacts: Networks such as SwitchLight can directly predict reflectance components but inevitably bake illumination effects (e.g., shadows) into their predictions.
The two paradigms are complementary: Model-based methods yield physically plausible decompositions but lack robustness; data-driven methods are robust but imperfect. Combining them is a natural direction.
In-the-wild capture has significant practical value: Enabling high-quality facial capture from casually recorded smartphone videos would substantially lower the barrier to digital human production.

Method¶

Overall Architecture: Hybrid Inverse Rendering¶

The pipeline consists of three stages: 1. Data preprocessing: Uniformly sample 300 frames (960×720) from a smartphone capture session, calibrate camera parameters via COLMAP, reconstruct a fine mesh with 2DGS, register the ICT template using Wrap3D, and select \(V=16\) frames for reflectance estimation. 2. Data-driven relighting: Apply SwitchLight to predict per-frame diffuse albedo images \(\{I^i\}\), converting complex in-the-wild illumination into a more constrained condition. 3. Model-based optimization: Interpret SwitchLight's baking artifacts as illumination effects in UV space, jointly optimizing a texel grid lighting model and sampling a diffusion prior to obtain a clean albedo map \(A\).

Key Design 1: Texel Grid Lighting Model¶

SwitchLight predictions are not produced by physical light sources; conventional spherical harmonics (SH) environment lighting cannot explain the non-physical shadow baking artifacts in these predictions.

Design motivation: Assign local SH illumination to facial regions exhibiting baking artifacts, so that artifacts can be interpreted as "clean albedo under dark local lighting."
Specific structure:
- Global SH illumination \(\gamma^g \in \mathbb{R}^{N_c}\) models the base illumination over the entire face.
- A 2D UV-space grid \(V \in \mathbb{R}^{\frac{H}{g} \times \frac{W}{g} \times N_c}\) stores local SH parameters.
- Modulated via a binary mask \(M\): \(\gamma = \gamma^g + \gamma^V \cdot M[u][v]\).
- Grid size \(g=96\), second-order SH (\(N_c=27\)), queried via bilinear interpolation.
Mask acquisition: Supports both manual annotation (Photoshop polygon lasso) and automatic detection (DiFaReli shadow detection lifted into UV space).

Key Design 2: Diffusion Prior and Posterior Sampling Optimization¶

Increasing the expressiveness of the lighting model exacerbates ill-posedness (scale ambiguity between illumination and albedo), necessitating prior constraints:

Patch-level diffusion prior training: A patch-level diffusion model at 64×64 resolution is trained on 48 Light Stage scans, modeling a 7-channel signal (3ch diffuse albedo + 3ch normal + 1ch specular albedo).
Initialization strategy: The scan with the closest skin tone \(x_0^{ref}\) is selected from the training set; \(T_{init}=0.6T\) steps of noise are added before sampling begins (rather than starting from pure noise), reducing the number of sampling steps required.
Joint optimization: At each diffusion timestep, both the reflectance map \(x_t\) (diffusion denoising + photometric gradient guidance) and the lighting parameters \(\theta_t\) (gradient descent + regularization) are updated simultaneously.

Loss & Training¶

Photometric loss: \(\mathcal{L}_{pho} = \|I_{UV} - \Gamma_\theta(A, N_c)\|_2^2\)
Lighting regularization: \(\mathcal{L}_{reg} = 0.1 \cdot \mathcal{L}_{TV} + \mathcal{L}_{neg}\)
- TV regularization enforces spatial smoothness of the lighting field.
- Negative shading regularization \(\mathcal{L}_{neg}\) ensures local lighting produces dark shading (to explain shadow baking).
Texture map construction: Minimizes LPIPS + gradient-space L1 loss.

Post-processing: 4K Super-Resolution¶

An RCAN super-resolution network upsamples the 1K reflectance map to 4K. Compared to DoRA, which requires 508 minutes to directly sample a 4K map, WildCap requires only 8 minutes (24 GB RTX 4090).

Key Experimental Results¶

Main Results (Facial Relighting, Average over 6 Subjects)¶

Method	PSNR ↑	SSIM ↑	LPIPS ↓
DeFace*	22.20	0.9279	0.1192
FLARE*	27.81	0.9411	0.0929
WildCap (Ours)	28.79	0.9520	0.0610

Main Results (Synthetic Data — Digital Emily, Albedo Reconstruction)¶

Method	PSNR ↑	SSIM ↑	LPIPS ↓
DeFace*	28.43	0.9791	0.0826
FLARE*	22.48	0.9742	0.0571
WildCap (Ours)	28.71	0.9802	0.0388

Ablation Study¶

w/o Hybrid (optimizing directly on raw images): Fails to effectively separate specular highlights and shadows under complex illumination.
w/o TGL (global SH only): Cannot explain non-physical baking artifacts; visible shadow residuals remain.
w/o Prior (no diffusion prior; direct Adam optimization per texel): Produces severe artifacts and fails to converge to a plausible reflectance map.
Grid size ablation: \(g=1/24\) provides insufficient expressiveness; \(g=384\) is overly smooth and loses fine detail; \(g=96\) achieves the best balance.

Highlights & Insights¶

Elegant hybrid inverse rendering framework: The approach organically combines the robustness of data-driven methods with the physical plausibility of model-based methods in a concise and well-motivated design.
Novel and effective Texel Grid Lighting Model: Transcends the limitations of physical illumination models by using a non-physical yet more expressive local SH grid to account for baking artifacts in network predictions.
Diffusion prior elegantly resolves scale ambiguity: Sampling albedo from a reasonable distribution while jointly optimizing illumination converts an ill-posed problem into a well-posed one.
High efficiency: Requires only 8 minutes versus 508 minutes for DoRA, while achieving comparable quality to controlled-illumination methods.
Thorough experimentation: Includes comprehensive ablations, quantitative evaluation on synthetic data, cross-setting comparison with DoRA, diverse scene demonstrations, and failure case analysis.

Limitations & Future Work¶

Dependency on SwitchLight preprocessing: SwitchLight is a closed-source commercial model accessible only via API, limiting the reproducibility and extensibility of the method.
Automatic shadow detection relies on DiFaReli: Iterative diffusion sampling is slow and may miss ambient occlusion effects.
Continuity constraints on lighting representation: When SwitchLight predictions contain sharp shadow boundaries (e.g., under harsh noon sunlight), the continuous grid representation cannot fully remove them.
Limited training data scale: The diffusion prior is trained on only 48 Light Stage scans, with limited racial and skin-tone diversity (33 Caucasian / 9 African / 6 Asian subjects).
Requires a target skin tone reference: Although obtainable manually or automatically, this introduces an additional step in the pipeline.

vs. DeFace: DeFace partitions the face into a limited number of regions (5–10), each corresponding to a trainable network, offering limited expressiveness; WildCap's texel grid provides finer granularity.
vs. FLARE: FLARE employs a split-sum approximation to model illumination; the physical model cannot account for non-physical baking artifacts.
vs. DoRA (controlled-illumination method): WildCap achieves quality comparable to DoRA under the more challenging in-the-wild setting, better preserves personal features (e.g., moles), and is approximately 63× faster.
vs. Rainer et al.: Using a small MLP to model shading is difficult to optimize within a diffusion posterior sampling framework; WildCap's grid representation is more amenable to optimization.
vs. test scenarios in Xu et al. / Rainer et al.: Prior methods are evaluated only on mildly shadowed scenes; WildCap addresses the more challenging case of strong cast shadows.

Rating¶

Novelty: ⭐⭐⭐⭐ — The hybrid inverse rendering framework and the texel grid lighting model are conceptually novel; the joint diffusion prior optimization is technically sophisticated.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive ablations, thorough quantitative and qualitative comparisons, synthetic data evaluation, and failure case analysis.
Writing Quality: ⭐⭐⭐⭐ — Overall clear presentation with well-motivated problem setup and detailed supplementary material.
Value: ⭐⭐⭐⭐ — Substantially lowers the barrier to facial appearance capture with practical implications for digital human production.