Monocular Facial Appearance Capture in the Wild¶
Conference: ICCV 2025 arXiv: 2412.12765 Code: N/A Area: Human Understanding Keywords: facial appearance capture, inverse rendering, occlusion-aware, monocular video, split-sum approximation
TL;DR¶
This paper proposes a method for reconstructing facial appearance attributes (diffuse albedo, specular intensity, specular roughness) from monocular head-rotation videos. By introducing an occlusion-aware split-sum approximation shading model, the method achieves studio-grade facial appearance capture quality without imposing any simplifying assumptions on the illumination environment.
Background & Motivation¶
High-quality 3D facial scans are critical for film, gaming, telepresence, and related applications. Traditional approaches rely on multi-camera, controlled-lighting studio setups, which yield accurate appearance maps (diffuse albedo, specular intensity, specular roughness) but at prohibitive cost. While lightweight facial reconstruction has advanced in recent years, existing in-the-wild methods are subject to the following limitations:
- SunStage assumes a single point light source (the sun) in the scene, restricting applicable scenarios.
- CoRA requires a smartphone flash in a dark room.
- FLARE employs the standard split-sum approximation, ignoring self-occlusion and causing illumination to be baked into the albedo.
- NextFace relies on statistical priors, limiting its expressive capacity.
The core problem is that existing methods either assume specific lighting conditions or disregard facial self-occlusion effects (e.g., the shadow cast by the nose onto the cheek), leading to inaccurate appearance decomposition.
Method¶
Overall Architecture¶
The input is a monocular head-rotation video. In the preprocessing stage, keypoint-based monocular tracking is used to obtain an initial 3DMM mesh, a fixed camera pose, and per-frame head poses. An inverse rendering optimization via differentiable rendering then jointly solves for geometry, appearance parameters (diffuse albedo \(\rho\), specular intensity, specular roughness), and environmental illumination.
Key Designs¶
-
Geometry Optimization (Laplacian Preconditioning): Vertex positions are optimized directly rather than through 3DMM blend weights. Following a framework similar to Nicolet et al., gradient steps are biased toward smooth solutions via the matrix \((I + \lambda_{geo} L)^2\): $\(v \leftarrow v - \eta(I + \lambda_{geo} L)^2 \frac{\partial \mathcal{L}}{\partial v}\)$ With \(\lambda_{geo}=19\), large learning rates can be employed while maintaining a smooth, self-intersection-free mesh. Design Motivation: This enables geometry and texture to be optimized simultaneously, eliminating the need for conventional two-stage pipelines.
-
Occlusion-Aware Shading Model (Visibility-Modulated Split-Sum): The standard split-sum approximation factorizes the rendering equation into a BRDF integral and a pre-filtered environment map term, but neglects self-occlusion. This work introduces a visibility term \(V(\mathbf{x}, \omega_i)\), incorporating per-point ray visibility modulation into the second integral. For the specular (low-roughness) component, an approximation \(\tilde{V}(\mathbf{x}, \omega_r) \approx \frac{1}{K}\sum_{k=1}^{K} \frac{V(\mathbf{x}, \omega_k)}{D(\mathbf{n}, \omega_k, \omega_r, r)}\) is used to soften visibility via Monte Carlo sampling. For the diffuse component, OptiX ray tracing with multiple importance sampling is employed. Design Motivation: Correctly modeling self-occlusion is essential to prevent shadow baking into the albedo.
-
Diffuse Regularization: A weak regularization term \(\mathcal{L}_{diffuse} = \|I_{diffuse}\|_2^2\) is added to encourage the diffuse rendering to be as small as possible. Design Motivation: This prevents specular signals from being excessively baked into the diffuse component, enabling better diffuse/specular separation.
Loss & Training¶
The total loss is: $\(\mathcal{L} = \mathcal{L}_{img} + \lambda_{mask}\mathcal{L}_{mask} + \lambda_{Lap}\mathcal{L}_{Lap} + \lambda_{light}\mathcal{L}_{light} + \lambda_{rough}\mathcal{L}_{rough} + \lambda_{diffuse}\mathcal{L}_{diffuse}\)$
- \(\mathcal{L}_{img}\): L1 image reconstruction loss
- \(\mathcal{L}_{mask}\): L1 mask loss (using MODNet segmentation)
- \(\mathcal{L}_{Lap}\): Laplacian regularization to keep the optimized mesh close to the initial 3DMM
- \(\mathcal{L}_{light}\): white light regularization
- \(\mathcal{L}_{rough}\): total variation regularization on the roughness texture
- Geometry and texture are optimized simultaneously (no two-stage pipeline)
Key Experimental Results¶
Main Results (Reconstruction Error, Computed on Skin Regions)¶
| Method | PSNR ↑ | MAE ↓ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|---|
| NextFace | 25.30 | 10.63 | 0.78 | 0.31 |
| SunStage | 29.47 | 5.28 | 0.88 | 0.14 |
| FLARE | 30.40 | 2.01 | 0.94 | 0.15 |
| Ours (w/o vis) | 34.55 | 1.79 | 0.96 | 0.10 |
| Ours | 38.09 | 1.18 | 0.97 | 0.10 |
The proposed method outperforms all baselines across every metric, achieving a PSNR gain of 7.69 dB over FLARE and reducing MAE by 41%.
Ablation Study (Ray Visibility & Component Analysis)¶
| Configuration | Effect |
|---|---|
| w/o visibility (standard split-sum) | Shadows baked into albedo; incorrect nostril-region shadows under relighting; PSNR 34.55 vs. 38.09 |
| w/o \(\mathcal{L}_{diffuse}\) regularization | Specular signals excessively baked into the diffuse component |
| Optimizing 3DMM blend weights vs. direct vertex optimization | Direct vertex optimization recovers correct nose geometry from shading with lower reconstruction error |
| Ray tracing vs. modified split-sum (specular) | Ray tracing causes flickering and high-frequency artifacts in inverse rendering; modified split-sum is more stable |
Evaluation on a synthetic dataset further confirms the substantial advantage of occlusion-aware modeling: diffuse albedo error is markedly reduced, and shape recovery in self-occluded regions (e.g., lips) is more accurate.
Key Findings¶
- The method operates across diverse indoor and outdoor scenes without any assumptions about illumination (no sun or flash required).
- Correct visibility modeling is the key to high-quality diffuse/specular separation.
- Direct vertex position optimization with preconditioning is more flexible than 3DMM parameterization.
- The modified split-sum for the specular component is more stable than direct ray tracing.
Highlights & Insights¶
- The technical contribution is solid: incorporating the visibility term into the split-sum approximation is an elegant and efficient solution, and employing distinct strategies for specular and diffuse components (approximate visibility vs. ray tracing) reflects sound engineering judgment.
- The end-to-end framework requires no two-stage training; the ability to simultaneously optimize geometry and texture stems from the Laplacian preconditioning technique.
- Results are convincing: qualitative comparisons demonstrate clearly superior diffuse/specular separation over FLARE, with relighting quality approaching studio-grade.
- The method has high practical value: a simple head-rotation video suffices to produce facial assets compatible with VFX pipelines.
Limitations & Future Work¶
- The method depends on the accuracy of head pose estimation; imprecise poses severely degrade reconstruction quality.
- Appearance cannot be recovered when the face remains in extreme shadow across all frames.
- The current facial template does not include an eye model.
- Correct skin color recovery is not guaranteed due to the inherent illumination–appearance ambiguity.
- The assumption of static expression limits applicability to videos with speech or rich facial motion.
Related Work & Insights¶
- The core distinction from FLARE lies in visibility modeling: FLARE's standard split-sum ignores self-occlusion and incurs substantial baking artifacts, whereas this work addresses the problem at its source through a modified formulation.
- The distinction from SunStage lies in the lighting assumption: SunStage assumes a known solar position as a single point light source.
- Munkberg et al.'s differentiable split-sum serves as the foundation for the shading model proposed here.
- This work offers important insights for the facial scanning field: the gap between lightweight capture and studio-grade quality is steadily narrowing.
Rating¶
- Novelty: ⭐⭐⭐⭐ The occlusion-aware split-sum approximation is the central innovation.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation on both synthetic and real data.
- Writing Quality: ⭐⭐⭐⭐⭐ Mathematical derivations are clear and figures are of high quality.
- Value: ⭐⭐⭐⭐ Substantially narrows the gap between lightweight capture and studio-grade quality.