BecomingLit: Relightable Gaussian Avatars with Hybrid Neural Shading¶
Conference: NeurIPS 2025 arXiv: 2506.06271 Code: jonathsch.github.io/becominglit Area: 3D Vision Keywords: relightable avatar, 3D Gaussian Splatting, hybrid neural shading, light stage, BRDF, face reconstruction
TL;DR¶
This paper proposes BecomingLit, a method that reconstructs high-fidelity, relightable, and real-time renderable head avatars from low-cost light stage multi-view sequences using 3D Gaussian primitives and hybrid neural shading (neural diffuse BRDF + analytic Cook-Torrance specular). A new publicly available OLAT facial dataset is also released.
Background & Motivation¶
Strong industrial demand: The need for relightable, photorealistic head avatars in VR/metaverse applications is rapidly growing, yet most existing methods bake the training illumination into the appearance, making re-lighting impossible.
High cost of traditional light stages: Existing methods (e.g., RGCA, Deep Appearance Models) rely on room-scale setups with hundreds of lights and cameras, affordable only by a few institutions.
Scarcity of public datasets: Controlled multi-view datasets for facial appearance modeling are extremely limited (Goliath covers only 4 subjects at relatively low resolution), hindering broad academic research.
Analytic BRDFs are insufficient for skin: Skin exhibits significant subsurface scattering and other global illumination effects that purely analytic models (e.g., Lambertian + Cook-Torrance) cannot accurately reproduce, particularly in the diffuse component.
Poor generalization of existing neural methods: RGCA learns precomputed radiance transfer (PRT), which generalizes poorly to unseen illumination; it also relies on per-identity VAE expression spaces, preventing cross-identity reenactment.
Goal: Using a low-cost light stage consisting of 16 cameras and 40 LEDs, combined with the FLAME parametric model and a hybrid neural BRDF, the paper aims to match state-of-the-art quality at approximately one-tenth the cost while supporting monocular video-driven animation.
Method¶
Overall Architecture¶
Input FLAME expression/pose parameters → geometry module \(\mathcal{F}_g\) predicts 3D Gaussian attributes → hybrid neural shading (diffuse \(\mathcal{F}_d\) + analytic specular) → Gaussian Splatting rendering → joint optimization via L1 + SSIM photometric loss + regularization.
Key Designs¶
1. Expression-Dependent Geometry Module \(\mathcal{F}_g\) (UV-Space CNN)
- A fixed number of anisotropic Gaussian primitives are defined on the FLAME UV map (512² UV → ~202k primitives).
- \(\mathcal{F}_g\) is a transposed convolutional network that takes FLAME expression coefficients (109-dim) as input and outputs per-texel position offsets \(\delta\mu\), rotation \(q\), scale \(s\), opacity \(\sigma\), and expression features \(f^{expr}\) (32-dim).
- Gaussian centers = FLAME mesh interpolated positions + TBN-rotated static local offsets + dynamic offsets \(\delta\mu\); static offsets carry most of the displacement, while dynamic offsets are regularized to remain small, improving generalization to novel expressions.
2. Hybrid Neural Shading
- Diffuse \(\mathcal{F}_d\): A 3-layer MLP (hidden size 64) that takes 6th-order spherical harmonic coefficients (encoding incident light) and \(f^{expr}\) as input, and outputs a scalar reflectance multiplied by a statically learned albedo \(a_k\) to produce the diffuse color. Parameterized as a monochromatic BRDF, it is trained under white light and evaluated per RGB channel at inference to support colored illumination. Subsurface scattering and self-occlusion are learned implicitly.
- Specular \(f_s\): Based on the Cook-Torrance model. The NDF uses a 2-lobe Blinn-Phong mixture (roughness \(r\) interpolated linearly); the Fresnel term uses the Schlick approximation; the masking-shadowing term is derived from the NDF. A small CNN \(\mathcal{F}_v\) conditioned on \((f^{expr}, \omega_o)\) predicts specular intensity \(k_s\) and normal perturbation \(\delta n\); the shading normal is computed by normalizing the sum of the mesh normal and \(\delta n\).
- Environment map relighting uses the split-sum approximation (pre-integrated mipmap + 2D BRDF LUT).
3. Low-Cost OLAT Dataset
- 16 industrial cameras (2200×3208, 72 fps, PTP sub-microsecond synchronization) + 40 high-CRI LEDs covering the front hemisphere.
- 10 subjects, each with ~150s of multi-sequence data (predefined expressions + reading + free expressions); OLAT frames and full-illumination tracking frames alternate at a 2:1 ratio.
- Setup cost is approximately one-tenth that of Goliath (144 views / 460 lights).
Loss & Training¶
- \(\lambda_{l1}=1.0, \lambda_{SSIM}=0.2\); \(\lambda_{alpha}=\lambda_{scale}=2\text{e-}2, \lambda_{pos}=1\text{e-}5\).
- The alpha loss (L2 between rendered alpha and foreground mask) is a critical regularizer that prevents transparency artifacts during environment map relighting — enabling front-hemisphere-only training to avoid back-side artifacts.
Key Experimental Results¶
Main Results (Table 2, 4 subjects, 15 training views / 1 test view, 4 hold-out illuminations)¶
| Method | Relighting PSNR↑ | SSIM↑ | LPIPS↓ | Relight+Reenact PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|---|---|
| RGCA | 29.21 | 0.8462 | 0.1659 | 26.31 | 0.8206 | 0.1917 |
| RGCA_FLAME | 29.78 | 0.8464 | 0.1444 | 26.91 | 0.8282 | 0.1667 |
| BecomingLit | 31.38 | 0.8956 | 0.1040 | 28.08 | 0.8730 | 0.1317 |
- Outperforms RGCA by approximately +2.2 dB in relighting PSNR, with a 37% reduction in LPIPS.
- Replacing RGCA's per-identity VAE with a FLAME expression space (RGCA_FLAME) consistently improves performance, confirming that a shared expression space is superior for reenactment.
Ablation Study (Table 3)¶
| Variant | Relight PSNR | Reenact PSNR | Key Findings |
|---|---|---|---|
| w/ PBR shading | 29.42 | 26.31 | Purely analytic BRDF cannot model subsurface scattering; produces plastic appearance |
| w/ PRT diffuse | 29.23 | 25.47 | PRT generalizes worst to unseen illuminations (−2.6 dB vs. full model) |
| w/ SG specular | 31.55 | 28.09 | Slightly better under point lights but lacks pore-level specular detail |
| w/o alpha loss | 31.34 | 28.07 | Transparency artifacts appear during environment map relighting |
| w/o expr features | 31.23 | 28.13 | Degraded recovery of pore details and specular highlights |
| Full model | 31.38 | 28.08 | Best overall performance |
Runtime (Table 4, RTX A6000, 202k primitives, 1100×1604)¶
| Method | CNN | Diffuse | Specular | Splatting | Total |
|---|---|---|---|---|---|
| RGCA | 9ms | 1ms | 1ms | 9ms | 20ms |
| Ours | 4ms | 3ms | 1ms | 9ms | 17ms (~59 FPS) |
Key Findings¶
- The core advantage of hybrid neural shading lies in the diffuse component: implicit learning of subsurface scattering substantially outperforms PRT and analytic PBR models in generalization.
- Analytic Cook-Torrance specular produces more natural pore-level reflections under environment lighting than Spherical Gaussians.
- The alpha regularizer enables front-hemisphere-only setups to support full-environment relighting, roughly halving setup complexity, cost, and training computation.
- Expression-dependent features \(f^{expr}\) jointly influence diffuse and specular quality, serving as a bridge between geometry and appearance.
Highlights & Insights¶
- Novel hybrid shading paradigm: The diffuse term is learned implicitly by a neural network while the specular term uses a classical analytic model, achieving both physical interpretability and expressive power.
- Practical low-cost setup: 16 cameras + 40 LEDs suffice to reach state-of-the-art quality, at roughly one order of magnitude lower cost than Goliath (144 cameras + 460 lights).
- First public high-resolution OLAT facial dataset: 10 subjects, 72 fps, 2200×3208, filling a significant gap in community resources.
- End-to-end pipeline: FLAME-driven → 3DGS geometry → hybrid shading → splatting rendering, with 17ms real-time inference.
- Monocular video drivable: After training, only FLAME parameters are required for animation, making the method naturally compatible with monocular trackers such as VHAP.
Limitations & Future Work¶
- Training still requires thousands of light stage frames with diverse expressions; reconstruction from casual smartphone footage remains out of reach, leaving a gap to consumer-level applications.
- Geometry is constrained by FLAME: the oral interior is not modeled, the method is sensitive to gaze direction tracking, and fine-grained geometric detail is limited.
- The diffuse network \(\mathcal{F}_d\) is trained from scratch without leveraging facial appearance priors, which may lead to instability under limited data.
- The dataset covers only 10 subjects with limited diversity in ethnicity and age.
- Ethical risk: photorealistic avatars drivable from monocular video pose a deepfake threat, despite requiring an initial capture session.
Related Work & Insights¶
| Direction | Representative Methods | Relation to This Work |
|---|---|---|
| 3DGS head modeling | GaussianAvatars, NPGA, RGCA | This work adds relightability on top of the 3DGS foundation |
| Light stage facial capture | Debevec 2000, Guo 2019, Goliath | This work proposes a lower-cost setup and a new dataset |
| Precomputed radiance transfer | RGCA (PRT SH coefficients) | This work replaces PRT with neural diffuse, improving generalization |
| Neural shading | LitNeRF, ReNeRF, Neural BRDF | This work is the first to combine neural diffuse + analytic specular on dynamic 3DGS avatars |
| Drivable avatars | FLAME, VHAP, Codec Avatars | This work reuses the shared FLAME expression space, eliminating per-identity encoders |
Rating¶
- Novelty: ⭐⭐⭐⭐ — The hybrid neural shading paradigm (neural diffuse + analytic specular), the low-cost light stage, and the public dataset are all novel contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Quantitative comparison on 4 subjects, 5 ablation variants, runtime analysis, and demonstrations of environment relighting and monocular reenactment; subject count is somewhat limited.
- Writing Quality: ⭐⭐⭐⭐ — Clear motivation, complete method figures, rigorous mathematical derivations, and detailed appendix.
- Value: ⭐⭐⭐⭐ — The public dataset and low-cost pipeline are likely to stimulate follow-up work; real-time performance and flexible driving make the method suitable for VR/AR deployment.