HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting¶
Conference: ECCV 2024
arXiv: 2312.02902
Code: No public code
Area: 3D Vision
Keywords: 3D Gaussian Splatting, Head Animation, Animatable Head Modeling, Expression Transfer, Real-Time Rendering
TL;DR¶
Proposes HeadGaS, which equips each 3D Gaussian primitive with a learnable latent feature base, linearly blends features using expression parameters, and predicts expression-dependent color and opacity via an MLP. This design achieves real-time (250+ fps) and high-quality animatable head reconstruction, outperforming baselines in PSNR by approximately 2 dB.
Background & Motivation¶
Background: Animatable 3D head reconstruction is a core technology for digital humans, AR/VR, and teleconferencing. NeRF-based methods (e.g., INSTA, NeRFBlendShape) suffer from a trade-off between quality and speed, reaching interactive frame rates of only 10-15 fps.
Limitations of Prior Work: NeRF methods are limited in rendering speed due to the dense sampling of volume rendering; explicit methods (meshes, point clouds) provide stronger geometric constraints but struggle to maintain photorealism (e.g., INSTA exhibits triangular mesh artifacts).
Key Challenge: The original design of 3DGS is a static scene representation, which does not support expression-driven dynamic appearance changes. Intuitively, one should move Gaussian positions to model dynamics, but this complicates optimization.
Goal: How to extend 3DGS to an animatable head representation while maintaining real-time rendering speed and high fidelity.
Key Insight: Inspired by traditional blendshape models, rather than moving Gaussian primitives, facial dynamics are represented by changing their opacity and color—effectively achieving motion through "over-representation."
Core Idea: Each Gaussian carries a learnable feature base, and expression parameters serve as blending weights for linear combination in the latent space, which are then passed through a lightweight MLP to output expression-dependent color and opacity.
Method¶
Overall Architecture¶
With a monocular video input, head poses and expression parameters (compatible with both FLAME and FaceWarehouse) are obtained via pre-processing. The core of the model is a feature-augmented set of 3D Gaussians:
where \(\mathbf{F} \in \mathbb{R}^{B \times f_{dim}}\) is the learnable feature base (\(B\)=expression dimension, \(f_{dim}\)=feature dimension). During rendering, expression parameters \(\mathbf{e}_i\) are blended with the feature base to generate frame-specific features, which are fed into an MLP to predict color and opacity, and then rendered into images via standard tile-based rasterization.
Key Designs¶
- Feature Blending: Each Gaussian possesses a feature base \(\mathbf{F} \in \mathbb{R}^{B \times f_{dim}}\). Given the expression weights \(\mathbf{e}_i \in \mathbb{R}^B\) of the \(i\)-th frame, the frame-specific feature is obtained via linear blending:
where \(\mathbf{f}_0\) is a bias term. This is similar to traditional blendshapes but blended in the latent feature space rather than the geometric space. Here, \(B=52\) (the first 52 expression coefficients of FLAME) and \(f_{dim}=32\). Design Motivation: Allowing each Gaussian to independently learn dynamic features is more expressive than requiring a single MLP to learn the global dynamics of all Gaussians.
- Lightweight MLP for Color and Opacity Prediction: The frame-specific feature \(\mathbf{f}_i\) and the positional encoding \(\psi(\boldsymbol{\mu})\) are fed into a two-layer MLP (with 64-channel hidden layers):
The output consists of color \(\mathbf{c}_i \in \mathbb{R}^{3(k+1)^2}\) (SH coefficients, \(k=3\)) and opacity \(\alpha_i \in [0,1]\) (sigmoid constrained). The MLP is extremely small so as not to affect real-time rendering.
-
Modeling Motion via Opacity Variation: Key observation (Fig. 2) — there are overlapping Gaussian primitives in dynamic areas of the face, which alternately become visible/invisible under different expressions. For example, when closing the mouth, the opacity of lip Gaussians is high; when opening the mouth, these Gaussians become transparent, and another set of Gaussians in the jaw area emerges. This "over-representation" strategy avoids optimization difficulties introduced by explicit 3D displacements.
-
Validation of Other Non-viable Alternatives:
- Direct blending of explicit parameters (without MLP): Severe artifacts appear in dynamic regions
- Predicting position/rotation offsets \(\Delta(\mu, R)\): Introducing 3D motion dimensions into the heuristic optimization of 3DGS leads to geometric inconsistencies
- Expression vectors as MLP conditioning only (without blending): A single MLP struggles to learn the global dynamics of all Gaussians
Loss & Training¶
- \(\lambda_1=0.8\), \(\lambda_s=0.2\), \(\lambda_p=0.1\)
- The perceptual loss \(\mathcal{L}_p\) is based on the VGG network, activated after 10K iterations, and computed only within the head bbox region
- Initialization: 2500 Gaussian center points (subsampled from 3DMM mesh vertices), with the feature base \(\mathbf{F}\) initialized to zero
- Densification is executed between 500 and 15K iterations, with 50K total training iterations, taking about 1 hour on a single V100 GPU
Key Experimental Results¶
Main Results¶
| Method | Dataset | PSNR↑ | SSIM↑ | LPIPS↓ | Rendering Time (s)↓ |
|---|---|---|---|---|---|
| NHA | INSTA | 26.99 | 0.942 | 0.043 | 0.63 |
| INSTA | INSTA | 28.61 | 0.944 | 0.047 | 0.05 |
| NeRFBlendShape | INSTA | 30.52 | 0.955 | 0.056 | 0.10 |
| PointAvatar | INSTA | 30.68 | 0.952 | 0.058 | 0.1-1.5 |
| HeadGaS | INSTA | 32.50 | 0.971 | 0.033 | 0.004 |
| NeRFBlendShape | NBS | 34.34 | 0.970 | 0.031 | 0.10 |
| HeadGaS | NBS | 36.66 | 0.976 | 0.026 | 0.004 |
HeadGaS leads by about 2 dB in PSNR on the INSTA and NBS datasets, with a rendering speed-up of 10-25×.
Ablation Study¶
| Variant | L2↓ | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|
| Ours w/o blending | 0.0012 | 30.28 | 0.955 | 0.041 |
| Ours w/ \(\Delta(\mu,R)\) | 0.0014 | 29.83 | 0.953 | 0.045 |
| Ours change all | 0.0014 | 29.65 | 0.951 | 0.041 |
| Ours w/o MLP | 0.0009 | 32.08 | 0.968 | 0.033 |
| Ours w/o \(\mathcal{L}_p\) | 0.0008 | 32.11 | 0.969 | 0.046 |
| Ours (Full) | 0.0008 | 32.50 | 0.971 | 0.033 |
Key Findings¶
- Modeling dynamics via opacity > Positional deformation: Changing color/opacity (PSNR 32.50) is far superior to moving Gaussian positions (29.83), as spatial motion introduces geometric inconsistencies under the heuristic optimization mechanism of 3DGS.
- Blending in feature space > Blending explicit parameters: Latent feature blending + MLP decoding avoids the representational limits of directly controlling color/opacity using expression weights.
- Role of perceptual loss: Removing \(\mathcal{L}_p\) degrades LPIPS from 0.033 to 0.046, with the quality loss mainly occurring in fine textures.
Highlights & Insights¶
- Counter-intuitive yet effective dynamic modeling: Instead of moving Gaussians, letting Gaussians "appear/disappear" to represent motion is simple yet highly effective.
- 3DMM-decoupled design: Not tied to a specific parametric model; validated compatibility with two different 3DMMs (FLAME and FaceWarehouse).
- Extreme speed: 250 fps at 512² resolution, which is 25-250× faster than interactive NeRF methods.
- Cross-subject expression transfer: Allows driving one person's model with another person's expression parameters without additional training.
Limitations & Future Work¶
- Memory consumption: Storing a feature base of \(B \times f_{dim}\) floats for each Gaussian leads to significant memory overhead when the expression dimension \(B\) is large.
- Dependence on head tracking quality: Pose/expression estimation errors from the tracker propagate directly to the rendering results.
- Training data coverage limitations: If certain expressions are only observed from the front, rendering those expressions from side angles will fail.
- Compression of the feature base remains unexplored; it could be combined with HAC-like methods to further reduce model size.
- Can be scaled to more complex dynamic regions such as full-body or hands.
Related Work & Insights¶
- NeRFBlendShape: Also uses expression parameter blending, but blends multi-level hash grid fields instead of individual Gaussian features.
- PointAvatar: Deformable point cloud representation, with stronger geometric constraints but lacking photorealism.
- INSTA: Based on FLAME mesh deformation + InstantNGP, reaches near-real-time speed but exhibits mesh artifacts.
- 4D-GS / Dynamic 3DGS: General dynamic scene methods, whereas HeadGaS features a more streamlined and specialized design for heads.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The insight of "modeling facial dynamics via opacity variation instead of geometric deformation" is highly original and counter-intuitive.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 3 datasets, comparisons with 8 baselines, comprehensive ablations, and cross-subject transfer experiments.
- Writing Quality: ⭐⭐⭐⭐ — Clear motivation, and the visualization of opacity changes (Fig 2) is highly intuitive.
- Value: ⭐⭐⭐⭐⭐ — Enables real-time driving of digital avatars, directly applicable to AR/VR.