HeadGaS: Real-Time Animatable Head Avatars via 3D Gaussian Splatting¶

Conference: ECCV 2024
arXiv: 2312.02902
Code: No public code
Area: 3D Vision
Keywords: 3D Gaussian Splatting, Head Animation, Animatable Head Modeling, Expression Transfer, Real-Time Rendering

TL;DR¶

Proposes HeadGaS, which equips each 3D Gaussian primitive with a learnable latent feature base, linearly blends features using expression parameters, and predicts expression-dependent color and opacity via an MLP. This design achieves real-time (250+ fps) and high-quality animatable head reconstruction, outperforming baselines in PSNR by approximately 2 dB.

Background & Motivation¶

Background: Animatable 3D head reconstruction is a core technology for digital humans, AR/VR, and teleconferencing. NeRF-based methods (e.g., INSTA, NeRFBlendShape) suffer from a trade-off between quality and speed, reaching interactive frame rates of only 10-15 fps.

Limitations of Prior Work: NeRF methods are limited in rendering speed due to the dense sampling of volume rendering; explicit methods (meshes, point clouds) provide stronger geometric constraints but struggle to maintain photorealism (e.g., INSTA exhibits triangular mesh artifacts).

Key Challenge: The original design of 3DGS is a static scene representation, which does not support expression-driven dynamic appearance changes. Intuitively, one should move Gaussian positions to model dynamics, but this complicates optimization.

Goal: How to extend 3DGS to an animatable head representation while maintaining real-time rendering speed and high fidelity.

Key Insight: Inspired by traditional blendshape models, rather than moving Gaussian primitives, facial dynamics are represented by changing their opacity and color—effectively achieving motion through "over-representation."

Core Idea: Each Gaussian carries a learnable feature base, and expression parameters serve as blending weights for linear combination in the latent space, which are then passed through a lightweight MLP to output expression-dependent color and opacity.

Method¶

Overall Architecture¶

With a monocular video input, head poses and expression parameters (compatible with both FLAME and FaceWarehouse) are obtained via pre-processing. The core of the model is a feature-augmented set of 3D Gaussians:

\[\mathcal{G}_a = (\boldsymbol{\Sigma}, \boldsymbol{\mu}, \mathbf{F})\]

where \(\mathbf{F} \in \mathbb{R}^{B \times f_{dim}}\) is the learnable feature base (\(B\)=expression dimension, \(f_{dim}\)=feature dimension). During rendering, expression parameters \(\mathbf{e}_i\) are blended with the feature base to generate frame-specific features, which are fed into an MLP to predict color and opacity, and then rendered into images via standard tile-based rasterization.

Key Designs¶

Feature Blending: Each Gaussian possesses a feature base \(\mathbf{F} \in \mathbb{R}^{B \times f_{dim}}\). Given the expression weights \(\mathbf{e}_i \in \mathbb{R}^B\) of the \(i\)-th frame, the frame-specific feature is obtained via linear blending:

\[\mathbf{f}_i = \mathbf{F}^T \mathbf{e}_i + \mathbf{f}_0\]

where \(\mathbf{f}_0\) is a bias term. This is similar to traditional blendshapes but blended in the latent feature space rather than the geometric space. Here, \(B=52\) (the first 52 expression coefficients of FLAME) and \(f_{dim}=32\). Design Motivation: Allowing each Gaussian to independently learn dynamic features is more expressive than requiring a single MLP to learn the global dynamics of all Gaussians.

Lightweight MLP for Color and Opacity Prediction: The frame-specific feature \(\mathbf{f}_i\) and the positional encoding \(\psi(\boldsymbol{\mu})\) are fed into a two-layer MLP (with 64-channel hidden layers):

\[\mathbf{c}_i, \alpha_i = \phi(\mathbf{f}_i, \psi(\boldsymbol{\mu}))\]

The output consists of color \(\mathbf{c}_i \in \mathbb{R}^{3(k+1)^2}\) (SH coefficients, \(k=3\)) and opacity \(\alpha_i \in [0,1]\) (sigmoid constrained). The MLP is extremely small so as not to affect real-time rendering.

Modeling Motion via Opacity Variation: Key observation (Fig. 2) — there are overlapping Gaussian primitives in dynamic areas of the face, which alternately become visible/invisible under different expressions. For example, when closing the mouth, the opacity of lip Gaussians is high; when opening the mouth, these Gaussians become transparent, and another set of Gaussians in the jaw area emerges. This "over-representation" strategy avoids optimization difficulties introduced by explicit 3D displacements.
Validation of Other Non-viable Alternatives:
- Direct blending of explicit parameters (without MLP): Severe artifacts appear in dynamic regions
- Predicting position/rotation offsets \(\Delta(\mu, R)\): Introducing 3D motion dimensions into the heuristic optimization of 3DGS leads to geometric inconsistencies
- Expression vectors as MLP conditioning only (without blending): A single MLP struggles to learn the global dynamics of all Gaussians

Loss & Training¶

\[\mathcal{L}_{\text{total}} = \lambda_1 \mathcal{L}_1(I_r, I_{gt}) + \lambda_s \mathcal{L}_{\text{SSIM}}(I_r, I_{gt}) + \lambda_p \mathcal{L}_p(I_r, I_{gt})\]

\(\lambda_1=0.8\), \(\lambda_s=0.2\), \(\lambda_p=0.1\)
The perceptual loss \(\mathcal{L}_p\) is based on the VGG network, activated after 10K iterations, and computed only within the head bbox region
Initialization: 2500 Gaussian center points (subsampled from 3DMM mesh vertices), with the feature base \(\mathbf{F}\) initialized to zero
Densification is executed between 500 and 15K iterations, with 50K total training iterations, taking about 1 hour on a single V100 GPU

Key Experimental Results¶

Main Results¶

Method	Dataset	PSNR↑	SSIM↑	LPIPS↓	Rendering Time (s)↓
NHA	INSTA	26.99	0.942	0.043	0.63
INSTA	INSTA	28.61	0.944	0.047	0.05
NeRFBlendShape	INSTA	30.52	0.955	0.056	0.10
PointAvatar	INSTA	30.68	0.952	0.058	0.1-1.5
HeadGaS	INSTA	32.50	0.971	0.033	0.004
NeRFBlendShape	NBS	34.34	0.970	0.031	0.10
HeadGaS	NBS	36.66	0.976	0.026	0.004

HeadGaS leads by about 2 dB in PSNR on the INSTA and NBS datasets, with a rendering speed-up of 10-25×.

Ablation Study¶

Variant	L2↓	PSNR↑	SSIM↑	LPIPS↓
Ours w/o blending	0.0012	30.28	0.955	0.041
Ours w/ \(\Delta(\mu,R)\)	0.0014	29.83	0.953	0.045
Ours change all	0.0014	29.65	0.951	0.041
Ours w/o MLP	0.0009	32.08	0.968	0.033
Ours w/o \(\mathcal{L}_p\)	0.0008	32.11	0.969	0.046
Ours (Full)	0.0008	32.50	0.971	0.033

Key Findings¶

Modeling dynamics via opacity > Positional deformation: Changing color/opacity (PSNR 32.50) is far superior to moving Gaussian positions (29.83), as spatial motion introduces geometric inconsistencies under the heuristic optimization mechanism of 3DGS.
Blending in feature space > Blending explicit parameters: Latent feature blending + MLP decoding avoids the representational limits of directly controlling color/opacity using expression weights.
Role of perceptual loss: Removing \(\mathcal{L}_p\) degrades LPIPS from 0.033 to 0.046, with the quality loss mainly occurring in fine textures.

Highlights & Insights¶

Counter-intuitive yet effective dynamic modeling: Instead of moving Gaussians, letting Gaussians "appear/disappear" to represent motion is simple yet highly effective.
3DMM-decoupled design: Not tied to a specific parametric model; validated compatibility with two different 3DMMs (FLAME and FaceWarehouse).
Extreme speed: 250 fps at 512² resolution, which is 25-250× faster than interactive NeRF methods.
Cross-subject expression transfer: Allows driving one person's model with another person's expression parameters without additional training.

Limitations & Future Work¶

Memory consumption: Storing a feature base of \(B \times f_{dim}\) floats for each Gaussian leads to significant memory overhead when the expression dimension \(B\) is large.
Dependence on head tracking quality: Pose/expression estimation errors from the tracker propagate directly to the rendering results.
Training data coverage limitations: If certain expressions are only observed from the front, rendering those expressions from side angles will fail.
Compression of the feature base remains unexplored; it could be combined with HAC-like methods to further reduce model size.
Can be scaled to more complex dynamic regions such as full-body or hands.

NeRFBlendShape: Also uses expression parameter blending, but blends multi-level hash grid fields instead of individual Gaussian features.
PointAvatar: Deformable point cloud representation, with stronger geometric constraints but lacking photorealism.
INSTA: Based on FLAME mesh deformation + InstantNGP, reaches near-real-time speed but exhibits mesh artifacts.
4D-GS / Dynamic 3DGS: General dynamic scene methods, whereas HeadGaS features a more streamlined and specialized design for heads.

Rating¶

Novelty: ⭐⭐⭐⭐ — The insight of "modeling facial dynamics via opacity variation instead of geometric deformation" is highly original and counter-intuitive.
Experimental Thoroughness: ⭐⭐⭐⭐ — 3 datasets, comparisons with 8 baselines, comprehensive ablations, and cross-subject transfer experiments.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation, and the visualization of opacity changes (Fig 2) is highly intuitive.
Value: ⭐⭐⭐⭐⭐ — Enables real-time driving of digital avatars, directly applicable to AR/VR.