Gaussian Eigen Models for Human Heads¶
Conference: CVPR 2025
arXiv: 2407.04545
Code: https://zielon.github.io/gem/ (Project page, code will be released)
Area: 3D Vision
Keywords: Gaussian head modeling, eigenbasis distillation, linear morphable models, cross-identity reenactment, real-time rendering
TL;DR¶
Proposes Gaussian Eigen Models (GEM), which distill a high-quality CNN-based Gaussian Avatar into a lightweight linear eigenbasis representation via PCA. By using linear combinations of low-dimensional coefficients to generate facial animations, it achieves high-quality, ultra-lightweight (starting at 7MB), and ultra-fast (200+ fps) animatable avatars, supporting real-time cross-identity expression reenactment from monocular video.
Background & Motivation¶
-
Background: 3D Gaussian Avatar methods can generate realistic animatable avatars. High-quality methods (e.g., Animatable Gaussians) rely on complex CNN architectures to generate expression-dependent appearance variations (such as wrinkles and self-shadowing), but require substantial computational resources.
-
Limitations of Prior Work: (a) CNN-based methods often feature tens of millions of parameters with checkpoints exceeding 500MB, making them unsuitable for consumer-grade devices; (b) CNN-free methods (e.g., Gaussian Avatars) have limited quality and fail to capture dynamic details like wrinkles; (c) most existing methods rely on FLAME 3DMM tracking and do not support direct image-driven control.
-
Key Challenge: High-quality rendering requires heavy CNNs, while lightweight models lack expressiveness. Is it possible to completely remove the CNN at inference time while maintaining CNN-level quality?
-
Goal: Distill high-quality CNN-based Gaussian Avatars into lightweight linear models that require no neural networks during inference, while maintaining high quality and controllability.
-
Key Insight: Inspired by the linear PCA basis representation of 3D Morphable Models (3DMMs)—if mesh vertices can be represented by a linear basis, then 3D Gaussian attributes (position, rotation, scale, opacity) should be able to as well.
-
Core Idea: Perform per-modality PCA (on position, rotation, scale, and opacity) on multi-frame Gaussian point clouds to construct an eigenbasis. New expressions can be generated via a simple linear combination of these basis vectors, completely eliminating the need for neural network inference.
Method¶
Overall Architecture¶
The method consists of three key steps: (1) Constructing a high-quality Gaussian dataset: Train an improved version of Animatable Gaussians (CNN) on multi-view video to generate canonical-space Gaussian point clouds \(\{G_0,...,G_{N-1}\}\) for each frame. (2) Distillation to GEM: Perform per-modality PCA on these Gaussians to obtain four sets of eigenbases for position, rotation, scale, and opacity, and refine the basis vectors via photometric loss. (3) Image-driven animation: Train a lightweight regressor to predict GEM coefficients from a single RGB image, achieving real-time driving.
Key Designs¶
-
Per-modality PCA Distillation:
- Function: Distill the high-quality Gaussian sequence generated by CNNs into a linear eigenbasis.
- Mechanism: Given \(N\) frames of Gaussian data \(\{G_0,...,G_{N-1}\}\), perform PCA separately on the four attributes: position \(\phi\), rotation \(\theta\), scale \(\sigma\), and opacity \(\alpha\), to obtain four sets of eigenbases \(B_\phi, B_\theta, B_\sigma, B_\alpha\) and their means. The color \(c\) is optimized individually as a global parameter and excluded from PCA (to maintain Gaussian semantic consistency). A new expression is generated via linear combination: \(G=\{\mu_i + B_i k_i \mid i \in \{\theta,\phi,\alpha,\sigma\}, \vec{c}\}\), where \(k\) is the coefficient vector. High quality is achieved using \(M=50\) principal components. Crucially, after PCA, the basis vectors are refined using photometric loss on training images for 30K steps, with QR decomposition performed every 1K steps to maintain orthogonality, boosting PSNR from 34.75 to 36.85.
- Design Motivation: Color is excluded from PCA because its variation can cause semantic drift in the Gaussians (e.g., the same Gaussian representing lips in one frame and teeth in another), violating the consistency assumption required for PCA. Per-modality separation allows independent control over the compression rate of each attribute.
-
Gaussian Map Generator (CNN):
- Function: Generate high-quality frame-by-frame Gaussian point clouds for PCA distillation.
- Mechanism: Improve the Animatable Gaussians architecture by merging the dual StyleUNet into a single network, reducing the number of convolutional layers, and operating in the FLAME UV space. Deformation gradients are used to handle the transformation from the canonical space to the deformed space. Gaussians are organized as 2D maps (where each pixel represents a Gaussian); a resolution of \(256^2\) yields approximately 60,000 active Gaussians. This CNN captures expression-dependent appearance variations (wrinkles, self-shadows), setting the upper bound of distillation quality.
- Design Motivation: Organizing the Gaussians in UV space ensures temporal consistency and point-to-point correspondence across frames (Gaussians at the same UV coordinates share the same semantics), which is a prerequisite for PCA.
-
Image-based Regressor:
- Function: Predict GEM coefficients directly from a single RGB image, bypassing traditional 3DMM tracking.
- Mechanism: Utilize the intermediate features of a pre-trained EMOCA network (expression features \(f_{expr}\) and shape features \(f_{shape}\)) and discard the final layer to obtain \(\mathbf{f} \in \mathbb{R}^{2 \times 1024}\). Perform PCA dimensionality reduction on the feature representations of the training frames and select the top 50 components to obtain coefficients \(\kappa\). A small MLP (3 layers, 256 hidden units) then maps this to the GEM coefficients: \(\mathbf{k}=3 \cdot \sigma_k \cdot \tanh(\text{MLP}(\kappa))\), where the tanh activation constrains the output within \([-3\sigma_k, 3\sigma_k]\) to prevent out-of-bounds errors. Relative features (obtained by subtracting the neutral expression reference frame) are used to accomplish cross-identity expression transfer.
- Design Motivation: GEM does not rely on FLAME, thus traditional 3DMM tracking cannot be used to obtain the driving parameters. Directly regressing the coefficients from the image bypasses the tracking step, which is both faster (real-time) and prevents 3DMM tracking error propagation.
Loss & Training¶
CNN model training: \(\mathcal{L}_{Color}=(1-\omega)\mathcal{L}_1+\omega\mathcal{L}_{D-SSIM}+\zeta\mathcal{L}_{VGG}\). The same loss is used for GEM basis vector refinement. The eigenbasis is refined for 30K steps with QR orthogonalization applied every 1K steps. The regressor is trained using 5 training frames from frontal cameras.
Key Experimental Results¶
Main Results¶
| Setup | Metric | GEM | AG (CNN) | GA (No CNN) | INSTA (NeRF) |
|---|---|---|---|---|---|
| Novel View PSNR↑ | dB | 33.55 | 32.42 | 31.32 | 27.78 |
| Novel View LPIPS↓ | 0.068 | 0.071 | 0.079 | 0.123 | |
| Novel Expr + View PSNR↑ | dB | 32.68 | 29.01 | 28.31 | 27.92 |
| Novel Expr + View LPIPS↓ | 0.068 | 0.081 | 0.082 | 0.115 | |
| Cross-ID FID↓ | 0.429 | 0.409 | 0.559 | 0.530 | |
| Rendering FPS↑ | 201.7 | 16.5 | 142.7 | 20.6 |
Ablation Study¶
| Config (#components × texture) | PSNR↑ | Model Size | FPS |
|---|---|---|---|
| 10 comp × 128² | 31.81 | 7MB | 238 |
| 30 comp × 128² | 34.20 | 20MB | 241 |
| 50 comp × 128² | 34.67 | 34MB | 239 |
| 50 comp × 256² | 34.61 | 138MB | 202 |
| 50 comp × 512² | 35.45 | 553MB | 117 |
| Ours CNN (256²) | 34.99 | 109MB | 36 |
| AG (256²) | 34.40 | 529MB | 17 |
Key Findings¶
- GEM quality surpasses the original CNN: On novel expressions + views, GEM achieves 32.68 dB vs CNN 29.01 dB. This is because analysis-by-synthesis directly optimizes the coefficients more accurately, avoiding 3DMM tracking errors.
- Extreme compression: With only 10 principal components, the model size is just 7MB while maintaining a quality of 31.81 dB, which is sufficient for low-end devices.
- Stunning speedup: 201 fps vs 17-36 fps for CNNs, as inference requires only a single dot product operation.
- Fixed color is key to success: Fixing the color prevents Gaussian semantic drift and ensures the validity of PCA.
- The refinement step improves PSNR from 34.75 to 36.85, and QR orthogonalization guarantees the quality of the basis vectors.
- 30K steps of GEM refinement do not lead to overfitting, showing performance improvements even on the test set.
Highlights & Insights¶
- The elegance and simplicity of PCA distillation: By replacing complex CNNs with classical linear algebra tools (PCA + linear combination), this approach achieves a win-win in both quality and efficiency. It instantiates the paradigm of "training complex models, deploying simple ones."
- The counter-intuitive result of GEM surpassing the original CNN: The distilled student model actually yields head quality superior to the teacher model because the precise optimization of analysis-by-synthesis avoids the error propagation of 3DMM tracking. This suggests that the bottleneck might reside in the quality of the driving signal rather than the model's expressive capacity.
- Flexible trade-offs between quality and size: Using 10, 30, or 50 principal components corresponds to sizes of 7, 20, or 34 MB, enabling deployment under various hardware constraints. Such continuous scalability is not possible with typical CNN models.
- Zero neural networks during inference: Similar to GASP, inference relies entirely on linear algebra computations and can run at 67 fps on a CPU (i9-13900K), demonstrating immense practical value.
Limitations & Future Work¶
- Relies on high-quality multi-view videos to train the CNN teacher model, posing a high data acquisition barrier.
- The fixed-topology UV space restricts the modeling of large accessories (e.g., large hats).
- Colors are treated as global parameters and do not participate in animation, which might cause unnatural rendering under varying illumination.
- PCA is a linear model; thus, the expressiveness for extreme expressions might be insufficient (though no obvious degradation was observed in the experiments).
- The FID for cross-identity reenactment is slightly trailing the CNN method (0.429 vs 0.409), indicating room for improvement in the regressor.
- Extending the method to full-body avatars remains unexplored.
Related Work & Insights¶
- vs Gaussian Avatars (GA): GA binds Gaussians to the FLAME mesh. It requires no CNN but lacks expression-dependent appearance changes (e.g., fine wrinkles). GEM retains CNN-learned dynamic appearances via PCA bases while running even faster than GA.
- vs Animatable Gaussians (AG): AG serves as the teacher model for GEM and requires CNN inference (17 fps, 529MB). After distillation, GEM achieves 202 fps and 34MB, with superior quality on novel expressions.
- vs 3D Gaussian Blendshapes: Also utilizes linear interpolation to steer Gaussians but depends on 3DMM expression coefficients. GEM employs an independent eigenbasis and does not require a 3DMM during inference.
- Complementarity with model compression methods: Compression techniques like Compact3D can be further applied to GEM to reduce storage requirements.
Rating¶
- Novelty: ⭐⭐⭐⭐ The idea of PCA-distilled Gaussian Avatars is intuitive yet elegantly executed; the per-modality separation and fixed color design offer strong insights.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across three settings (novel view, novel expression, cross-identity), complete compression ablation studies, speed analysis, and a real-time demo.
- Writing Quality: ⭐⭐⭐⭐ Structure is clear, the analogy with 3DMM is highly appropriate and easy to understand.
- Value: ⭐⭐⭐⭐⭐ An extremely lightweight, high-quality avatar solution; 7MB+ at 200+ fps holds massive practical value for AR/VR applications.