FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation¶
Conference: ICLR 2026 arXiv: 2601.13837 Code: To be confirmed Area: 3D Vision / Head Reconstruction Keywords: 3D Gaussian Splatting, head avatar, few-shot, real-time animation, feed-forward
TL;DR¶
This paper proposes FastGHA, a feed-forward few-shot 3D Gaussian head avatar generation framework that reconstructs an animatable 3D Gaussian head from 4 arbitrary-expression/viewpoint input images in ~1 second, supporting real-time animation at 62 FPS. On Ava-256, it achieves a PSNR of 22.5 dB, surpassing Avat3r's 20.7 dB while being 7.75× faster.
Background & Motivation¶
Background: Methods for 3D head avatar generation fall into two categories: optimization-based and feed-forward. Optimization-based approaches (e.g., per-identity fitting) require large amounts of multi-view data and lengthy optimization, making them unsuitable for real-time deployment. Feed-forward methods (Avat3r, GPAvatar) can generate avatars from sparse images, but either lack controllable animation, suffer from slow animation speeds (Avat3r achieves only 8 FPS), or produce limited reconstruction quality.
Limitations of Prior Work: (a) Avat3r employs skip connections with geometric priors, causing geometric errors to propagate directly into the final output; (b) existing methods struggle to simultaneously achieve accurate expression transfer (AKD) and identity preservation (CSIM); (c) there is a persistent trade-off between animation speed and quality, with high-quality methods typically being slow.
Key Insight: A two-stage design — first, feed-forward reconstruction of a canonical Gaussian head from few-shot images (with learned per-Gaussian features); second, expression-driven deformation via a lightweight MLP for fast animation.
Core Idea: A multi-view Transformer based on SD-Turbo VAE and DINOv3 features reconstructs a canonical Gaussian head, which is then animated in real time via per-Gaussian learned features and a lightweight deformation MLP.
Method¶
Overall Architecture¶
Stage 1: 4 input images → SD-Turbo VAE extracts color features + DINOv3 extracts semantic features + Plücker ray encoding for camera pose → multi-view Transformer aggregates cross-view information → modified VAE decoder outputs per-pixel Gaussian parameters → fused into canonical Gaussian head \(\mathcal{G}^c_f\) (with 32-dimensional per-Gaussian features).
Stage 2: Canonical Gaussian head + FLAME expression encoding → lightweight MLP processes each Gaussian independently → outputs position and color offsets \(\delta_z\) → deformed Gaussians rendered via differentiable rasterization.
Key Designs¶
-
SD-Turbo VAE as Backbone: The encoder is frozen to retain pretrained high-level semantic features; the decoder is fine-tuned to generate Gaussian parameters. This yields a +0.5 dB PSNR improvement over training from scratch.
-
Per-Gaussian Learned Features \(\mathbf{f} \in \mathbb{R}^{32}\): In addition to standard Gaussian attributes (position/color/rotation/scale/opacity), each Gaussian is assigned a 32-dimensional semantic feature encoding expression-relevant high-level information, which is fed into the deformation MLP. Removing this component reduces PSNR by 0.22 and CSIM by 0.014.
-
VGGT Geometry Regularization: Point clouds generated by a pretrained VGGT model serve as geometric supervision (depth loss \(\mathcal{L}_{geo}\)), replacing Avat3r's skip-connection approach and thereby avoiding error propagation.
-
Lightweight Deformation MLP: Each Gaussian point is processed independently (highly parallelizable), taking canonical attributes and the FLAME expression code as input and producing position and color offsets.
Loss & Training¶
Training data: Ava-256 (256 subjects / 40 cameras) + NeRSemble (425 subjects / 16 cameras). Each sample consists of 4 images of the same subject with different expressions/viewpoints as input, and 8 same-expression images as supervision. Trained on 4× H800 GPUs for 400k steps (~4 days).
Key Experimental Results¶
Main Results¶
| Method | PSNR↑ | SSIM↑ | LPIPS↓ | CSIM↑ | AKD↓ | FPS |
|---|---|---|---|---|---|---|
| InvertAvatar | 14.2 | 0.36 | 0.55 | 0.29 | 15.8 | - |
| GPAvatar | 19.1 | 0.70 | 0.32 | 0.26 | 6.9 | - |
| Avat3r | 20.7 | 0.71 | 0.33 | 0.59 | 4.8 | 8 |
| FastGHA | 22.5 | 0.77 | 0.23 | 0.73 | 4.8 | 62 |
FastGHA comprehensively outperforms Avat3r: PSNR +1.8, LPIPS −0.10, CSIM +0.14, FPS 7.75×.
Ablation Study¶
| Configuration | PSNR | CSIM | AKD |
|---|---|---|---|
| w/o VAE pretrained weights | 20.789 | 0.681 | 5.487 |
| w/o geometry loss | 21.132 | 0.687 | 5.049 |
| w/o per-Gaussian features | 21.053 | 0.690 | 5.216 |
| Full FastGHA | 21.274 | 0.704 | 4.996 |
Key Findings¶
- Pretrained VAE weights are the most critical component: removing them reduces PSNR by 0.49 and CSIM by 0.023.
- Sub-second reconstruction: only 0.98 seconds for 4 input images.
- Trade-off with number of input images: 2 images → 128 FPS but lower quality; 6 images → 32 FPS with marginal quality gain. 4 images represents the optimal balance.
- Strong performance on NeRSemble as well: PSNR 24.0, SSIM 0.81.
Highlights & Insights¶
- Correct use of geometric priors: Employing geometric information as a regularization loss rather than via skip connections avoids the error propagation issue present in Avat3r. This constitutes a general design principle.
- Per-Gaussian semantic features: The 32-dimensional learned features enable the deformation MLP to leverage high-level information beyond low-level geometric attributes, yielding significant gains at minimal cost.
- Key to real-time animation: The deformation MLP processes each Gaussian independently (requiring no cross-Gaussian interaction), making the operation fully parallelizable.
Limitations & Future Work¶
- Camera parameters and FLAME expression codes must be obtained in advance, which may become a bottleneck in practical applications.
- Training and evaluation are conducted exclusively on laboratory-captured multi-view datasets; robustness to in-the-wild inputs such as low-quality selfies has not been validated.
- Fine-grained modeling of hair and accessories is not supported, due to inherent limitations of the Gaussian representation.
- The deformation MLP processes each Gaussian independently, lacking global consistency constraints.
Related Work & Insights¶
- vs. Avat3r: Avat3r is also feed-forward but employs skip connections with geometric priors, leading to error propagation and achieving only 8 FPS. FastGHA replaces skip connections with depth supervision, achieving 62 FPS.
- vs. GPAvatar: GPAvatar exhibits poor identity preservation (CSIM 0.26 vs. 0.73), attributable to the absence of strong semantic feature extraction.
Rating¶
- Novelty: ⭐⭐⭐⭐ The two-stage design and per-Gaussian feature concept are clear and effective, though individual components are not individually groundbreaking.
- Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, multiple baselines, comprehensive ablations, and speed analysis.
- Writing Quality: ⭐⭐⭐⭐ The pipeline is described clearly, though the motivation for certain design choices could be elaborated further.
- Value: ⭐⭐⭐⭐ First to achieve few-shot + real-time animation for 3D Gaussian head avatars, with high practical utility.