Skip to content

FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation

Conference: ICLR 2026 arXiv: 2601.13837 Code: To be confirmed Area: 3D Vision / Head Reconstruction Keywords: 3D Gaussian Splatting, head avatar, few-shot, real-time animation, feed-forward

TL;DR

This paper proposes FastGHA, a feed-forward few-shot 3D Gaussian head avatar generation framework that reconstructs an animatable 3D Gaussian head from 4 arbitrary-expression/viewpoint input images in ~1 second, supporting real-time animation at 62 FPS. On Ava-256, it achieves a PSNR of 22.5 dB, surpassing Avat3r's 20.7 dB while being 7.75× faster.

Background & Motivation

Background: Methods for 3D head avatar generation fall into two categories: optimization-based and feed-forward. Optimization-based approaches (e.g., per-identity fitting) require large amounts of multi-view data and lengthy optimization, making them unsuitable for real-time deployment. Feed-forward methods (Avat3r, GPAvatar) can generate avatars from sparse images, but either lack controllable animation, suffer from slow animation speeds (Avat3r achieves only 8 FPS), or produce limited reconstruction quality.

Limitations of Prior Work: (a) Avat3r employs skip connections with geometric priors, causing geometric errors to propagate directly into the final output; (b) existing methods struggle to simultaneously achieve accurate expression transfer (AKD) and identity preservation (CSIM); (c) there is a persistent trade-off between animation speed and quality, with high-quality methods typically being slow.

Key Insight: A two-stage design — first, feed-forward reconstruction of a canonical Gaussian head from few-shot images (with learned per-Gaussian features); second, expression-driven deformation via a lightweight MLP for fast animation.

Core Idea: A multi-view Transformer based on SD-Turbo VAE and DINOv3 features reconstructs a canonical Gaussian head, which is then animated in real time via per-Gaussian learned features and a lightweight deformation MLP.

Method

Overall Architecture

Stage 1: 4 input images → SD-Turbo VAE extracts color features + DINOv3 extracts semantic features + Plücker ray encoding for camera pose → multi-view Transformer aggregates cross-view information → modified VAE decoder outputs per-pixel Gaussian parameters → fused into canonical Gaussian head \(\mathcal{G}^c_f\) (with 32-dimensional per-Gaussian features).

Stage 2: Canonical Gaussian head + FLAME expression encoding → lightweight MLP processes each Gaussian independently → outputs position and color offsets \(\delta_z\) → deformed Gaussians rendered via differentiable rasterization.

Key Designs

  1. SD-Turbo VAE as Backbone: The encoder is frozen to retain pretrained high-level semantic features; the decoder is fine-tuned to generate Gaussian parameters. This yields a +0.5 dB PSNR improvement over training from scratch.

  2. Per-Gaussian Learned Features \(\mathbf{f} \in \mathbb{R}^{32}\): In addition to standard Gaussian attributes (position/color/rotation/scale/opacity), each Gaussian is assigned a 32-dimensional semantic feature encoding expression-relevant high-level information, which is fed into the deformation MLP. Removing this component reduces PSNR by 0.22 and CSIM by 0.014.

  3. VGGT Geometry Regularization: Point clouds generated by a pretrained VGGT model serve as geometric supervision (depth loss \(\mathcal{L}_{geo}\)), replacing Avat3r's skip-connection approach and thereby avoiding error propagation.

  4. Lightweight Deformation MLP: Each Gaussian point is processed independently (highly parallelizable), taking canonical attributes and the FLAME expression code as input and producing position and color offsets.

Loss & Training

\[\mathcal{L} = \mathcal{L}_{RGB} + \mathcal{L}_{SSIM} + 0.5\mathcal{L}_{perc} + \mathcal{L}_{sil} + 0.5\mathcal{L}_{geo}\]

Training data: Ava-256 (256 subjects / 40 cameras) + NeRSemble (425 subjects / 16 cameras). Each sample consists of 4 images of the same subject with different expressions/viewpoints as input, and 8 same-expression images as supervision. Trained on 4× H800 GPUs for 400k steps (~4 days).

Key Experimental Results

Main Results

Method PSNR↑ SSIM↑ LPIPS↓ CSIM↑ AKD↓ FPS
InvertAvatar 14.2 0.36 0.55 0.29 15.8 -
GPAvatar 19.1 0.70 0.32 0.26 6.9 -
Avat3r 20.7 0.71 0.33 0.59 4.8 8
FastGHA 22.5 0.77 0.23 0.73 4.8 62

FastGHA comprehensively outperforms Avat3r: PSNR +1.8, LPIPS −0.10, CSIM +0.14, FPS 7.75×.

Ablation Study

Configuration PSNR CSIM AKD
w/o VAE pretrained weights 20.789 0.681 5.487
w/o geometry loss 21.132 0.687 5.049
w/o per-Gaussian features 21.053 0.690 5.216
Full FastGHA 21.274 0.704 4.996

Key Findings

  • Pretrained VAE weights are the most critical component: removing them reduces PSNR by 0.49 and CSIM by 0.023.
  • Sub-second reconstruction: only 0.98 seconds for 4 input images.
  • Trade-off with number of input images: 2 images → 128 FPS but lower quality; 6 images → 32 FPS with marginal quality gain. 4 images represents the optimal balance.
  • Strong performance on NeRSemble as well: PSNR 24.0, SSIM 0.81.

Highlights & Insights

  • Correct use of geometric priors: Employing geometric information as a regularization loss rather than via skip connections avoids the error propagation issue present in Avat3r. This constitutes a general design principle.
  • Per-Gaussian semantic features: The 32-dimensional learned features enable the deformation MLP to leverage high-level information beyond low-level geometric attributes, yielding significant gains at minimal cost.
  • Key to real-time animation: The deformation MLP processes each Gaussian independently (requiring no cross-Gaussian interaction), making the operation fully parallelizable.

Limitations & Future Work

  • Camera parameters and FLAME expression codes must be obtained in advance, which may become a bottleneck in practical applications.
  • Training and evaluation are conducted exclusively on laboratory-captured multi-view datasets; robustness to in-the-wild inputs such as low-quality selfies has not been validated.
  • Fine-grained modeling of hair and accessories is not supported, due to inherent limitations of the Gaussian representation.
  • The deformation MLP processes each Gaussian independently, lacking global consistency constraints.
  • vs. Avat3r: Avat3r is also feed-forward but employs skip connections with geometric priors, leading to error propagation and achieving only 8 FPS. FastGHA replaces skip connections with depth supervision, achieving 62 FPS.
  • vs. GPAvatar: GPAvatar exhibits poor identity preservation (CSIM 0.26 vs. 0.73), attributable to the absence of strong semantic feature extraction.

Rating

  • Novelty: ⭐⭐⭐⭐ The two-stage design and per-Gaussian feature concept are clear and effective, though individual components are not individually groundbreaking.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, multiple baselines, comprehensive ablations, and speed analysis.
  • Writing Quality: ⭐⭐⭐⭐ The pipeline is described clearly, though the motivation for certain design choices could be elaborated further.
  • Value: ⭐⭐⭐⭐ First to achieve few-shot + real-time animation for 3D Gaussian head avatars, with high practical utility.