FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation¶

Conference: ICLR 2026 arXiv: 2601.13837 Code: To be confirmed Area: 3D Vision / Head Reconstruction Keywords: 3D Gaussian Splatting, head avatar, few-shot, real-time animation, feed-forward

TL;DR¶

This paper proposes FastGHA, a feed-forward few-shot 3D Gaussian head avatar generation framework that reconstructs an animatable 3D Gaussian head from 4 arbitrary-expression/viewpoint input images in ~1 second, supporting real-time animation at 62 FPS. On Ava-256, it achieves a PSNR of 22.5 dB, surpassing Avat3r's 20.7 dB while being 7.75× faster.

Background & Motivation¶

Background: Methods for 3D head avatar generation fall into two categories: optimization-based and feed-forward. Optimization-based approaches (e.g., per-identity fitting) require large amounts of multi-view data and lengthy optimization, making them unsuitable for real-time deployment. Feed-forward methods (Avat3r, GPAvatar) can generate avatars from sparse images, but either lack controllable animation, suffer from slow animation speeds (Avat3r achieves only 8 FPS), or produce limited reconstruction quality.

Limitations of Prior Work: (a) Avat3r employs skip connections with geometric priors, causing geometric errors to propagate directly into the final output; (b) existing methods struggle to simultaneously achieve accurate expression transfer (AKD) and identity preservation (CSIM); (c) there is a persistent trade-off between animation speed and quality, with high-quality methods typically being slow.

Key Insight: A two-stage design — first, feed-forward reconstruction of a canonical Gaussian head from few-shot images (with learned per-Gaussian features); second, expression-driven deformation via a lightweight MLP for fast animation.

Core Idea: A multi-view Transformer based on SD-Turbo VAE and DINOv3 features reconstructs a canonical Gaussian head, which is then animated in real time via per-Gaussian learned features and a lightweight deformation MLP.

Method¶

Overall Architecture¶

Stage 1: 4 input images → SD-Turbo VAE extracts color features + DINOv3 extracts semantic features + Plücker ray encoding for camera pose → multi-view Transformer aggregates cross-view information → modified VAE decoder outputs per-pixel Gaussian parameters → fused into canonical Gaussian head \(\mathcal{G}^c_f\) (with 32-dimensional per-Gaussian features).

Stage 2: Canonical Gaussian head + FLAME expression encoding → lightweight MLP processes each Gaussian independently → outputs position and color offsets \(\delta_z\) → deformed Gaussians rendered via differentiable rasterization.

Key Designs¶

SD-Turbo VAE as Backbone: The encoder is frozen to retain pretrained high-level semantic features; the decoder is fine-tuned to generate Gaussian parameters. This yields a +0.5 dB PSNR improvement over training from scratch.
Per-Gaussian Learned Features \(\mathbf{f} \in \mathbb{R}^{32}\): In addition to standard Gaussian attributes (position/color/rotation/scale/opacity), each Gaussian is assigned a 32-dimensional semantic feature encoding expression-relevant high-level information, which is fed into the deformation MLP. Removing this component reduces PSNR by 0.22 and CSIM by 0.014.
VGGT Geometry Regularization: Point clouds generated by a pretrained VGGT model serve as geometric supervision (depth loss \(\mathcal{L}_{geo}\)), replacing Avat3r's skip-connection approach and thereby avoiding error propagation.
Lightweight Deformation MLP: Each Gaussian point is processed independently (highly parallelizable), taking canonical attributes and the FLAME expression code as input and producing position and color offsets.

Loss & Training¶

\[\mathcal{L} = \mathcal{L}_{RGB} + \mathcal{L}_{SSIM} + 0.5\mathcal{L}_{perc} + \mathcal{L}_{sil} + 0.5\mathcal{L}_{geo}\]

Training data: Ava-256 (256 subjects / 40 cameras) + NeRSemble (425 subjects / 16 cameras). Each sample consists of 4 images of the same subject with different expressions/viewpoints as input, and 8 same-expression images as supervision. Trained on 4× H800 GPUs for 400k steps (~4 days).

Key Experimental Results¶

Main Results¶

Method	PSNR↑	SSIM↑	LPIPS↓	CSIM↑	AKD↓	FPS
InvertAvatar	14.2	0.36	0.55	0.29	15.8	-
GPAvatar	19.1	0.70	0.32	0.26	6.9	-
Avat3r	20.7	0.71	0.33	0.59	4.8	8
FastGHA	22.5	0.77	0.23	0.73	4.8	62

FastGHA comprehensively outperforms Avat3r: PSNR +1.8, LPIPS −0.10, CSIM +0.14, FPS 7.75×.

Ablation Study¶

Configuration	PSNR	CSIM	AKD
w/o VAE pretrained weights	20.789	0.681	5.487
w/o geometry loss	21.132	0.687	5.049
w/o per-Gaussian features	21.053	0.690	5.216
Full FastGHA	21.274	0.704	4.996

Key Findings¶

Pretrained VAE weights are the most critical component: removing them reduces PSNR by 0.49 and CSIM by 0.023.
Sub-second reconstruction: only 0.98 seconds for 4 input images.
Trade-off with number of input images: 2 images → 128 FPS but lower quality; 6 images → 32 FPS with marginal quality gain. 4 images represents the optimal balance.
Strong performance on NeRSemble as well: PSNR 24.0, SSIM 0.81.

Highlights & Insights¶

Correct use of geometric priors: Employing geometric information as a regularization loss rather than via skip connections avoids the error propagation issue present in Avat3r. This constitutes a general design principle.
Per-Gaussian semantic features: The 32-dimensional learned features enable the deformation MLP to leverage high-level information beyond low-level geometric attributes, yielding significant gains at minimal cost.
Key to real-time animation: The deformation MLP processes each Gaussian independently (requiring no cross-Gaussian interaction), making the operation fully parallelizable.

Limitations & Future Work¶

Camera parameters and FLAME expression codes must be obtained in advance, which may become a bottleneck in practical applications.
Training and evaluation are conducted exclusively on laboratory-captured multi-view datasets; robustness to in-the-wild inputs such as low-quality selfies has not been validated.
Fine-grained modeling of hair and accessories is not supported, due to inherent limitations of the Gaussian representation.
The deformation MLP processes each Gaussian independently, lacking global consistency constraints.

vs. Avat3r: Avat3r is also feed-forward but employs skip connections with geometric priors, leading to error propagation and achieving only 8 FPS. FastGHA replaces skip connections with depth supervision, achieving 62 FPS.
vs. GPAvatar: GPAvatar exhibits poor identity preservation (CSIM 0.26 vs. 0.73), attributable to the absence of strong semantic feature extraction.

Rating¶

Novelty: ⭐⭐⭐⭐ The two-stage design and per-Gaussian feature concept are clear and effective, though individual components are not individually groundbreaking.
Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, multiple baselines, comprehensive ablations, and speed analysis.
Writing Quality: ⭐⭐⭐⭐ The pipeline is described clearly, though the motivation for certain design choices could be elaborated further.
Value: ⭐⭐⭐⭐ First to achieve few-shot + real-time animation for 3D Gaussian head avatars, with high practical utility.