GASP: Gaussian Avatars with Synthetic Priors¶

Conference: CVPR 2025
arXiv: 2412.07739
Code: https://microsoft.github.io/GASP/ (Project Page)
Area: 3D Vision
Keywords: Gaussian Avatars, Synthetic Data Priors, Monocular Reconstruction, 360-degree Rendering, Real-time Animation

TL;DR¶

This paper proposes GASP, which utilizes synthetic data to train a generative prior model (auto-decoder) for Gaussian Avatars. It bridges the synthetic-to-real domain gap through a three-stage fitting process and learned per-Gaussian semantic feature correlations, enabling the creation of high-quality, real-time animatable avatars (at 70 fps) supporting 360° rendering from only a single image or a short video.

Background & Motivation¶

Background: Animatable avatars based on Gaussian Splatting (Gaussian Avatars) have achieved significant progress in quality and speed. Existing methods either require expensive multi-camera setups for free-viewpoint rendering or reconstruct from monocular training videos but can only render from fixed viewpoints.
Limitations of Prior Work: (a) Multi-camera methods require complex acquisition setups, making them inaccessible to average users; (b) monocular methods suffer from severe artifacts under non-training viewpoints (especially the sides and back of the head); (c) existing few-shot avatar methods (such as the NeRF-based Preface and Cafca) suffer from extremely slow rendering speeds (>20s/frame).
Key Challenge: Reconstructing 360° avatars from monocular input is a highly under-constrained problem, as the extreme sides and back of the head are completely invisible. A prior model is required to "fill in" the missing regions, but high-quality multi-view real-human datasets are scarce, and annotations (camera calibration, 3DMM parameters) contain significant errors.
Goal: How to create high-quality 360° avatars supporting real-time rendering from a single image or a short video captured by a webcam or smartphone?
Key Insight: Synthetic data has the natural advantage of pixel-level accurate annotations and arbitrary multi-view coverage, which can be leveraged to train large-scale prior models. The key challenge lies in bridging the synthetic-to-real domain gap.
Core Idea: Train an auto-decoder prior of Gaussian Avatars on large-scale synthetic facial data. By utilizing the correlation of per-Gaussian semantic features and a three-stage fitting process, the synthetic-to-real domain gap is crossed, enabling 360° real-time avatars from a single image.

Method¶

Overall Architecture¶

The proposed method consists of two main stages. Prior Training Stage: An auto-decoder model is trained on 1,000 synthetic subjects (with 50 multi-view images per subject) to jointly optimize a Canonical Template, per-identity latent codes, per-Gaussian features, and an MLP decoder. Fitting Stage: Given a single real image or short video, the prior is adapted to the real data through a three-step fitting process: (1) inverting the latent code, (2) fine-tuning the MLP, and (3) refining the Gaussian parameters.

Key Designs¶

Auto-decoder Prior Model Training (Prior Model Training):
- Function: Learning a generative model capable of generating Gaussian Avatars of different identities.
- Mechanism: Each Gaussian is assigned an 8-dimensional learnable feature vector \(\mathbf{f}_i\), and each identity is associated with a 512-dimensional latent code \(\mathbf{z}_j\). A shared MLP decoder \(\mathcal{D}\) maps the features and identity codes to the offsets of Gaussian attributes: \(\mathcal{A}_{i,j}=\mathcal{C}_{i,j}+\mathcal{D}(\mathbf{f}_i, \mathbf{z}_j)\), where \(\mathcal{C}\) represents the Canonical Template (mean avatar). Initializing using a UV map at \(512 \times 512\) resolution yields approximately 188k Gaussians. The training loss incorporates pixel-level L1+SSIM, perceptual loss LPIPS, alpha mask loss, and regularization losses.
- Design Motivation: Directly regressing all Gaussian attributes via an MLP is intractable due to the extremely high dimensionality. Employing a per-Gaussian feature as a joint "positional + semantic encoding" allows the MLP to process each Gaussian independently and in parallel, thereby drastically reducing the parameter count. The template-plus-offset design enables the model to focus solely on identity variations, promoting stability.
Per-Gaussian Semantic Feature Correlations (Learned Feature Correlations):
- Function: Enabling updates in visible regions during the fitting process to propagate automatically to invisible regions.
- Mechanism: During training, the MLP is constrained to map semantically similar Gaussians to a similar feature space. PCA visualization shows that learned features exhibit clear semantic clustering (e.g., forehead, lips, and scalp are naturally grouped). During fitting, the features \(\mathbf{f}\) are frozen, and only \(\mathbf{z}\) and \(\mathcal{D}\) are optimized. Consequently, if the MLP learns to turn the forehead Gaussians blonde, other Gaussians with similar features on the back of the head will automatically turn blonde.
- Design Motivation: This is the key mechanism to address the core dilemma of "monocular input to 360° output". It enables information propagation from visible to invisible regions via implicit semantic associations without relying on explicit symmetry assumptions.
Three-stage Fitting Process (Three-stage Fitting):
- Function: Adapting the synthetic prior to real user data.
- Mechanism: Stage 1 (Inversion): Freeze everything and only optimize the identity code \(\mathbf{z}\) to find the optimal avatar within the prior space (500 steps). Stage 2 (MLP Fine-tuning): Freeze features and template, and fine-tune the MLP \(\mathcal{D}\) to bridge the domain gap using feature correlations (500 steps). Stage 3 (Gaussian Refinement): Directly optimize individual Gaussian parameters to best fit the target data (100 steps). A regularization term \(L_{prior}\) (L2 distance of Gaussian parameters to the Stage 1 result) is incorporated at each stage to prevent drifting too far from the prior. The entire fitting process takes about 10 minutes on an RTX 4090.
- Design Motivation: Pure inversion can only generate synthetic-looking appearances, whereas direct Gaussian optimization produces severe artifacts in invisible views. The three-stage progressive transition smoothly transfers the representation. Stage 2 is critical—fine-tuning the MLP allows simultaneous updates of both visible and invisible regions through feature correlations, achieving elegant domain adaptation.

Loss & Training¶

Prior training loss is defined as: \(\mathcal{L}=\lambda_{pix}L_{pix}+\lambda_\alpha L_\alpha+\lambda_{percep}L_{percep}+L_{reg}\). Here, \(L_{pix}\) includes L1 and SSIM, \(L_{percep}\) is based on LPIPS, and \(L_{reg}\) regularizes Gaussian scale and displacement. The displacement regularization on the scalp region is reduced by a factor of 100 to allow for hair modeling. The prior is trained on 4×A100 GPUs for 4 days with a batch size of 8 for 250 epochs. At inference time, no neural networks are required, enabling pure Gaussian splatting.

Key Experimental Results¶

Main Results¶

Setup	Metric	GASP	FlashAvatar	GA	DiffusionRig	ROME
Monocular Video PSNR↑	dB	21.34	17.25	17.39	19.67	-
Monocular Video SSIM↑		0.712	0.603	0.601	0.343	-
Monocular Video LPIPS↓		0.333	0.450	0.428	0.436	-
Monocular Video FID↓		117	351	366	155	-
Monocular Video ID-SIM↑		0.568	0.234	0.179	0.302	-
Single Image PSNR↑	dB	20.73	13.26	14.80	16.87	15.78
Single Image QUAL↑	/5	3.80	2.05	2.03	3.15	3.38

Ablation Study¶

Configuration	PSNR↑	SSIM↑	LPIPS↓	FID↓	ID-SIM↑
Full model	21.34	0.712	0.333	117	0.568
w/o prior	19.42	0.670	0.391	212	0.478
w/o prior regularization	20.31	0.701	0.344	122	0.620
w/o stage 1	19.56	0.678	0.364	127	0.588
w/o stage 2	20.33	0.704	0.347	118	0.490
1 subject prior	Worse than w/o prior	-	-	-	-
1000 subjects prior	Full model	-	-	-	-

Key Findings¶

Synthetic prior contributes significantly: Removing the prior drops PSNR by nearly 2 dB and increases FID from 117 to 212.
Trade-off in prior regularization: Disabling it yields the highest ID-SIM (0.620) because the model can fit the visible frontal regions with fewer constraints, but degrades FID and LPIPS due to deterioration on invisible views.
Training the prior on only 1 subject performs worse than no prior, indicating that the prior requires sufficient diversity to be effective.
Inference speed runs at 70 fps (RTX 4090) with 15MB storage, completely network-free (pure Gaussian splatting).
Comparable to SOTA in multi-camera setups (PSNR 23.44 vs. GA 23.73), verifying that the synthetic prior does not degrade reconstruction quality when dense observations are available.

Highlights & Insights¶

Successful Application of Synthetic Priors: Leveraging perfectly annotated synthetic data combined with a domain-adaptive fitting strategy provides an elegant paradigm to address real-world data scarcity. This pipeline could be generalized to full-body avatars, hand reconstruction, and other areas.
Implicit Propagation of Per-Gaussian Semantic Features: Avoiding explicit symmetry or correspondence constraints, information automatically propagates from visible to invisible regions via learned feature correlations. PCA visualization demonstrates that the features indeed capture consistent semantics.
Zero Network Overhead at Inference: The prior and MLP are only utilized during the fitting stage and discarded at inference, allowing lightweight and real-time rendering.
Progressive Domain Adaptation via Three-stage Fitting: Each stage gradually unlocks degrees of freedom, balancing the regularization from priors and fitting accuracy to the target data.

Limitations & Future Work¶

Absence of illumination variation and dynamic wrinkles modeling leads to slightly lower quality than specialized methods in dense multi-camera setups.
Domain gap in illumination: The synthetic data is rendered under uniform white lighting, meaning complex real-world illumination can impede adaptation.
Fitting time limit: Optimization takes about 10 minutes on an RTX 4090, which may be too slow for instant-on applications.
Dependence on 3DMM fitting quality: Occlusion or extreme facial expressions often result in erroneous 3DMM annotations, which directly impacts the output quality.
Interactive online fitting (e.g., progressively refining the avatar during an active video call) remains unexplored.

vs. Cafca/Preface: These methods also utilize synthetic priors, but they are NeRF-based, requiring >20s/frame for rendering static scenes. GASP employs Gaussian Splatting, enabling real-time (70 fps) dynamic animations.
vs. GaussianAvatars: GA binds Gaussians to a FLAME mesh. While GA achieves high quality in multi-view settings, it degrades under monocular scenarios. GASP solves this monocular overfitting via its learned prior.
vs. DiffusionRig: Diffusion models act as strong priors but suffer from poor identity preservation. GASP achieves much stronger identity consistency (ID-SIM of 0.568 vs. 0.302).
vs. Gaussian Morphable Model (Xu et al.): Shares a similar concept but only trains on frontal faces, failing to render the back of the head. GASP utilizes synthetic data with full spherical coverage.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of synthetic priors, semantic feature correlation, and three-stage fitting is highly effective, though the individual components are not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation under three settings, complete ablation studies, and includes user studies.
Writing Quality: ⭐⭐⭐⭐ Well-structured with clearly justified motivations and method design.
Value: ⭐⭐⭐⭐⭐ Exceptionally practical system, representing the first real-time system capable of generating 360° dynamic Gaussian avatars from a single view.