GAS: Generative Avatar Synthesis from a Single Image¶

Conference: ICCV 2025 arXiv: 2502.06957 Code: Project Page Area: 3D Vision Keywords: Human Avatar Generation, Single Image, Video Diffusion, NeRF, Multi-view Consistency

TL;DR¶

GAS is a framework that unifies novel view synthesis and novel pose synthesis into a video generation task by combining dense appearance cues from a generalizable NeRF with a video diffusion model. A modality switcher decouples the two tasks, enabling view-consistent and temporally coherent human avatar generation from a single image.

Background & Motivation¶

Human avatar generation has broad applications in gaming, film, sports, and telepresence, yet existing techniques typically require expensive capture setups.

Two major camps of prior methods and their limitations:

Regression-based generalizable methods (GHNeRF, GPS-Gaussian, etc.): - Incorporate 3D human priors (SMPL) to support sparse or even single-view input - However, the regression nature causes many-to-one mapping averaging, resulting in blurry outputs - Limited to rigid deformation and unable to model clothing dynamics

Generative diffusion-based methods (Animate Anyone, Champ, Human4DiT): - Conditioned on sparse human templates (depth/normal maps) to produce high-quality results - However, the gap between sparse conditioning signals and real appearance leads to multi-view flickering and temporal inconsistency

Core insight of GAS: Using dense appearance cues reconstructed by NeRF as diffusion model conditions — providing richer structural information than sparse normal maps and bridging the gap between conditioning signals and real appearance.

Method¶

Overall Architecture¶

Two-stage training: 1. Train a generalizable human NeRF: Trained on multi-view datasets to support rendering arbitrary views/poses from a single image 2. Train a video diffusion model: Conditioned on NeRF renderings and SMPL normal maps to learn the distribution shift

Stage 1: Generalizable Human NeRF¶

Based on a single-view generalizable NeRF, target-space points are transformed into the SMPL canonical space via inverse LBS for feature querying:

\[\boldsymbol{\sigma}(\boldsymbol{x}), \boldsymbol{c}(\boldsymbol{x}) = \mathcal{F}(\boldsymbol{x}, \boldsymbol{p}, \gamma_d(\boldsymbol{d}))\]

Although NeRF renderings are blurry (due to mean regression), they provide dense, 3D-consistent appearance conditioning.

Stage 2: Video Diffusion Model¶

Built on Stable Video Diffusion (SVD), three conditioning inputs are fused:

NeRF rendering \(C_{\text{nerf}}\): VAE encoding + small CNN → dense appearance cues
SMPL normal map \(C_{\text{smpl}}\): 2D convolution → geometric structure cues
Reference image \(C_{\text{vae}}\): VAE encoding → appearance preservation

\(C_{\text{nerf}}\) and \(C_{\text{smpl}}\) are fused via element-wise addition and injected into the first convolutional layer of the UNet.

Unified View-Pose Synthesis¶

Core innovation: Novel view synthesis and novel pose synthesis are unified as a video generation task: - Novel view: Fixed pose, varying camera trajectory \(\{P_1, ..., P_T\}\) - Novel pose: Fixed camera, varying SMPL pose \(\{\boldsymbol{\theta}_1, ..., \boldsymbol{\theta}_T\}\)

Both tasks share model parameters; in-the-wild dynamic videos used for pose synthesis training naturally transfer to view synthesis, improving generalization.

Modality Switcher¶

Naive joint training causes dynamic motion to corrupt view consistency. A one-hot switcher \(\boldsymbol{s}\) is concatenated with the timestep embedding and injected into the UNet:

\[\mathcal{L}_{\mathcal{U}_\theta} = \mathbb{E}[\|\epsilon - \mathcal{U}_\theta(Z_t, t, \boldsymbol{h}_{\text{clip}}, C_{\text{vae}}, C_{\text{nerf}}, C_{\text{smpl}}, \boldsymbol{s})\|]\]

The switcher enables the network to distinguish: view synthesis → prioritize view consistency; pose synthesis → prioritize realistic deformation.

Training Details¶

NeRF: Trained on MVHumanNet
Diffusion model: Initialized from SVD 1.1, trained on 8× A100 GPUs for 3 days, 150k iterations
Inference CFG: Triangular CFG for view synthesis (front 1 → back 2 → front 1); fixed CFG=2 for pose synthesis

Key Experimental Results¶

Novel View Synthesis (Main Results)¶

Method	THuman PSNR↑	2K2K PSNR↑	THuman LPIPS↓	2K2K FVD↓
Animate Anyone	22.48	18.48	0.061	1422.1
Champ	20.96	22.14	0.074	480.3
Animate Anyone*	25.20	26.22	0.046	286.4
Champ*	23.89	25.66	0.054	279.3
GAS (Ours)	26.77	28.82	0.041	191.3

*denotes versions fine-tuned on the 3D scan dataset in this paper. GAS outperforms all baselines by a substantial margin across all metrics.

Novel Pose Synthesis¶

Method	TikTok PSNR↑	SSIM↑	LPIPS↓	FVD↓
Animate Anyone	17.21	0.762	0.225	1274.1
Champ	18.48	0.806	0.182	585.0
Champ*	18.57	0.797	0.187	893.7
GAS (Ours)	19.11	0.833	0.176	362.0

Ablation Study¶

Ablation	Key Findings
w/o NeRF conditioning	Using only SMPL normal maps as conditions leads to appearance inconsistency (multi-view flickering)
w/o modality switcher	Joint training impairs view consistency (dynamic motion interference)
3D scan training only	Insufficient generalization; poor quality on in-the-wild data
+ Internet videos	Parameter sharing improves view synthesis generalization (gains on both in-domain and out-of-domain)

Key Findings¶

Critical role of NeRF conditioning: Dense appearance cues are foundational to multi-view consistency; sparse normal maps are insufficient
Cross-task transfer via parameter sharing: The diversity from pose synthesis training naturally improves view synthesis quality
Necessity of the modality switcher: Different tasks impose different notions of "consistency" and require explicit decoupling
Triangular CFG: Linearly increasing CFG from front to back views for view synthesis effectively balances fidelity and generation quality

Highlights & Insights¶

Dense conditioning over sparse conditioning: Using NeRF renderings instead of normal maps as diffusion conditions is a practically elegant and simple idea
Dual-task unification: A single model handles both view and pose synthesis, achieving high parameter efficiency and mutual reinforcement
Practical training data strategy: Mixed training with 3D scans (small-scale, precise) and Internet videos (large-scale, diverse)

Limitations & Future Work¶

Relies on the accuracy of SMPL fitting for NeRF; inaccurate SMPL leads to rendering artifacts that propagate into the diffusion model
Currently limited to 20-frame video generation; long sequences require a sliding window approach
NeRF rendering quality is limited in heavily occluded regions (e.g., fully rear-facing views)

Generalizable human NeRF: GHNeRF, GPS-Gaussian, EVA3D
Generative human animation: Animate Anyone, Champ, Human4DiT
Video diffusion models: SVD, Sora

Rating¶

Novelty: ⭐⭐⭐⭐ — Dense NeRF conditioning combined with a unified view/pose video generation framework
Technical Depth: ⭐⭐⭐⭐ — Targeted design of the modality switcher and CFG strategy
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple datasets (3D scans + multi-view video + TikTok), comprehensive comparisons and ablations
Value: ⭐⭐⭐⭐ — Single-image input with view and pose controllability; broad application prospects