GAS: Generative Avatar Synthesis from a Single Image¶
Conference: ICCV 2025 arXiv: 2502.06957 Code: Project Page Area: 3D Vision Keywords: Human Avatar Generation, Single Image, Video Diffusion, NeRF, Multi-view Consistency
TL;DR¶
GAS is a framework that unifies novel view synthesis and novel pose synthesis into a video generation task by combining dense appearance cues from a generalizable NeRF with a video diffusion model. A modality switcher decouples the two tasks, enabling view-consistent and temporally coherent human avatar generation from a single image.
Background & Motivation¶
Human avatar generation has broad applications in gaming, film, sports, and telepresence, yet existing techniques typically require expensive capture setups.
Two major camps of prior methods and their limitations:
Regression-based generalizable methods (GHNeRF, GPS-Gaussian, etc.): - Incorporate 3D human priors (SMPL) to support sparse or even single-view input - However, the regression nature causes many-to-one mapping averaging, resulting in blurry outputs - Limited to rigid deformation and unable to model clothing dynamics
Generative diffusion-based methods (Animate Anyone, Champ, Human4DiT): - Conditioned on sparse human templates (depth/normal maps) to produce high-quality results - However, the gap between sparse conditioning signals and real appearance leads to multi-view flickering and temporal inconsistency
Core insight of GAS: Using dense appearance cues reconstructed by NeRF as diffusion model conditions — providing richer structural information than sparse normal maps and bridging the gap between conditioning signals and real appearance.
Method¶
Overall Architecture¶
Two-stage training: 1. Train a generalizable human NeRF: Trained on multi-view datasets to support rendering arbitrary views/poses from a single image 2. Train a video diffusion model: Conditioned on NeRF renderings and SMPL normal maps to learn the distribution shift
Stage 1: Generalizable Human NeRF¶
Based on a single-view generalizable NeRF, target-space points are transformed into the SMPL canonical space via inverse LBS for feature querying:
Although NeRF renderings are blurry (due to mean regression), they provide dense, 3D-consistent appearance conditioning.
Stage 2: Video Diffusion Model¶
Built on Stable Video Diffusion (SVD), three conditioning inputs are fused:
- NeRF rendering \(C_{\text{nerf}}\): VAE encoding + small CNN → dense appearance cues
- SMPL normal map \(C_{\text{smpl}}\): 2D convolution → geometric structure cues
- Reference image \(C_{\text{vae}}\): VAE encoding → appearance preservation
\(C_{\text{nerf}}\) and \(C_{\text{smpl}}\) are fused via element-wise addition and injected into the first convolutional layer of the UNet.
Unified View-Pose Synthesis¶
Core innovation: Novel view synthesis and novel pose synthesis are unified as a video generation task: - Novel view: Fixed pose, varying camera trajectory \(\{P_1, ..., P_T\}\) - Novel pose: Fixed camera, varying SMPL pose \(\{\boldsymbol{\theta}_1, ..., \boldsymbol{\theta}_T\}\)
Both tasks share model parameters; in-the-wild dynamic videos used for pose synthesis training naturally transfer to view synthesis, improving generalization.
Modality Switcher¶
Naive joint training causes dynamic motion to corrupt view consistency. A one-hot switcher \(\boldsymbol{s}\) is concatenated with the timestep embedding and injected into the UNet:
The switcher enables the network to distinguish: view synthesis → prioritize view consistency; pose synthesis → prioritize realistic deformation.
Training Details¶
- NeRF: Trained on MVHumanNet
- Diffusion model: Initialized from SVD 1.1, trained on 8× A100 GPUs for 3 days, 150k iterations
- Inference CFG: Triangular CFG for view synthesis (front 1 → back 2 → front 1); fixed CFG=2 for pose synthesis
Key Experimental Results¶
Novel View Synthesis (Main Results)¶
| Method | THuman PSNR↑ | 2K2K PSNR↑ | THuman LPIPS↓ | 2K2K FVD↓ |
|---|---|---|---|---|
| Animate Anyone | 22.48 | 18.48 | 0.061 | 1422.1 |
| Champ | 20.96 | 22.14 | 0.074 | 480.3 |
| Animate Anyone* | 25.20 | 26.22 | 0.046 | 286.4 |
| Champ* | 23.89 | 25.66 | 0.054 | 279.3 |
| GAS (Ours) | 26.77 | 28.82 | 0.041 | 191.3 |
*denotes versions fine-tuned on the 3D scan dataset in this paper. GAS outperforms all baselines by a substantial margin across all metrics.
Novel Pose Synthesis¶
| Method | TikTok PSNR↑ | SSIM↑ | LPIPS↓ | FVD↓ |
|---|---|---|---|---|
| Animate Anyone | 17.21 | 0.762 | 0.225 | 1274.1 |
| Champ | 18.48 | 0.806 | 0.182 | 585.0 |
| Champ* | 18.57 | 0.797 | 0.187 | 893.7 |
| GAS (Ours) | 19.11 | 0.833 | 0.176 | 362.0 |
Ablation Study¶
| Ablation | Key Findings |
|---|---|
| w/o NeRF conditioning | Using only SMPL normal maps as conditions leads to appearance inconsistency (multi-view flickering) |
| w/o modality switcher | Joint training impairs view consistency (dynamic motion interference) |
| 3D scan training only | Insufficient generalization; poor quality on in-the-wild data |
| + Internet videos | Parameter sharing improves view synthesis generalization (gains on both in-domain and out-of-domain) |
Key Findings¶
- Critical role of NeRF conditioning: Dense appearance cues are foundational to multi-view consistency; sparse normal maps are insufficient
- Cross-task transfer via parameter sharing: The diversity from pose synthesis training naturally improves view synthesis quality
- Necessity of the modality switcher: Different tasks impose different notions of "consistency" and require explicit decoupling
- Triangular CFG: Linearly increasing CFG from front to back views for view synthesis effectively balances fidelity and generation quality
Highlights & Insights¶
- Dense conditioning over sparse conditioning: Using NeRF renderings instead of normal maps as diffusion conditions is a practically elegant and simple idea
- Dual-task unification: A single model handles both view and pose synthesis, achieving high parameter efficiency and mutual reinforcement
- Practical training data strategy: Mixed training with 3D scans (small-scale, precise) and Internet videos (large-scale, diverse)
Limitations & Future Work¶
- Relies on the accuracy of SMPL fitting for NeRF; inaccurate SMPL leads to rendering artifacts that propagate into the diffusion model
- Currently limited to 20-frame video generation; long sequences require a sliding window approach
- NeRF rendering quality is limited in heavily occluded regions (e.g., fully rear-facing views)
Related Work & Insights¶
- Generalizable human NeRF: GHNeRF, GPS-Gaussian, EVA3D
- Generative human animation: Animate Anyone, Champ, Human4DiT
- Video diffusion models: SVD, Sora
Rating¶
- Novelty: ⭐⭐⭐⭐ — Dense NeRF conditioning combined with a unified view/pose video generation framework
- Technical Depth: ⭐⭐⭐⭐ — Targeted design of the modality switcher and CFG strategy
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple datasets (3D scans + multi-view video + TikTok), comprehensive comparisons and ablations
- Value: ⭐⭐⭐⭐ — Single-image input with view and pose controllability; broad application prospects