AdaHuman: Animatable Detailed 3D Human Generation with Compositional Multiview Diffusion¶

Conference: ICCV 2025 arXiv: 2505.24877 Code: https://nvlabs.github.io/AdaHuman (code and models to be released) Area: 3D Vision Keywords: 3D human reconstruction, 3D Gaussian splatting, multiview diffusion model, pose-conditioned generation, animatable avatar

TL;DR¶

AdaHuman is proposed as a framework that generates high-fidelity, animatable 3D human avatars from a single image via a pose-conditioned 3D joint diffusion model and a compositional 3DGS refinement module.

Background & Motivation¶

Generating high-quality animatable 3D human avatars from a single image has significant applications in gaming, animation, and VR. Existing methods face two primary challenges: (1) self-occlusion—avatars are typically generated in the same pose as the input image, making it difficult to complete occluded regions under complex poses, which impairs rigging and animation; and (2) detail blurriness—feed-forward 3D reconstruction models are constrained by fixed output resolutions (e.g., 256×256), preventing the capture of fine-grained details such as facial features and clothing textures.

SDS-based methods are flexible but suffer from over-saturation artifacts and slow generation. Multiview generation-and-reconstruction pipelines improve speed and realism but still fail to address these two core issues. A new framework capable of handling pose transformation while enhancing local detail is therefore required.

Method¶

Overall Architecture¶

AdaHuman consists of two core modules: (1) pose-conditioned 3D joint diffusion—simultaneously performing multiview image synthesis and 3DGS reconstruction during the diffusion process, supporting avatar generation under arbitrary pose conditions; and (2) compositional 3DGS refinement—seamlessly integrating refined local 3DGS representations into a complete avatar through image-to-image refinement of local body parts and crop-aware camera ray maps.

Key Designs¶

Pose-Conditioned 3D Joint Diffusion Model:
- Given a full-body image, local views of different body parts (head, upper body, lower body) are generated and combined with the input to form the input view set \(\mathcal{I}_{i=1}^V\)
- Each view is represented by a triplet \(\{x_i, p_i, c_i\}\)—RGB image, 2D semantic pose map (rendered from SMPL), and camera ray map
- 2D self-attention in a single-image LDM is replaced with 3D attention to enable multiview-consistent generation
- At each denoising step \(t\), a 3DGS generator \(\mathbf{G}\) produces \(\mathcal{G}^t\) from the predicted "clean" images \(x_j^{t\to 0}\), which are then re-rendered as 3D-consistent images for the subsequent denoising step
- Key advantage: Pose conditioning enables the model to generate avatars in A-pose without requiring standard-pose training data, minimizing self-occlusion and naturally supporting rigging and animation
- Reconstruction mode selects 3 target views at 90° intervals from the same frame; re-posing mode selects 4 target views from different frames
Compositional 3DGS Refinement Module:
- Local body part refinement: Four canonical views at 90° intervals are rendered from the coarse 3DGS \(\mathcal{G}_\text{coarse}\), with magnified local views rendered for three body regions (head, upper body, lower body)
- Local renderings are refined via image-to-image diffusion (analogous to SDEdit), substantially enhancing detail
- Crop-aware camera ray map: Pixel coordinates \((u,v)\) in a local view are mapped back to global view coordinates \((i,j)\) through cropping box parameters, computing the corresponding camera ray embedding \(\mathcal{R}(i,j) = (\mathbf{o}(i,j), \mathbf{o}(i,j) \times \mathbf{d}(i,j))\)
- Design motivation: Establishes precise 3D coordinate correspondences between local and global views, enabling the 3DGS generator to process inputs at different scales uniformly in global space without architectural modifications
Visibility-Aware 3DGS Composition:
- Two criteria determine which 3D Gaussian primitives to retain:
  - View coverage: The number of input views covering each Gaussian primitive is counted; primitives with low coverage lack multiview consensus and may be unreliable
  - Visibility saliency: The gradient magnitude of the alpha channel across all rendered views is measured; primitives with low saliency contribute little to appearance and may constitute noise
- When a Gaussian primitive from a coarser body region is well covered by views from a finer region (e.g., head), the redundant coarse primitive is discarded
- Design motivation: Directly merging all local 3DGS representations produces floating artifacts; this strategy ensures only the most reliable and visually salient Gaussian primitives are retained through principled filtering

Loss & Training¶

Jointly trained on MVHumanNet (6,209 subjects) and CustomHuman (589 meshes)
Both the LDM and 3DGS generator are initialized from pretrained weights
Reconstruction is trained for 30K steps, followed by 10K steps of re-posing fine-tuning
The LDM is supervised with MSE loss on image latents; the 3DGS is supervised with MSE + LPIPS rendering loss + surface regularization
An additional 12 views are sampled beyond the target views to provide dense supervision for the 3DGS

Key Experimental Results¶

Main Results (Avatar Reconstruction)¶

Model	PSNR↑	SSIM↑	LPIPS↓	FID↓	CD(cm)↓
LGM	18.99	0.8445	0.1664	122.3	2.175
SiTH	20.77	0.8727	0.1277	42.9	1.389
SIFU	20.59	0.8853	0.1359	92.6	2.009
Human3Diffusion	21.08	0.8728	0.1364	35.3	1.230
AdaHuman	21.46	0.8925	0.1087	27.3	0.962

In a user study, AdaHuman achieves preference rates of 88.3%, 99.2%, 79.7%, and 93.8% against SiTH, SIFU, Human3Diffusion, and the coarse 3DGS, respectively.

Ablation Study¶

Configuration	PSNR↑	SSIM↑	LPIPS↓	FID↓	Note
Coarse 3DGS (no refinement)	20.84	0.8789	0.1296	31.9	Lacks fine details such as facial features
Direct composition (no filtering)	20.41	0.8700	0.1350	36.2	Produces substantial floating artifacts
Learnable composition (network prediction)	20.87	0.8788	0.1270	28.0	Marginal improvement but artifacts remain
Without joint diffusion	20.79	0.8762	0.1283	27.6	View inconsistency
Full method	21.46	0.8925	0.1087	27.3	Best
+ GT pose condition	23.00	0.9028	0.1086	27.0	Improved pose alignment

Key Findings¶

On the re-posing task, PSNR reaches 24.64 (vs. 21.21 for SiTH) and LPIPS drops to 0.0863, representing a substantial margin
Strong generalization to complex and loose clothing in in-the-wild images
Despite not being trained on standard-pose data, the model successfully generalizes to A-pose by leveraging the diverse pose distribution in MVHumanNet
Inference requires approximately 70 seconds on an A100 GPU

Highlights & Insights¶

Pose-decoupled design: Unifies avatar reconstruction and pose transformation within a single diffusion framework, enabling animation-ready avatar generation without standard-pose training data
Local-global compositional strategy: Crop-aware ray maps elegantly introduce multi-scale inputs without modifying the 3DGS generator architecture
Two animation modes compared: Direct re-posing (more realistic clothing deformation but slower) vs. LBS-based animation (real-time but limited clothing deformation), offering options for different application scenarios

Limitations & Future Work¶

The local refinement strategy may produce artifacts in occluded or low-coverage regions such as hands and arms
Animation remains dependent on the SMPL model and skinning weights, limiting accuracy for facial expressions, hand gestures, and clothing deformation
Future work may explore video diffusion models to improve animation quality and temporal consistency
Simulation-based approaches have the potential to improve physically correct deformation of loose garments

Builds upon the joint diffusion-and-reconstruction paradigm of Human3Diffusion; the core innovations lie in pose conditioning and compositional refinement
Compared with contemporaneous methods such as IDOL and LHM, the diffusion-based approach leverages stronger generative priors
The design of crop-aware ray maps is generalizable to other scenarios requiring multi-scale 3DGS reconstruction

Rating¶

Novelty: ⭐⭐⭐⭐ — The pose-conditioned joint diffusion and visibility-aware composition scheme are elegantly designed
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive quantitative and qualitative evaluation, user study, and thorough ablation
Writing Quality: ⭐⭐⭐⭐ — Method description is clear with intuitive illustrations
Value: ⭐⭐⭐⭐ — Addresses a practical need for single-image animatable avatar generation with significantly superior results