GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior¶
Conference: CVPR 2025
arXiv: 2503.11143
Code: https://github.com/silence-tang/GaussianIP
Area: Human Understanding / 3D Human Generation
Keywords: 3D Human Generation, Identity Preservation, 3D Gaussian Splatting, Score Distillation, Multi-view Consistency
TL;DR¶
This work proposes GaussianIP, a two-stage framework that efficiently generates identity-consistent 3D Gaussian humans from a human-centric diffusion model using Adaptive Human Distillation Sampling (AHDS). It then enhances facial and clothing texture details using mutual attention via a View-Consistent Refinement (VCR) mechanism, completing training within 40 minutes while significantly outperforming existing methods.
Background & Motivation¶
Background: Text-guided 3D human generation has achieved significant progress. DreamFusion proposed Score Distillation Sampling (SDS), pioneering the paradigm of distilling 3D scenes from 2D diffusion priors. Subsequent works (e.g., DreamWaltz, TADA, HumanGaussian) combine the SMPL parametric human model with SDS to generate 3D humans, while recent methods employ 3D Gaussian Splatting (3DGS) instead of NeRF to achieve more efficient rendering.
Limitations of Prior Work: (1) Excessive training time: most methods require 1–3 hours; (2) Lack of fine facial and clothing details: the noise term in the SDS distillation process leads to blurry textures; (3) Inability to handle image inputs to maintain identity consistency: existing text-to-3D methods only accept text prompts and cannot generate 3D avatars that maintain facial identity based on a user's portrait, which seriously limits practical applications.
Key Challenge: General-purpose diffusion models (such as Stable Diffusion) lack human-specific prior knowledge, and the virtual humans generated by employing them as distillation priors lack identity attributes and clothing details. Conversely, the capabilities of 2D human diffusion models (such as virtual try-on and identity customization models) have not yet been fully utilized in 3D generation.
Goal: (1) How to efficiently generate identity-consistent 3D humans using the prior knowledge of 2D human-centric diffusion models; (2) How to refine texture details after distillation while maintaining multi-view 3D consistency.
Key Insight: The authors have two key insights: (a) human-centric diffusion models (such as IP-Adapter-FaceID) can replace general diffusion models for distillation by decomposing and redesigning the score difference to inject identity conditions; (b) the generation capability of diffusion models can be further used to refine the distillation results, but attention feature sharing must be employed to guarantee multi-view consistency.
Core Idea: Replace general diffusion priors with human-centric diffusion priors, inject identity conditions through HDS decomposition to achieve identity-consistent 3D human generation, and use multi-view mutual attention for refinement to ensure 3D texture consistency.
Method¶
Overall Architecture¶
GaussianIP is a two-stage framework. Stage 1: Dense sampling is performed on the SMPL-X mesh to initialize the 3DGS point cloud, and Adaptive Human Distillation Sampling (AHDS) is used to guide 3DGS training for 2400 steps, rendering a coarse but identity-accurate 3D human. Stage 2: Multi-view images are rendered from Stage 1, then consistency refinement is executed via the View-Consistent Refinement (VCR) mechanism. The refined images are then used as ground truth (GT) to optimize the 3DGS for 800 steps under a reconstruction loss. The overall training time is approximately 40 minutes on a single V100 GPU.
Key Designs¶
-
Human Distillation Sampling (HDS):
- Function: Injects identity-preservation capability into the SDS distillation process.
- Mechanism: The original score difference of SDS is decomposed into three terms: rectifier \(\delta_{\text{rect}}\) (guiding the image towards the real image manifold), denoiser \(\delta_{\text{noise}}\) (the denoising direction, which introduces blurriness), and conditional \(\delta_{\text{cond}}\) (the conditional guidance direction). The improvements in HDS are: (a) discarding the noisy \(\delta_{\text{noise}} - \epsilon\) term to avoid texture blur; (b) incorporating the identity image condition \(\boldsymbol{I}_{ip}\) into \(\delta_{\text{cond}}\); (c) introducing a repelling score (using negative prompts to prevent low-quality generation) for \(\delta_{\text{rect}}\) at high timesteps. IP-Adapter-FaceID-PlusV2 is adopted as the diffusion prior combined with pose-conditioned ControlNet to control poses. A view-dependent skeleton-cropping strategy is also utilized to handle the visibility of facial keypoints and alleviate the Janus problem.
- Design Motivation: Standard SDS uses general-purpose diffusion models and contains noise terms that lead to texture blur and over-saturation. Replacing it with a human-centric model and removing the noise term addresses both challenges simultaneously.
-
Adaptive Human-specific Timestep Scheduling:
- Function: Accelerates HDS training and reduces training steps by approximately 30%.
- Mechanism: The entire HDS process is analogous to the denoising process of 2D human generation, divided into three stages: coarse geometry and base texture (phase 1), intermediate texture (phase 2, a transition phase with few steps), and fine facial and clothing details (phase 3). An optimized two-segment Gaussian PDF function is used to determine the diffusion timestep for each training step, allowing phase 1 and phase 3 to occupy most of the training steps. A timestep lower bound is established for each stage, and sampling is performed randomly between the lower bound and the planned value to prevent over-saturation and ensure smooth transitions.
- Design Motivation: 3D human generation has a unique "coarse-to-fine" structure. Since SMPL-X provides a decent initial geometry, the process can start with smaller timesteps; the intermediate texture stage does not require many steps, whereas the fine detail stage requires intensive training.
-
View-Consistent Refinement (VCR):
- Function: Refines texture details of multi-view rendered images while maintaining cross-view 3D consistency.
- Mechanism: VCR operates in two steps: (a) Key view refinement: first denoise four principal views (front, back, left, right) and store their self-attention K/V. For each key view, apply mutual attention (concatenating the nearest principal view's K/V with its own K/V for attention computation) to guarantee appearance consistency with the principal views; (b) Intermediate view propagation: for intermediate views between two key views, calculate the relative distance \(\eta\) to the left and right neighboring key views based on the azimuth. The attention features of the two neighbors are fused with distance-based weights \(\boldsymbol{O}_{\text{fa}} = \eta_l \text{Attn}(\boldsymbol{Q}_i, \boldsymbol{K}_{P_l}, \boldsymbol{V}_{P_l}) + \eta_r \text{Attn}(\boldsymbol{Q}_i, \boldsymbol{K}_{P_r}, \boldsymbol{V}_{P_r})\), and then combined with a weighted blend of self-attention: \(\boldsymbol{O}_{\text{final}} = \lambda_{\text{self}} \boldsymbol{O}_{\text{sa}} + (1-\lambda_{\text{self}}) \boldsymbol{O}_{\text{fa}}\).
- Design Motivation: If each view is denoised and refined independently, individual view quality may improve, but cross-view texture inconsistency arises (e.g., mismatched clothing patterns). Mutual attention allows different views to share texture features, ensuring 3D consistency.
Loss & Training¶
Stage 1: AHDS gradient guides 3DGS training for 2400 steps with a CFG coefficient of \(\gamma=7.5\). Densification and pruning are performed from steps 200–1700 (every 800 steps), with prune-only executed at step 1800. Stage 2: Refined images serve as the ground truth. The reconstruction loss \(\mathcal{L}_{\text{recon}} = \lambda_{L1} L_1 + \lambda_{\text{lpips}} L_{\text{lpips}}\) is used to optimize the 3DGS for 800 steps (\(\lambda_{L1}=10, \lambda_{\text{lpips}}=15\), batch size of 8). VCR denoising consists of 8 steps with \(\lambda_{\text{self}}=0.55\).
Key Experimental Results¶
Main Results¶
| Method | Facial Details↑ | Clothing Texture↑ | Visual Quality↑ | Text Alignment↑ | GPT Score↑ | Training Time | Identity Preservation |
|---|---|---|---|---|---|---|---|
| DreamWaltz | 1.33 | 1.46 | 1.38 | 1.58 | 1.82 | 1.3h | ✗ |
| TADA | 2.21 | 2.46 | 2.58 | 3.13 | 3.24 | 2h | ✗ |
| HumanGaussian | 4.29 | 4.17 | 3.96 | 4.42 | 4.08 | 1.2h | ✗ |
| GaussianIP | 4.71 | 4.50 | 4.17 | 4.62 | 4.52 | 40min | ✓ |
Face++ Verification: All generated human faces match the input portrait with an average confidence level exceeding 83%. GPU VRAM requirement is <24GB.
Ablation Study¶
| Configuration | Effect | Explanation |
|---|---|---|
| 3DGS + SDS (baseline) | Basic shape but over-saturated and lacking details | General-purpose diffusion model distillation |
| + HDS | Improved identity consistency, enhanced clothing details | Human-centric distillation + identity condition |
| + AHDS | Training steps reduced from 3600 to 2400 (↓33%), with further quality enhancement | Adaptive timestep scheduling |
| + VCR | Significant improvement in multi-view texture consistency | View-consistent refinement |
| Independent Denoising Refinement | Acceptable single-view quality but inconsistent across views | No mutual attention |
| VCR Refinement | Aligned cross-view textures and consistent details | Mutual attention + distance-based fusion |
Key Findings¶
- AHDS reduces training steps from 3600 to 2400 (a 33% speedup) while boosting generation quality, validating the effectiveness of the human-specific timestep scheduling.
- Mutual attention in VCR is critical for cross-view consistency; independent denoising leads to mismatched patterns on the same garment across different views.
- Completing training in 40 minutes is 33% faster than the quickest baseline (AvatarVerse, taking 1 hour). Furthermore, GaussianIP supports identity preservation, offering richer functionality.
- It comprehensively outperforms HumanGaussian (the previous SOTA 3DGS method) across all metrics, showing a significant advantage particularly in facial details (4.71 vs. 4.29).
Highlights & Insights¶
- The score difference decomposition and reorganization of HDS is highly elegant: Through theoretical analysis, it is discovered that the \(\delta_{\text{noise}} - \epsilon\) term in standard SDS is the main culprit behind blurry textures. Discarding this term and replacing it with a repelling score elegantly solves two issues: blurriness and over-saturation. This "decompose-diagnose-replace" methodology can be readily applied to other SDS variants.
- The three-stage timestep scheduling leverages domain priors of human-centric generation: Since SMPL-X initialization already provides coarse geometry, high-timestep stages can be skipped. Intermediate textures represent smooth transitions and do not require many steps, whereas fine details are critical and require focused training. This domain-adaptive scheduling strategy is more efficient than a general linear decay.
- The distance-weighted attention fusion of VCR ensures smooth texture transitions between views. Closer key views exert greater influence, avoiding texture discontinuities caused by hard switching. This concept can be generalized to other 2D-to-3D tasks that demand multi-view consistency.
Limitations & Future Work¶
- Generation can fail for highly complex poses or extreme clothing textures, as the framework relies heavily on the pose priors of SMPL-X.
- Human-object and human-human interaction scenarios are not handled.
- The VCR stage uses Stable Diffusion for denoising, which might yield sub-optimal results for non-standard human proportions or cartoon styles.
- It only renders static 3D humans and does not support animations or retargeting to different poses.
- The user study involves a relatively small number of evaluators (24) and prompts (20).
Related Work & Insights¶
- vs. HumanGaussian: HumanGaussian was the strongest prior 3DGS human method but lacked support for image input and required 1.2h of training. GaussianIP introduces identity preservation and reduces training to 40 minutes.
- vs. DreamWaltz/AvatarVerse: Early NeRF-based methods suffer from a noticeable lack of facial and clothing details, with slower training speeds.
- vs. TADA/X-Oscar: Methods based on SMPL-X meshes have advantages in geometric control, but their texture quality remains inferior to 3DGS-based methods.
- vs. 2D Human Customization (e.g., IP-Adapter): GaussianIP extends the capability of 2D customized human models to 3D space, showing great potential.
Rating¶
- Novelty: ⭐⭐⭐⭐ HDS's score difference decomposition and VCR's distance-weighted attention fusion make significant theoretical contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprises user study + GPT rating + Face++ verification + detailed ablation, though it lacks standard quantitative metrics (e.g., FID).
- Writing Quality: ⭐⭐⭐⭐ Formulations are complete and the structure is clear, though some derivations could be made more intuitive.
- Value: ⭐⭐⭐⭐⭐ The first high-quality 3D human generation method supporting identity preservation. A 40-minute training time is highly appealing for practical applications.