NeurIPS 2025 Image Generation Single-image 3D human reconstruction diffusion model priors template-free reconstruction multi-source data Score Distillation Sampling texture refinement

GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data¶

Conference: NeurIPS 2025 arXiv: 2411.18624 Code: Project Page Area: Image Generation Keywords: Single-image 3D human reconstruction, diffusion model priors, template-free reconstruction, multi-source data, Score Distillation Sampling, texture refinement

TL;DR¶

GeneMAN proposes a generalizable single-image 3D human reconstruction framework that requires no parametric body model (e.g., SMPL). By training human-specific 2D/3D diffusion prior models on large-scale multi-source human data, and combining a geometry initialization-sculpting pipeline with multi-space texture refinement, GeneMAN achieves high-fidelity 3D human reconstruction from in-the-wild images, handling diverse body proportions, complex poses, and personal belongings.

Background & Motivation¶

Problem Definition¶

Reconstructing high-fidelity 3D human models from a single in-the-wild photograph is a core requirement for applications such as VR/AR, telepresence, digital humans, film, and gaming. However, due to severe 3D information loss, it remains a highly ill-posed problem.

Limitations of Prior Work¶

Template-based methods (PaMIR, ICON, ECON, SiTH, SIFU, TeCH, etc.) rely on parametric body models such as SMPL/SMPL-X as geometric priors, but these models cannot represent 3D details of loose clothing and produce severe artifacts (e.g., bent legs) when pose/shape estimation is inaccurate.

Template-free methods (PIFu, PIFuHD, PHORHUM, etc.) avoid parametric constraints but still suffer from poor texture consistency and insufficient geometric detail due to the lack of adequate human-specific priors.

General image-to-3D methods (Zero-1-to-3, Magic123, etc.) perform well on general objects but lack human-specific priors, resulting in inaccurate body geometry and loss of facial and clothing details.

Three Core Challenges¶

Challenge	Description
Variable body proportions	In-the-wild photos include full-body, half-body, and head-shoulder crops; existing methods primarily target full-body reconstruction
Personal belongings	People in everyday photos frequently hold objects, stand on items, or wear accessories, significantly complicating reconstruction
Natural pose and texture consistency	The absence of general human-specific geometry/texture models leads to implausible geometry and cross-view texture inconsistency

Furthermore, the scarcity of high-quality 3D human data exacerbates these challenges.

Method¶

Overall Architecture¶

The GeneMAN pipeline consists of four core components:

Multi-source data curation → Prior model training → Geometry initialization & sculpting → Multi-space texture refinement

1. Multi-Source Human Dataset Construction¶

To improve generalizability, the authors collect over 50K multi-view instances from four data source categories:

Source	Specific Datasets
3D Scans	RenderPeople, CustomHumans, HuMMan, THuman2.0/3.0, X-Humans, Objaverse human subset
Multi-view Videos	DNA-Rendering, ZJU-Mocap, AIST++, Neural Actor, Actors-HQ
2D Images	DeepFashion, LAION-5B
Synthetic Augmentation	ControlNet-generated multi-view humans with diverse clothing + image cropping augmentation (covering half-body/close-up proportions)

2. GeneMAN Prior Models¶

2D Prior: Stable Diffusion V1.5 is fine-tuned on all multi-source human data, with an equal amount of LAION-5B images mixed in to preserve general capability. Trained with AdamW (lr=1e-5) on 4×A100 for 5 days. Provides rich human geometry and texture detail.

3D Prior: Zero-1-to-3 is fine-tuned on 3D scans, multi-view videos, synthetic data, and DeepFashion images, with 20% Objaverse data added to preserve general object reconstruction capability. Trained with AdamW (lr=1e-4) on 8×A100 for 1 week. Ensures multi-view consistency.

Key Designs¶

Geometry Initialization & Sculpting¶

Stage 1: NeRF Initialization

Instant-NGP is used as the NeRF backbone, with resolution progressively increased from 256 to 384 over 5000 training steps
Reference-view supervision loss \(\mathcal{L}_{\text{ref}}\): RGB MSE + mask MSE
Depth/normal priors: inferred by the human foundation model Sapiens, with depth loss (normalized negative Pearson correlation) and normal loss (MSE) applied respectively
Novel-view guidance: hybrid SDS loss

\[\mathcal{L}_{\text{guid}} = \mathcal{L}_{\text{2D-SDS}}(\phi_{2d}, g(\theta)) + \mathcal{L}_{\text{3D-SDS}}(\phi_{3d}, g(\theta))\]

Stage 2: DMTet Sculpting

The NeRF is converted to an explicit mesh and used to initialize DMTet (a hybrid SDF-mesh representation)
Optimized for 3000 steps at resolution 512, with MSE + perceptual loss supervising normals; novel-view guidance is provided by HumanNorm pretrained normal/depth-adaptive diffusion models
SDF regularization loss \(\mathcal{L}_{\text{sdf}}\) prevents excessive geometric deviation

Multi-Space Texture Refinement¶

Latent-Space Optimization (Coarse Texture):

Hybrid SDS loss (2D + 3D priors) is used to optimize texture representation for 10000 steps
Reference-view MSE loss ensures consistency with the input image
Training-free multi-view strategy: the same Gaussian noise is added to renderings from different viewpoints, which are then concatenated into a single image for inference, achieving cross-view texture consistency without retraining the diffusion model

\[\mathcal{L}_{\text{coarse}} = \lambda_{\text{ref}}^c (\|{\hat{I}} - g_c(\theta;\hat{c})\|_2 + \|\hat{m} - g_c(\theta;\hat{c})\|_2) + \lambda_{\text{guid}}^c \mathcal{L}_{\text{guid}}^c\]

Pixel-Space Optimization (Fine Texture):

Uses the SDEdit framework: render coarse texture image → add noise → multi-step denoising with GeneMAN 2D prior + ControlNet to obtain refined image
UV texture map optimized for 1000 steps with MSE + LPIPS loss:

\[\mathcal{L}_{\text{fine}} = \|I_{\text{fine}} - I_{\text{coarse}}\|_2 + \lambda_{LP} \cdot \text{LPIPS}(I_{\text{fine}}, I_{\text{coarse}})\]

Loss & Training¶

Summary of loss functions involved across all training stages:

Stage	Loss Function	Role
NeRF Init	\(\mathcal{L}_{\text{ref}}\) (RGB+mask)	Reference-view reconstruction
NeRF Init	\(\mathcal{L}_{\text{depth}}\) (Pearson)	Depth consistency
NeRF Init	\(\mathcal{L}_{\text{normal}}\) (MSE)	Normal consistency
NeRF/DMTet	\(\mathcal{L}_{\text{guid}}\) (2D+3D SDS)	Novel-view human prior guidance
DMTet Sculpting	\(\mathcal{L}_{\text{sdf}}\)	Geometry regularization
Coarse Texture	\(\mathcal{L}_{\text{coarse}}\) (ref+SDS)	Consistent texture learning
Fine Texture	\(\mathcal{L}_{\text{fine}}\) (MSE+LPIPS)	High-fidelity texture refinement

Key Experimental Results¶

Main Results: Quantitative Comparison¶

Test set: 50 samples from in-the-wild internet images and the CAPE dataset; 120 viewpoints rendered per method over a 360° rotation.

Method	in-the-wild PSNR↑	LPIPS↓	CLIP-Sim↑	CAPE PSNR↑	LPIPS↓	CLIP-Sim↑
PIFu	26.97	0.035	0.594	26.91	0.028	0.764
GTA	25.06	0.064	0.568	30.38	0.019	0.785
TeCH	25.74	0.053	0.713	27.60	0.025	0.826
SiTH	20.41	0.129	0.608	21.99	0.048	0.815
GeneMAN	32.24	0.013	0.730	28.49	0.015	0.838

On in-the-wild images, GeneMAN's PSNR exceeds the second-best method by approximately 5.3 dB, with LPIPS reduced by 63%
Highest CLIP-Sim score, indicating superior multi-view consistency
On CAPE, GeneMAN achieves the best LPIPS and CLIP-Sim, with competitive PSNR

User Study¶

40 participants, 30 test cases, 1200 pairwise comparisons. 73.08% of participants preferred GeneMAN's reconstruction results (combined geometry and texture evaluation), far surpassing all baseline methods.

Ablation Study¶

Ablation	Key Finding
Geometry initialization vs. sculpting	DMTet sculpting smooths overly noisy surfaces while recovering high-frequency details such as clothing wrinkles and facial features
Latent-space texture vs. pixel-space refinement	Latent-space texture is broadly reasonable but suffers from back-view inconsistency and slight blurriness; pixel-space optimization significantly improves detail
GeneMAN 2D prior vs. original DeepFloyd-IF	The original 2D prior causes front-back inconsistency in shirt hems; GeneMAN 2D prior ensures multi-view consistency
GeneMAN 3D prior vs. original Zero-1-to-3	The original 3D prior produces unnatural poses (forward head lean); GeneMAN 3D prior captures more natural body posture

Key Findings¶

The template-free design enables GeneMAN to effectively handle loose clothing (skirts, dresses) and personal belongings (e.g., basketballs), avoiding cascading errors from SMPL estimation failures
Collecting large-scale multi-source data and fine-tuning diffusion models as human priors is the key source of generalization capability
The training-free multi-view consistency strategy (shared noise + concatenated inference) effectively improves cross-view consistency without additional training overhead

Highlights & Insights¶

Data-driven prior learning: Rather than relying on hand-crafted parametric models, GeneMAN learns 2D/3D human priors from 50K+ multi-source data — a more scalable paradigm where more data yields stronger priors.
Complementary hybrid prior design: The 2D prior supplies detail (texture, fine-grained geometry) while the 3D prior enforces consistency (multi-view, natural pose); the two are organically combined via SDS loss, each fulfilling a distinct role.
Coarse-to-fine hierarchical strategy is applied throughout: NeRF → DMTet for geometry from coarse to fine, and latent space → pixel space for texture from coarse to fine, with the most appropriate representation and supervision at each stage.
Template-free design eliminates dependence on accurate SMPL estimation, making the framework naturally suited for children, non-standard body shapes, and object occlusion scenarios that existing methods struggle with.
Training-free multi-view consistency strategy is an elegant engineering design — adding the same noise to batch-rendered views and concatenating them for inference improves cross-view texture consistency without retraining the model.

Limitations & Future Work¶

Slow inference: The full pipeline takes approximately 1.4 hours on a single A100 80G GPU, including NeRF optimization, DMTet sculpting, and two-stage texture optimization — far from meeting the demands of interactive applications.
High training cost: The 3D prior requires 8×A100 for one week, and the 2D prior requires 4×A100 for five days, posing a high barrier to entry.
Reliance on the SDS optimization paradigm: SDS inherently suffers from well-known issues such as over-saturation and mode collapse; while partially mitigated by the hybrid prior design, these issues are not fundamentally resolved.
Lack of hand/face-specific quantitative evaluation: The reported metrics (PSNR/LPIPS/CLIP-Sim) are global measures and do not specifically assess the reconstruction accuracy of facial expressions or hand poses.
Dataset bias risk: Synthetic data generated by ControlNet may introduce specific distributional biases; the paper provides no in-depth analysis of how domain gaps across multi-source data are balanced.

LRM (Hong et al., 2023): Demonstrates that transformer-based 3D reconstructors trained on multi-source data exhibit strong generalization, inspiring GeneMAN's data strategy
DreamCraft3D (Sun et al., 2023): Combines 2D+3D priors in a coarse-to-fine optimization pipeline; GeneMAN extends and deepens this paradigm
HumanNorm (Huang et al., 2024): Pretrained normal/depth-adaptive diffusion models are directly adopted by GeneMAN for DMTet sculpting guidance
Sapiens (Khirodkar et al., 2024): A human foundation model that provides normal and depth priors for the reference view, serving as a geometric signal source in GeneMAN
The overall technical approach can be viewed as a "human-specialized DreamCraft3D", with the core innovation lying in replacing general-domain data with multi-source human data to train stronger domain-specific priors

Rating¶

Dimension	Score (1–5)	Comments
Novelty	3.5	Individual components (SDS, NeRF→DMTet, SDEdit texture refinement) are human-specialized adaptations of existing methods; core innovation lies in multi-source data-driven prior learning and the template-free design
Technical Depth	4.0	System design is solid; module selection and loss design at each stage of the multi-stage pipeline are well-motivated
Experimental Thoroughness	4.0	Comprehensive quantitative, qualitative, user study, and ablation experiments; in-the-wild image tests cover diverse challenging scenarios
Writing Quality	4.0	Clear structure, rich illustrations, and detailed method descriptions
Value	3.5	Results are impressive, but the long inference time must be reduced to an interactive level before practical deployment
Overall	3.8	A solid systems paper that advances single-image human reconstruction through multi-source data, dual priors, and multi-stage optimization

Method	Type	Human Prior	Texture Optimization	Strengths	Limitations
PIFu / PIFuHD	Template-free	Pixel-aligned implicit field	None	End-to-end, no SMPL dependency	Poor side-view geometry, unrealistic texture
PaMIR / ICON / ECON	Template-based	SMPL features	None / Limited	Exploits body structure information	Clothing detail limited by SMPL topology
SiTH	Template-based	Fine-tuned diffusion for back-view hallucination	SDEdit	Relatively fast	Depends on HPS accuracy, poor generalization on wild images
TeCH	Template-based	SDS optimization	Diffusion refinement	Rich detail	Overly noisy surface, inconsistent texture
GTA	Template-based	Transformer triplane	Feed-forward	Fast inference	Depends on SMPL, fails on loose clothing
HumanLRM	Template-free	Diffusion-guided feed-forward	Implicit field	No SMPL, feed-forward	Texture inconsistency, insufficient geometric detail
GeneMAN	Template-free	2D+3D diffusion priors fine-tuned on multi-source data	Latent + pixel space	Strong generalization, robust to body proportions and personal belongings	Slow inference (~1.4h)

Core distinction: GeneMAN simultaneously achieves template-free reconstruction (avoiding cascading SMPL estimation errors) and rich prior learning (diffusion priors trained on 50K+ multi-source data), which is the fundamental reason for its substantial advantage over template-based methods on in-the-wild images.

Additional Insights:

Data > Architecture: GeneMAN's individual modules (NeRF, DMTet, SDS, SDEdit) are all standard components; its competitive edge stems from domain-specific priors trained on 50K+ multi-source human data. This suggests that in vertical-domain 3D reconstruction, high-quality domain data collection and curation may be more impactful than architectural innovation.
Generalizability of the hybrid prior paradigm: The dual-prior design — 2D prior for detail + 3D prior for consistency — is transferable to single-image 3D reconstruction in other specific domains (animals, vehicles, architecture, etc.).
Complementarity with feed-forward methods: GeneMAN follows an optimization-based approach (~1.4h), yielding high quality but slow speed; feed-forward methods such as LRM/Instant3D are fast but quality-limited. A promising future direction is to initialize feed-forward networks with GeneMAN-level priors to combine quality and speed.
Sapiens as a human foundation model: The paper uses Sapiens to provide depth/normal priors for the reference view, suggesting that foundation models can serve as plug-and-play geometric signal sources for downstream 3D tasks.
Training-free multi-view consistency trick: Adding the same noise to multi-view renderings and concatenating them for inference is a simple yet effective technique that may generalize to other SDS-based 3D generation tasks.