HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation¶
Conference: NeurIPS 2025 arXiv: 2511.00468 Code: https://paulpanwang.github.io/HumanCrafter Area: 3D Human Reconstruction / 3D Semantic Segmentation Keywords: 3D Gaussian Splatting, Human Reconstruction, 3D Semantic Segmentation, Single-Image Reconstruction, Multi-Task Learning, DINOv2
TL;DR¶
HumanCrafter is proposed as the first feed-forward framework that unifies single-image 3D human reconstruction with body-part semantic segmentation. A human geometry prior-guided Transformer aggregates multi-view features, while DINOv2 self-supervised semantic priors construct a 3D feature field. The method simultaneously surpasses existing SOTA in both 3D reconstruction and segmentation on 2K2K and THuman2.1.
Background & Motivation¶
Background: 3D human reconstruction has advanced rapidly in recent years — 3DGS enables real-time rendering, and large-scale reconstruction Transformers (LRM, GRM) achieve feed-forward generalization. However, 3D human semantic segmentation (body-part segmentation) remains an open problem.
Limitations of Prior Work: - 2D human segmentation models (e.g., Sapiens) cannot guarantee 3D consistency — segmentation results across different viewpoints are incoherent. - Two-stage pipelines (reconstruct first, then apply 2D segmentation) are inefficient, 3D-inconsistent, and engineering-heavy. - General-purpose 3D reconstruction models (LGM, GRM) lack human-specific priors, resulting in poor reconstruction quality at joints and clothing details. - 3D human semantic datasets are scarce — annotated multi-view body segmentation data is largely unavailable.
Core Idea: Build a unified framework in which 3D reconstruction and 3D semantic segmentation mutually benefit — sharing 3D Gaussian parameters, where the reconstruction task provides geometric constraints and the segmentation task provides semantic regularization.
Method¶
Overall Architecture¶
Given a single RGB image \(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\) as input, the framework outputs semantically enriched 3D Gaussian primitives (VersatileSplats) that simultaneously support novel-view synthesis and body-part segmentation.
1. Human Prior-Guided Feature Aggregation (Sec 3.1)¶
Diffusion Prior: A pretrained SV3D model generates multi-view images, guided geometrically by SMPL side-view normal maps. Multi-view images are concatenated with Plücker embeddings and divided into patch tokens \(\mathbf{F}_i \in \mathbb{R}^{(h \times w) \times d_1}\).
Cross-View Attention: \(N_f\) layers of Grouped Query Attention (GQA) blocks with RMS pre-normalization, GELU activation, and FFN enable cross-view feature interaction.
Depth Prediction and 3D Localization: Depth maps \(\mathbf{D}_i\) and 3D offsets \(\boldsymbol{\Delta}_i\) are predicted; 3D Gaussian centers are obtained via back-projection:
2. Self-Supervised Models as Inductive Bias (Sec 3.2)¶
A frozen DINOv2-ViT-s14-reg extracts semantic features \(\mathbf{f}_i \in \mathbb{R}^{(h \times w) \times d_2}\) from the input image.
Pixel-Aligned Aggregation — The cross-view attention weights learned by the preceding Transformer stage are directly reused to compute a weighted combination of DINOv2 features, without relearning attention:
Here \(\mathbf{Q}\) and \(\mathbf{K}\) are derived from the positional associations already learned by the reconstruction Transformer, while \(\mathbf{V}\) is provided by the DINOv2 features.
Semantic 3D Gaussians — Each Gaussian primitive is augmented with a learnable semantic embedding decoded via a \(1 \times 1\) convolution. The rendered semantic feature map is:
3. Multi-Task Training Objectives (Sec 3.3)¶
- \(\mathcal{L}_{\text{render}} = \mathcal{L}_{\text{mse}} + \lambda_m \mathcal{L}_{\text{mask}} + \lambda_p \mathcal{L}_{\text{LPIPS}}\) (rendering loss)
- \(\mathcal{L}_{\text{dist}}\): DINOv2 feature distillation (cosine similarity, self-supervised signal)
- \(\mathcal{L}_{\text{CE}}\): cross-entropy loss (applied only to annotated views; 28 body-part classes + background)
- Hyperparameters: \(\lambda_m=1,\ \lambda_p=0.1,\ \lambda_{\text{dist}}=0.5,\ \lambda_{\text{seg}}=0.5\)
Semantic Annotation Dataset Construction¶
500 scans are selected from the training data; each scan is annotated with 8 semantic segmentation maps via an interactive annotation pipeline, yielding high-quality data–label pairs.
Key Experimental Results¶
3D Human Segmentation (2K2K Dataset)¶
| Method | Input | mIoU↑ | Acc.↑ | PSNR↑ | Time |
|---|---|---|---|---|---|
| LSM* | 2-view | 0.724 | 0.873 | 23.81 | 108ms/obj |
| Sapiens | 2D per-frame | 0.823 | 0.904 | N/A | 640ms/frame |
| HumanCrafter | 2-view | 0.840 | 0.925 | 24.79 | 126ms/obj |
| Human3Diff+Sapiens | 1-view | 0.781 | 0.851 | 21.83 | 23.21s/obj |
| HumanCrafter | 1-view | 0.801 | 0.882 | 23.49 | 6.24s/obj |
Under the two-view setting, mIoU surpasses the 2D Sapiens model (0.840 vs. 0.823) while maintaining 3D consistency. The single-view setting also substantially outperforms the two-stage pipeline.
3D Human Reconstruction (512×512 Resolution)¶
| Method | THuman2.1 PSNR↑ | SSIM↑ | LPIPS↓ | 2K2K PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|---|---|---|
| LGM | 20.11 | 0.859 | 0.196 | 21.69 | 0.850 | 0.166 |
| GRM | 20.50 | 0.868 | 0.141 | 21.50 | 0.858 | 0.171 |
| Human3Diffusion | 22.16 | 0.872 | 0.063 | 22.32 | 0.882 | 0.053 |
| PSHuman | 20.85 | 0.862 | 0.076 | 21.93 | 0.892 | 0.076 |
| HumanCrafter | 23.19 | 0.907 | 0.046 | 23.49 | 0.916 | 0.045 |
Reconstruction quality is comprehensively superior — PSNR improves by over 1 dB and LPIPS decreases by 27% (0.046 vs. 0.063).
Ablation Study¶
| Configuration | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| Full model | 23.49 | 0.916 | 0.045 |
| w/o SMPL prior | 22.20 | 0.890 | 0.064 |
| w/o DINOv2 → MAE | 22.03 | 0.891 | 0.055 |
| w/o pixel-aligned aggregation | 21.18 | 0.891 | 0.067 |
| w/o \(\mathcal{L}_{\text{dist}}\) | 22.46 | 0.896 | 0.055 |
| w/o \(\mathcal{L}_{\text{CE}}\) | 23.22 | 0.901 | 0.051 |
The SMPL prior contributes the largest gain (+1.29 PSNR); the segmentation loss \(\mathcal{L}_{\text{CE}}\) in turn improves reconstruction quality (+0.27 PSNR), confirming mutual benefit between the two tasks.
Highlights & Insights¶
- First unified framework for 3D reconstruction + 3D segmentation — shared Gaussian parameters enable mutual task reinforcement.
- Pixel-aligned aggregation — DINOv2 features are propagated by reusing the reconstruction Transformer's attention weights, introducing zero additional parameters.
- Self-supervised to 3D — DINOv2 2D features are distilled into a 3D-consistent semantic field, elegantly addressing the scarcity of 3D annotations.
- High practicality — a complete 3D Gaussian representation is produced from a single image in 6.24s, directly integrable into VR pipelines and 3D editing workflows.
Limitations & Future Work¶
- Reliance on SV3D for multi-view image generation (~6s) constitutes the primary speed bottleneck — faster multi-view generation methods should be explored.
- Only 40K annotated images (500 scans × 8 views) are used; expanding the annotation data could further improve segmentation quality.
- The 28-class body-part segmentation could be refined to finer granularity (e.g., finger-level).
- Quantitative segmentation performance under real-world challenging scenarios (extreme poses, occlusions, multi-person interaction) has not been evaluated.
Related Work & Insights¶
- The "reuse Q/K, replace V" strategy of pixel-aligned aggregation is generalizable to other multi-task 3D systems.
- The finding that 3D segmentation regularization improves reconstruction quality suggests a new training paradigm for 3D generative models.
- The FLUX-inpainting-based 3D editing application demonstrates a viable productization pathway.
Rating¶
- Novelty: ⭐⭐⭐⭐ The unified framework and pixel-aligned aggregation design are elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, comprehensive ablations, and in-the-wild qualitative evaluation.
- Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear and the method is described in a well-organized manner.
- Value: ⭐⭐⭐⭐⭐ Pioneers the unified direction of 3D human reconstruction and understanding with strong practical applicability.