Skip to content

HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation

Conference: NeurIPS 2025 arXiv: 2511.00468 Code: https://paulpanwang.github.io/HumanCrafter Area: 3D Human Reconstruction / 3D Semantic Segmentation Keywords: 3D Gaussian Splatting, Human Reconstruction, 3D Semantic Segmentation, Single-Image Reconstruction, Multi-Task Learning, DINOv2

TL;DR

HumanCrafter is proposed as the first feed-forward framework that unifies single-image 3D human reconstruction with body-part semantic segmentation. A human geometry prior-guided Transformer aggregates multi-view features, while DINOv2 self-supervised semantic priors construct a 3D feature field. The method simultaneously surpasses existing SOTA in both 3D reconstruction and segmentation on 2K2K and THuman2.1.

Background & Motivation

Background: 3D human reconstruction has advanced rapidly in recent years — 3DGS enables real-time rendering, and large-scale reconstruction Transformers (LRM, GRM) achieve feed-forward generalization. However, 3D human semantic segmentation (body-part segmentation) remains an open problem.

Limitations of Prior Work: - 2D human segmentation models (e.g., Sapiens) cannot guarantee 3D consistency — segmentation results across different viewpoints are incoherent. - Two-stage pipelines (reconstruct first, then apply 2D segmentation) are inefficient, 3D-inconsistent, and engineering-heavy. - General-purpose 3D reconstruction models (LGM, GRM) lack human-specific priors, resulting in poor reconstruction quality at joints and clothing details. - 3D human semantic datasets are scarce — annotated multi-view body segmentation data is largely unavailable.

Core Idea: Build a unified framework in which 3D reconstruction and 3D semantic segmentation mutually benefit — sharing 3D Gaussian parameters, where the reconstruction task provides geometric constraints and the segmentation task provides semantic regularization.

Method

Overall Architecture

Given a single RGB image \(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\) as input, the framework outputs semantically enriched 3D Gaussian primitives (VersatileSplats) that simultaneously support novel-view synthesis and body-part segmentation.

1. Human Prior-Guided Feature Aggregation (Sec 3.1)

Diffusion Prior: A pretrained SV3D model generates multi-view images, guided geometrically by SMPL side-view normal maps. Multi-view images are concatenated with Plücker embeddings and divided into patch tokens \(\mathbf{F}_i \in \mathbb{R}^{(h \times w) \times d_1}\).

Cross-View Attention: \(N_f\) layers of Grouped Query Attention (GQA) blocks with RMS pre-normalization, GELU activation, and FFN enable cross-view feature interaction.

Depth Prediction and 3D Localization: Depth maps \(\mathbf{D}_i\) and 3D offsets \(\boldsymbol{\Delta}_i\) are predicted; 3D Gaussian centers are obtained via back-projection:

\[\boldsymbol{\mu}_p = \mathbf{R}_i^\top \mathbf{K}^{-1} \mathbf{D}_i[u,v] - \mathbf{t}_i + \boldsymbol{\Delta}[u,v]\]

2. Self-Supervised Models as Inductive Bias (Sec 3.2)

A frozen DINOv2-ViT-s14-reg extracts semantic features \(\mathbf{f}_i \in \mathbb{R}^{(h \times w) \times d_2}\) from the input image.

Pixel-Aligned Aggregation — The cross-view attention weights learned by the preceding Transformer stage are directly reused to compute a weighted combination of DINOv2 features, without relearning attention:

\[\text{CrossAttn}(\mathbf{f}_i) = \text{SoftMax}\left(\frac{\mathbf{Q}(\mathbf{F}_i)\mathbf{K}(\mathbf{F}_i)^\top}{\sqrt{d_k}} + \mathbf{B}\right) \mathbf{f}_i\]

Here \(\mathbf{Q}\) and \(\mathbf{K}\) are derived from the positional associations already learned by the reconstruction Transformer, while \(\mathbf{V}\) is provided by the DINOv2 features.

Semantic 3D Gaussians — Each Gaussian primitive is augmented with a learnable semantic embedding decoded via a \(1 \times 1\) convolution. The rendered semantic feature map is:

\[f = \sum_{i=1}^{N} \mathbf{M}\tilde{\mathbf{f}}_i \boldsymbol{\sigma}_i \prod_{j=1}^{i-1}(1 - \boldsymbol{\sigma}_j)\]

3. Multi-Task Training Objectives (Sec 3.3)

\[\mathcal{L}(\boldsymbol{\Theta}) = \mathbb{E}_{i}[\mathcal{L}_{\text{render}} + \lambda_{\text{dist}} \cdot \mathcal{L}_{\text{dist}}(\mathbf{f}_i, \hat{\mathbf{f}_i})] + \lambda_{\text{seg}} \cdot \mathbb{E}_{j}[\mathcal{L}_{\text{CE}}(\mathbf{S}_j, \hat{\mathbf{S}_j})]\]
  • \(\mathcal{L}_{\text{render}} = \mathcal{L}_{\text{mse}} + \lambda_m \mathcal{L}_{\text{mask}} + \lambda_p \mathcal{L}_{\text{LPIPS}}\) (rendering loss)
  • \(\mathcal{L}_{\text{dist}}\): DINOv2 feature distillation (cosine similarity, self-supervised signal)
  • \(\mathcal{L}_{\text{CE}}\): cross-entropy loss (applied only to annotated views; 28 body-part classes + background)
  • Hyperparameters: \(\lambda_m=1,\ \lambda_p=0.1,\ \lambda_{\text{dist}}=0.5,\ \lambda_{\text{seg}}=0.5\)

Semantic Annotation Dataset Construction

500 scans are selected from the training data; each scan is annotated with 8 semantic segmentation maps via an interactive annotation pipeline, yielding high-quality data–label pairs.

Key Experimental Results

3D Human Segmentation (2K2K Dataset)

Method Input mIoU↑ Acc.↑ PSNR↑ Time
LSM* 2-view 0.724 0.873 23.81 108ms/obj
Sapiens 2D per-frame 0.823 0.904 N/A 640ms/frame
HumanCrafter 2-view 0.840 0.925 24.79 126ms/obj
Human3Diff+Sapiens 1-view 0.781 0.851 21.83 23.21s/obj
HumanCrafter 1-view 0.801 0.882 23.49 6.24s/obj

Under the two-view setting, mIoU surpasses the 2D Sapiens model (0.840 vs. 0.823) while maintaining 3D consistency. The single-view setting also substantially outperforms the two-stage pipeline.

3D Human Reconstruction (512×512 Resolution)

Method THuman2.1 PSNR↑ SSIM↑ LPIPS↓ 2K2K PSNR↑ SSIM↑ LPIPS↓
LGM 20.11 0.859 0.196 21.69 0.850 0.166
GRM 20.50 0.868 0.141 21.50 0.858 0.171
Human3Diffusion 22.16 0.872 0.063 22.32 0.882 0.053
PSHuman 20.85 0.862 0.076 21.93 0.892 0.076
HumanCrafter 23.19 0.907 0.046 23.49 0.916 0.045

Reconstruction quality is comprehensively superior — PSNR improves by over 1 dB and LPIPS decreases by 27% (0.046 vs. 0.063).

Ablation Study

Configuration PSNR↑ SSIM↑ LPIPS↓
Full model 23.49 0.916 0.045
w/o SMPL prior 22.20 0.890 0.064
w/o DINOv2 → MAE 22.03 0.891 0.055
w/o pixel-aligned aggregation 21.18 0.891 0.067
w/o \(\mathcal{L}_{\text{dist}}\) 22.46 0.896 0.055
w/o \(\mathcal{L}_{\text{CE}}\) 23.22 0.901 0.051

The SMPL prior contributes the largest gain (+1.29 PSNR); the segmentation loss \(\mathcal{L}_{\text{CE}}\) in turn improves reconstruction quality (+0.27 PSNR), confirming mutual benefit between the two tasks.

Highlights & Insights

  1. First unified framework for 3D reconstruction + 3D segmentation — shared Gaussian parameters enable mutual task reinforcement.
  2. Pixel-aligned aggregation — DINOv2 features are propagated by reusing the reconstruction Transformer's attention weights, introducing zero additional parameters.
  3. Self-supervised to 3D — DINOv2 2D features are distilled into a 3D-consistent semantic field, elegantly addressing the scarcity of 3D annotations.
  4. High practicality — a complete 3D Gaussian representation is produced from a single image in 6.24s, directly integrable into VR pipelines and 3D editing workflows.

Limitations & Future Work

  1. Reliance on SV3D for multi-view image generation (~6s) constitutes the primary speed bottleneck — faster multi-view generation methods should be explored.
  2. Only 40K annotated images (500 scans × 8 views) are used; expanding the annotation data could further improve segmentation quality.
  3. The 28-class body-part segmentation could be refined to finer granularity (e.g., finger-level).
  4. Quantitative segmentation performance under real-world challenging scenarios (extreme poses, occlusions, multi-person interaction) has not been evaluated.
  • The "reuse Q/K, replace V" strategy of pixel-aligned aggregation is generalizable to other multi-task 3D systems.
  • The finding that 3D segmentation regularization improves reconstruction quality suggests a new training paradigm for 3D generative models.
  • The FLUX-inpainting-based 3D editing application demonstrates a viable productization pathway.

Rating

  • Novelty: ⭐⭐⭐⭐ The unified framework and pixel-aligned aggregation design are elegant.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, comprehensive ablations, and in-the-wild qualitative evaluation.
  • Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear and the method is described in a well-organized manner.
  • Value: ⭐⭐⭐⭐⭐ Pioneers the unified direction of 3D human reconstruction and understanding with strong practical applicability.