NeurIPS 2025 Segmentation 3D Gaussian Splatting Human Reconstruction 3D Semantic Segmentation Single-Image Reconstruction Multi-Task Learning DINOv2

HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation¶

Conference: NeurIPS 2025 arXiv: 2511.00468 Code: https://paulpanwang.github.io/HumanCrafter Area: 3D Human Reconstruction / 3D Semantic Segmentation Keywords: 3D Gaussian Splatting, Human Reconstruction, 3D Semantic Segmentation, Single-Image Reconstruction, Multi-Task Learning, DINOv2

TL;DR¶

HumanCrafter is proposed as the first feed-forward framework that unifies single-image 3D human reconstruction with body-part semantic segmentation. A human geometry prior-guided Transformer aggregates multi-view features, while DINOv2 self-supervised semantic priors construct a 3D feature field. The method simultaneously surpasses existing SOTA in both 3D reconstruction and segmentation on 2K2K and THuman2.1.

Background & Motivation¶

Background: 3D human reconstruction has advanced rapidly in recent years — 3DGS enables real-time rendering, and large-scale reconstruction Transformers (LRM, GRM) achieve feed-forward generalization. However, 3D human semantic segmentation (body-part segmentation) remains an open problem.

Limitations of Prior Work: - 2D human segmentation models (e.g., Sapiens) cannot guarantee 3D consistency — segmentation results across different viewpoints are incoherent. - Two-stage pipelines (reconstruct first, then apply 2D segmentation) are inefficient, 3D-inconsistent, and engineering-heavy. - General-purpose 3D reconstruction models (LGM, GRM) lack human-specific priors, resulting in poor reconstruction quality at joints and clothing details. - 3D human semantic datasets are scarce — annotated multi-view body segmentation data is largely unavailable.

Core Idea: Build a unified framework in which 3D reconstruction and 3D semantic segmentation mutually benefit — sharing 3D Gaussian parameters, where the reconstruction task provides geometric constraints and the segmentation task provides semantic regularization.

Method¶

Overall Architecture¶

Given a single RGB image \(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\) as input, the framework outputs semantically enriched 3D Gaussian primitives (VersatileSplats) that simultaneously support novel-view synthesis and body-part segmentation.

1. Human Prior-Guided Feature Aggregation (Sec 3.1)¶

Diffusion Prior: A pretrained SV3D model generates multi-view images, guided geometrically by SMPL side-view normal maps. Multi-view images are concatenated with Plücker embeddings and divided into patch tokens \(\mathbf{F}_i \in \mathbb{R}^{(h \times w) \times d_1}\).

Cross-View Attention: \(N_f\) layers of Grouped Query Attention (GQA) blocks with RMS pre-normalization, GELU activation, and FFN enable cross-view feature interaction.

Depth Prediction and 3D Localization: Depth maps \(\mathbf{D}_i\) and 3D offsets \(\boldsymbol{\Delta}_i\) are predicted; 3D Gaussian centers are obtained via back-projection:

\[\boldsymbol{\mu}_p = \mathbf{R}_i^\top \mathbf{K}^{-1} \mathbf{D}_i[u,v] - \mathbf{t}_i + \boldsymbol{\Delta}[u,v]\]

2. Self-Supervised Models as Inductive Bias (Sec 3.2)¶

A frozen DINOv2-ViT-s14-reg extracts semantic features \(\mathbf{f}_i \in \mathbb{R}^{(h \times w) \times d_2}\) from the input image.

Pixel-Aligned Aggregation — The cross-view attention weights learned by the preceding Transformer stage are directly reused to compute a weighted combination of DINOv2 features, without relearning attention:

\[\text{CrossAttn}(\mathbf{f}_i) = \text{SoftMax}\left(\frac{\mathbf{Q}(\mathbf{F}_i)\mathbf{K}(\mathbf{F}_i)^\top}{\sqrt{d_k}} + \mathbf{B}\right) \mathbf{f}_i\]

Here \(\mathbf{Q}\) and \(\mathbf{K}\) are derived from the positional associations already learned by the reconstruction Transformer, while \(\mathbf{V}\) is provided by the DINOv2 features.

Semantic 3D Gaussians — Each Gaussian primitive is augmented with a learnable semantic embedding decoded via a \(1 \times 1\) convolution. The rendered semantic feature map is:

\[f = \sum_{i=1}^{N} \mathbf{M}\tilde{\mathbf{f}}_i \boldsymbol{\sigma}_i \prod_{j=1}^{i-1}(1 - \boldsymbol{\sigma}_j)\]

3. Multi-Task Training Objectives (Sec 3.3)¶

\[\mathcal{L}(\boldsymbol{\Theta}) = \mathbb{E}_{i}[\mathcal{L}_{\text{render}} + \lambda_{\text{dist}} \cdot \mathcal{L}_{\text{dist}}(\mathbf{f}_i, \hat{\mathbf{f}_i})] + \lambda_{\text{seg}} \cdot \mathbb{E}_{j}[\mathcal{L}_{\text{CE}}(\mathbf{S}_j, \hat{\mathbf{S}_j})]\]

\(\mathcal{L}_{\text{render}} = \mathcal{L}_{\text{mse}} + \lambda_m \mathcal{L}_{\text{mask}} + \lambda_p \mathcal{L}_{\text{LPIPS}}\) (rendering loss)
\(\mathcal{L}_{\text{dist}}\): DINOv2 feature distillation (cosine similarity, self-supervised signal)
\(\mathcal{L}_{\text{CE}}\): cross-entropy loss (applied only to annotated views; 28 body-part classes + background)
Hyperparameters: \(\lambda_m=1,\ \lambda_p=0.1,\ \lambda_{\text{dist}}=0.5,\ \lambda_{\text{seg}}=0.5\)

Semantic Annotation Dataset Construction¶

500 scans are selected from the training data; each scan is annotated with 8 semantic segmentation maps via an interactive annotation pipeline, yielding high-quality data–label pairs.

Key Experimental Results¶

3D Human Segmentation (2K2K Dataset)¶

Method	Input	mIoU↑	Acc.↑	PSNR↑	Time
LSM*	2-view	0.724	0.873	23.81	108ms/obj
Sapiens	2D per-frame	0.823	0.904	N/A	640ms/frame
HumanCrafter	2-view	0.840	0.925	24.79	126ms/obj
Human3Diff+Sapiens	1-view	0.781	0.851	21.83	23.21s/obj
HumanCrafter	1-view	0.801	0.882	23.49	6.24s/obj

Under the two-view setting, mIoU surpasses the 2D Sapiens model (0.840 vs. 0.823) while maintaining 3D consistency. The single-view setting also substantially outperforms the two-stage pipeline.

3D Human Reconstruction (512×512 Resolution)¶

Method	THuman2.1 PSNR↑	SSIM↑	LPIPS↓	2K2K PSNR↑	SSIM↑	LPIPS↓
LGM	20.11	0.859	0.196	21.69	0.850	0.166
GRM	20.50	0.868	0.141	21.50	0.858	0.171
Human3Diffusion	22.16	0.872	0.063	22.32	0.882	0.053
PSHuman	20.85	0.862	0.076	21.93	0.892	0.076
HumanCrafter	23.19	0.907	0.046	23.49	0.916	0.045

Reconstruction quality is comprehensively superior — PSNR improves by over 1 dB and LPIPS decreases by 27% (0.046 vs. 0.063).

Ablation Study¶

Configuration	PSNR↑	SSIM↑	LPIPS↓
Full model	23.49	0.916	0.045
w/o SMPL prior	22.20	0.890	0.064
w/o DINOv2 → MAE	22.03	0.891	0.055
w/o pixel-aligned aggregation	21.18	0.891	0.067
w/o \(\mathcal{L}_{\text{dist}}\)	22.46	0.896	0.055
w/o \(\mathcal{L}_{\text{CE}}\)	23.22	0.901	0.051

The SMPL prior contributes the largest gain (+1.29 PSNR); the segmentation loss \(\mathcal{L}_{\text{CE}}\) in turn improves reconstruction quality (+0.27 PSNR), confirming mutual benefit between the two tasks.

Highlights & Insights¶

First unified framework for 3D reconstruction + 3D segmentation — shared Gaussian parameters enable mutual task reinforcement.
Pixel-aligned aggregation — DINOv2 features are propagated by reusing the reconstruction Transformer's attention weights, introducing zero additional parameters.
Self-supervised to 3D — DINOv2 2D features are distilled into a 3D-consistent semantic field, elegantly addressing the scarcity of 3D annotations.
High practicality — a complete 3D Gaussian representation is produced from a single image in 6.24s, directly integrable into VR pipelines and 3D editing workflows.

Limitations & Future Work¶

Reliance on SV3D for multi-view image generation (~6s) constitutes the primary speed bottleneck — faster multi-view generation methods should be explored.
Only 40K annotated images (500 scans × 8 views) are used; expanding the annotation data could further improve segmentation quality.
The 28-class body-part segmentation could be refined to finer granularity (e.g., finger-level).
Quantitative segmentation performance under real-world challenging scenarios (extreme poses, occlusions, multi-person interaction) has not been evaluated.

The "reuse Q/K, replace V" strategy of pixel-aligned aggregation is generalizable to other multi-task 3D systems.
The finding that 3D segmentation regularization improves reconstruction quality suggests a new training paradigm for 3D generative models.
The FLUX-inpainting-based 3D editing application demonstrates a viable productization pathway.

Rating¶

Novelty: ⭐⭐⭐⭐ The unified framework and pixel-aligned aggregation design are elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets, comprehensive ablations, and in-the-wild qualitative evaluation.
Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear and the method is described in a well-organized manner.
Value: ⭐⭐⭐⭐⭐ Pioneers the unified direction of 3D human reconstruction and understanding with strong practical applicability.