ICCV 2025 Human Understanding Human pose estimation personalized body shape point diffusion model shape prior SMPL fitting plug-and-play

PHD: Personalized 3D Human Body Fitting with Point Diffusion¶

Conference: ICCV 2025 arXiv: 2508.21257 Code: https://PHD-Pose.github.io Area: 3D Vision Keywords: Human pose estimation, personalized body shape, point diffusion model, shape prior, SMPL fitting, plug-and-play

TL;DR¶

This paper proposes PHD, a personalized 3D human pose estimation paradigm that first calibrates user-specific body shape via SHAPify, then employs a shape-conditioned point diffusion model (PointDiT) as a 3D prior, and iteratively optimizes pose parameters through Point Distillation Sampling combined with 2D keypoint constraints, achieving state-of-the-art absolute pose accuracy on the EMDB dataset.

Background & Motivation¶

Core Problem¶

Recovering accurate 3D human pose and shape from monocular video is foundational to future AI systems in AR/VR telepresence, service robotics, and related domains.

Two Fundamental Limitations of Existing Methods¶

Problem 1: Shape–Pose Entanglement Existing methods (e.g., HMR2.0b, CameraHMR) simultaneously estimate shape, pose, and pelvis position per frame in a subject-agnostic manner. However, the body shape of the same user should remain constant within a video — unstable shape estimates force pose parameters to compensate for shape errors to satisfy optimization objectives. A typical symptom is unnatural knee bending caused by incorrect shape estimates when aligning to 2D projections.

Problem 2: Over-reliance on 2D Alignment Optimization-based methods refine pose by minimizing 2D keypoint reprojection error (the regress-then-refine paradigm), but 2D projection inherently suffers from depth ambiguity — good 2D alignment does not guarantee 3D accuracy. More critically, the 3D pseudo-labels used to train regressors are themselves obtained via 2D alignment, creating a closed loop of error propagation.

Key Insights¶

Personalization is necessary: Exploiting the consistency of user identity (invariant body shape) to decouple shape and pose estimation.

3D priors should not be based on joint angles: Joint angle representations have weak correlation with image features and body shape conditions, leading to poor conditional generative model performance; body surface point clouds should be used instead, as they are naturally and strongly correlated with shape conditions.

Method¶

Overall Architecture (Two-Stage Decoupling)¶

Personalization Stage (SHAPify): Calibrates user-specific body shape $\beta^*$ from a single frame; performed only once.
Pose Fitting Stage: Alternately optimizes pose parameters $\theta, \phi, \mathbf{p}$ by combining PointDiT sampling with 2D keypoint constraints, conditioned on $\beta^*$.

Key Designs¶

1. SHAPify: Personalized Body Shape Calibration¶

Estimates body shape from a single reference image of the user in a standing pose (T-pose/I-pose), with optional height and weight information.

Optimization Objective: $$\text{argmin}_{\phi,\theta,\beta,\mathbf{p}} \mathcal{L}_{rep} + \mathcal{L}_{reg}$$

Reprojection loss: $$\mathcal{L}_{rep} = \|\Pi(J_{3D}(\phi, \theta, \beta) + \mathbf{p}) - J_{2D}\|_1$$

Regularization term (addressing the insufficient shape constraint from 2D keypoints alone): $$\mathcal{L}_{reg} = \lambda_\beta \|\beta\|_2^2 + \lambda_h \|H(\beta) - h\|_1 + \lambda_w \|W(\beta) - w\|_1$$ where $H(\cdot), W(\cdot)$ are differentiable functions computing height and weight from the SMPL mesh. $(h, w)$ can be set to population averages or user-provided measurements.

Key Implementation Detail: $\theta$ is initialized to a predefined rest pose, and smaller learning rates are assigned to $\theta$ and $\mathbf{p}$ to prioritize updates to $\beta$ and $\phi$.

2. PointDiT: Shape-Conditioned Point Diffusion Prior¶

Representation Choice: Body surface point clouds ($S=238$ mesh vertices + $J=45$ joint positions) are used as the pose representation instead of joint angles. Joint angles exhibit weak correlation with image features and shape conditions, yielding poor performance on uncommon poses.

Architecture (based on Diffusion Transformer + Rectified Flow): - Forward process: simple linear interpolation $\mathbf{x}_t = (1-t/T)\mathbf{x}_0 + (t/T)\epsilon$ - Reverse sampling: $\mathbf{x}_{t-1} = \mathbf{x}_t - \hat{\mathbf{u}}(\mathbf{x}_t, t)$ - Training loss: conditional flow matching $\mathcal{L}_{CFM} = \mathbb{E}[w_t \|\hat{\epsilon}(\mathbf{x}_t, t) - \epsilon\|_2^2]$

Condition Injection: - Image condition: ViTPose extracts image tokens and 2D heatmaps → 256 conditioning tokens → self-attention - Shape condition: SMPL shape parameters $\beta$ → adaLN-Zero (replacing the original class embedding) - Rectified flow schedule: only $T=5$ denoising steps are required, which is critical for iterative fitting

Training Data: Trained exclusively on the synthetic dataset BEDLAM, ensuring clean ground-truth shape and pose annotations.

3. Point Distillation Sampling + Sample–Fit Loop¶

Inspired by Score Distillation Sampling, the paper proposes Point Distillation Sampling:

Point cloud alignment loss (pelvis-aligned L2): $$\mathcal{L}_p = \lambda_p \|\mathbf{x}_0 - P_{3D}(\phi, \theta, \beta^*)\|_2$$

Angular alignment loss (applied after converting point clouds back to SMPL parameters via Point Fitter): $$\mathbf{x}_0 \xmapsto{\text{Point Fitter}} (\phi_g, \theta_g) \quad \Rightarrow \quad \mathcal{L}_a = \lambda_\phi \|\phi_g - \phi\|_2 + \lambda_\theta \|\theta_g - \theta\|_2$$

Alternating Sample–Fit Loop: At iteration $k$: 1. Compute the data term $\mathcal{L}_{data}$ and prior term $\mathcal{L}_{prior}$ using current parameters $(\phi^k, \theta^k)$ 2. Update parameters to obtain $(\phi^{k+1}, \theta^{k+1})$ 3. Generate updated point cloud $\mathbf{x}^{k+1} = P_{3D}(\phi^{k+1}, \theta^{k+1})$ 4. Perturb at a low noise level ($t/T=0.75$) and re-sample to provide a refreshed 3D prior for the next round

\[\text{argmin}_{\phi,\theta,\mathbf{p}} \underbrace{\|\Pi(J_{3D}(\phi,\theta,\beta^*)+\mathbf{p}) - J_{2D}\|_1}_{\mathcal{L}_{data}} + \underbrace{\mathcal{L}_p + \mathcal{L}_a}_{\mathcal{L}_{prior}}\]

Key Experimental Results¶

Main Results: EMDB1 Pelvis-Aligned Pose Accuracy¶

Method	MPJPE↓	PA-MPJPE↓	MVE↓	PA-MVE↓
ScoreHMR Sample init.	114.0	82.3	141.3	101.9
PHD Sample init.	73.6	49.2	86.4	59.1
HMR2.0b init.	117.2	77.9	140.2	93.9
+ ScoreHMR	105.5	70.0	124.5	84.7
+ PHD (Ours)	73.2	47.4	86.4	58.5
CameraHMR init.	70.3	43.3	81.7	—
+ ScoreHMR	74.9(+4.6)	45.0(+1.7)	89.0(+7.3)	—
+ PHD (Ours)	62.5(-7.8)	42.4(-0.9)	74.6(-7.1)	—

Key finding: PHD yields substantial improvements across all initialization strategies, whereas ScoreHMR actually degrades performance when initialized with CameraHMR.

Absolute Pose Accuracy (C-MPJPE)¶

Method	Pelvis Err.↓	C-MPJPE↓
HMR2.0b	144.0	182.0
+ ScoreHMR	180.6(+36.6)	181.4(-0.6)
+ PHD	94.7(-49.3)	112.6(-69.4)
CameraHMR	163.0	160.3
+ PHD	130.9(-32.1)	135.6(-27.4)

Good local pose accuracy does not imply good absolute pose accuracy — CameraHMR achieves low MPJPE but high Pelvis Error. PHD's personalized shape calibration substantially improves absolute accuracy.

Ablation Study: Point Cloud vs. Joint Angle Representation¶

Representation	MPJPE↓	PA-MPJPE↓	PA-MVE↓
6D Angular	177.9	125.2	154.8
ScoreHMR (angular)	150.0	102.3	128.0
Points (Ours)	75.6	52.1	62.1

Under identical conditions, the point cloud representation achieves less than half the error of joint angle representations, with particularly pronounced advantages on uncommon poses.

Shape Calibration Accuracy¶

Method	Mean Joint Error↓	Mean Vertex Error↓
CameraHMR	30.60	31.85
NLF	19.36	20.61
SHAPY	22.94	21.38
SHAPify (w/ measurements)	11.29	9.18

SHAPify leverages height and weight constraints to achieve body shape accuracy far surpassing data-driven methods.

Highlights & Insights¶

Paradigm innovation: Shifting from "generic estimation" to "personalized fitting" — decoupling shape and pose is the key to improving accuracy.
Deep rationale for representation choice: Using point clouds rather than joint angles as the pose representation is fundamentally motivated by the requirement that conditional generative models need strong correlation between conditions (image/shape) and outputs; the relationship between image features and surface points is far more direct than that with joint rotation angles.
Plug-and-play design: Compatible with any 3D pose estimator as a post-processing module for accuracy improvement.
Synthetic data only: PointDiT is trained exclusively on BEDLAM, avoiding noise propagation from pseudo-labeled data.

Limitations & Future Work¶

SHAPify requires the user to provide a reference pose image, increasing deployment overhead.
The personalized shape assumption presumes a constant body shape throughout the video — scenarios involving clothing changes may violate this assumption.
The iterative fitting process (PointDiT sampling with 5 steps × multiple fitting iterations) incurs considerable computational cost and is not real-time.
Evaluation is primarily conducted on the EMDB dataset; validation on broader in-the-wild scenarios remains limited.

Regress-then-refine paradigm: SMPLify → ScoreHMR → PHD (from fixed priors to image-conditioned priors to shape+image-conditioned priors)
3D pose priors: GMM (SMPLify) → GAN/VAE → diffusion models (ScoreHMR / this work)
Personalization methods: SHAPY (data-driven shape estimation) → PHD (optimization + measurement constraints)

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — A well-motivated combination of personalization, point diffusion, and iterative fitting, with deep insight into representation choice
Technical Depth: ⭐⭐⭐⭐⭐ — A complete technical pipeline from shape calibration to diffusion prior to fitting loop
Experimental Thoroughness: ⭐⭐⭐⭐ — Comparisons and ablations are thorough, but dataset coverage is somewhat narrow
Value: ⭐⭐⭐⭐ — Plug-and-play improvement over existing methods, though requiring a personalization calibration step