Diffusion-based 3D Hand Motion Recovery with Intuitive Physics¶

Conference: ICCV 2025 arXiv: 2508.01835 Code: None Area: 3D Hand Reconstruction / Motion Recovery Keywords: Diffusion Models, 3D Hand Reconstruction, Motion Refinement, Intuitive Physics, Hand-Object Interaction

TL;DR¶

This paper proposes a physics-augmented conditional diffusion model that refines per-frame 3D hand reconstruction results into temporally consistent motion sequences via an iterative denoising process, incorporating intuitive physics constraints (kinematic and stability constraints) to substantially improve reconstruction accuracy and physical plausibility.

Background & Motivation¶

Recovering 3D hand motion from monocular RGB video is a critical problem in VR/AR, robotic dexterous manipulation, and related fields. The main challenges are as follows:

Limitations of per-frame methods: Even leading per-frame reconstruction methods (e.g., HaMer) produce severely degraded predictions under heavy occlusion caused by hand-object interaction, and the lack of temporal consistency across adjacent frames results in unnatural motion.

Scarcity of video data: Video-based methods require annotated video sequence training data, which is extremely costly to collect, particularly in hand-object interaction scenarios.

Deficiencies of existing motion refinement methods: Deterministic methods such as PoseBERT struggle to capture the inherent uncertainty in per-frame estimates and neglect the physical laws governing hand motion.

The core insights of this paper are: - Modeling uncertainty with diffusion models: Motion refinement is formulated as a conditional probability distribution \(p(\mathbf{x}_{1:T}|\mathbf{y}_{1:T})\) rather than a deterministic mapping. - Incorporating intuitive physics: Human hands follow specific physical laws during object interaction (e.g., approaching along the shortest path, fingers remaining still during stable grasping); such prior knowledge can substantially improve model performance.

Method¶

Overall Architecture¶

The overall pipeline consists of three stages: 1. An arbitrary per-frame reconstruction model (e.g., HaMer, K-Hand) produces initial estimates \(\mathbf{y}_{1:T}\). 2. A conditional diffusion-based motion refinement model generates improved motion estimates \(\mathbf{x}_{1:T}\). 3. Intuitive physics knowledge is incorporated into diffusion model training via loss functions.

Key characteristic: The model is trained exclusively on motion capture (MoCap) data without requiring image data, and after training it can be used as a plug-and-play module with any per-frame reconstruction method.

Key Designs¶

Conditional diffusion motion refinement: Unlike standard diffusion models, this paper adopts a Shifting Diffusion framework. The forward process does not transition from data to pure noise; instead, it gradually shifts from the ground-truth motion \(\mathbf{x}_{1:T}\) toward the initial estimate \(\mathbf{y}_{1:T}\):

\[q(\mathbf{x}_{1:T}^n | \mathbf{x}_{1:T}, \mathbf{y}_{1:T}) \sim \mathcal{N}(\mathbf{x}_{1:T} + \eta_n \mathbf{e}_{1:T}, \kappa^2 \eta_n \mathbf{I})\]

where \(\mathbf{e}_{1:T} = \mathbf{y}_{1:T} - \mathbf{x}_{1:T}\) is the residual and \(\eta_n\) is a monotonically increasing shift schedule. The reverse denoising process thus naturally corresponds to recovering the true motion from the initial estimate.

The reverse process is parameterized by directly estimating the clean state \(\hat{\mathbf{x}}_{1:T} = f_{\mathbf{W}}(\mathbf{x}_{1:T}^n, \mathbf{y}_{1:T}, n)\) to define the transition distribution.

Hybrid spatiotemporal architecture: The core network of the reverse diffusion process combines MeshCNN and Transformer:
- MeshCNN (4 layers, feature dimensions [32, 64, 64, 64]): captures spatial geometric dependencies of the 3D hand mesh.
- Transformer (4 layers, 8 heads, 512-dimensional embeddings): captures temporal dependencies across the sequence.
- MLP: encodes the diffusion step \(n\).
- Autoregressive prediction is adopted, where the prediction at time \(t+1\) is conditioned on historical predictions from \(1\) to \(t\).
Identification and integration of intuitive physics knowledge: The paper identifies four fundamental motion states in hand-object interaction:
- Reaching: the hand moves directly toward the object from an idle state.
- Stable Grasping: the hand stably holds the object with fingers largely stationary.
- Manipulation: the hand undergoes significant pose changes while manipulating the object.
- Releasing: the hand withdraws from the object after manipulation.

Based on these states, two types of physical constraints are derived: - Kinematic constraints: during reaching and releasing, the hand moves along a minimum-energy trajectory. - Stability constraints: during stable grasping, finger joints should remain stationary.

Motion state prediction: The diffusion model additionally predicts the motion state category \(\hat{c}_{1:T}\) for each frame, supervised via cross-entropy loss:

\[\mathcal{L}_{state} = \frac{1}{T} \sum_{t=1}^T CE(c_t, \hat{c}_t)\]

The predicted states serve as conditioning variables in autoregressive prediction, with Gumbel-Softmax enabling differentiable training.

Loss & Training¶

The total training objective is a weighted sum of four losses:

\[\mathcal{L}_{total} = \mathcal{L}_{data} + \lambda_1 \mathcal{L}_{state} + \lambda_2 \mathcal{L}_{kinetic} + \lambda_3 \mathcal{L}_{stability}\]

\(\mathcal{L}_{data} = \mathbb{E}_{n} \|\mathbf{x}_{1:T} - f_{\mathbf{W}}(\mathbf{x}_{1:T}^n, \mathbf{y}_{1:T}, n)\|^2\): standard diffusion reconstruction loss.
\(\mathcal{L}_{kinetic}\): penalizes directional reversals during reaching/releasing states to encourage minimum-energy paths.
\(\mathcal{L}_{stability} = \frac{1}{|\mathcal{C}_g|} \sum \|\boldsymbol{\theta}_{f,t} - \boldsymbol{\theta}_{f,t+1}\|_2^2\): penalizes changes in finger joints during stable grasping.
Hyperparameters: \(\lambda_1 = 50\), \(\lambda_2 = 5e^2\), \(\lambda_3 = 1e^3\).
Optimizer: AdamW, initial learning rate \(10^{-4}\), decayed by 0.8 every 5 epochs.
Training data: MoCap data only, using randomly perturbed GT motion as initial estimates.

Key Experimental Results¶

Main Results¶

Comparison with state-of-the-art motion refinement methods on DexYCB (base reconstruction: HaMer):

Method	MJE↓	P-MJE↓	ACCL↓	KIN↓	STA↓
HaMer (per-frame)	18.9	4.4	7.95	22.49	1.08
HaMer + PoseBERT	18.0	4.4	2.38	0.67	0.00
HaMer + Ours	17.5	4.1	1.01	0.00	0.00

Zero-shot transfer comparison with state-of-the-art video-based methods on HO3Dv2:

Method	P-MJE↓	P-MVE↓	F@5↑	F@15↑
HaMer	8.1	8.6	58.0	97.4
Deformer	9.4	9.1	54.6	96.3
Ours (zero-shot)	8.0	8.3	59.7	97.6

Ablation Study¶

Incremental component addition on DexYCB (base reconstruction: K-Hand):

Configuration	MJE↓	ACCL↓	KIN↓	STA↓
Deterministic Transformer	22.7	2.53	0.39	0.04
+ Diffusion model	21.9	1.38	0.12	0.02
+ State prediction	21.6	1.27	0.06	0.00
+ Kinematic constraint	21.7	1.17	0.00	0.00
+ Stability constraint	21.5	1.27	0.05	0.00
Full model	21.5	1.17	0.00	0.00
SmoothFilter (heuristic)	25.3	1.96	0.00	0.00
Constant Accl. Loss	21.7	1.29	0.00	0.00

Key Findings¶

The diffusion model reduces ACCL (acceleration error) by 45% compared to the deterministic Transformer (2.53→1.38), validating the effectiveness of uncertainty modeling.
Intuitive physics constraints reduce KIN and STA violations to nearly zero without compromising reconstruction accuracy.
Heuristic smoothing (Gaussian filtering) reduces ACCL but degrades reconstruction accuracy (MJE increases from 24.4 to 25.3).
Zero-shot transfer to HO3Dv2 still outperforms methods trained on that dataset, demonstrating strong generalization.

Highlights & Insights¶

Elegant Shifting Diffusion design: The forward process shifts from GT toward the initial estimate (rather than pure noise), making the reverse denoising process naturally aligned with the motion refinement task.
Effective integration of physical priors: Rather than imposing hard constraints (e.g., post-processing filters), physical laws are embedded via differentiable loss functions, allowing the model to internalize them during training.
MoCap-only training: This avoids the high cost of video annotation, and the trained model can be plugged into any per-frame method.
Motion state modeling: Incorporating discrete physical states as conditioning variables into the diffusion process represents a novel conditioning design.

Limitations & Future Work¶

Inherent limitations of the MANO model: 778 vertices may be insufficient to accurately capture fine-grained hand deformations.
Motion state annotation relies on heuristic rules such as hand-object distance, which may be inaccurate in complex interaction scenarios.
Sequence length is fixed at 16 frames; processing longer videos requires a sliding window, potentially introducing discontinuities at window boundaries.
Bimanual interaction and hand interaction with deformable objects are not considered.
Inference requires multi-step iterative denoising, with speed limited by the number of diffusion steps.

PoseBERT [Baradel et al., 2022]: deterministic motion refinement baseline; this paper surpasses it through probabilistic modeling.
LIA/VASA [Wang et al., 2023]: motion latent space concepts, though applied to a different domain.
DDPM for human body [Tevet et al., 2023]: diffusion-based human body motion generation; this paper extends the paradigm to hand motion and incorporates physical constraints.
Insight: domain-specific physical priors can be effectively incorporated into generative models via differentiable loss formulations.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of conditional diffusion and intuitive physics is novel; the Shifting Diffusion design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Dual-dataset validation with zero-shot transfer and detailed ablation studies provide strong evidence.
Writing Quality: ⭐⭐⭐⭐ Physical motivation is clearly articulated; the visualization of four motion states is intuitive.
Value: ⭐⭐⭐⭐ Driven by a practical problem; the MoCap-only training strategy has broad applicability.