Estimating Body and Hand Motion in an Ego-sensed World¶

Conference: CVPR 2025
arXiv: 2410.03665
Code: Yes (code and models available on the project homepage)
Area: 3D Vision
Keywords: Egocentric Human Motion Estimation, Diffusion Models, SLAM Pose, Invariant Representation, Hand Estimation

TL;DR¶

EgoAllo proposes a system to estimate the wearer's full-body pose, height, and hand parameters from head-mounted egocentric SLAM poses and images. By designing head motion conditioning parameters that satisfy spatial and temporal invariance, it reduces human motion estimation errors by up to 18% and decreases hand world-coordinate errors by 40% using kinematic constraints.

Background & Motivation¶

Background: 3D human pose estimation from external perspectives is highly mature (e.g., HMR, SMPL recovery, optimization). However, estimating the wearer's own motion from a head-mounted egocentric viewpoint faces unique challenges, as most of the body is outside the camera's field of view. Existing egocentric methods (e.g., EgoEgo, BoDiffusion, AvatarPoser) utilize diffusion models or VAEs as motion priors.
Limitations of Prior Work: (a) Existing methods utilize varying conditioning representations for head poses, lacking systematic design principles. (b) Most approaches use a fixed "average" human shape, ignoring the importance of height variation for scene grounding. (c) Hand motion is typically processed in isolation, lacking joint reasoning with body motion.
Key Challenge: The fundamental difficulty of egocentric human motion estimation lies in extremely low observability—the wearer's body is rarely in the field of view, necessitating strong priors to resolve ambiguity. The quality of the prior depends directly on the design of the conditioning representation. However, existing conditioning representations either violate spatial invariance (absolute positions are affected by global coordinates) or violate temporal invariance (sequence normalization introduces cross-timestep dependencies).
Goal: To design a unified system that simultaneously estimates body pose, height, and hand parameters from egocentric SLAM poses and images, with the key being to identify a conditioning representation that satisfies both spatial and temporal invariance.
Key Insight: Starting from invariance principles—global ground-plane transformations should not affect local motion (spatial invariance), and the same motion at different positions within a time window should have identical representations (temporal invariance).
Core Idea: By constructing conditioning parameters that satisfy both spatial and temporal invariance through step-by-step local normalization (projecting the head pose to the ground plane and aligning it with the forward direction), the learning of the diffusion motion prior is significantly enhanced.

Method¶

Overall Architecture¶

EgoAllo takes two inputs: a sequence of head-mounted SLAM poses (\(SE(3)\) transformations) and an egocentric video. The system outputs SMPL-H model parameters, including local joint rotations \(\Theta^t\), body shape \(\beta\) (encoding height), and binary contact predictions \(\psi_j^t\). The workflow is divided into three stages: (1) transforming SLAM poses into conditioning vectors via an invariant parameterization function \(g(\cdot)\); (2) sampling local body parameters using a conditional diffusion model; and (3) placing them into the world coordinate system via global alignment, followed by guiding hand estimation using an LM optimizer.

Key Designs¶

Invariant Conditioning:
- Function: Converts raw SLAM poses into conditioning representations suitable for diffusion model learning.
- Mechanism: The condition \(\vec{c}^t\) at each timestep \(t\) consists of two components: (a) the relative transformation between adjacent frames \(\Delta T_{\text{cpf}}^{t-1,t}\) (expressing translation in the local frame, which is naturally spatially invariant); (b) the transformation between the current CPF frame and a canonical frame calculated independently for each timestep. The canonical frame is constructed by projecting the CPF origin onto the ground plane (encoding height) and aligning the y-axis with the forward direction of the CPF. The key difference is that EgoEgo computes only one canonical frame per sequence, whereas EgoAllo computes it independently for each timestep, thereby satisfying both spatial invariance (no effect from ground-plane transformations) and temporal invariance (no cross-sequence dependencies).
- Design Motivation: EgoEgo's sequence normalization introduces dependency on the first frame, violating temporal invariance. AvatarPoser and BoDiffusion leverage absolute positions + global differences, failing to satisfy spatial invariance. The proposed invariant conditioning is the unique formulation that satisfies both properties simultaneously.
Local Body Representation:
- Function: Decouples the diffusion model outputs from the global coordinate system.
- Mechanism: The diffusion model samples only local parameters—joint rotations \(\Theta^t \in \mathbb{R}^{51 \times 3 \times 3}\), body shape \(\beta \in \mathbb{R}^{16}\) (encoding height), and contact predictions \(\psi_j^t\) (ground contact probability for 21 joints). It does not contain global root transformations. The body shape remains consistent across all timesteps (as the height of the same person is constant). The global pose \(T_{\text{world,root}}^t = T_{\text{world,cpf}}^t \cdot T_{\text{cpf,root}}^{(\Theta^t, \beta^t)}\) is precisely calculated from the SLAM pose.
- Design Motivation: (a) Local parameters naturally satisfy spatial invariance, independent of global coordinate selection. (b) The body shape encodes height, which is critical for scene grounding in metric scale. (c) Contact predictions can be utilized to reduce foot sliding.
Guidance via Levenberg-Marquardt:
- Function: Integrates hand visual observations into the diffusion sampling process.
- Mechanism: HaMeR is used to detect hands from egocentric images, yielding 3D keypoints \(\hat{p}_{\text{camera},j}^t\). At each step of diffusion denoising, an LM optimizer is applied to the predicted joint rotations \(\Theta\) to minimize a combination of three losses: \(\mathcal{E}_{\text{guidance}} = \mathcal{E}_{\text{hands}} + \mathcal{E}_{\text{skate}} + \mathcal{E}_{\text{prior}}\). Hand losses include 3D hand parameter matching and camera reprojection losses; skate losses penalize contact-joint motion using contact predictions; and prior losses prevent the joint rotations from deviating too far from the denoiser's predictions.
- Design Motivation: Diffusion models provide motion priors but lack visual observation constraints. Single-frame hand estimation is accurate but lacks temporal smoothness and body kinematic constraints. LM guidance integrates both, where kinematic and temporal consistency constraints reduce hand world-coordinate errors by 40%.

Loss & Training¶

The diffusion model is trained using a standard denoising objective: \(\min_\theta \mathbb{E}[w_n \|\mu_\theta(\vec{x}_n, n, \vec{c}) - \vec{x}_0\|^2]\), and employs DDIM sampling. It is trained on the AMASS dataset, with device poses synthesized during training from the left and right pupil vertex positions of the SMPL-H blend skin mesh. Sequence lengths range from 32 to 128. During testing, long sequences are processed through MultiDiffusion-style window blending.

Key Experimental Results¶

Main Results (AMASS Test Set, Conditioning Comparison)¶

Conditioning Method	Sequence Length	Spatial/Temporal Invariance	MPJPE↓	Gain %	PA-MPJPE↓
EgoAllo (Eq.4)	32	✓/✓	129.8	—	109.8
Absolute + Local Relative	32	Partial/✓	133.0	2.4%	113.6
Absolute + Global Difference	32	✗/✓	136.2	4.9%	118.3
Sequence Normalization [EgoEgo]	32	✓/✗	153.1	17.9%	128.7
Absolute Pose	32	✗/✓	159.9	23.2%	141.0

Hand Estimation (EgoExo4D Dataset)¶

Method	Hand World Coordinate Error
Single-frame HaMeR Estimation	Baseline
EgoAllo Body Guidance	Reduced by 40%+

Key Findings¶

The improvement of invariant conditioning is highly significant—compared to EgoEgo's sequence normalization, MPJPE is reduced by 17.9% (sequence length 32), proving that temporal invariance is critical for diffusion model learning.
Both spatial and temporal invariance are crucial, but temporal invariance has a larger impact (violating temporal invariance drops performance by 17.9% vs. 4.9% for violating spatial invariance).
Longer sequences (128 steps) perform better than shorter ones (MPJPE 119.7 vs. 129.8), indicating that more temporal context aids motion estimation.
The improvement of body constraints on hand estimation is unexpectedly large (40%), demonstrating the importance of kinematic and temporal consistency constraints.

Highlights & Insights¶

Systematic design of conditioning representations starting from invariance principles is the most illuminating contribution of this work. Rather than "stumbling upon" a good representation via ablation, the authors first define two clear invariance axioms and then derive the unique solution satisfying both. This "principles-driven" design methodology can be generalized to any conditional generative task.
Step-by-step local normalization is simple yet highly effective. By merely projecting the CPF frame onto the ground and aligning it with the forward direction, it simultaneously encodes height information and satisfies dual invariance. This operation incurs almost zero computational cost.
LM guidance fuses independent hand estimation with body priors, compensating for the instability of single-frame estimation using kinematic constraints, thereby reducing error by 40%. This "prior + observation-guided" framework can be transferred to other body parts or object interaction estimation.

Limitations & Future Work¶

The training data still relies on synthetic device poses from AMASS, which may exhibit domain gaps compared to real head-mounted devices.
Only the SMPL-H model is used, which cannot handle clothed scenarios.
Hand guidance requires hands to be visible in the egocentric field of view; thus, completely invisible hands cannot be reconstructed.
The assumption of millimeter-level accuracy for SLAM poses may not hold on certain consumer devices.
Future work could consider incorporating scene geometry (e.g., SLAM 3D point clouds) as conditioning to provide environmental interaction constraints.

vs EgoEgo: EgoEgo utilizes sequence normalization to achieve spatial invariance but violates temporal invariance, resulting in a 17.9% higher MPJPE. EgoAllo's step-by-step local normalization satisfies both invariances simultaneously.
vs AvatarPoser/BoDiffusion: These methods use absolute position + global differences, which violates spatial invariance. Moreover, they rely on VR controller inputs, whereas EgoAllo requires only SLAM poses.
vs Non-learning methods: Physical simulation methods guarantee physical plausibility but lack data-driven motion diversity. EgoAllo achieves a better tradeoff by using diffusion priors with physical guidance.

Rating¶

Novelty: ⭐⭐⭐⭐ The derivation of invariant conditioning is rigorous and elegant, though the overall framework (diffusion model + guidance) is not completely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Tested on four datasets (AMASS/RICH/ADT/EgoExo4D) with systematic ablation comparisons of different conditioning approaches.
Writing Quality: ⭐⭐⭐⭐⭐ The introduction and derivation of the invariance axioms are logically clear with rigorous mathematical formulations.
Value: ⭐⭐⭐⭐ The design principles of invariant representations carry broad transferability and make a significant contribution to the field of egocentric perception.