A Plug-and-Play Physical Motion Restoration Approach for In-the-Wild High-Difficulty Motions¶

Conference: ICCV 2025 arXiv: 2412.17377 Code: Project Page Area: Image Segmentation Keywords: physical motion restoration, motion imitation, test-time adaptation, motion correction, high-difficulty motions

TL;DR¶

This paper proposes a plug-and-play physical motion restoration approach that repairs artifact frames in video-based motion capture via a Mask-conditioned Motion Correction Module (MCM), and achieves physics-based simulation of high-difficulty in-the-wild motions via a Physics-based Motion Transfer Module (PTM) built on pretraining and test-time adaptation, substantially improving the physical plausibility of recovered motions.

Background & Motivation¶

Extracting physically plausible 3D human motion from monocular video is a critical task. Current video motion capture methods (e.g., GVHMR, TRAM) can rapidly recover 3D motions but lack dynamics modeling, leading to severe physical implausibilities: - Floating: the body hovers above the ground - Foot sliding: feet slide while in contact with the ground - Self-penetration: body parts interpenetrate each other - Ground penetration: the body passes through the ground

Existing physics-based motion imitation methods (e.g., PHC+, UHC) can improve the physical quality of everyday motions (walking, running, jumping), but remain ineffective for high-difficulty motions (gymnastics, martial arts, dance, etc.) due to two challenges:

Challenge 1 — Artifact Reference Motions: Under fast movements and extreme poses, video motion capture algorithms produce artifact frames (temporally incoherent poses), which—even if brief—cause physics simulation to fail.

Challenge 2 — Intrinsic Imitation Complexity: The long-tail distribution of high-difficulty motions, complex force control requirements, and catastrophic forgetting make it difficult for a single controller to generalize across diverse high-difficulty actions.

Method¶

Overall Architecture¶

Given the video motion capture result (reference motion) and the original video, the overall pipeline proceeds as follows: 1. MCM (Mask-conditioned Correction Module): detects and corrects artifact motion frames 2. PTM (Physics-based Motion Transfer Module): performs physics-based motion restoration on the corrected motion - If PTM fails initially, test-time adaptation (TTA) is triggered to update network parameters until success or a threshold is reached

The entire approach is designed as a plug-and-play module that can be directly integrated after any video motion capture method.

Key Designs¶

1. Mask-conditioned Motion Correction Module (MCM)¶

Problem: Blurry frames in high-difficulty motions cause motion capture algorithms to fail at localizing body parts, producing temporally incoherent artifact motions.

Key Insight: Segmentation methods inherently distinguish foreground from background and can define the approximate body region even in blurry frames. Artifact motions are temporally short and surrounded by rich motion context, making segmentation-guided interpolative replacement feasible.

Mismatch Detection: - Project 3D positions to 2D coordinates → compute OKS similarity against 2D keypoints extracted by object detection - Frames below the threshold are flagged as artifact motions - Detection can also leverage the overlap between projected SMPL mesh and the human segmentation mask

Motion Correction: - Human segmentation masks are obtained using SAM - A pretrained ViT serves as the mask feature extractor, capturing rich human pose information - Mask features and motion context are used as conditions for a diffusion model performing motion in-betweening - 10% of training data is set to unconditional generation to enhance generalization - Implemented using a GMD-based UNet architecture

2. Physics-based Motion Transfer Module (PTM)¶

Pretraining Stage: - An imitation controller is trained on four datasets: AMASS, Human3.6M, AIST++, and Motion-X (kungfu subset) - Policy \(\pi_{\text{PTM}}\) is optimized using PPO - A PD controller computes joint torques: \(\tau^i = k_p^i(a_t^i - x_t^i) - k_d^i q_t^i\) - An AMP discriminator provides style rewards - Total reward = reconstruction reward + AMP style reward + energy penalty

RL-based Test-Time Adaptation (TTA):

Core innovation — leveraging the trial-and-error nature of RL to perform limited-step parameter updates at test time, processing each motion sequence individually. The adaptation incorporates the following designs:

Relative Reward: Captured reference motions contain jitter and accumulated root-node errors; constructing a full reconstruction reward is harmful. Absolute root position is therefore ignored, and global orientation and translation are maintained through explicit rotation guidance and implicit velocity guidance:

\[r_t^g = e^{w_p\|rela(\hat{p}_t) - rela(p_t)\|} + e^{w_r\|\hat{\theta}_t \ominus \theta_t\|} + e^{w_v\|\hat{v}_t - v_t\|} + e^{w_\omega\|\hat{\omega}_t - \omega_t\|}\]

Relative Early Termination: Conventional strict termination conditions are easily triggered when facing low-quality reference motions. A termination condition based on the average relative distance across joints is designed to be more permissive for high-difficulty motions:

\[\mathcal{F}_t = \left(\frac{1}{J}\sum_{i=1}^J\|rela(\hat{p}_t^i) - rela(p_t^i)\| > d_{term}\right) \vee \mathcal{F}_t^h \vee \mathcal{F}_t^c\]

Residual Force: High-difficulty motions often involve aerial flips and jumps that rely on trampolines or mats; an auxiliary external force is introduced to compensate for missing environmental conditions in simulation.

Loss & Training¶

MCM: Denoising training based on a diffusion model; random motion segments are selected as generation targets
PTM pretraining: PPO + AMP, strict reconstruction reward + early termination; approximately 2–3 days on a single A100
TTA inference: everyday motions require <500 steps or no adaptation; high-difficulty motions require 2,000–4,000 steps

Key Experimental Results¶

Main Results¶

Comparison on Public Datasets (Table 1 — selected key results):

Dataset	Method	WA-MJE↓	W-MJE↓	MPJPE↓	OKS↑	GP↓	Float↓	FS↓
AIST++	TRAM	106.2	159.5	91.8	0.945	20.6	490.0	2.35
AIST++	TRAM+PhysPT	136.8	218.3	93.6	0.903	4.08	22.7	2.07
AIST++	TRAM+Ours	106.2	157.7	94.0	0.953	0.50	1.97	0.59
AIST++	GVHMR	124.4	197.3	93.5	0.965	12.4	71.2	2.23
AIST++	GVHMR+Ours	123.4	193.8	94.0	0.963	0.50	1.98	0.59
Kungfu	TRAM	113.4	209.7	84.6	0.925	4.32	40.9	2.57
Kungfu	TRAM+Ours	113.3	193.7	79.5	0.931	0.24	5.71	0.26
EMDB	GVHMR	109.1	274.9	252.2	0.954	82.3	510.3	0.69
EMDB	GVHMR+Ours	91.2	261.6	249.1	0.948	0.25	3.63	0.17

In-the-Wild High-Difficulty Test Set (Table 2 — 206 videos):

Method	OKS↑	MPS↑	GP↓	Float↓	FS↓
TRAM	0.828	0.667	19.99	107.4	12.26
TRAM+PhysPT	0.730	0.645	7.88	39.4	6.01
TRAM+Ours	0.845	0.687	0.60	17.0	0.78
GVHMR	0.837	0.704	10.0	138.0	3.01
GVHMR+Ours	0.854	0.710	0.33	14.9	0.72

Ablation Study¶

PTM Physical Transfer Capability (Table 3 — Kungfu dataset):

Method	SR↑	MPJPEg↓	MPJPE↓	PA-MPJPE↓
UHC	42.91%	86.23	48.91	39.73
PHC+	76.41%	84.86	47.98	39.43
PTM (Ours)	98.16%	82.13	33.45	26.12

Component Ablation of TTA (Table 4 — in-the-wild high-difficulty dataset):

Early-Term	Res-F	TTA	Rela-Rwd	OKS↑	MPS↑	SR↑
✗	✗	✗	✗	0.811	0.673	37%
✓	✗	✗	✗	0.784	0.652	52%
✓	✓	✗	✗	0.823	0.673	61%
✓	✓	✓	✗	0.850	0.706	85%
✓	✓	✓	✓	0.853	0.710	87%

MCM Configuration Ablation (Table 5):

In-between	Mask Cond.	Kpts Cond.	Mask Det.	Kpts Det.	OKS↑	SR↑
✗	-	-	-	-	0.802	78%
✓	✗	✓	-	✓	0.834	83%
✓	✓	✗	✓	✗	0.845	87%
✓	✓	✗	✓	✓	0.853	87%

Key Findings¶

Substantial improvement in physical plausibility: ground penetration drops from 82.3 to 0.25 (EMDB-GVHMR), foot sliding from 12.26 to 0.78 (Wild-TRAM), and self-penetration decreases by more than 50%
Original motion patterns are preserved: joint errors in world/camera coordinates remain largely unchanged, and 2D similarity even improves
TTA is the largest contributor: success rate improves from 61% (without TTA) to 85% (with TTA enabled)
Residual force is critical for aerial motions: compensates for the absence of environmental support such as trampolines and mats
Masks outperform keypoints as conditioning signals: segmentation algorithms only distinguish foreground from background, whereas keypoint detection is prone to failure on blurry frames
PhysPT degrades certain metrics: its simplification of dynamics equations and lack of understanding of high-difficulty motion distributions cause world-coordinate errors to increase

Highlights & Insights¶

Plug-and-play design: integrates into any video motion capture method without additional training, greatly lowering the barrier to adoption
Elegance of the pretrain + adapt paradigm: pretraining accumulates motion priors, while TTA adapts to specific motions, naturally addressing long-tail distribution and catastrophic forgetting
Relative reward and termination design: recognizing that video motion capture outputs are inherently unreliable, the method abandons absolute positional constraints in favor of relative ones—a highly practical design choice
SAM segmentation + ViT features: leverages the generalization capability of large foundation models to compensate for the failure of keypoint detection in high-difficulty scenarios
Collection of 206 high-difficulty in-the-wild videos as a benchmark: covering rhythmic gymnastics, taekwondo, yoga, and more, filling an evaluation gap in the field
Thorough problem analysis: clearly distinguishes "artifact motions" and "complex motion imitation" as two independent problems, addressed respectively by MCM and PTM

Limitations & Future Work¶

Single-person motions only: cannot restore multi-person motions involving close interaction
TTA inference is slow: high-difficulty motions require 2,000–4,000 adaptation steps
The use of residual forces is not fully physically justified, as it essentially introduces external force assistance
MCM relies on the accuracy of SAM and object detection
Object interaction scenarios are not considered (human–object interaction data was excluded)
The relative reward ignores absolute root position, which may cause drift in long sequences

PHC/PHC+: motion imitation controllers achieving 97%+ success rate on AMASS, but failing on high-difficulty noisy motions
PhysPT: a pretrained physics-aware Transformer that lacks understanding of high-difficulty motions due to the absence of corresponding training data
SimPoE: integrates kinematic and physical dynamics from images, but is sensitive to control parameters
AMP: adversarial motion prior providing style rewards
GVHMR/TRAM: state-of-the-art video motion capture methods; this work serves as a downstream post-processing stage for them

The pretrain + TTA paradigm proposed in this paper may inspire robot control tasks that require generalization to out-of-distribution samples.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of MCM (mask-guided correction) and PTM (pretraining + TTA) is original; the relative reward design demonstrates genuine insight
Practicality: ⭐⭐⭐⭐⭐ — Plug-and-play, covering motions from everyday to high-difficulty
Experiments: ⭐⭐⭐⭐⭐ — Three public datasets + 206 high-difficulty in-the-wild videos, comprehensive ablations, and thorough comparison with state-of-the-art
Writing Quality: ⭐⭐⭐⭐ — Problem motivation is clear, with a clean one-to-one correspondence between two challenges and two modules
Overall: 8.5/10