PhysMoDPO: Physically-Plausible Humanoid Motion with Preference Optimization¶

Conference: CVPR 2026 arXiv: 2603.13228 Code: None (not yet released) Area: Human Motion Generation / Embodied Intelligence / Reinforcement Learning Alignment Keywords: Diffusion motion generation, DPO preference optimization, physical simulation, humanoid robots, zero-shot transfer

TL;DR¶

PhysMoDPO integrates a pretrained whole-body controller (WBC/DeepMimic) into the post-training pipeline of a diffusion-based motion generator. By automatically constructing preference pairs via physical simulation and fine-tuning with DPO, generated motions—after WBC execution—simultaneously satisfy physical plausibility and text/spatial condition faithfulness, enabling zero-shot transfer to the Unitree G1 real robot.

Background & Motivation¶

Text-driven human motion generation has advanced substantially via diffusion models, yet a core contradiction persists: diffusion models are trained and evaluated in kinematic space (focusing on distribution similarity and condition alignment), whereas robot deployment requires motions to be feasible under dynamic constraints (no foot sliding, center of mass within the support polygon, friction constraints satisfied, etc.). Current deployment pipelines rely on whole-body controllers (WBC, e.g., DeepMimic) to convert generated motions into physically executable trajectories—but WBC may substantially modify motions to satisfy physical constraints, causing executed trajectories to deviate from textual intent. Existing physics-augmented approaches either apply hand-crafted physics losses (e.g., foot-slide penalties) at test time, which may corrupt the motion distribution, or employ hand-crafted reward RL fine-tuning (e.g., ReinDiffuse with PPO, HY-Motion with GRPO)—yet manual heuristics struggle to capture complex dynamic properties (e.g., anomalous center-of-mass positions). There is therefore a need for an automated post-training framework that directly exploits the physical simulator as a reward signal source.

Core Problem¶

How can a diffusion-based motion generator be automatically fine-tuned—without hand-crafted physical heuristics—such that its outputs, after WBC physical simulation execution, both satisfy physical constraints (no falling, no sliding) and maintain faithfulness to input text/spatial conditions?

Method¶

Overall Architecture¶

PhysMoDPO operates as a "generate–simulate–evaluate–fine-tune" iterative loop: 1. For each condition \(C\) (text or text + spatial constraints), \(K\) candidate motions are sampled from the diffusion generator via its stochasticity: \(X_k = G_\theta(\epsilon_k, C)\) 2. Each candidate is executed by a fixed WBC (DeepMimic) in physical simulation, yielding physically feasible trajectories \(X'_k = \mathcal{T}(X_k)\) 3. Physical and task rewards are computed on the post-simulation trajectories \(X'_k\) 4. Preference pairs \((X_{win}, X_{lose})\) are constructed from rewards, and DPO loss is used to fine-tune the generator 5. Iteration: the updated generator resamples candidates, and new preference pairs are constructed

Key insight: evaluation takes place in physical space, while DPO training operates on kinematic-space sample pairs—respecting DPO's premise that data must come from the model's own samples.

Key Designs¶

Physical operator \(\mathcal{T}\) as a black-box reward source: A pretrained DeepMimic tracking controller executes generated motions in simulation. Tracking distortion \(\Delta(X) = \|X' - X\|^2\) directly measures physical feasibility—smaller distortion indicates the motion is closer to the physically feasible space \(\mathcal{X}_{phys}\). Treating \(\mathcal{T}\) as a non-differentiable black box avoids the difficulty of differentiating through the simulator.
Combination of four reward functions:
\(\mathcal{R}_{track}\): tracking reward, minimizing the discrepancy between pre- and post-simulation motions
\(\mathcal{R}_{slide}\): foot-slide penalty, applied when foot height is below a threshold (0.05 m) and horizontal velocity exceeds a threshold (0.5 m/s)
\(\mathcal{R}_{M2T}\): text–motion consistency, computing cosine similarity between post-simulation motion and text using a pretrained TMR encoder
\(\mathcal{R}_{control}\): spatial control reward (only when spatial constraints are present), measuring how well joint trajectories match the target
Dominance-based preference pair construction: Rather than computing a weighted sum of rewards (which introduces weight sensitivity and reward hacking), a winning sample must strictly dominate the losing sample across all reward dimensions, avoiding multi-objective reward engineering.
Iterative generate–fine-tune loop: Each round resamples candidates with the updated model and constructs new preference pairs. Three iterations yield the best results, progressively targeting the model's current physical weaknesses.
SMPL representation replacing HumanML3D: Training directly on SMPL joint rotations avoids the expensive inverse kinematics computation required when converting from HumanML3D format to SMPL, facilitating downstream robot deployment.

Loss & Training¶

Total loss: \(\mathcal{L} = \mathcal{L}_{DPO}(X_{win}, X_{lose}) + \lambda_{SFT} \mathcal{L}_{SFT}(X_{win})\)

\(\mathcal{L}_{DPO}\): standard Diffusion-DPO objective, encouraging the model to sample motions more similar to \(X_{win}\)
\(\mathcal{L}_{SFT}\): supervised fine-tuning loss on winning samples only (Two-Forward strategy), preventing preference optimization drift and maintaining generation quality
Text task: \(\lambda_{SFT}=1\), \(\beta=5\), lr=1e-6, batch=32, 5000 steps, only the diffusion head is updated
Spatial control task: \(\lambda_{SFT}=2\), \(\beta=20\), batch=64, 4000 steps
12 candidate motions sampled per text prompt

Key Experimental Results¶

Main Results¶

Text-to-motion (HumanML3D, evaluated after SMPL simulation)

Method	R@3↑	FID↓	Jerk↓
MaskedMimic	0.6305	73.79	66.08
MotionStreamer	0.8310	—	46.75
SFT baseline	—	49.22	48.30
PhysMoDPO	0.8517	—	43.60

Spatial–text control (HumanML3D, evaluated after SMPL simulation, cross-control)

Method	Err.↓	FID↓	Jerk↓
OmniControl (original)	0.1998	5.82	115.12
OmniControl (cross)	—	0.75	64.07
PhysMoDPO	0.1298	0.93	62.31

G1 robot zero-shot transfer (text-to-motion)

Method	M2T↑	R@3↑	FID↓
MaskedMimic	0.7156	0.5761	0.3673
MotionStreamer	—	—	—
PhysMoDPO	—	—	—

Ablation Study¶

Iteration rounds: 1→3 rounds reduces Err from 0.1421 to 0.1298, FID from 1.17 to 0.93, Jerk from 72.13 to 62.31, validating the sustained benefit of iterative fine-tuning.
Reward combination: tracking only → +control → +sliding → +M2T yields progressive improvement. Tracking alone encourages conservative motions and causes training instability; adding M2T slightly increases Jerk (by encouraging more dynamic semantic actions) but improves overall realism and alignment.
Preference pair construction: strict dominance significantly outperforms fused score weighting, which is sensitive to weight choices and prone to hacking.
SFT loss weight: \(\lambda_{SFT}=2\) is optimal; \(\lambda_{SFT}=0\) degrades control precision and generation quality; excessively large values (5/10) diminish DPO gains.
DPO temperature \(\beta\): \(\beta=20\) is optimal; \(\beta=1\) yields limited improvement; \(\beta=50\) causes over-updating, degrading FID and Jerk.
Data representation: SMPL outperforms HumanML3D format on spatial controllability and FID (with slight text-alignment reduction) and requires no inverse kinematics.
Data scale: 20% of data achieves reasonable performance, validating sample efficiency.

Highlights & Insights¶

Clear core contribution: The physical simulator is used as a non-differentiable black-box reward source; DPO preference learning circumvents the need for differentiation—more comprehensive than hand-crafted physical heuristic rewards (automatically covering dynamics, contact, balance, etc.) and more practical than differentiable physical simulation.
Evaluation in deployed space: All metrics are computed on physically executed trajectories after WBC execution rather than in kinematic space, directly measuring deployment performance and exposing the blind spot of prior methods that appear strong on kinematic metrics but perform poorly under physical execution.
Strict dominance preference pair construction: Under multi-objective rewards, rather than computing weighted sums, a candidate must win on all dimensions to be designated a winner, eliminating reward engineering and hacking. This simple and effective strategy is transferable to other multi-objective DPO scenarios.
Successful zero-shot cross-body transfer: PhysMoDPO trained on SMPL transfers zero-shot to both G1 and H1 robots, demonstrating the generalizability of physical compatibility.
Effectiveness of iterative DPO: Each round resamples from the updated model to construct new preference pairs, achieving progressive improvement that saturates at three rounds.

Limitations & Future Work¶

Validation is limited to flat terrain; complex environments (stairs, slopes, uneven surfaces) are not addressed.
Dependence on a fixed DeepMimic tracking policy means its own biases propagate into preference pair construction; ideally, multiple different WBCs or human evaluation would reduce evaluation bias.
Motions requiring object support (e.g., climbing stairs) are filtered out, limiting applicability.
Generator parameter updates during DPO fine-tuning are restricted to the diffusion head; full-parameter fine-tuning or strategies such as LoRA are not explored.
The user study involves only 20 participants and 40 video pairs, limiting its scale.

vs. ReinDiffuse/HY-Motion: Both also fine-tune motion generators, but rely on hand-crafted heuristics (floating/foot-sliding) as PPO/GRPO rewards, which fail to capture complex physical issues such as anomalous center-of-mass positions. PhysMoDPO uses the simulator to directly produce physical trajectories, yielding more comprehensive reward signals. DPO is also more stable and efficient than PPO.
vs. MaskedMimic: An end-to-end physical policy that, while physically compliant, exhibits poor text following (R@3 of only 0.6305 vs. PhysMoDPO's 0.8517), indicating that learning a control policy directly from large-scale natural language remains highly challenging.
vs. PhysPT/PhysDiff/Zhang et al. constraint projection methods: These impose physical constraints at inference or sampling time, potentially altering the output distribution. PhysMoDPO internalizes physical knowledge during training, requiring no additional optimization at inference, and is therefore more efficient.
vs. Morph: Morph refines data with a physical model and then fine-tunes the generator, but excessive generator noise may in turn harm the physical model. PhysMoDPO uses the physical model solely to compute rewards without modifying it.
DPO as a paradigm for continuous control: This work successfully transfers DPO from LLM alignment to motion generation. The core pattern—"non-differentiable evaluator + preference pairs + DPO"—generalizes to any generative task requiring alignment with non-differentiable simulators or renderers (e.g., 3D generation in settings where differentiable rendering is infeasible).
Generality of the strict dominance rule: The idea of avoiding reward weighting in multi-objective DPO is broadly applicable.
Philosophy of "deployed space evaluation": Evaluating in the deployment space rather than the generation space is transferable to other generation-plus-execution pipelines (e.g., code generation → compile-and-run, text-to-image → rendering).

Rating¶

Novelty: ⭐⭐⭐⭐ Using the simulator as a black-box reward source for DPO is a natural and novel idea; strict dominance preference construction is particularly distinctive.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three body types (SMPL, G1, H1), two tasks (text and spatial control), real robot deployment, user study, and very detailed ablations.
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear (kinematic space vs. physical space), formalization is rigorous, and experimental organization is strong.
Value: ⭐⭐⭐⭐ Provides a systematic solution for physical plausibility in motion generation with direct impact on the embodied AI community.
Novelty: ⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐
Writing Quality: ⭐⭐⭐
Value to Me: ⭐⭐⭐