Skip to content

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

Conference: CVPR 2026 arXiv: 2603.11346 Code: AssistMimic Area: Video Understanding Keywords: multi-agent reinforcement learning, physics-based character control, human-human interaction, assistive motion imitation, motion tracking

TL;DR

This paper proposes AssistMimic, which formulates physics-based imitation of human-human assistive interactions as a multi-agent reinforcement learning (MARL) problem. Through motion prior initialization, dynamic reference retargeting, and contact-promoting rewards, it achieves, for the first time, physics-simulation tracking of force-exchanging assistive motions.

Background & Motivation

  • Background: Physics-based human motion imitation (e.g., DeepMimic, PHC) has enabled virtual characters and humanoid robots to reproduce single-person motions with high fidelity, but research remains largely limited to single-person scenarios, with close-contact multi-person interactions rarely addressed.
  • Limitations of Prior Work: Existing multi-agent interaction methods (e.g., Human-X, Phys-Reaction) rely on a "kinematic playback" strategy—first generating the recipient's motion using a single-person controller, then fixing that playback to train the supporter. However, in assistive scenarios, the recipient cannot independently perform the motion (e.g., a paralyzed person cannot stand up alone), making it physically infeasible to generate the recipient's trajectory in isolation.
  • Key Challenge: Assistive human-human interaction requires both parties to continuously perceive each other's pose and adapt forces/positions in real time; decoupled training breaks physical consistency, leading to severe artifacts such as interpenetration and character instability.
  • Goal: To learn physically plausible controllers for dyadic assistive interaction, enabling the supporter to provide meaningful physical support in response to the recipient's real-time state.
  • Key Insight: The problem is modeled as a multi-agent MDP with asymmetric dynamics, jointly training both parties' policies so that the recipient also learns "how to receive assistance."
  • Core Idea: Three components work synergistically to enable MARL convergence in high-contact scenarios: initialization from single-person motion priors, dynamic reference retargeting to maintain contact alignment, and contact-promoting rewards to replace noisy hand tracking.

Method

Overall Architecture

AssistMimic jointly trains tracking policies for a Supporter and a Recipient within a physics simulator. Both agents share a symmetric goal-conditioned policy architecture, taking as input proprioceptive observations \(s_{\text{prior}}\), interaction-aware state \(s_{\text{assist}}\) (partner observations, contact state, contact forces, previous action), and goal \(g\). Physical constraints are imposed on the recipient (reduced PD gains and maximum torques) to force dependence on external support. Specialist policies are trained with PPO and subsequently distilled into a generalist via DAgger.

Key Design 1: Weight Initialization from Motion Prior

  • Function: Initializes both agents' policy network weights using a pre-trained single-person motion tracking controller (PHC).
  • Mechanism: PHC's input-layer weights are directly copied to the portion corresponding to \(s_{\text{prior}}\), while weights for the additional \(s_{\text{assist}}\) inputs are initialized to zero: \(\mathbf{W}_{\text{new}}^{\text{input}} = [\mathbf{W}_{\text{prior}}^{\text{input}} \mid \mathbf{0}]\). The initial policy behavior is thus equivalent to a single-person controller, preserving basic motor skills such as standing and walking.
  • Design Motivation: The MARL exploration space for assistive interactions is enormous; training from scratch fails entirely (ablations show 0% success rate). The motion prior provides a stable starting point, preventing the policy from collapsing into reward hacking.

Key Design 2: Dynamic Reference Retargeting

  • Function: When the distance between the supporter and recipient falls below threshold \(\tau_{\text{dist}}\), the supporter's hand reference trajectory is retargeted from global coordinates to coordinates relative to the recipient's current body pose.
  • Mechanism: The nearest recipient body joint \(k^*\) to the supporter's wrist in reference space is identified, the reference offset \(\Delta\hat{\mathbf{p}}\) is computed, and this offset is applied to the recipient's actual joint position in simulation: \(\hat{\mathbf{p}}_{h_i,t}^{(S)} = \mathbf{p}_{k^*,t}^{(R)} + \Delta\hat{\mathbf{p}}_{h_i,t}\).
  • Design Motivation: Due to physical constraints, the recipient deviates from the reference trajectory. If the supporter blindly tracks fixed reference positions, its hands will completely miss the target body region and fail to make contact. Dynamic retargeting ensures the supporter's hands always track the recipient's actual position.

Key Design 3: Contact-Promoting Reward

  • Function: When the supporter's hand enters the close-range vicinity of the recipient's body (\(d_{i,t} \leq d_{\text{th}}\)), the standard hand-tracking reward is replaced by a contact-promoting reward.
  • Mechanism: At close range, tracking penalties are suppressed and the policy is instead rewarded for finger contact force and proximity: \(r = \beta f_{i,t} \exp(-\alpha d_{i,t}) + b_{\text{contact}}\), where \(f_{i,t}\) is a safely saturated aggregate of finger contact forces. At longer range, standard tracking rewards are retained.
  • Design Motivation: Hand trajectories in motion capture data suffer from severe occlusion noise; strict tracking impedes effective support and may even cause collisions. The contact-promoting reward trains the policy to "apply the right force at the right location" rather than pursue kinematically precise reproduction.

Loss & Training

  • Base Reward: Tracking reward \(r_{\text{track}}^{(m)} = \exp(-D(\hat{\mathbf{q}}_t^{(m)}, \mathbf{q}_t^{(m)}))\), measuring a weighted distance of joint rotations/positions/velocities from the reference.
  • Recipient Reward: Tracking + power penalty + assistive stability term.
  • Supporter Reward: Standard tracking at long range; switches to contact-promoting reward (Eq. 11) at close range.
  • Training Procedure: Motions are clustered by subject ID; one specialist (PPO) is trained per cluster. Early termination is applied at a pose deviation threshold of 0.25 m. Physical State Initialization (PSI) samples initial states from recent rollouts to avoid interpenetration. A generalist is distilled from multiple specialists via DAgger.

Key Experimental Results

Main Results: Specialist Policy Evaluation

Method Inter-X SR(%)↑ Inter-X MPJPE(mm)↓ Mass×1.2 SR(%)↑ Kp/Kd×0.5 SR(%)↑
Sequential Training 62.4 92.3 49.9 50.5
AssistMimic 83.4 107 73.1 83.3
(−) Dynamic Retargeting 74.9 113 57.9 72.8
(−) Contact Reward 81.6 80.4 66.3 77.1
(−) Weight Init 0.0 248 0.0 0.0
Method HHI-Assist SR(%)↑ MPJPE(mm)↓ Mass×1.5 SR(%)↑ Hip torque×0.5 SR(%)↑
AssistMimic 97.7 89.5 67.8 73.2
(−) Dynamic Retargeting 85.4 125 49.1 62.9
(−) Contact Reward 85.8 127 56.4 27.7
(−) Weight Init 19.1† 364† - -

Ablation Study: Generalist Policy and COM Stability

Method Inter-X Generalist SR(%)↑ MPJPE(mm)↓
AssistMimic 39.8 103
+ DAgger Distillation 64.7 106
Method COM Std(seen)↓ COM Std(Mass×1.5)↓ COM Std(Hip τ×0.5)↓
AssistMimic 0.0921 0.0738 0.0865
(−) Dyn Retarget 0.1038 0.0902 0.0924
(−) Contact 0.0938 0.0838 0.0849

Highlights & Insights

  1. First physics-based tracking of force-exchanging assistive interactions: On both Inter-X and HHI-Assist benchmarks, AssistMimic is the first method to successfully track close-contact, force-exchanging human-human motions, filling a critical gap in the field.
  2. The insight that "the recipient must also learn" is profound: The comparison between joint and decoupled training (83.4% vs. 62.4%) clearly demonstrates that even the assisted party must actively learn how to cooperate with and receive support—unidirectional adaptation is far from sufficient.
  3. Motion prior initialization is indispensable: Removing it causes the success rate to drop to 0%, indicating that without a sound initialization, RL cannot effectively explore the high-dimensional dyadic interaction space.
  4. Contact-promoting reward substantially improves robustness: The advantage is especially pronounced under unseen dynamics (increased mass, reduced torque), e.g., hip torque×0.5 on HHI-Assist: 73.2% vs. 27.7%, demonstrating that learning to "actively apply force through contact" is more important than kinematically precise hand trajectory following.
  5. Generalizes to generative motions: The framework can track interaction trajectories generated by diffusion models, converting kinematic outputs into physically plausible motions, demonstrating its generality.

Limitations & Future Work

  1. Insufficient hand dexterity: The current model exhibits high failure rates in scenarios requiring grasping and lifting the recipient's arm; fine-grained finger coordination is difficult to learn from noisy demonstrations, necessitating higher-DoF hand models or dedicated grasping policies.
  2. Absence of visual observations: The current policy relies on precise proprioceptive and partner state information without incorporating visual observations, limiting the feasibility of sim-to-real transfer to physical humanoid robots.
  3. Decoupled planning and control: The lack of tight integration between high-level motion planners and low-level tracking controllers prevents truly real-time adaptive coordination; future work may explore end-to-end joint learning of planning and control.
  4. Generalist policy success rate has room for improvement: The generalist achieves a success rate of 64.7% across 30 diverse interaction clips, still lagging behind the specialist's 83.4%; expanding training data and policy capacity are important directions.
Aspect AssistMimic Human-X (2025) Phys-Reaction (2024)
Interaction Modeling Joint MARL, both parties co-optimized Diffusion planner + single-agent tracking Single agent + kinematic playback
Physical Consistency Fully physics-simulated, supports force feedback Partial physics, open-loop reaction Playback breaks physical consistency
Applicable Scenarios Force-exchanging assistance (supporting/lifting) Social interaction (high-fives, etc.) Non-contact social interaction
Key Limitation Insufficient hand dexterity Cannot handle force-coupled interactions Recipient trajectory cannot be independently generated

vs. PHC: AssistMimic directly inherits PHC's single-person tracking framework as a motion prior and extends it to a multi-agent architecture, demonstrating that a strong single-person controller can serve as a powerful foundation for multi-person interaction.

vs. CooHOI: CooHOI addresses human-object collaborative manipulation, whereas this work targets human-human assistive interaction. The key challenge is that the recipient is itself an agent with autonomous dynamics rather than a passive object, requiring bidirectional adaptation.

Rating

  • Novelty: ⭐⭐⭐⭐ — Formulating assistive interaction as an asymmetric MARL problem is a first in this domain; all three core components are grounded in clear physical intuition.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Two datasets, four evaluation settings, full ablations, and robustness tests under unseen dynamics; however, sim-to-real and real-robot experiments are absent.
  • Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear, component motivations are well-justified, and figures effectively illustrate method advantages and baseline failure modes.
  • Value: ⭐⭐⭐⭐⭐ — Assistive robotics is an important application domain; this work is the first to solve the control problem of physically plausible dyadic assistive interaction, laying the groundwork for future sim-to-real transfer.