PIPHEN: Physical Interaction Prediction with Hamiltonian Energy Networks¶

Conference: AAAI 2026 arXiv: 2511.16200 Code: None Area: Other Keywords: multi-robot collaboration, semantic communication, Hamiltonian energy networks, knowledge distillation, physical interaction prediction

TL;DR¶

This paper proposes PIPHEN, a distributed physical cognition-control framework that employs a Physical Interaction Prediction Network (PIPN) for "semantic distillation" to compress high-dimensional perceptual data to less than 5% of the original data volume, while a Hamiltonian energy conservation-based HEN controller generates coordinated actions, thereby addressing the "shared brain dilemma" in multi-robot systems.

Background & Motivation¶

Multi-robot systems face the "shared brain dilemma" in complex physical collaboration: transmitting high-dimensional multimedia data (e.g., RGB-D video streams at ~30 MB/s) causes severe bandwidth bottlenecks and decision latency, while fully distributed architectures struggle to maintain global coordination. This challenge is particularly acute in critical application domains such as industrial automation, surgical assistance, and agricultural robotics.

Existing solutions oscillate between two extremes:

Centralized approaches: "sacrifice communication for coordination," such as multimodal fusion methods, incurring decision latency as high as 315 ms.

Distributed approaches: "sacrifice coordination for communication," such as distributed learning methods, which struggle to maintain global consistency.

Recent LLM-based multi-agent planners (e.g., LLaMAR, RoCo, CoELA) have demonstrated significant potential in task decomposition and high-level reasoning, but they share a fundamental limitation: they treat robot actions as "atomic black boxes," generating only logical sequences of what to do without addressing how to accomplish tasks with physical precision, nor resolving the high-dimensional perceptual data sharing required for precise collaboration.

The core insight of this paper is that the key to resolving this problem lies in shifting from a "transmit raw data" paradigm to a "transmit semantic knowledge" paradigm—not compressing data files, but distilling 1 second of RGB-D raw video (~30 MB) into structured graph data describing object states and relationships (~1 MB), thereby compressing the effective representation of critical information to less than 5% of the original volume.

Method¶

Overall Architecture¶

PIPHEN adopts a hierarchical "micro-brain" architecture composed of three collaborative layers:

Central Coordination Layer ("brain"): responsible for global knowledge fusion and collaborative strategy generation.
Local Execution Layer ("cerebellum"): deployed on each robot, running a lightweight PIPN for real-time perception and HEN for real-time control.
Dedicated Processing Layer ("micro-brain"): provides dynamically loadable functional modules that support cross-robot invocation.

The framework centers on two core components—the Physical Interaction Prediction Network (PIPN) and the Hamiltonian Energy Network (HEN) controller—forming a complete perception–cognition–control closed loop.

Key Designs¶

Physical Interaction Prediction Network (PIPN): The perceptual and cognitive core of the system, responsible for constructing an interpretable and compact physical world model from multimodal data.
- Hybrid physical representation: Fuses a structured physical knowledge graph (encoding object attributes and relationships) with task vector embeddings generated by a Transformer encoder (capturing dynamic and contextual information), integrated via a Cross-Attention module:
\(R = \text{PhysGCN}(\{f_p^i\}, A, E; \theta_g)\)

where \(R\) is the relational representation, \(\{f_p^i\}\) are initial physical features, \(A\) is the adjacency matrix, and \(E\) denotes edge features.
- Physics-aware Graph Convolutional Network (PhysGCN): A relational attention module is designed to encode physical constraints (e.g., mass ratio, relative velocity) as part of the attention weights, enabling the network to prioritize aggregating information from physically more relevant nodes.
- Physically Consistent Temporal Convolutional Network (PC-TCN): Employs a dynamic causal masking mechanism to adaptively adjust the temporal receptive field for more accurate capture of causal relationships in physical interactions.
Hamiltonian Energy Network (HEN) Controller: Grounded in Hamiltonian mechanics, this component transforms physical understanding into precisely coordinated action control. The core constraint is:

\(\dot{x} = f(x, u) \quad \text{s.t.} \quad \frac{dH}{dt} \approx 0\)

where \(x\) is the system state vector (including generalized coordinates and momenta) and \(u\) is the control command vector. The constraint \(dH/dt \approx 0\) requires control commands to conserve the total system energy, fundamentally guaranteeing the smoothness, stability, and physical realism of coordinated actions. HEN is trained via imitation learning (behavior cloning) with an energy conservation penalty term \(\lambda \| \frac{dH}{dt} \|^2\) incorporated in the loss function.
LLM-Augmented Edge Physical Cognition (three-stage "Generate–Purify–Deploy" knowledge transfer):
- Generate: A large generative model (Claude-3.7-Sonnet) is used to generate large-scale, diverse interaction scenarios in simulation.
- Purify: A foundation model with strong logical reasoning capabilities (GPT-4o) serves as a "physical verifier" to evaluate and filter the physical consistency of generated data.
- Deploy: The purified expert knowledge is injected into a lightweight edge multimodal model (Qwen2.5-VL-3B) via knowledge distillation.
Uncertainty Decomposition and Collaborative Learning: Prediction uncertainty is decomposed into three components—perceptual, model, and environmental: \(U_{\text{total}} = U_{\text{perc}} + U_{\text{model}} + U_{\text{env}}\), quantified respectively via Monte Carlo Dropout, Deep Ensembles, and direct distributional prediction.

Loss & Training¶

The training objective of PIPN combines prediction loss and physical consistency regularization:

\[\mathcal{L} = \mathcal{L}_{\text{pred}} + \lambda_{\text{phy}} \mathcal{L}_{\text{phy}}\]

where \(\lambda_{\text{phy}} = 0.1\). The prediction loss is the L2 distance:

\[\mathcal{L}_{\text{pred}} = \frac{1}{N \cdot T} \sum_{t=1}^{T} \sum_{i=1}^{N} \left( \|\hat{p}_i^t - p_i^t\|_2^2 + \|\hat{q}_i^t - q_i^t\|_2^2 \right)\]

The physical consistency loss comprises energy conservation and momentum conservation terms: \(\mathcal{L}_{\text{phy}} = w_E \mathcal{L}_E + w_M \mathcal{L}_M\), where the energy conservation term penalizes changes in total energy (kinetic + potential) over time, and the momentum conservation term enforces momentum conservation across collisions.

Key Experimental Results¶

Main Results¶

Comparison with state-of-the-art methods on the MAP-THOR benchmark (2-agent setting):

Method	SR (%) ↑	TR (%) ↑	C (%) ↑	B ↑	Steps ↓
Act	33	67	91	0.59	24.8
ReAct	34	72	92	0.67	24.3
CoT	14	59	87	0.62	26.9
SmartLLM	11	23	91	0.45	28.5
CoELA	25	46	76	0.73	25.7
LLaMAR	66	91	97	0.82	21.9
PIPHEN	75	95	98	0.89	20.1

PIPHEN achieves best performance across all key metrics, with a 13.6% improvement in success rate over LLaMAR.

Ablation Study¶

Method Variant	Task Completion (%)	Control Precision (cm)	Communication Load (MB/s)	Data Efficiency (K samples)
PIPHEN (Oracle)	98.2	1.1	1.7	-
PIPHEN (Full)	92.6	3.2	1.8	78
w/o uncertainty decomposition	89.1	4.1	1.8	82
w/o hybrid physical representation	78.4	6.5	1.9	165
w/o Hamiltonian energy network	82.1	5.7	2.2	94
w/o micro-brain ecosystem	85.8	3.9	5.4	87
w/o LLM augmentation	86.3	4.4	2.6	126

Impact of different model configurations during knowledge transfer:

Model Config (Generate/Purify/Deploy)	SR(%)	TR(%)	C(%)	B
Default (Claude-3.7/GPT-4o/Qwen2.5-VL)	75	95	98	0.89
GPT-4o/GPT-4o/Qwen2.5-VL	73	93	97	0.88
Claude-3.7/Claude-3.7/Qwen2.5-VL	70	90	96	0.85
Claude-3.7/GPT-4o/Qwen2.5-0.5B	68	88	94	0.86

Key Findings¶

Hybrid physical representation is the most critical component: Removing it reduces task completion rate from 92.6% to 78.4% (a 14.2% drop) and degrades control precision by over 100%.
HEN contributes significantly to control precision: Removing it degrades control precision from 3.2 cm to 5.7 cm.
Substantial communication efficiency gains: Collaborative decision latency is reduced from 315 ms (centralized methods) to 76 ms.
Performance approaches the Oracle: The full PIPHEN (92.6%) closely approaches the ideal Oracle version (98.2%), validating the effectiveness of PIPN.
Multi-agent scalability advantage: In crowded environments with 4–5 agents, PIPHEN exhibits significantly smaller performance degradation than LLaMAR.
Real-world deployment validation: Successfully demonstrated tableware placement tasks on two XLeRobot single-arm mobile manipulators, reducing communication load by over 95%.

Highlights & Insights¶

Core idea of the paradigm shift: From "transmit raw data" to "transmit semantic knowledge"—extracting knowledge rather than compressing data represents a fundamental shift in information processing paradigm.
Embedding physical constraints in control: Applying Hamiltonian energy conservation to multi-robot collaborative controller design provides theoretical guarantees for the physical consistency and stability of control policies.
Three-stage knowledge transfer pipeline: Cleverly leverages the strengths of different large models—large generative models for data generation, strong reasoning models for validation, and small models for edge deployment—forming a complete knowledge pipeline.
Spatial reasoning advantage: PIPHEN constructs precise 3D spatial representations, predicts spatial conflicts, and proactively adjusts task allocation (e.g., B=0.85 vs. LLaMAR's B=0.33 in illustrated scenarios), whereas LLaMAR's text-based spatial reasoning is severely limited in cluttered environments.

Limitations & Future Work¶

Scalability of graph representation: The current framework performs well in typical scenarios (<50 objects), but graph computation may become a bottleneck when the number of interacting agents and objects increases dramatically.
Simplifying assumptions in uncertainty modeling: Linear summation of the three uncertainty components serves as an approximate estimate, which may be insufficiently precise in highly coupled scenarios.
Dependence on specific large models in knowledge transfer: The "purify" stage uses an LLM as a physical verifier; integrating dedicated physics simulators could further improve physical consistency.
Limited scale of real-world experiments: Only a two-robot tableware placement task is demonstrated; validation in more complex real-world scenarios remains absent.
HEN's imitation learning depends on expert data quality: The pipeline using PDDL planners with TrajOpt to generate expert trajectories may be constrained in highly dynamic scenarios.
Sparse graph representations, hierarchical communication protocols, and active graph pruning are directions worth exploring.

This paper integrates ideas from multiple research frontiers:

Intuitive physics understanding: From structured models (embedding physical representations) and pixel-level generative models (directly predicting future perceptual inputs) to representation learning approaches; this paper follows the structured representation + graph neural network route.
Physics-based robot control: Hamiltonian Neural Networks (HNN), port-Hamiltonian neural ODE networks, and Physics-Informed Neural Networks (PINN); this paper innovatively applies HNN to multi-robot collaborative control.
LLM multi-agent planning: Systems such as RoCo, CoELA, SmartLLM, and LLaMAR focus on task-level planning; this paper addresses the low-level physical execution problem they overlook.

Insight: The "semantic distillation" approach of this framework is generalizable to other bandwidth-constrained distributed AI systems, such as UAV formations and underwater robot swarms.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of semantic communication and Hamiltonian energy conservation constitutes a novel framework design.
Experimental Thoroughness: ⭐⭐⭐⭐ — Simulation benchmarks, ablation studies, and real-world deployment provide a reasonably comprehensive evaluation.
Theoretical Depth: ⭐⭐⭐⭐ — The introduction of Hamiltonian mechanics is theoretically grounded, though some details are deferred to the appendix.
Value: ⭐⭐⭐⭐ — Over 95% improvement in communication efficiency has direct significance for practical deployment.
Overall Rating: ⭐⭐⭐⭐