Causal-PIK: Causality-based Physical Reasoning with a Physics-Informed Kernel¶
Conference: ICML 2025
arXiv: 2505.22861
Code: None
Area: Scientific Computing
Keywords: Physical reasoning, causal inference, Bayesian optimization, physics-informed kernel, active exploration
TL;DR¶
Proposes Causal-PIK, which encodes physical causal similarity into a Physics-Informed Kernel for Bayesian optimization. This enables agents to find optimal actions with very few attempts in physical reasoning tasks, outperforming SOTA on the Virtual Tools and PHYRE benchmarks.
Background & Motivation¶
Physical reasoning tasks require agents to achieve goals (e.g., making a red ball fall into a green area) by placing action objects in environments with unknown dynamics. The core challenges of these tasks lie in:
Unknown environmental dynamics: Precise solutions cannot be planned directly, necessitating active exploration to gather information.
Causal reasoning requirements: It is necessary to understand the causal chain of "action \(\rightarrow\) object motion \(\rightarrow\) outcome".
Sample efficiency: The cost of physical interaction is high, requiring as few attempts as possible.
Limitations of Prior Work: - Forward prediction models (world models) select actions directly using dynamics models but do not exploit historical failure experiences. - SSUP uses Gaussian Mixture Models for guidance but does not encode physical effect-based correlations between actions. - RL methods (e.g., DQN) lack physical intuition, resulting in inefficient exploration.
Cognitive science research shows that humans solve such problems by constructing internal models of the physical world: estimating the causal effects of actions and rapidly learning from failures. Inspired by this, this work integrates physical intuition into a Bayesian optimization framework.
Method¶
Overall Architecture¶
Causal-PIK adopts a Bayesian Optimization (BO) framework, where the core innovation is replacing the standard RBF kernel with a Physics-Informed Kernel. This enables the Gaussian Process (GP) to model correlations between actions based on their physical causal effects.
Problem Formulation: Consider a single-intervention physical reasoning task. The agent executes an action \(\bm{x}\) under an initial state \(\bm{s}_0\), observes the environment's evolution for \(T\) steps, and obtains a score \(y = f(\bm{x}) = \mathbb{S}(\mathbb{D}(\bm{s}_0, \bm{x}))\). The goal is to maximize \(f(\bm{x})\) with the minimum number of attempts.
Algorithmic Procedure (Algorithm 1):
- GP Initialization: Build a GP prior using \(n_{initial}=9\) initial data points. The initial points are sampled from a Gaussian distribution centered on the object, serving as warmup samples that do not count toward the attempt limit.
- Physics-Informed GP Update: Update the GP posterior using all historical actions and their observations via the Physics-Informed Kernel.
- Causal-Guided Action Selection: Select the top-5 candidates from 500 action candidates generated by a Sobol sequence using the UCB acquisition function. Then, simulate these 5 actions using a probabilistic physics engine and execute the one with the best expected outcome.
- Action Execution and Feedback: Execute the action and observe the outcome. Terminate if successful; otherwise, add the data to the training set and return to step 2.
Key Designs¶
1. Physics-Informed Kernel¶
The kernel function encodes two types of physical intuition:
(a) Causal Effect Prediction: Train a dynamics model \(\hat{\mathbb{D}}\) to predict the causal effects of actions on the environment. The model is based on the Region Proposal Interaction Networks (RPIN) architecture, taking the initial state image and action as input, and outputting object bounding boxes for the future \(n_{pred}=20\) steps.
(b) Causal Similarity Computation: Define causal similarity between two actions to capture the degree to which they produce similar effects on the environment.
First, define the causal effect vector. Identify the timestep \(t_{event}\) of the first interaction between the action object and the dynamic object, and calculate the state change of object \(O\):
Then, for the effects of two actions \(a\) and \(b\) on object \(O\), calculate respectively:
-
Directional Similarity (cosine similarity): \(\text{sim}_{cos}(\dot{\bm{s}}^{O,a}, \dot{\bm{s}}^{O,b}) = \frac{\dot{\bm{s}}^{O,a} \cdot \dot{\bm{s}}^{O,b}}{\|\dot{\bm{s}}^{O,a}\| \|\dot{\bm{s}}^{O,b}\|} \in [-1, 1]\)
-
Magnitude Similarity: \(\text{sim}_{mag}(\dot{\bm{s}}^{O,a}, \dot{\bm{s}}^{O,b}) = \frac{1}{1 + |\|\dot{\bm{s}}^{O,a}\| - \|\dot{\bm{s}}^{O,b}\||} \in [0, 1]\)
Single-object similarity:
Final causal similarity metric (averaged over all \(D\) dynamic objects and exponentially weighted):
Proof of Kernel Validity: \(\text{sim}_{csl}(a,b)\) satisfies the requirements of symmetry and positive semi-definiteness, making it a valid kernel function.
2. Causal-Guided Action Selection¶
The Upper Confidence Bound (UCB) acquisition function is employed to balance exploration and exploitation: - Generate \(n_{candidate}=500\) action candidates using a Sobol sequence. - Evaluate the UCB value for each candidate action. - Select the top-\(n_{best}=5\) actions and simulate their outcomes with a probabilistic physics engine. - Select and execute the action with the best expected outcome.
3. Counterfactual Baseline¶
Counterfactual reasoning is introduced to isolate causal effects from confounding environmental dynamics: the environmental evolution when no action object is placed serves as a baseline, which is used to normalize both the distance metric in the objective function and the effect vectors in the causal similarity calculation.
Loss & Training¶
Objective Function Design:
where \(d_c = \min_{t=1,...,T} \text{dist}(\bm{s}_t, \bm{s}_g)\) is the minimum distance to the target state across all timesteps in the episode. Key Insight: Measuring the closest distance at any timestep (rather than only the final timestep) corresponds to the physical intuition of "near-misses".
Dynamics Model Training: - Virtual Tools: 10 variants are generated for each of the 20 block puzzles, with 300 actions per variant (\(\geq50\%\) leading to collisions, and \(10\%\) as no-action baselines). - PHYRE: 10-fold cross-validation is used, training a model for each fold separately, with 500 actions per puzzle (350 failures + 150 successes + 50 no-action).
Key Experimental Results¶
Main Results¶
Virtual Tools Benchmark (20 puzzles, 100 tests per puzzle, up to 10 attempts):
| Model | AUCCESS ↑ | Description |
|---|---|---|
| RAND | 16.0±20.0 | Random baseline |
| DQN | 25.0±24.0 | Reinforcement learning |
| SSUP | 58.0±27.0 | Prev. SOTA |
| Ours (RBF) | 42.0±33.0 | Ablation: Standard kernel |
| Ours (Causal-PIK) | 65.0±25.0 | +7 vs SSUP |
| Humans | 53.25±23 | Human baseline |
PHYRE-1B Cross Benchmark (25 tasks, 10 folds, up to 100 attempts):
| Model | AUCCESS ↑ | Action Space |
|---|---|---|
| RAND | 13.0±5.0 | Full (~2.55 million) |
| Harter et al. 2020 | 30.2±48.9 | Full |
| Ours (RBF) | 27.7±9.68 | Full |
| Ours (Causal-PIK) | 41.6±9.33 | Full |
| DQN | 36.8±9.7 | Reduced (10k) |
| Ahmed et al. 2021 | 41.9±8.8 | Reduced (10k) |
| RPIN (Qi et al.) | 42.2±7.1 | Reduced (10k) |
| Dec [Joint] | 40.3±8 | Reduced (1k) |
| Humans @10 | 36.6±10.2 | Continuous |
Ablation Study¶
| Configuration | Virtual Tools AUCCESS | PHYRE AUCCESS | Description |
|---|---|---|---|
| Causal-PIK (Full) | 65.0 | 41.6 | Full method |
| RBF Kernel Replacement | 42.0 | 27.7 | Standard kernel, significant performance drop |
| High-precision Dynamics Model | - | 45.0 | L2=3.56 (trained on test templates) |
| Standard Dynamics Model | - | 41.6 | L2=19.3±4.55 (completely unseen puzzles) |
Key Findings¶
-
Physics-Informed Kernel is key: Replacing PIK with the RBF kernel drops AUCCESS on Virtual Tools by 23 points (65 \(\rightarrow\) 42) and on PHYRE by 14 points (41.6 \(\rightarrow\) 27.7), demonstrating that the causal effect-based kernel is substantially superior to the geometric distance-based RBF kernel.
-
Achieving comparable performance to reduced-space methods on the full action space: Causal-PIK achieves 41.6 AUCCESS on the full space of ~2.55 million actions, matching RPIN (42.2) which operates on a reduced space of 10k actions, despite the vastly different difficulty scales.
-
Robustness to dynamics prediction noise: Even when the prediction error of the dynamics model increases from 3.56 to 19.3 (approximately 5.4 times larger), the AUCCESS drop is marginal (from 45 to 41.6, a -3.4 point difference), indicating that the method does not rely heavily on exact predictions.
-
Highly correlated with human reasoning patterns: The model shows a high correlation with humans on PHYRE (\(r=0.73\), the lowest among all compared models), indicating that Causal-PIK shares similar difficulty patterns with human problem-solving.
-
Outperforming human performance: Causal-PIK achieves 65 AUCCESS vs 53.25 for humans on Virtual Tools, and 41.6 vs 36.6 for humans on PHYRE (with 100 attempts).
Highlights & Insights¶
-
Elegance of Kernel Design: High-level semantic "physical causal effects" are encoded into the kernel function rather than being used directly for action selection. This allows the BO framework to naturally balance exploration and exploitation. The PIK works effectively using a simple product of directional similarity \(\times\) magnitude similarity.
-
Extracting multiple pieces of information from a single trial: Through the causal kernel function, a single failed trial not only updates the expectation of that specific action but also indirectly updates the expectations of all actions predicted to yield similar physical effects—a capability lacking in traditional BO.
-
"Near-miss" Objective Function: Using the minimum distance across all timesteps in an episode (rather than only the final distance) elegantly captures the near-miss signals critical in physical reasoning.
-
Introduction of counterfactual reasoning helps distinguish true causal effects from the environment's internal dynamics (such as natural fallback due to gravity), enhancing kernel measurement precision.
Limitations & Future Work¶
- No cross-task knowledge transfer: Currently, each task is solved independently without leveraging experiences from similar tasks. Future work could identify task families sharing structural dynamics to enable knowledge reuse.
- Dynamics prediction noise: Although somewhat robust to noise, inaccurate predictions can still introduce misleading similarities. Stronger dynamics models could further enhance performance.
- Action space dimension constraints: The method is currently validated only on 3D action spaces. Scaling up to higher-dimensional spaces requires modifying the causal effect predictor, although the core kernel formulation remains the same.
- Reliance on a probabilistic physics engine: The simulation step in action selection requires an approximate physics engine, which restricts applicability to completely unknown environments.
Related Work & Insights¶
- RPIN (Qi et al., 2021): Used as the backbone for the dynamics model. Insight: Object-level interaction networks are highly effective at capturing local physical dynamics.
- SSUP (Allen et al., 2020): Former SOTA on Virtual Tools, using GMM to guide search without encoding causal associations. Insight: Physical simulation is inherently approximate; the key lies in how simulation feedback is utilized.
- Li et al., 2022; 2024: Analyze the failure of RL in physical reasoning, attributing it to a lack of understanding of physical attributes. Insight: Pure state-action mapping is insufficient for tasks demanding physical intuition.
- Gerstenberg & Tenenbaum, 2016: Propose the concept of "almost achieving the goal." Insight: Near-miss signals are important for learning.
- Bayesian Optimization in Robotics: Can naturally extend to sim-to-real scenarios.
Rating¶
| Dimension | Score (1-5) | Description |
|---|---|---|
| Novelty | 4 | The idea of encoding causal inference into a kernel function is novel and elegant |
| Theoretical Depth | 4 | The proof of kernel validity is rigorous, and the counterfactual reasoning is deep |
| Experimental Thoroughness | 4 | Dual-benchmark evaluation + human comparison + ablation studies + robustness analysis |
| Practical Value | 3 | Limited to 2D physical puzzle scenarios, transferability to real robots remains to be validated |
| Writing Quality | 4 | The method is clearly articulated, the illustrations are intuitive, and the connection to cognitive science is compelling |
| Overall Rating | 4.0 | Excellent methodological contribution, elegantly combining causal reasoning with BO |
Rating¶
- Novelty: Pending
- Experimental Thoroughness: Pending
- Writing Quality: Pending
- Value: Pending