Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning¶
Conference: CVPR2026
arXiv: 2603.18495
Code: To be confirmed
Area: Robotics
Keywords: video-instructed robotic programming, cross-domain adaptation, neurosymbolic reasoning, counterfactual reasoning, code-as-policies
TL;DR¶
This paper proposes NeSyCR, a neurosymbolic counterfactual reasoning framework that abstracts video demonstrations into a symbolic world model, detects cross-domain incompatibilities via counterfactual state simulation, and automatically corrects program steps. NeSyCR achieves a 31.14% improvement in success rate over the strongest baseline, Statler, on cross-domain demo-to-code tasks.
Background & Motivation¶
- Rise of the Code-as-Policies Paradigm: LLMs/VLMs possess code generation capabilities, enabling the synthesis of executable robot control code from language instructions or video demonstrations. However, cross-domain adaptation remains a critical challenge.
- Inevitable Domain Gaps in Video Demonstrations: Intrinsic differences in environment layout, object attributes, and robot configuration exist between the demonstration domain and the deployment domain; directly imitating demonstration behavior leads to program failures.
- Perceptual Observations Are Insufficient to Explain Procedural Discrepancies: While observations can reveal physical differences, they cannot explain how structural differences undermine the underlying task program or causal dependencies.
- VLMs Lack Procedural Understanding: Current VLMs struggle to reconstruct causal dependencies and achieve behavioral compatibility under domain shift, and tend to produce actions that are semantically plausible but logically inconsistent.
- Cascading Incompatibilities Are Difficult to Handle: Cross-domain differences affect not only individual steps but may also trigger cascading incompatibilities (e.g., a change in tool position blocking subsequent steps), requiring global program reorganization.
- Existing Methods Lack Verifiable Adaptation Mechanisms: VLM-based reasoning approaches lack symbolic tool verification, while world model approaches rely on constructing complete domain knowledge from a single demonstration and frequently produce invalid plans.
Method¶
Overall Architecture: NeSyCR¶
NeSyCR (Neurosymbolic Counterfactual Reasoning) formulates cross-domain adaptation as a counterfactual reasoning problem and operates in two phases:
- Phase 1 — Symbolic World Model Construction: Abstracts symbolic trajectories from video demonstrations and constructs a verifiable world model.
- Phase 2 — Neurosymbolic Counterfactual Adaptation: Compares the world model against target-domain observations, detects incompatible steps, and corrects the program.
Symbolic World Model Construction¶
- Symbolic State Translation: A VLM extracts each frame observation into a scene graph containing object entities and spatial relations, forming a symbolic state sequence \(\{s_1, \dots, s_N\}\).
- Symbolic Dynamics Reconstruction: For each pair of consecutive states \((s_t, s_{t+1})\), the VLM predicts the action operator \(a_t = \Psi(s_t, s_{t+1})\), defining preconditions and effects.
- Consistency Verification: A symbolic tool \(\Phi\) (based on VAL) performs forward simulation and verifies \(\forall t, \Phi(s_t, a_t) \models s_{t+1}\), ensuring the world model \(\mathcal{W} = (\mathcal{Q}, \mathcal{P}, \mathcal{A}, \Phi)\) is logically consistent.
- A STRIPS-style formalism is adopted to support forward execution and logical verification.
Neurosymbolic Counterfactual Adaptation¶
- Counterfactual Identification: The VLM generates a counterfactual initial state \(\hat{s}_1\) from target-domain observations; the symbolic tool performs forward simulation along the demonstration program to detect incompatible actions.
- Incompatibility Detection: An action is flagged as incompatible when its preconditions are not satisfied (\(\text{pre}(a_t) \nsubseteq \hat{s}_t\)) or its effects cannot be reproduced.
- Counterfactual Exploration: For each incompatible action, the VLM proposes alternative actions to restore the preconditions of subsequent steps, supporting addition, deletion, and reordering operations.
- Iterative Verification: The symbolic tool verifies the causal validity of each alternative action, ensuring that the adapted program \(\tilde{\pi}\) satisfies \(\hat{s}_{t+1} \models s_N\) (reaching the goal state).
- The adapted program is ultimately compiled into an executable code policy \(\pi_\theta = \Psi(\tilde{\pi})\).
Key Designs¶
- VLM–Symbolic Tool Collaboration: The VLM proposes alternative actions (leveraging commonsense knowledge) while the symbolic tool verifies logical consistency, forming a closed loop.
- Additive and Subtractive Modifications: Supports inserting new actions (e.g., introducing auxiliary tools) and removing redundant actions (e.g., steps whose goals are already satisfied).
- Cascading Incompatibility Resolution: Automatically discovers and resolves downstream incompatibilities triggered by a single modification through global forward simulation.
Key Experimental Results¶
Experimental Setup¶
- Cross-Domain Factors: 5 categories — Obstruction, Object affordance, Kinematic configuration, Gripper type, and combinations.
- Benchmark Tasks: Long-horizon manipulation tasks (up to 116 API calls), covering pick-and-place, sweeping, rotation, sliding, and other subtasks.
- Three Complexity Levels: Low / Medium / High, with 440 scenarios in total.
- 6 Baselines: Demo2Code, GPT4V-Robotics, Critic-V, MoReVQA, Statler, LLM-DM.
Main Results (Table 1 — Simulation Environment)¶
| Method | SR (Low) | SR (Med) | SR (High) | Notes |
|---|---|---|---|---|
| Demo2Code | 26.67 | 25.00 | 22.50 | No adaptation mechanism |
| GPT4V-Robotics | 71.67 | 41.67 | 20.00 | VLM reasoning |
| Statler | 61.67 | 41.67 | 5.00 | World model without symbolic verification |
| NeSyCR | 86.67 | 75.00 | 60.00 | Ours |
- NeSyCR achieves an average SR improvement of 31.14% over Statler and 27.73% over GPT4V-Robotics.
- Under combined cross-domain factors, NeSyCR maintains 47.5–80.0% SR, while Statler drops to 32.5–67.5%.
Real-World Experiments (Table 2)¶
| Method | SR | GC | PD |
|---|---|---|---|
| Demo2Code | 0.00 | 25.00 | — |
| GPT4V-Robotics | 50.00 | 75.00 | 0.00 |
| Statler | 50.00 | 67.86 | 42.86 |
| NeSyCR | 87.50 | 98.21 | 24.49 |
- Experiments use a Franka Emika Research 3 robotic arm, adapting from human video demonstrations to real-robot deployment.
- A vertically placed drawer scenario requires alternating operations on two drawers to avoid mutual interference.
Ablation Study (Table 3)¶
| Variant | SR | Drop |
|---|---|---|
| NeSyCR (full) | 68.42 | — |
| w/o alternative action verification (Eq. 8) | 50.00 | -18.42 |
| w/o counterfactual identification (Eq. 6) | 47.37 | -21.05 |
| w/o both | 39.47 | -28.95 |
| w/o symbolic world model (Eq. 4) | 34.21 | -34.21 |
Removing the symbolic world model causes the largest performance drop, confirming that verifiable symbolic reasoning is the core component.
Highlights & Insights¶
- Formulating cross-domain demo-to-code as counterfactual reasoning provides a clear formal framework (STRIPS + counterfactual state-space exploration).
- The VLM–symbolic tool co-design is elegant: the VLM leverages commonsense knowledge to propose candidates, while the symbolic tool guarantees logical correctness, making them highly complementary.
- Cascading incompatibility handling: the framework automatically detects how a single modification affects downstream steps and performs global correction.
- Thorough real-world validation: large-scale quantitative simulation experiments (440 scenarios) are complemented by end-to-end validation on a physical robot.
- Refined experimental design: a systematic matrix of 5 cross-domain factor types × 3 complexity levels facilitates fine-grained performance analysis across different dimensions.
Limitations & Future Work¶
- Significant performance degradation under large task complexity gaps: when the deployment task substantially exceeds the demonstration in complexity, counterfactual reasoning struggles to compensate for missing information.
- Dependence on VLM scene graph extraction quality: the accuracy of symbolic state translation is bounded by the perceptual capability of the VLM.
- Limited expressiveness of STRIPS-style representation: it is difficult to model continuous physical quantities, deformable objects, and other complex scenarios.
- Computational overhead: iterative verification with VLM and symbolic tools may introduce significant latency for long-horizon tasks.
- Single-demonstration constraint: the world model is constructed from a single demonstration video, resulting in insufficient diversity.
Related Work & Insights¶
- vs. Code-as-Policies (SayCan, ProgPrompt): This work focuses on cross-domain adaptation rather than single-domain code generation.
- vs. Demo2Code: Demo2Code directly imitates demonstrations without any adaptation mechanism; NeSyCR corrects the program through counterfactual reasoning.
- vs. Statler: Statler employs symbolic state representations but does not integrate symbolic verification tools; re-planning from scratch causes collapse under high complexity.
- vs. LLM-DM: LLM-DM constructs complete domain knowledge from a single demonstration and frequently generates invalid plans; NeSyCR preserves the demonstration structure and applies only local corrections.
- vs. Behavior Cloning / Inverse Reinforcement Learning: These methods generalize poorly under perceptual and physical changes; NeSyCR performs adaptation at the symbolic level.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The combination of counterfactual reasoning and neurosymbolic verification constitutes a new paradigm in the demo-to-code domain.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Systematic experiments across 440 scenarios, real-robot validation, and fine-grained controlled variable analysis.
- Writing Quality: ⭐⭐⭐⭐ — Formalization is clear, symbolic notation is rigorous, and case studies are intuitive.
- Value: ⭐⭐⭐⭐ — Provides a verifiable adaptation framework for cross-domain robotic programming, addressing an important research direction.