Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI¶
Conference: ICLR 2026 arXiv: 2602.08373 Code: https://github.com/Sn0wm1an/VIRF Area: Robotics Keywords: embodied AI, neuro-symbolic, safe planning, LLM agent, formal verification
TL;DR¶
This paper proposes VIRF (Verifiable Iterative Refinement Framework), a neuro-symbolic hybrid architecture that couples a deterministic Logic Tutor with an LLM planner, using a verifiable formal ontology as a safety anchor. VIRF achieves 0% Hazardous Action Rate (HAR) and 77.3% Goal Completion Rate (GCR) on SafeAgentBench, demonstrating that strict safety guarantees need not compromise agent utility.
Background & Motivation¶
- Safety dilemma of LLM planners: LLMs exhibit strong capabilities in embodied AI task planning, yet their stochastic nature and lack of formal reasoning prevent them from providing the deterministic safety guarantees required for physical deployment.
- Self-supervision paradox: Current safety paradigms rely on LLMs to supervise their own outputs (e.g., self-correction, multi-agent debate), creating a self-referential loop that offers no verifiable safety assurance—an unreliable system supervising an unreliable system.
- Semantic gap: Plans generated by LLMs in sub-symbolic space frequently suffer semantic distortion when mapped to the physical world, as models lack a verifiable understanding of real causal consequences.
- Limitations of the corrective paradigm: Existing interactive verifiers act solely as gatekeepers, informing the agent that a plan has been rejected (e.g., "violates rule 4") without explaining why it is unsafe, causing the agent to reach dead ends and abandon solvable tasks.
- Blind spots in safety evaluation: Existing safety benchmarks (e.g., SafeAgentBench) focus on simple physical hazards and neglect complex semantic risks such as chemical hazards and food safety, resulting in narrow evaluation coverage.
- Knowledge acquisition bottleneck: Constructing formal safety knowledge bases requires substantial manual effort, and extracting formal constraints such as OWL 2 from unstructured documents remains highly challenging.
Method¶
Overall Architecture: Dual-Process Cognitive System¶
VIRF draws an analogy to Kahneman's dual-process theory: the LLM planner (Apprentice) serves as the intuitive, creative, yet error-prone "System 1," while the deterministic verification engine—grounded in a formal ontology and a Rich Semantic Scene Graph (RSSG)—serves as the rigorous, logical "System 2" supervisor. The core of the framework is a closed-loop cycle of plan–verify–diagnose–refine.
Pillar 1: Traceable Axiom Synthesis (TAS) Knowledge Engineering¶
To address the knowledge acquisition bottleneck, TAS adopts an AI-synthesizer / human-arbiter collaborative workflow: - Retrieval stage: Text passages relevant to target safety concepts are retrieved from a curated document corpus. - Synthesis stage: An LLM interprets the retrieved evidence and drafts candidate safety axioms in a formal language, with mandatory citation of source sentences to establish traceable axiom–evidence pairs. - Arbitration stage: Human experts perform semantic and logical validation to confirm that each formal axiom accurately captures the meaning of its source text.
This pipeline synthesized 92 validated axioms within two days, substantially expanding hazard coverage. The ontology adopts a layered compositional design that separates abstract safety principles from domain-specific knowledge.
Pillar 2: VLM-Cascade Perception Pipeline¶
A three-stage architecture generates high-fidelity RSSGs: 1. VLM-Detect: A global VLM call performs open-vocabulary object discovery, prioritizing recall. 2. VLM-Attribute-Refine: Cropped images of individual objects undergo deep semantic analysis to extract safety-critical attributes (category, state, material, etc.). 3. VLM-Relation-Refine: The spatial arrangement of all objects is analyzed to construct a complete set of relational assertions.
Design philosophy: safety first—minimizing false negatives (missed hazardous objects) is far more important than reducing false positives.
Pillar 3: Verifiable Tutor–Apprentice Refinement Loop¶
The core algorithmic procedure: 1. The LLM planner proposes an initial plan \(\pi\). 2. The verifier simulates the plan against the knowledge core and checks for safety violations. 3. SAFE → approve for execution; UNSAFE → generate a structured causal diagnostic report (derived from the reasoner's proof trace) explaining the causal chain of failure (action → axiom → violation); UNKNOWN → query the user to update the knowledge base. 4. The diagnostic report serves as a pedagogical scaffold guiding the LLM toward intelligent repair, rather than mere avoidance.
Key innovation: a shift from the corrective paradigm to the pedagogical paradigm—explaining why something is unsafe rather than merely reporting what was rejected.
Key Experimental Results¶
Main Results: SafeAgentBench¶
| Method | HAR(%)↓ | GCR(%)↑ | FPR(%)↓ | FNR(%)↓ | Iterations↓ |
|---|---|---|---|---|---|
| Impulsive (direct) | 11.9 | 56.8 | 32.7 | 16.5 | N/A |
| Thinker (CoT) | 9.8 | 59.1 | 35.4 | 14.4 | N/A |
| Committee (SAFER-like) | 7.6 | 57.3 | 28.6 | 18.9 | 1.98 |
| Impulsive + Rules | 0.9 | 70.5 | 13.3 | 21.4 | N/A |
| Thinker + Rules | 1.2 | 67.0 | 12.4 | 22.7 | N/A |
| Thinker + Diagnostic | 0.0 | 76.8 | 10.1 | 14.4 | N/A |
| VIRF-Reject (ablation) | 0.0 | 63.4 | 15.9 | 33.0 | 1.3 |
| VIRF (full) | 0.0 | 77.3 | 12.1 | 20.2 | 1.1 |
VIRF is the only method to simultaneously achieve 0% HAR and the highest GCR. The VIRF-Reject ablation demonstrates that without causal explanations, the planner's FNR reaches 33.0% (abandoning solvable tasks); pedagogical dialogue reduces this by approximately 40%.
Ablation Study: Perception Architecture¶
| Perception Architecture | Instances | Categories | Accuracy (%) | Time (s) |
|---|---|---|---|---|
| Hybrid Detector (DINO-X + VLM) | 108.8±25.5 | 35.6±5.4 | 35.8±8.5 | 85.2±23.9 |
| VLM-Cascade (Ours) | 174.4±34.9 | 55.2±8.7 | 76.3±10.9 | 168.4±23.9 |
VLM-Cascade substantially outperforms the hybrid detector in accuracy (76.3% vs. 35.8%) and produces richer scene graphs (174.4 vs. 108.8 instances), at the cost of higher latency.
Key Findings¶
The Impulsive+Rules baseline (70.5% GCR) unexpectedly outperforms Thinker+Rules (67.0% GCR), revealing a cognitive overload phenomenon: CoT reasoning tends to drift under constraint saturation, whereas the direct approach treats rules as strict instructions. However, neither approach achieves 0% HAR, confirming that passive knowledge injection cannot substitute for active verification.
Highlights & Insights¶
- Pedagogical paradigm innovation: This work is the first to transform the verifier from a passive corrective role into an active teaching role, providing causal and explanatory feedback derived from proof traces.
- Perfect safety record: VIRF is the only method among all baselines to achieve 0% HAR while simultaneously attaining the highest task completion rate.
- Efficient iteration: On average, only 1.1 refinement iterations are required to converge to a safe plan.
- Revealing evaluation blind spots: The RAG-based knowledge base systematically identifies critical hazard categories absent from existing benchmarks, including chemical hazards (12%) and food safety (16%).
- Robust perceptual degradation: When faced with conflicting information or attribute uncertainty, VIRF correctly identifies logical inconsistencies 100% of the time and defaults to a safe "inquiry" state.
Limitations & Future Work¶
- Static knowledge core: The TBox cannot be adaptively updated at runtime; the absence of a learn–verify–write loop limits continual learning capability.
- Perception noise bottleneck: Symbolic grounding (VLM-to-ontology mapping) remains fragile; while the system can detect contradictions and degrade safely, a fundamental solution has yet to be explored.
- Simulation-to-real gap: The ontology does not yet model continuous physical dynamics (e.g., friction forces), and deployment on physical robots requires further extension.
- Computational overhead: VLM-Cascade incurs a latency of 168 s (approximately 2× that of the Hybrid Detector), and the Pellet reasoner is highly dependent on CPU performance.
Related Work & Insights¶
vs. VeriPlan (Lee et al., 2025)¶
VeriPlan, as an iterative verifier, can identify which predefined rule has been violated ("what failed") but provides only corrective feedback. VIRF additionally supplies a causal proof trace ("why it failed"), forming a complete reasoning chain from object attributes to rule violations. This paradigm shift from correction to teaching enables VIRF to achieve a significantly lower FNR than a simple rejection strategy (20.2% vs. 33.0%).
vs. SAFER (Khan et al., 2025, Committee method)¶
SAFER employs a multi-LLM committee to generate natural-language safety critiques, offering explanations of why a plan may be unsafe. However, these explanations are stochastically generated and lack formal guarantees. VIRF's feedback is derived from deterministic logical proof traces, ensuring both formal verifiability and causal explicitness. In experiments, the Committee method yields a HAR of 7.6% (unable to eliminate hazardous actions), whereas VIRF achieves 0%.
vs. End-to-End VLA Models (e.g., RT-2)¶
End-to-end vision-language-action models directly map pixels to action commands and are highly capable, but function as black boxes that resist formal verification. As an instance of a hierarchical architecture, VIRF explicitly separates high-level planning (LLM) from symbolic verification (Logic Tutor), injecting and verifying safety constraints at an interpretable interface—providing safety guarantees currently unattainable by end-to-end models.
Rating¶
- ⭐⭐⭐⭐ Novelty: The pedagogical paradigm (from correction to teaching) and the tutor–apprentice dialogue design are original; the dual-process cognitive analogy is natural and effective.
- ⭐⭐⭐⭐ Experimental Thoroughness: Multi-dimensional ablations (VIRF-Reject / RAG / Manual), discovery of the cognitive overload phenomenon, robustness testing, and scalability analysis constitute a comprehensive experimental design.
- ⭐⭐⭐⭐ Value: The combination of 0% HAR and high GCR has direct deployment value for safety-critical applications; the TAS knowledge engineering pipeline is reproducible.
- ⭐⭐⭐ Writing Quality: The paper is clearly structured with rich figures and tables, though the coordination of three major pillars and multiple subsystems introduces considerable overall complexity.