Agentified Assessment of Logical Reasoning Agents¶
Conference: ICLR 2026 arXiv: 2603.02788 Code: HuggingFace Dataset Area: LLM Reasoning Keywords: Logical Reasoning Evaluation, Agent-to-Agent Assessment, First-Order Logic, Automatic Formalization, SMT Solving
TL;DR¶
This paper proposes an agent-based evaluation framework (AAA) that encapsulates assessment logic as an assessor agent and interacts with the agent under test via a standard A2A interface. On a FOLIO dataset systematically cleaned using the Vampire theorem prover, an auto-formalization agent (NL→Z3Py + SMT solving) achieves 86.70% accuracy, substantially outperforming the CoT baseline at 73.89%, with a particularly notable gain of 32.79 percentage points on contradiction detection (False class).
Background & Motivation¶
- Evaluation Pain Point 1 — Conflation of Failure Modes: When evaluating reasoning agents, execution failures (timeouts / parsing errors / runtime exceptions) and reasoning errors are often conflated into a single accuracy figure, making it difficult to distinguish "the model cannot reason" from "the model's tools malfunctioned."
- Evaluation Pain Point 2 — Linear Integration Cost: Traditional evaluation harnesses tightly couple benchmark logic with agent implementations; adding each new benchmark requires re-integration, resulting in O(n) cost.
- Data Quality Issues: The FOLIO dataset contains potential label errors (3.8% in the training set, 1.5% in the validation set) as well as NL-FOL translation quality issues, making it inherently unreliable to assess reasoning capabilities on flawed data.
- Lack of Standardized Interfaces: Different agents have different input/output formats, execution environments, and error-handling mechanisms, with no unified plug-and-play interface available.
- Value of Formal Verification: First-order logic reasoning is an important capability for LLMs, yet CoT methods cannot guarantee logical validity; formal verification via SMT solving provides deterministic guarantees.
Method¶
Overall Architecture¶
Agentified Agent Assessment (AAA) = Assessor Agent (evaluation protocol) + Agent Under Test + Standard A2A Interface (AgentBeats/A2A Protocol). Evaluation logic is agentified and communicates with the agent under test through a standard interface.
Key Designs¶
1. AAA Evaluation Framework: - The assessor agent manages the full evaluation pipeline: task dispatch → execution budget control (timeouts / retry limits) → output parsing → structured failure-type logging → machine-readable evaluation report generation. - Failure types are categorized as: Timeout (execution timeout), RuntimeError (runtime exception), and ParseError (output parsing failure) — distinguished from reasoning errors. - Core value: integration cost is reduced from O(n) to O(1) — an agent implements the A2A interface once and can participate in any assessor's evaluation. - No failure discarded: traditional harnesses count unparseable outputs as errors; AAA distinguishes execution failures from reasoning errors, supporting post-hoc auditing.
2. FOLIO Data Cleaning Pipeline: - Step 1: Formal verification of FOL representations using the Vampire theorem prover — satisfiability of \(\bigwedge_i \phi_i \wedge \neg\varphi\) is checked to validate True/False/Uncertain labels. - Step 2: When verification results conflict with labels, a critique agent diagnoses translation errors (bracket mismatches, naming inconsistencies, etc.), and a refiner agent performs targeted repairs. - Step 3: Iterative verify-repair until consistency is achieved; samples exceeding the threshold are flagged for human review. - Results: 674 training samples (67.3%) passed verification directly, 23 (2.3%) passed after repair, and 304 (30.4%) remain problematic.
3. Auto-formalization Agent: - Stage 1 (Code Generation): An LLM translates natural-language premises and conclusions into executable Z3Py programs. - Stage 2 (Execution and Verification): Sandboxed execution (60s timeout); satisfiability checking determines True/False/Uncertain. - Self-repair loop: up to 3 iterations; upon syntax errors, error messages are extracted for targeted code repair before retrying.
4. CoT Baseline: Standard chain-of-thought prompting requiring step-by-step reasoning before outputting a final label.
Key Experimental Results¶
Main Results¶
| Method | True Acc. | False Acc. | Uncertain Acc. | Overall Acc. |
|---|---|---|---|---|
| Chain-of-Thought | 89.04% | 44.26% | 84.06% | 73.89% |
| Auto-formalization | 90.41% | 77.05% | 91.30% | 86.70% |
Ablation Study¶
| Analysis Dimension | Key Findings |
|---|---|
| False class gain | +32.79 pp; contradiction detection is CoT's weakest point, where formal verification yields the most significant advantage. |
| Uncertain class gain | +7.24 pp; the SMT solver handles logical uncertainty more effectively. |
| True class | 89.04%→90.41%; already high, marginal improvement. |
| Effect of data cleaning | Cleaned labels are more reliable, yielding more faithful evaluation results. |
Key Findings¶
- The substantial gain on the False class (44.26%→77.05%) reveals a systematic weakness of CoT in logical contradiction reasoning: models struggle to derive contradictions from premises (which requires proving that \(\phi \wedge \neg\varphi\) is unsatisfiable).
- Formal verification replaces "seemingly correct" reasoning with deterministic logical guarantees.
- The backbone model is Gemini 2.5 Flash (T=0.0), ensuring deterministic outputs.
- Data cleaning exposes implicit quality issues in existing benchmarks — evaluating on incorrect labels systematically underestimates model capability.
Highlights & Insights¶
- Paradigm Innovation in Agentified Evaluation: Transforming evaluation itself into an agent decouples assessment logic from agent implementation, reducing integration cost.
- Structured Failure Logging: Distinguishing Timeout / RuntimeError / ParseError from reasoning errors makes evaluations auditable and traceable.
- Lessons from Data Cleaning: Using a formal theorem prover to verify NLI dataset labels is more reliable and scalable than manual annotation.
- Dramatic Gain on False Class: Reveals a fundamental limitation of CoT in logical negation and contradiction reasoning — informal methods face a hard ceiling in this setting.
Limitations & Future Work¶
- Validation is conducted on a single dataset (FOLIO validation set, 203 examples), with extremely small scale and limited statistical power.
- Only two agents are compared (CoT and Auto-formalization), lacking broader comparisons with methods such as LINC, Logic-LM, and SymbCoT.
- 30.4% of training samples remain flagged as "problematic" after the cleaning pipeline, leaving its completeness to be improved.
- Practical interoperability, communication latency, and overhead of the A2A interface are not analyzed in detail.
- The quality of auto-formalized Z3Py code is heavily dependent on the capabilities of the backbone LLM.
- Performance on more complex reasoning tasks (e.g., higher-order logic, probabilistic reasoning) is not evaluated.
Related Work & Insights¶
- Compared to traditional static evaluation harnesses, AAA decouples assessment logic from agent implementation, representing an advancement in evaluation methodology.
- Built on the AgentBeats framework and A2A protocol, it aligns with the trend toward agent interoperability standards.
- The FOLIO cleaning work uses the Vampire theorem prover to verify NL-FOL alignment, serving as an exemplar of data quality assurance.
- The auto-formalization approach (NL→Z3Py) is in the same vein as LINC and Logic-LM, but adds a self-repair loop to improve robustness.
Rating¶
- Novelty: ⭐⭐⭐ (The AAA framework concept is novel, though the auto-formalization technique is relatively straightforward.)
- Experimental Thoroughness: ⭐⭐ (Single dataset, few comparison methods, extremely small scale.)
- Writing Quality: ⭐⭐⭐⭐ (Clear and well-structured, with precise problem definitions.)
- Value: ⭐⭐⭐ (The agentified evaluation paradigm is thought-provoking; the data cleaning pipeline has practical value.)