Skip to content

Agentified Assessment of Logical Reasoning Agents

Conference: ICLR 2026 arXiv: 2603.02788 Code: HuggingFace Dataset Area: LLM Reasoning Keywords: Logical Reasoning Evaluation, Agent-to-Agent Assessment, First-Order Logic, Automatic Formalization, SMT Solving

TL;DR

This paper proposes an agent-based evaluation framework (AAA) that encapsulates assessment logic as an assessor agent and interacts with the agent under test via a standard A2A interface. On a FOLIO dataset systematically cleaned using the Vampire theorem prover, an auto-formalization agent (NL→Z3Py + SMT solving) achieves 86.70% accuracy, substantially outperforming the CoT baseline at 73.89%, with a particularly notable gain of 32.79 percentage points on contradiction detection (False class).

Background & Motivation

  • Evaluation Pain Point 1 — Conflation of Failure Modes: When evaluating reasoning agents, execution failures (timeouts / parsing errors / runtime exceptions) and reasoning errors are often conflated into a single accuracy figure, making it difficult to distinguish "the model cannot reason" from "the model's tools malfunctioned."
  • Evaluation Pain Point 2 — Linear Integration Cost: Traditional evaluation harnesses tightly couple benchmark logic with agent implementations; adding each new benchmark requires re-integration, resulting in O(n) cost.
  • Data Quality Issues: The FOLIO dataset contains potential label errors (3.8% in the training set, 1.5% in the validation set) as well as NL-FOL translation quality issues, making it inherently unreliable to assess reasoning capabilities on flawed data.
  • Lack of Standardized Interfaces: Different agents have different input/output formats, execution environments, and error-handling mechanisms, with no unified plug-and-play interface available.
  • Value of Formal Verification: First-order logic reasoning is an important capability for LLMs, yet CoT methods cannot guarantee logical validity; formal verification via SMT solving provides deterministic guarantees.

Method

Overall Architecture

Agentified Agent Assessment (AAA) = Assessor Agent (evaluation protocol) + Agent Under Test + Standard A2A Interface (AgentBeats/A2A Protocol). Evaluation logic is agentified and communicates with the agent under test through a standard interface.

Key Designs

1. AAA Evaluation Framework: - The assessor agent manages the full evaluation pipeline: task dispatch → execution budget control (timeouts / retry limits) → output parsing → structured failure-type logging → machine-readable evaluation report generation. - Failure types are categorized as: Timeout (execution timeout), RuntimeError (runtime exception), and ParseError (output parsing failure) — distinguished from reasoning errors. - Core value: integration cost is reduced from O(n) to O(1) — an agent implements the A2A interface once and can participate in any assessor's evaluation. - No failure discarded: traditional harnesses count unparseable outputs as errors; AAA distinguishes execution failures from reasoning errors, supporting post-hoc auditing.

2. FOLIO Data Cleaning Pipeline: - Step 1: Formal verification of FOL representations using the Vampire theorem prover — satisfiability of \(\bigwedge_i \phi_i \wedge \neg\varphi\) is checked to validate True/False/Uncertain labels. - Step 2: When verification results conflict with labels, a critique agent diagnoses translation errors (bracket mismatches, naming inconsistencies, etc.), and a refiner agent performs targeted repairs. - Step 3: Iterative verify-repair until consistency is achieved; samples exceeding the threshold are flagged for human review. - Results: 674 training samples (67.3%) passed verification directly, 23 (2.3%) passed after repair, and 304 (30.4%) remain problematic.

3. Auto-formalization Agent: - Stage 1 (Code Generation): An LLM translates natural-language premises and conclusions into executable Z3Py programs. - Stage 2 (Execution and Verification): Sandboxed execution (60s timeout); satisfiability checking determines True/False/Uncertain. - Self-repair loop: up to 3 iterations; upon syntax errors, error messages are extracted for targeted code repair before retrying.

4. CoT Baseline: Standard chain-of-thought prompting requiring step-by-step reasoning before outputting a final label.

Key Experimental Results

Main Results

Method True Acc. False Acc. Uncertain Acc. Overall Acc.
Chain-of-Thought 89.04% 44.26% 84.06% 73.89%
Auto-formalization 90.41% 77.05% 91.30% 86.70%

Ablation Study

Analysis Dimension Key Findings
False class gain +32.79 pp; contradiction detection is CoT's weakest point, where formal verification yields the most significant advantage.
Uncertain class gain +7.24 pp; the SMT solver handles logical uncertainty more effectively.
True class 89.04%→90.41%; already high, marginal improvement.
Effect of data cleaning Cleaned labels are more reliable, yielding more faithful evaluation results.

Key Findings

  • The substantial gain on the False class (44.26%→77.05%) reveals a systematic weakness of CoT in logical contradiction reasoning: models struggle to derive contradictions from premises (which requires proving that \(\phi \wedge \neg\varphi\) is unsatisfiable).
  • Formal verification replaces "seemingly correct" reasoning with deterministic logical guarantees.
  • The backbone model is Gemini 2.5 Flash (T=0.0), ensuring deterministic outputs.
  • Data cleaning exposes implicit quality issues in existing benchmarks — evaluating on incorrect labels systematically underestimates model capability.

Highlights & Insights

  • Paradigm Innovation in Agentified Evaluation: Transforming evaluation itself into an agent decouples assessment logic from agent implementation, reducing integration cost.
  • Structured Failure Logging: Distinguishing Timeout / RuntimeError / ParseError from reasoning errors makes evaluations auditable and traceable.
  • Lessons from Data Cleaning: Using a formal theorem prover to verify NLI dataset labels is more reliable and scalable than manual annotation.
  • Dramatic Gain on False Class: Reveals a fundamental limitation of CoT in logical negation and contradiction reasoning — informal methods face a hard ceiling in this setting.

Limitations & Future Work

  • Validation is conducted on a single dataset (FOLIO validation set, 203 examples), with extremely small scale and limited statistical power.
  • Only two agents are compared (CoT and Auto-formalization), lacking broader comparisons with methods such as LINC, Logic-LM, and SymbCoT.
  • 30.4% of training samples remain flagged as "problematic" after the cleaning pipeline, leaving its completeness to be improved.
  • Practical interoperability, communication latency, and overhead of the A2A interface are not analyzed in detail.
  • The quality of auto-formalized Z3Py code is heavily dependent on the capabilities of the backbone LLM.
  • Performance on more complex reasoning tasks (e.g., higher-order logic, probabilistic reasoning) is not evaluated.
  • Compared to traditional static evaluation harnesses, AAA decouples assessment logic from agent implementation, representing an advancement in evaluation methodology.
  • Built on the AgentBeats framework and A2A protocol, it aligns with the trend toward agent interoperability standards.
  • The FOLIO cleaning work uses the Vampire theorem prover to verify NL-FOL alignment, serving as an exemplar of data quality assurance.
  • The auto-formalization approach (NL→Z3Py) is in the same vein as LINC and Logic-LM, but adds a self-repair loop to improve robustness.

Rating

  • Novelty: ⭐⭐⭐ (The AAA framework concept is novel, though the auto-formalization technique is relatively straightforward.)
  • Experimental Thoroughness: ⭐⭐ (Single dataset, few comparison methods, extremely small scale.)
  • Writing Quality: ⭐⭐⭐⭐ (Clear and well-structured, with precise problem definitions.)
  • Value: ⭐⭐⭐ (The agentified evaluation paradigm is thought-provoking; the data cleaning pipeline has practical value.)