Skip to content

Investigating Advanced Reasoning of Large Language Models via Black-Box Environment Interaction

Conference: ICML 2026
arXiv: 2508.19035
Code: https://github.com/lemonsis/Oracle_Benchmark (available)
Area: LLM Evaluation / Reasoning Benchmark
Keywords: Reasoning Evaluation, Black-Box Interaction, Exploration Strategy, Abduction-Deduction-Induction, ORACLE Benchmark

TL;DR

This work proposes "black-box environment interaction" as a new paradigm for evaluating integrated reasoning (abduction + deduction + induction) in LLMs, constructing the ORACLE benchmark with 96 environments across 6 task types. Benchmarking 19 LLMs reveals that even the strongest model, o3, achieves only 70% accuracy in simple environments and drops to 40% in difficult ones. All LLMs lack high-level planning abilities for "adaptive optimization of exploration strategies based on feedback."

Background & Motivation

Background: LLMs have achieved soaring scores on reasoning benchmarks such as GSM8k and MATH, with long CoT and test-time scaling making models appear "capable of reasoning."

Limitations of Prior Work: (1) Existing datasets mostly test abduction, deduction, and induction in isolation, without treating them as a unified process; (2) Using games (Minecraft / 24 Game) to simulate interactive environments introduces confounding factors such as spatial reasoning and long context, and training data may be contaminated; (3) Static datasets are prone to memorization, rendering benchmarks ineffective.

Key Challenge: Human discovery in unknown environments follows a dynamic closed loop—abduction (hypothesis from observation) → deduction (predict new observations) → induction (revise hypothesis with new evidence)—the Peirce framework. Current LLM evaluation almost exclusively tests single-step deduction or static CoT, failing to assess the full "hypothesis-verification-revision" reasoning cycle.

Goal: (1) Design an interactive paradigm that compels LLMs to complete the full reasoning cycle; (2) Ensure the paradigm is pure—measuring only reasoning, without confounding abilities; (3) Make the paradigm robust to contamination and scalable to arbitrary difficulty.

Key Insight: Abstract the "unknown environment" as a black-box hidden function \(f:X\to Y\), where the LLM, within a limited \(T\) rounds of exploration, queries input-output pairs to uncover \(f\), then predicts outputs for new inputs in a test set. This paradigm inherently requires hypothesis generation (abduction), query generation (deduction), and feedback-based revision (induction).

Core Idea: Use "black-box environment interaction" as the evaluation paradigm, forcing LLMs to treat deduction, induction, and abduction as an inseparable, holistic reasoning cycle.

Method

Overall Architecture

Each evaluation instance consists of two stages: (1) Exploration stage for \(T\) rounds, where model \(M\) at round \(t\) adaptively generates query \(x_t=M(H_{t-1})\) based on history \(H_{t-1}=(x_1,y_1,\ldots,x_{t-1},y_{t-1})\), and the black box returns \(y_t=f(x_t)\); (2) Evaluation stage for \(K\) rounds, where the model predicts \(\hat{y}^k\) for each test input \(x^k_{\rm test}\) (disjoint from exploration queries), and the black box returns binary correctness \(c^k=\mathbb{1}(\hat{y}^k=f(x^k_{\rm test}))\). The model can further revise subsequent predictions based on correctness signals. Two metrics are used: accuracy \(=\sum c^k / K\) and turn@shot (e.g., 20@2 means 20 exploration rounds + 2 attempts per test sample).

Key Designs

  1. Black-Box Environment Interaction Paradigm + 6 Task Types:

    • Function: Unifies hidden functions from various domains into a black box mapping "input space \(X\) → output space \(Y\)," and designs 6 semantically distinct tasks to cover a broad spectrum of reasoning.
    • Mechanism: The 6 task types are CII (Code Intent Inference: black box is algorithmic code, model queries variable values at checkpoints), CRI (Circuit Rule Inference: black box is a Boolean circuit, queries input wires → gate outputs), PSI (Physics System Inference: black box is a classical mechanics system, queries time → object coordinates), ERI (Encryption Rule Inference: black box is an encryption mapping, queries plaintext → ciphertext), IPI (Interactive Puzzle Inference: black box is an interactive game like number guessing), GSI (Game Strategy Inference: black box is an opponent's fixed strategy, model must win). Each type includes easy/hard environments, totaling 96 environments.
    • Design Motivation: Synthetic black boxes prevent data contamination—even if LLMs have seen similar tasks, each specific black-box rule is novel; the function space is pure, with no visual/long-context/common-sense knowledge, isolating reasoning ability.
  2. Three-Module LLM Agentic Framework for Automatic Black-Box Generation:

    • Function: Automatically generates black-box code and interaction interfaces from natural language descriptions, enabling the benchmark to scale arbitrarily.
    • Mechanism: Three modules collaborate—(a) Coding LLM receives natural language task description and interaction rules, generating platform code; (b) Test LLM acts as a player interacting with the black box, simulating real scenarios to produce interaction logs; (c) Refinement LLM uses logs and task rules to automatically diagnose errors (execution errors / functionality mismatch / correctness) and iteratively refine. This closed-loop process aligns with engineering principles of "debugging via runtime feedback," and is more robust than static analysis.
    • Design Motivation: Manually writing black boxes is costly and unscalable; using LLMs as generators does not introduce bias, as the evaluated model only sees the interaction interface, not the underlying code.
  3. Theoretical Query Lower Bound + Adaptive Exploration Stratification:

    • Function: Provides an information-theoretic lower bound on the minimum number of queries needed to identify the hidden function, and stratifies LLM exploration ability into three tiers.
    • Mechanism: From the perspective of exact identification from membership queries, identifying hypothesis space \(\mathcal{H}\) requires at least \(T_{\rm info}\geq \lceil\log_2|\mathcal{H}|/\log_2|Y|\rceil\) queries. The authors stratify LLM exploration into three tiers: Tier 1—random exploration (no strategy); Tier 2—fixed strategy without feedback optimization; Tier 3—adaptive strategy that approaches optimality using instant feedback. Tier 3 is human-level, currently unattained by LLMs.
    • Design Motivation: Provides an absolute reference (information-theoretic lower bound) rather than only comparing baselines; stratification structures analysis and pinpoints specific LLM deficiencies.

Loss & Training

This work is an evaluation paradigm and benchmark, not a training method. All models use default API parameters (temperature=0), reasoning effort=medium (GPT series), thinking budget=20,000 tokens (Claude/Qwen series).

Key Experimental Results

Main Results

19 qualified LLMs (including o1/o3/o3-mini/o4-mini, Claude-3.5/3.7/4-sonnet, Gemini-2.5-flash/pro, DeepSeek-v3/r1, Qwen3 series, etc.) are evaluated under 10@1 and 20@2 settings. The table below shows SOTA performance for 6 task types under 10@1 (o3 consistently leads in 5/6):

Task 1st Place 2nd Place Easy acc (SOTA) Hard acc (SOTA)
CII o3 o4-mini ~85% ~50%
CRI o3 gemini-2.5-pro ~80% ~40%
PSI o3 gemini-2.5-pro ~75% ~35%
ERI o4-mini o3-mini ~80% ~30%
IPI o3 o4-mini ~85% ~45%
GSI o3 gemini-2.5-pro ~70% ~40%

Ablation Study

The core ablation compares setting (i) "no feedback during exploration, all query answers revealed at the end" vs setting (ii) "instant feedback after each round," on CRI and ERI tasks for gemini-2.5-pro / o3-mini / o4-mini:

Model Task Setting (i) acc Setting (ii) acc Difference
gemini-2.5-pro CRI ~0
o3-mini CRI ~0
o4-mini CRI ~0
gemini-2.5-pro ERI ~0
o3-mini ERI ~0
o4-mini ERI ~0

Performance is nearly identical in both settings—direct evidence that "LLMs do not utilize instant feedback to optimize strategies."

Key Findings

  • Reasoning models outperform chat models: claude-4-sonnet_thinking consistently outperforms the non-thinking version; newer models outperform older ones (gemini-2.5-flash > 2.0-flash).
  • Doubling exploration budget (10→20 rounds, 1→2 attempts) improves CII/CRI/IPI tasks by >10%, but has little effect on PSI (limited by numerical computation), ERI, or GSI—the bottleneck is "inability to design efficient exploration strategies."
  • Setting (i) vs (ii) equivalence: SOTA models perform identically with or without instant feedback, indicating that even with feedback, models do not alter exploration behavior. Figure 9 shows o4-mini exhaustively tries one-hot inputs in CRI, and gemini-2.5-pro repeatedly queries single letters in ERI, both using rigid strategies in both settings.
  • Three-tier exploration ability: Authors classify LLMs as Tier 1 (random exploration), Tier 2 (fixed strategy, no optimization), Tier 3 (adaptive). Current top models only occasionally reach Tier 2; none reach Tier 3—Tier 3 remains human territory.
  • Large difficulty gap: Easy tasks achieve 70-85% accuracy, hard tasks drop to 30-50%, indicating a reasonable difficulty gradient.

Highlights & Insights

  • Explicitly maps Peirce's "abduction-deduction-induction" triad into the benchmark's design philosophy, providing the evaluation community with a tool to measure the complete reasoning cycle.
  • The structure where "LLMs generate black boxes, but evaluated LLMs cannot see the code" naturally prevents data contamination and allows arbitrary benchmark expansion—very clever.
  • The equivalence experiment for settings (i)/(ii) is counterintuitive yet highly diagnostic—directly falsifying the common assumption that "LLMs use feedback."
  • The Tier 1/2/3 stratification can guide RL post-training design: the reward should target the dynamic quality of strategy optimization, not just final correctness.
  • The information-theoretic lower bound provides an absolute scale for "how hard is the black box," allowing direct calculation of "how far o3 is from optimal query efficiency."

Limitations & Future Work

  • The 96 environments are still limited in number, and some categories (e.g., GSI) have relatively fixed rules; over time, models may learn meta-patterns, reducing challenge.
  • Black-box tasks are somewhat toy-like and far from real scientific discovery; while the finding that "LLMs do not adaptively explore" is strong, its relevance to real tasks (e.g., code debugging) is unproven.
  • Evaluation depends on commercial APIs, making reproduction costly (19 SOTA models × 96 environments × multiple turn@shot settings).
  • Some tasks (PSI) are bottlenecked by LLMs' poor numerical computation, not pure reasoning, so evaluation signals are mixed.
  • No training-time methods are proposed to teach LLMs adaptive exploration; the benchmark raises the problem but leaves solutions to future work.
  • vs WebArena / GameBench / GameArena: These use real web or game environments, introducing spatial reasoning, long context, and common-sense knowledge; ORACLE uses pure function black boxes to isolate reasoning ability.
  • vs InductionBench / DEER / Mirage: These only test inductive reasoning; ORACLE's interactive paradigm tests deduction, induction, and abduction simultaneously.
  • vs LiveBench / LiveCodeBench: These use timestamps to resist contamination; ORACLE uses synthetic black boxes for more thorough contamination resistance.
  • vs DyVal / DARG: Dynamically generate evaluation questions; ORACLE not only generates dynamically but also introduces an interactive closed loop.
  • vs PlanBench (Valmeekam et al. 2023): PlanBench also tests planning ability but is limited to static tasks with known rules; ORACLE emphasizes exploratory planning in unknown environments.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Precisely maps the Peirce reasoning framework to a black-box interaction evaluation paradigm; the setting (i)/(ii) equivalence experiment is a truly original diagnostic design.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 19 SOTA LLMs × 6 tasks × 96 environments × multiple turn@shot settings, plus baseline tests and in-depth behavioral analysis.
  • Writing Quality: ⭐⭐⭐⭐ Case figures are highly intuitive, but some sections (e.g., theoretical lower bound analysis) are relegated to the appendix; the Tier 1/2/3 classification could appear earlier in the main text.
  • Value: ⭐⭐⭐⭐⭐ Contamination-resistant, scalable, and directly reveals current LLM reasoning bottlenecks; the benchmark is open-sourced and widely usable.