EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer¶
Conference: AAAI 2026 arXiv: 2509.12718 Code: https://anonymous.4open.science/r/EvoEmpirBench-143C/ Area: Robotics Keywords: Dynamic Spatial Reasoning, Partial Observability, Online Learning, Experience Verification, Maze Navigation
TL;DR¶
This paper proposes EvoEmpirBench (EEB), comprising two dynamic interactive benchmarks (partially observable maze navigation + Match-2), and the Agent-ExpVer three-agent online learning framework (GeoLink for interaction + InsightForce for experience abstraction + TruthWeaver for knowledge management). Through a cognitive cycle of "experience → verification → truth induction," the framework achieves continuous strategy evolution without parameter updates, improving GPT-4.1 success rate by 5.6% and Qwen-32B by 29%.
Background & Motivation¶
Background: Existing LLM reasoning benchmarks (BIG-Bench, PlanBench, etc.) are primarily built on static datasets, making them susceptible to data contamination and subject to rapid performance saturation. Game-based benchmarks (SmartPlay, GameArena) are more engaging, but tend to feature static environments, shallow interactions, or evaluation of only specific capabilities.
Limitations of Prior Work: Real-world reasoning requires long-horizon planning in partially observable, dynamically changing environments — each action alters the environment state, requiring agents to continuously update their understanding and strategies. Existing benchmarks rarely evaluate all three dimensions simultaneously: partial observability + dynamic environment + long-horizon reasoning.
Key Challenge: The conventional paradigm of "collect data → offline training" is ill-suited for dynamic environments, whereas human learning adapts to new situations through continuous abstraction and rule induction (experience → verification → truth). LLM agents lack analogous online learning mechanisms.
Goal: (a) Construct a genuinely dynamic, partially observable reasoning benchmark; (b) Design a human cognition-inspired online learning framework that enables agents to improve continuously without parameter updates.
Key Insight: Two carefully designed games (maze + Match-2) serve as test environments — each action modifies the environment, and agents can only observe local information. The three-agent collaborative framework is designed based on the principle of human "experiential learning."
Core Idea: Replace offline training with a cognitive cycle of "subjective experience → verification → truth induction" to enable parameter-free continual learning in dynamic environments.
Method¶
Overall Architecture¶
The work consists of two components: (1) the EvoEmpirBench dynamic benchmark construction; and (2) the Agent-ExpVer three-agent online learning framework.
Key Designs¶
-
EvoEmpirBench: Two Dynamic Games:
- Maze Navigation: A 9×9 grid in which the agent has partial observability (only a local region is visible). The Easy level contains only coins; Medium adds moving monsters; Hard introduces 4 types of items (pickaxe, iron sword, magnet, key) along with monsters and obstacles. Each action (destroying obstacles, picking up items, etc.) alters the environment structure.
- Match-2: An 8×8 board where the agent eliminates ≥2 adjacent same-colored tiles; eliminated tiles cause remaining tiles to fall and new tiles are randomly replenished. The agent must reach a target elimination count for each color within a limited number of steps. Power-ups (row clear, column clear, bomb, hammer) require spending points.
- Design Motivation: The two games evaluate complementary dimensions — the maze tests spatial navigation, risk management, and memory utilization, while Match-2 tests strategic planning, resource optimization, and long-term goal management. The benchmark contains 120 task instances (3 difficulty levels × 30 instances per game).
-
GeoLink Agent (Environment Interaction):
- Function: Directly interacts with the game environment, selecting actions and collecting trajectories at each timestep.
- Mechanism: \(a_t \sim \pi_t(\mathbf{s}_t)\), collecting interaction history \(\mathcal{H}_{0:T} = \{(\mathbf{s}_0, a_0, r_0), \ldots\}\).
- The policy \(\pi_t\) evolves continuously by integrating accumulated "truth" knowledge: \(\pi_t = \pi_0 \cup \bigcup_{e \in \mathcal{M}_{\text{truth}}} e\).
-
InsightForce Agent (Experience Abstraction + Verification):
- Function: Abstracts interaction trajectories into "subjective experiences" and validates their effectiveness through replaying.
- Mechanism: An LLM summarizes the trajectory \(\mathcal{H}_{0:T}\) and final metrics \(\mathbf{m}\) into an experience \(\mathbf{e} = f_{\text{sum}}(\mathcal{H}_{0:T}, \mathbf{m})\). The agent then replays the same level with experience \(\mathbf{e}\); if the level is completed and the score improves, the experience is promoted to a "truth": \(\mathcal{M}_{\text{truth}} \leftarrow \mathcal{M}_{\text{truth}} \cup \mathbf{e}\) if \(P \wedge (S' > S)\).
- Design Motivation: Inspired by human episodic memory — not all experiences are valuable; only those verified to be genuinely effective are worth retaining.
-
TruthWeaver Agent (Knowledge Management):
- Function: Manages the truth knowledge base to prevent redundant accumulation.
- Mechanism: Three operations — (1) merge semantically similar truths (different phrasings with equivalent meaning); (2) remove exact duplicates; (3) insert new truths. This keeps the knowledge base concise and high-quality.
- Design Motivation: As learning rounds accumulate, knowledge grows explosively; a mechanism analogous to human "memory consolidation" is needed to refine and deduplicate entries.
Policy Rollback Mechanism¶
If the average score decreases after a policy update (\(\Delta < 0\)), the framework automatically reverts to the previous policy version and restarts experience abstraction. This ensures the learning process is monotonically non-degrading.
Key Experimental Results¶
Main Results — Maze Navigation¶
| Model | Success Rate (%) | Avg. Score | Avg. Steps |
|---|---|---|---|
| Human | 90.00 | 2914.67 | 20.6 |
| GPT-4.1 | 73.33 | 2562.33 | 34.0 |
| GPT-4.1 + ExpVer | 78.89 | 2805.67 | 32.8 |
| DeepSeek-V3 | 61.11 | 1649.78 | 50.6 |
| Qwen2.5-32B | 42.22 | 1122.22 | 38.4 |
| Qwen2.5-32B + ExpVer | 54.44 | 1532.33 | 35.8 |
| Llama-3.1-8B | 23.33 | -1213.67 | 54.4 |
Main Results — Match-2¶
| Model | Success Rate (%) | Avg. Score |
|---|---|---|
| Human | 86.67 | 350.22 |
| GPT-4.1 | 40.00 | 245.04 |
| GPT-4.1 + ExpVer | 53.33 | 234.60 |
| Grok-3 | 42.22 | 246.87 |
| Claude-3.7-Sonnet | 41.11 | 230.33 |
| Qwen2.5-32B | 33.33 | 203.07 |
| Qwen2.5-32B + ExpVer | 41.57 | 197.42 |
Ablation Study¶
| Configuration | Maze Succ. | Maze Score | Match-2 Succ. | Match-2 Score |
|---|---|---|---|---|
| GPT-4.1 Baseline | 73.33% | 2562 | 40.00% | 245 |
| GPT-4.1 w/o TruthWeaver | 77.78% | 2765 | 48.89% | 220 |
| GPT-4.1 + ExpVer (Full) | 78.89% | 2806 | 53.33% | 235 |
Key Findings¶
- All LLMs lag significantly behind humans: 90% (human) vs. 78.89% (best LLM) on maze; 86.67% vs. 53.33% on Match-2, indicating that dynamic spatial reasoning remains a critical weakness of LLMs.
- Agent-ExpVer consistently improves all models: average +5.6% success rate on maze and +13.3% on Match-2, without any parameter updates.
- Qwen-32B shows the largest improvement: success rate from 42.22% to 54.44% (+29% relative gain), suggesting ExpVer provides greater benefit to models of moderate capability.
- Learned "truths" carry concrete semantics: early-stage agents learn "bold exploration is harmful," which evolves into "survival first" in later stages, demonstrating a human-like learning trajectory.
- Partial observability is the primary source of difficulty: providing a global view raises GPT-4.1's success rate from 73% to 93%, confirming that incomplete information is the main bottleneck.
- Match-2 is more challenging: baseline LLMs achieve only 33.7% average success rate, as the task demands precise spatial reasoning combined with multi-step lookahead planning.
Highlights & Insights¶
- The cognitive cycle of "experience → verification → truth" is the most central contribution — rather than simple self-reflection (Reflexion only examines failure causes), this is a complete closed loop of "summarize → replay to verify → promote to reusable knowledge → deduplicate and refine."
- TruthWeaver's knowledge management addresses a practical issue: as learning rounds increase, the number of knowledge entries in the prompt grows continuously; without merging and deduplication, the context window becomes saturated with low-quality, redundant knowledge.
- The policy rollback mechanism is pragmatically motivated — non-monotonic learning (where early rounds may degrade performance due to erroneous inductions) requires a safeguard.
- The two games are well-complementary in design: the maze tests exploration-exploitation balance and risk management, while Match-2 tests precise coordination and long-term planning.
Limitations & Future Work¶
- Performance is bounded by the capabilities of the base model — small models (Llama-8B) derive almost no benefit from ExpVer.
- Only two game types are included, lacking more complex scenarios such as temporal reasoning and multi-agent collaboration.
- The "truths" in Agent-ExpVer are prompt extensions in text form, constrained by context window size and thus unable to accumulate indefinitely.
- The game rules are relatively simple (9×9 grid, 8×8 board), which may not fully reflect the complexity of real-world spatial reasoning.
- Direct comparisons with other online learning methods (e.g., Voyager, CLIN) are absent.
Related Work & Insights¶
- vs. SmartPlay: SmartPlay environments are static; EEB is dynamic with partial observability, making it more realistic.
- vs. Reflexion: Reflexion only performs failure reflection, whereas ExpVer implements a complete closed loop of "summarize → verify → truth induction → knowledge management" with cross-level transfer support.
- vs. Agent-Pro: Agent-Pro is limited to shallow-interaction settings such as poker and blackjack; EEB's maze and Match-2 require longer-horizon reasoning chains.
- Implications for agent research: Parameter-free online learning (pure prompt augmentation) may be a practical path for the continuous improvement of LLM agents.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dynamic benchmark design is novel, and the cognitive cycle underlying Agent-ExpVer has theoretical depth; however, the core techniques (prompt augmentation + experience verification) are relatively straightforward.
- Experimental Thoroughness: ⭐⭐⭐⭐ The evaluation covers 15+ models, ablation studies, human baselines, and learning process visualizations, though only two game types are included.
- Writing Quality: ⭐⭐⭐⭐ The citation of Jean Piaget is apt and tasteful; the method description is clear, and the appendix provides highly detailed reasoning process examples.
- Value: ⭐⭐⭐⭐ Dynamic spatial reasoning is a clearly identified weakness of LLMs; both the benchmark and the parameter-free learning method hold significant research value.