Experience-based Knowledge Correction for Robust Planning in Minecraft¶
Conference: ICLR 2026 arXiv: 2505.24157 Code: None Area: Robotics Keywords: LLM planning, knowledge correction, Minecraft, embodied agent, self-correction failure
TL;DR¶
This paper demonstrates that LLMs cannot self-correct erroneous planning priors (item dependency relations) through prompting alone, and proposes XENON — an algorithmic knowledge management framework consisting of an Adaptive Dependency Graph (ADG) and Failure-Aware Action Memory (FAM) that learns from binary feedback. XENON enables a 7B LLM to surpass the SOTA that uses GPT-4V with oracle knowledge on long-horizon planning tasks in Minecraft.
Background & Motivation¶
Background: LLM-driven agents tackling long-horizon planning tasks in environments such as Minecraft require accurate item dependency knowledge (e.g., a diamond pickaxe requires diamonds and sticks), yet the parametric knowledge encoded in LLMs is frequently erroneous.
Limitations of Prior Work: Self-correction — prompting an LLM to reflect on and revise its knowledge — is ineffective for parametric knowledge errors. LLMs repeatedly make the same mistakes because the errors are encoded in model weights and cannot be altered through prompting.
Key Challenge: LLMs possess strong language understanding but unreliable factual knowledge; external mechanisms, rather than prompting, are required to correct knowledge errors.
Goal: How can an agent algorithmically correct an LLM's planning knowledge given only binary feedback (success/failure)?
Key Insight: Shifting knowledge correction from "letting the LLM correct itself" to "algorithmically updating an external knowledge base."
Core Idea: Algorithmic knowledge management — correcting dependency graphs from successful experiences and filtering ineffective actions from failure experiences — outperforms LLM self-correction.
Method¶
Overall Architecture¶
XENON = Adaptive Dependency Graph (ADG) + Failure-Aware Action Memory (FAM) + Context-aware Reprompting (CRe). ADG learns item dependency relations, FAM learns which actions are effective or ineffective, and CRe helps low-level controllers recover from stuck states.
Key Designs¶
-
Adaptive Dependency Graph (ADG):
- Function: Corrects erroneous item dependency relations in the LLM's knowledge from successful experiences.
- Core Algorithm — RevisionByAnalogy: When the agent successfully acquires item \(X\), it observes the inventory item set, compares it against known dependencies, and revises or confirms dependency edges via analogical reasoning.
- Handling hallucinated items: RevisionByAnalogy can identify non-existent items through actual experience and remove them from the graph.
- Performance: Achieves ~0.90 accuracy on Mineflayer after 400 rounds.
-
Failure-Aware Action Memory (FAM):
- Function: Learns which actions are effective or ineffective from binary feedback.
- Mechanism: Each action maintains success and failure counts; once a threshold is exceeded, the action is classified as "experientially effective" or "experientially ineffective."
- Ineffective actions are filtered out in subsequent planning to prevent repeated failures.
-
Context-aware Reprompting (CRe):
- Function: Re-prompts the controller (e.g., STEVE-1) when it becomes stuck during execution.
- Detects environmental state stagnation and actively interrupts execution to trigger replanning.
Key Experimental Results¶
Long-horizon Planning Success Rate (Learned vs. Oracle Knowledge)¶
| Goal Type | Oracle Knowledge SR | Learned Knowledge SR |
|---|---|---|
| Gold items | 0.83 | 0.74 |
| Diamond items | 0.82 | 0.64 |
| Redstone items | 0.75 | 0.28 |
| Overall | 0.80 | 0.54 |
Dependency Learning Accuracy (EGA)¶
| Platform | After 400 Rounds |
|---|---|
| MineRL | ~0.60 |
| Mineflayer | ~0.90 |
Model Comparison¶
- 7B Qwen2.5-VL + XENON outperforms Optimus-1 (GPT-4V + oracle knowledge) across multiple goal categories.
Key Findings¶
- Accurate dependency knowledge is critical for successful planning — Redstone goals that achieve 0.75 SR with oracle knowledge drop to 0.00 with learned knowledge due to controller capability limitations.
- XENON is robust to hallucinated items generated by LLMs, identifying and removing them via RevisionByAnalogy.
- LLM self-correction through prompting fails across all baselines and cannot correct parametric knowledge errors.
Highlights & Insights¶
- Empirical evidence that LLMs cannot self-correct parametric knowledge: This finding carries significant implications for LLM agent design — prompt-based self-correction should not be relied upon to fix factual knowledge errors.
- Algorithm > Prompting paradigm: When the root cause is knowledge error rather than reasoning error, algorithmic correction (external memory + statistical updates) substantially outperforms natural language reflection.
- Small model + effective knowledge management > Large model + poor knowledge: A 7B model with XENON outperforms GPT-4V with oracle knowledge, indicating that knowledge management strategy matters more than model scale.
Limitations & Future Work¶
- Performance is bounded by the capability of the underlying controller — STEVE-1's inability to execute certain complex actions causes complete failure on Redstone-category goals.
- RevisionByAnalogy involves multiple hyperparameters that require tuning.
- Evaluation is conducted solely in Minecraft (with preliminary household task experiments in the appendix).
- The framework assumes item dependencies form a DAG (acyclic graph).
Related Work & Insights¶
- vs. Optimus-1: Optimus-1 relies on GPT-4V with oracle dependencies, whereas XENON achieves superior performance across multiple goal categories using a 7B model with learned dependencies.
- vs. Voyager/DEPS: These Minecraft agents rely on LLM prompting but do not correct knowledge errors.
Rating¶
- Novelty: ⭐⭐⭐⭐ The concept of replacing self-correction with algorithmic knowledge management is original and compelling
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-platform × multi-goal-type × detailed ablation studies
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and well-structured
- Value: ⭐⭐⭐⭐ Provides important paradigmatic insights for knowledge management in LLM agents