See and Think: Embodied Agent in Virtual Environment¶
Conference: ECCV 2024
arXiv: 2311.15209
Code: None
Area: Robotics
Keywords: Embodied Agent, Minecraft, Multi-modal LLM, Open-world, Skill Retrieval
TL;DR¶
This paper proposes STEVE, an open-world embodied agent in Minecraft based on three major components: visual perception, language instruction, and code action. By fine-tuning LLaMA-2 on the STEVE-21K dataset and combining it with a visual encoder and a skill database, STEVE significantly outperforms existing methods in tech tree unlocking and block search tasks.
Background & Motivation¶
Background: Open-world embodied agent research (e.g., in the Minecraft environment) has become an important testing ground for AI. Recently, LLM-driven agents (such as Voyager, DEPS) have demonstrated powerful planning capabilities.
Limitations of Prior Work: Current LLM-driven Minecraft agents primarily rely on text interaction, lacking visual perception capabilities. They tend to generate unpredictable outputs, relying heavily on carefully crafted prompt engineering; furthermore, text cannot naturally convey visual information such as synthesis recipes.
Key Challenge: There is a need for a multimodal framework that can process visual inputs and textual reasoning simultaneously, and translate high-level planning into executable code.
Goal: To build a comprehensive Minecraft embodied agent that combines vision, language, and code.
Key Insight: Decompose the problem into three modules—visual perception (what to see), language instruction (how to plan), and code action (how to execute)—and construct a dedicated dataset, STEVE-21K, to support training.
Core Idea: Build a complete open-world embodied agent by perceiving the environment through a visual encoder, reasoning and decomposing tasks with an LLM fine-tuned for the Minecraft domain, and retrieving execution code from a skill database.
Method¶
Overall Architecture¶
STEVE is an LLM-based multimodal autonomous system that receives the visual state \(X^v\), agent state \(X^s\), and task description \(X^t\), and outputs executable code actions \(\mathbf{a}^c\). The overall formulation is: \(\mathbf{a}^c = \mathcal{F}(X^v, X^s, X^t) = \mathcal{A}^c(\mathcal{I}^l(\mathcal{P}^v(X^v, X^s, X^t)))\).
Key Designs¶
-
Vision Perception 视觉感知 \(\mathcal{P}^v\)
- Function: Encodes the visual state (image/video), agent state (HP/inventory), and task description into a unified token representation.
- Mechanism: EfficientFormerV2-S0 is used as the visual encoder to encode images into \(n\) visual tokens of dimension \(d\), which are concatenated with the state and task tokens processed by the text tokenizer.
- Design Motivation: The textual context in Minecraft is insufficient to convey the visual characteristics of blocks and entities, necessitating direct visual perception.
-
Language Instruction 语言指令 \(\mathcal{I}^l\)
- Function: Responsible for iterative reasoning and decomposing complex tasks into manageable steps.
- Mechanism: Comprises four independent LLM sub-modules: Planner (high-level planning), Critic (evaluation and feedback), Curriculum (progressive learning), and Describer (information summarization). These are based on STEVE-7B/13B (fine-tuned from LLaMA-2) and possess Minecraft domain expertise.
- Design Motivation: A single LLM struggles to perform planning, evaluation, and learning simultaneously; collaborative role-playing is more effective.
-
Code Action 代码动作 \(\mathcal{A}^c\)
- Function: Translates language instructions into executable code in Minecraft.
- Mechanism: Based on skill database retrieval. Instructions are encoded into query vectors, which are matched with skill-code pairs in the database using cosine similarity, containing 210 skills across 8 categories.
- Design Motivation: Code execution is more reliable than direct control, and skill database retrieval is more stable than generating code directly via LLMs.
-
Curriculum Learning with Memory(带记忆的课程学习)
- Function: Progressively learns tasks from simple to complex, accumulating experience into memory.
- Mechanism: Performs curriculum tasks first to allow agent exploration, storing successful experiences. Uses the Chain of Summarization method to compress overly long memories, enabling gradient-free in-context lifelong learning.
- Design Motivation: Open-world tasks require progressive learning, and memory can reuse experiences to improve efficiency.
Loss & Training¶
- Two-Stage Training:
- Stage 1 (Offline Warm-up): Fine-tunes LLaMA-2 using LoRA on the STEVE-21K dataset containing 20K single-round QA pairs.
- Stage 2 (Online Fine-tuning): Simultaneously trains the visual encoder and fine-tunes the LLM under simulation, utilizing instructions generated by an Expert LLM (GPT-4) as ground truth.
- Visual Encoder Training: Class/block/entity labels are retrieved within the FOV using Ray Tracing, and context data from successful runs are collected after 5,000 simulations.
- Loss Function: Negative log-likelihood objective: \(\mathcal{L}(\theta) = -\sum_{j=1}^{L} \log \mathcal{F}_\theta(Y_j | X^v, \hat{Y}_{1:j-1})\)
Key Experimental Results¶
Main Results¶
Tech Tree Unlocking Task Comparison (lower iteration count is better, 3/3 indicates full success in three runs):
| Method | Wooden Tool | Stone Tool | Iron Tool | Diamond Tool |
|---|---|---|---|---|
| AutoGPT | 92±72 (3/3) | 94±72 (3/3) | 135±103 (3/3) | N/A (0/3) |
| Voyager | 6±2 (3/3) | 11±2 (3/3) | 21±7 (3/3) | 102 (1/3) |
| STEVE | 4±1 (3/3) | 8±1 (3/3) | 15±2 (3/3) | 106±12 (3/3) |
Continuous Block Search Task:
| Method | Avg. Iterations↓ | Avg. Diamonds Found↑ |
|---|---|---|
| AutoGPT | N/A | 7 |
| Voyager | 35 | 26 |
| STEVE | 14 | 67 |
Ablation Study¶
Ablation study (tech tree task):
| Method | Wooden Tool | Stone Tool | Iron Tool | Diamond Tool |
|---|---|---|---|---|
| w/o vision unit | 11±5 (3/3) | 27±5 (3/3) | 46±11 (3/3) | 158 (1/3) |
| STEVE (GPT-4) | 6±2 (3/3) | 10±1 (3/3) | 14±3 (3/3) | 89±9 (3/3) |
| STEVE (Ours-13B) | 4±1 (3/3) | 8±1 (3/3) | 15±2 (3/3) | 106±12 (3/3) |
Key Findings¶
- STEVE is faster than all methods on simple tasks (wooden/stone tools), including the GPT-4 based version.
- The visual unit is critical for complex tasks like the Diamond Tool (the success rate drops from 3/3 to 1/3 upon removal).
- STEVE-13B scores 8.12 in QA, outperforming GPT-4's 8.04, which demonstrates the value of domain-specific fine-tuning.
- Visual perception increases search efficiency by 2.5x compared to Voyager.
Highlights & Insights¶
- Complete Trinity Framework: The see-think-act paradigm design is clear and natural, with a well-defined division of labor among modules.
- Dataset Construction: STEVE-21K contains vision-environment pairs, QA pairs, and skill-code pairs, providing a comprehensive resource for Minecraft AI training.
- Smaller Models Outperforming Larger Models: Domain-fine-tuned STEVE-13B outperforms the general-purpose GPT-4 in Minecraft knowledge.
- Curriculum Learning + Memory: The gradient-free, in-context lifelong learning paradigm is highly practical.
Limitations & Future Work¶
- Evaluated only in the Minecraft environment; generalizability to other open-world scenarios remains unverified.
- The skill database is manual-crafted with 210 skills, which limits its scalability.
- The visual encoder is relatively simple (EfficientFormerV2-S0); stronger vision concept models were not utilized.
- It is still less efficient than the Voyager + GPT-4 combination on the Diamond Tool task.
- Relies on GPT-4 to generate ground truth for online fine-tuning, which incurs high costs.
Related Work & Insights¶
- Similar positioning to Voyager but emphasizes multimodal inputs, filling the visual perception blank in Minecraft AI.
- The skill retrieval concept in Code Action is similar to tool learning in HuggingGPT, but more tailored to Minecraft.
- Insights from DEPS (multi-step reasoning) and GITM (structured actions) are both reflected in STEVE.
- Insight: The combination of domain-specific fine-tuning, visual perception, and structured execution is an effective paradigm for building embodied agents.
Rating¶
- ⭐⭐⭐ Novelty: The framework is an integration of existing components; technical novelty of individual modules is limited.
- ⭐⭐⭐ Experimental Thoroughness: The number of baselines is small (only AutoGPT and Voyager), and more ablation studies are needed.
- ⭐⭐⭐ Writing Quality: Generally readable, but mathematical notations are somewhat redundant.
- ⭐⭐⭐⭐ Value: The STEVE-21K dataset and the complete framework design are valuable resources for the Minecraft AI community.