See and Think: Embodied Agent in Virtual Environment¶

Conference: ECCV 2024
arXiv: 2311.15209
Code: None
Area: Robotics
Keywords: Embodied Agent, Minecraft, Multi-modal LLM, Open-world, Skill Retrieval

TL;DR¶

This paper proposes STEVE, an open-world embodied agent in Minecraft based on three major components: visual perception, language instruction, and code action. By fine-tuning LLaMA-2 on the STEVE-21K dataset and combining it with a visual encoder and a skill database, STEVE significantly outperforms existing methods in tech tree unlocking and block search tasks.

Background & Motivation¶

Background: Open-world embodied agent research (e.g., in the Minecraft environment) has become an important testing ground for AI. Recently, LLM-driven agents (such as Voyager, DEPS) have demonstrated powerful planning capabilities.

Limitations of Prior Work: Current LLM-driven Minecraft agents primarily rely on text interaction, lacking visual perception capabilities. They tend to generate unpredictable outputs, relying heavily on carefully crafted prompt engineering; furthermore, text cannot naturally convey visual information such as synthesis recipes.

Key Challenge: There is a need for a multimodal framework that can process visual inputs and textual reasoning simultaneously, and translate high-level planning into executable code.

Goal: To build a comprehensive Minecraft embodied agent that combines vision, language, and code.

Key Insight: Decompose the problem into three modules—visual perception (what to see), language instruction (how to plan), and code action (how to execute)—and construct a dedicated dataset, STEVE-21K, to support training.

Core Idea: Build a complete open-world embodied agent by perceiving the environment through a visual encoder, reasoning and decomposing tasks with an LLM fine-tuned for the Minecraft domain, and retrieving execution code from a skill database.

Method¶

Overall Architecture¶

STEVE is an LLM-based multimodal autonomous system that receives the visual state \(X^v\), agent state \(X^s\), and task description \(X^t\), and outputs executable code actions \(\mathbf{a}^c\). The overall formulation is: \(\mathbf{a}^c = \mathcal{F}(X^v, X^s, X^t) = \mathcal{A}^c(\mathcal{I}^l(\mathcal{P}^v(X^v, X^s, X^t)))\).

Key Designs¶

Vision Perception 视觉感知 \(\mathcal{P}^v\)
- Function: Encodes the visual state (image/video), agent state (HP/inventory), and task description into a unified token representation.
- Mechanism: EfficientFormerV2-S0 is used as the visual encoder to encode images into \(n\) visual tokens of dimension \(d\), which are concatenated with the state and task tokens processed by the text tokenizer.
- Design Motivation: The textual context in Minecraft is insufficient to convey the visual characteristics of blocks and entities, necessitating direct visual perception.
Language Instruction 语言指令 \(\mathcal{I}^l\)
- Function: Responsible for iterative reasoning and decomposing complex tasks into manageable steps.
- Mechanism: Comprises four independent LLM sub-modules: Planner (high-level planning), Critic (evaluation and feedback), Curriculum (progressive learning), and Describer (information summarization). These are based on STEVE-7B/13B (fine-tuned from LLaMA-2) and possess Minecraft domain expertise.
- Design Motivation: A single LLM struggles to perform planning, evaluation, and learning simultaneously; collaborative role-playing is more effective.
Code Action 代码动作 \(\mathcal{A}^c\)
- Function: Translates language instructions into executable code in Minecraft.
- Mechanism: Based on skill database retrieval. Instructions are encoded into query vectors, which are matched with skill-code pairs in the database using cosine similarity, containing 210 skills across 8 categories.
- Design Motivation: Code execution is more reliable than direct control, and skill database retrieval is more stable than generating code directly via LLMs.
Curriculum Learning with Memory（带记忆的课程学习）
- Function: Progressively learns tasks from simple to complex, accumulating experience into memory.
- Mechanism: Performs curriculum tasks first to allow agent exploration, storing successful experiences. Uses the Chain of Summarization method to compress overly long memories, enabling gradient-free in-context lifelong learning.
- Design Motivation: Open-world tasks require progressive learning, and memory can reuse experiences to improve efficiency.

Loss & Training¶

Two-Stage Training:
- Stage 1 (Offline Warm-up): Fine-tunes LLaMA-2 using LoRA on the STEVE-21K dataset containing 20K single-round QA pairs.
- Stage 2 (Online Fine-tuning): Simultaneously trains the visual encoder and fine-tunes the LLM under simulation, utilizing instructions generated by an Expert LLM (GPT-4) as ground truth.
Visual Encoder Training: Class/block/entity labels are retrieved within the FOV using Ray Tracing, and context data from successful runs are collected after 5,000 simulations.
Loss Function: Negative log-likelihood objective: \(\mathcal{L}(\theta) = -\sum_{j=1}^{L} \log \mathcal{F}_\theta(Y_j | X^v, \hat{Y}_{1:j-1})\)

Key Experimental Results¶

Main Results¶

Tech Tree Unlocking Task Comparison (lower iteration count is better, 3/3 indicates full success in three runs):

Method	Wooden Tool	Stone Tool	Iron Tool	Diamond Tool
AutoGPT	92±72 (3/3)	94±72 (3/3)	135±103 (3/3)	N/A (0/3)
Voyager	6±2 (3/3)	11±2 (3/3)	21±7 (3/3)	102 (1/3)
STEVE	4±1 (3/3)	8±1 (3/3)	15±2 (3/3)	106±12 (3/3)

Continuous Block Search Task:

Method	Avg. Iterations↓	Avg. Diamonds Found↑
AutoGPT	N/A	7
Voyager	35	26
STEVE	14	67

Ablation Study¶

Ablation study (tech tree task):

Method	Wooden Tool	Stone Tool	Iron Tool	Diamond Tool
w/o vision unit	11±5 (3/3)	27±5 (3/3)	46±11 (3/3)	158 (1/3)
STEVE (GPT-4)	6±2 (3/3)	10±1 (3/3)	14±3 (3/3)	89±9 (3/3)
STEVE (Ours-13B)	4±1 (3/3)	8±1 (3/3)	15±2 (3/3)	106±12 (3/3)

Key Findings¶

STEVE is faster than all methods on simple tasks (wooden/stone tools), including the GPT-4 based version.
The visual unit is critical for complex tasks like the Diamond Tool (the success rate drops from 3/3 to 1/3 upon removal).
STEVE-13B scores 8.12 in QA, outperforming GPT-4's 8.04, which demonstrates the value of domain-specific fine-tuning.
Visual perception increases search efficiency by 2.5x compared to Voyager.

Highlights & Insights¶

Complete Trinity Framework: The see-think-act paradigm design is clear and natural, with a well-defined division of labor among modules.
Dataset Construction: STEVE-21K contains vision-environment pairs, QA pairs, and skill-code pairs, providing a comprehensive resource for Minecraft AI training.
Smaller Models Outperforming Larger Models: Domain-fine-tuned STEVE-13B outperforms the general-purpose GPT-4 in Minecraft knowledge.
Curriculum Learning + Memory: The gradient-free, in-context lifelong learning paradigm is highly practical.

Limitations & Future Work¶

Evaluated only in the Minecraft environment; generalizability to other open-world scenarios remains unverified.
The skill database is manual-crafted with 210 skills, which limits its scalability.
The visual encoder is relatively simple (EfficientFormerV2-S0); stronger vision concept models were not utilized.
It is still less efficient than the Voyager + GPT-4 combination on the Diamond Tool task.
Relies on GPT-4 to generate ground truth for online fine-tuning, which incurs high costs.

Similar positioning to Voyager but emphasizes multimodal inputs, filling the visual perception blank in Minecraft AI.
The skill retrieval concept in Code Action is similar to tool learning in HuggingGPT, but more tailored to Minecraft.
Insights from DEPS (multi-step reasoning) and GITM (structured actions) are both reflected in STEVE.
Insight: The combination of domain-specific fine-tuning, visual perception, and structured execution is an effective paradigm for building embodied agents.

Rating¶

⭐⭐⭐ Novelty: The framework is an integration of existing components; technical novelty of individual modules is limited.
⭐⭐⭐ Experimental Thoroughness: The number of baselines is small (only AutoGPT and Voyager), and more ablation studies are needed.
⭐⭐⭐ Writing Quality: Generally readable, but mathematical notations are somewhat redundant.
⭐⭐⭐⭐ Value: The STEVE-21K dataset and the complete framework design are valuable resources for the Minecraft AI community.