CVPR 2025 Multimodal VLM Minecraft Agent Multimodal Large Language Model Behavior Cloning Goal-Conditioned Policy Observation-Action Causal Modeling GOAP MGOA Dataset

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy¶

Conference: CVPR 2025
arXiv: 2502.19902
Code: https://cybertronagent.github.io/Optimus-2.github.io/
Area: Multimodal Large Models / Embodied AI
Keywords: Minecraft Agent, Multimodal Large Language Model, Behavior Cloning, Goal-Conditioned Policy, Observation-Action Causal Modeling, GOAP, MGOA Dataset

TL;DR¶

This paper proposes Optimus-2, which utilizes MLLMs for high-level planning combined with a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. Within this framework, GOAP models the causal relationship between observations and actions using an Action-guided Behavior Encoder, and aligns behavior tokens with language instructions using an MLLM. It achieves average improvements of 27% on Minecraft atomic tasks, 10% on long-horizon tasks, and 18% on open-ended instruction tasks.

Background & Motivation¶

Background: As a representative benchmark for open-world environments, Minecraft has inspired a wealth of research on intelligent agents. The current mainstream framework adopts a two-tier "planner + policy" architecture, where an MLLM serves as the planner to decompose complex tasks into sequences of subgoals, while goal-conditioned policies (e.g., STEVE-1, GROOT) execute specific low-level control actions.

Limitations of Prior Work: - Existing policies ignore the causal relationship between observations and actions—the current observation is generated by the interaction of the previous action with the environment, yet current policies only model the relationship between subgoals and the current observation (simply adding goal embeddings to visual features). - Existing policies have limited ability to comprehend open-ended natural language subgoals—for instance, STEVE-1 uses MineCLIP as a goal encoder while GROOT uses a video encoder, resulting in implicit goal embeddings that lack sufficient expressiveness. - There is a lack of large-scale, high-quality goal-observation-action aligned datasets—VPT data lacks language instructions, and STEVE-1 data contains only 32K aligned samples.

Key Challenge: Policies need to simultaneously understand "where to go" (subgoal semantics) and "how to get there" (temporal dependencies of the observation-action sequence), but existing methods face bottlenecks in both dimensions.

Goal: To design a policy network capable of simultaneously modeling observation-action causal relationships and understanding open-ended language instructions, and to construct a large-scale training dataset.

Key Insight: Introduce an MLLM as the policy backbone to comprehend open-ended instructions, while designing a specialized encoder to capture the temporal causal relationships of observation-action historical trajectories.

Core Idea: Use an Action-guided Behavior Encoder to compress observation-action sequences into fixed-length "behavior tokens," and then prompt the MLLM to align these behavior tokens with subgoal instructions in the language space to predict actions.

Method¶

Overall Architecture¶

Optimus-2 adopts a planner-policy architecture: an MLLM planner (GPT-4V) decomposes complex tasks into sequences of subgoals, and the GOAP policy sequentially executes each subgoal. GOAP consists of two main components: the Action-guided Behavior Encoder and the MLLM backbone.

Key Designs¶

Action-guided Behavior Encoder: It contains two sub-modules. Causal Perceiver: At each timestep, action embeddings (key/value) are injected into visual features (query) via cross-attention to explicitly model the \(action \to observation\) causal relationship, enriching visual representations with task-relevant action information. History Aggregator: It introduces fixed-length "behavior tokens" that interact with the historical behavior token sequence through a history attention layer. Combined with a Memory Bank, it dynamically aggregates and compresses long-term historical information, which captures long-range temporal dependencies without exceeding model context limits due to excessive input lengths.
MLLM as Policy Backbone: Initialized with DeepSeek-VL-1.3B, the input consists of subgoal text, visual tokens of the current observation, and behavior tokens. The MLLM leverages its language understanding capabilities to align the open-ended subgoal with the behavior sequence, autoregressively predicting the next action. VPT is used as the Action Head to map the MLLM output embeddings into the control action space of Minecraft. The training loss combines behavior cloning loss with KL divergence loss against the teacher model VPT.
MGOA Dataset Construction: An automated pipeline is developed to generate 25,000 videos and approximately 30M aligned goal-observation-action pairs spanning 8 atomic tasks. Specifically, STEVE-1 is executed on GPT-4 generated instructions to record successful trajectories, filtering out failed or timed-out data. The entire process requires no manual annotation and can be parallelized for rapid generation. The training consists of two stages: behavior pre-training to align the behavior encoder, followed by action fine-tuning to map the language space to the action space.

Key Experimental Results¶

Atomic Tasks: GOAP achieves an average reward of 19.0 across four tasks (Logs, Seeds, Dirt, Stone), outperforming GROOT (15.1) and STEVE-1 (7.3) with an improvement of approximately 27%.
Long-horizon Tasks: Optimus-2 achieves the highest success rates across all 7 task groups, reaching 13% on the Diamond Group and 28% on the Redstone Group, yielding an overall average improvement of 10%.
Open-ended Instruction Tasks: GOAP achieves successful completion on Golden Shovel (13%), Diamond Pickaxe (16%), and Compass (17%), whereas existing policies all fail (0%).
Ablation Study: Removing the Causal Perceiver leads to a 47.4% performance drop; removing the History Aggregator + Memory Bank results in a 44.2% decline.
LLM Backbone: Replacing the LLM with Transformer-XL significantly degrades performance on open-ended instruction tasks, verifying the necessity of the MLLM's language comprehension capabilities.
Training Data: Training on the OpenAI Contractor dataset alone yields an 89% lower performance on the Stone task compared to training on the mixed dataset.

Key Findings¶

The action guidance from the Causal Perceiver allows behavior representations to clearly distinguish among different tasks (forming distinct clusters in t-SNE visualizations for the four tasks), whereas ViT and MineCLIP representations are highly confounded.
Using VPT as the Action Head is significantly superior to a 2-layer MLP (since massive game-data pre-training of VPT provides crucial domain knowledge).
High-quality aligned data from the MGOA dataset is key to the performance gain—jointly training on MGOA and the OpenAI Contractor Dataset achieves the best results.

Highlights & Insights¶

Profound Insight on Observation-Action Causal Modeling: Existing methods overlook an intuitively obvious fact—the current observation is a direct consequence of the previous action. Explicitly encoding this causal relationship dramatically increases the discriminability of behavior representations.
Pioneering Use of MLLM as a Policy: This is the first work to utilize an MLLM as the core architecture of a Minecraft policy (rather than just for planning), unlocking its open-ended language comprehension capabilities and enabling the policy to handle open instructions such as "I need some iron ores, what should I do?" for the first time.
Compression Design via Behavior Tokens: Representing arbitrarily long historical sequences with fixed-length behavior tokens and a memory bank preserves long-term dependencies while controlling computational overhead, offering an elegant solution for long-video modeling.
Automated Data Pipeline: The "bootstrap" concept of using existing agents to collect training data provides a low-cost, highly efficient solution suitable for rapid scaling.

Limitations & Future Work¶

A lack of high-quality training data for open-ended tasks (e.g., "build a house", "defeat the Ender Dragon") limits the execution of complex creative tasks.
Validations are restricted to the Minecraft environment, without extension to other simulation platforms (e.g., AI2-THOR, Habitat) or real-world robots.
The planner relies on GPT-4V (a closed-source model), increasing operational costs and limiting reproducibility.
The foundation model for GOAP (DeepSeek-VL-1.3B) has a restricted parameter size; whether scaling to larger MLLMs would yield further improvements remains unexplored.
Data generation depends on the capabilities of the existing policy (STEVE-1); thus, training data cannot be generated for tasks that STEVE-1 is incapable of completing.
Training requires about 2 days on 8 L40 GPUs, imposing certain demands on computational resources.