CrafText Benchmark: Advancing Instruction Following in Complex Multimodal Open-Ended World¶
Conference: ACL 2025
arXiv: 2505.11962
Code: GitHub
Area: Multimodal VLMs
Keywords: Instruction Following, Multimodal Benchmark, Reinforcement Learning, Open World, Language Grounding
TL;DR¶
This paper proposes CrafText, a multimodal instruction-following benchmark based on the Craftax open-world environment. It contains 3,924 instructions with 3,423 unique words, covering four task categories: localization, conditional, building, and achievement. It also introduces a dual-evaluation protocol designed to test the language and goal generalization capabilities of agents.
Background & Motivation¶
Instruction following in the real world faces two core challenges: (1) decision-making in dynamically changing environments, where the environment is unpredictable and states evolve independently of agent behavior; and (2) generalizing across various tasks and instruction expressions, where agents must correctly interpret diversely phrased instructions and ground them to observations.
Limitations of Prior Work: - Most environments are static (e.g., Alfred, Touchdown), lacking environmental dynamics. - Instructions are typically generated procedurally via templates, resulting in limited vocabularies (e.g., BabyAI, HomeGrid). - Even environments with rich vocabularies (e.g., Alfred) lack diverse object interactions. - No existing environments offer a dual-evaluation protocol to simultaneously test "language generalization" and "goal generalization."
CrafText aims to fill this gap by establishing a comprehensive benchmark that incorporates environmental dynamics, linguistic diversity, rich interactions, and a dual-evaluation scheme.
Method¶
Overall Architecture¶
CrafText is built upon Craftax (a Minecraft-like open-world RL environment) and extends it with a natural language instruction interface. The overall framework comprises three components: dataset design, the instruction generation pipeline, and environmental extension.
Key Designs¶
-
Hierarchical Dataset Structure: A three-tier structure of "Scenario \(\rightarrow\) Goal \(\rightarrow\) Instruction" is adopted. Scenarios define abstract task classes (e.g., "build a square"), goals parameterize them into specific instances (e.g., "build a 2×2 wooden square"), and instructions represent multiple natural language expressions of the goal (approx. 5–6 formulations per goal).
-
Four Task Categories:
- Building: Requires agents to build specific structures, demanding that they remember the starting point and potentially leave to collect additional resources.
- Conditional: Tests instruction understanding, such as "Collect two stones and then craft a sword" vs. "Before crafting a sword, collect two stones."
- Localization: Evaluates spatial instruction comprehension, including compass directions (South, East, West, North) and relative directions (right, above).
- Achievement: Executes standard in-game achievements and their combinations, such as collecting wood and mining diamonds.
-
Three Difficulty Levels: Structured based on the sequence length of prerequisite actions required to complete the task:
- Easy: Achievement scenarios, completing game achievements and their combinations.
- Medium: All scenario types, but with shorter action sequences (\(<10\) steps).
- Hard: Complex goals or long action sequences.
-
Instruction Generation Pipeline: Combines procedural goal generation with GPT-4 language generation. First, expert-defined scenario checking functions and parameter ranges are used to enumerate combinations and generate a large set of goal templates. Then, GPT-4 is leveraged to generate diverse natural language instructions and paraphrases for each goal, ensuring linguistic complexity and diversity.
-
Dual Evaluation Protocol:
- Paraphrased Test Set: Uses the same goals as the training set but with rephrased instructions to test language generalization capabilities.
- New Objects Test Set: Introduces combinations of objects unseen during training (though all individual objects have appeared in the training set) to test goal-level generalization.
-
JAX-Accelerated Environment: All evaluation checking functions are implemented in JAX, supporting JIT compilation and GPU acceleration to enable highly parallelized, large-scale training.
Reward System¶
- A reward of 1 is granted upon instruction completion.
- Achievement discovery rewards provided by the Craftax environment are scaled by 1/50.
- Scenario checking functions are executed at each step to verify the completion status.
Key Experimental Results¶
Main Results (Medium tasks, 50 seeds)¶
| Algorithm | Conditional | Building | Localization | Achievement | Total |
|---|---|---|---|---|---|
| PPO-T | 0.15 | 0.25 | 0.33 | 0.55 | 0.40 |
| PPO-T+ | 0.17 | 0.24 | 0.30 | 0.70 | 0.45 |
| Dynalang | 0.00 | 0.12 | 0.15 | 0.17 | 0.15 |
| FiLM | 0.07 | 0.38 | 0.29 | 0.76 | 0.43 |
Generalization Experiments¶
| Test Set | PPO-T | PPO-T+ | Dynalang | FiLM |
|---|---|---|---|---|
| Train | 0.40 | 0.45 | 0.15 | 0.43 |
| Paraphrased | 0.36 | 0.35 | 0.05 | 0.35 |
| New Objects | 0.22 | 0.28 | 0.10 | 0.26 |
Key Findings¶
- Dynalang performs far below expectations: Despite its outstanding performance in the Crafter environment, Dynalang only achieves a 0.15 success rate on CrafText. This indicates that the combination of complex linguistic instructions and dynamic environments dramatically increases the learning difficulty.
- All baseline methods yield low success rates: Even the best method, PPO-T+, only achieves 0.45, confirming the high difficulty of the CrafText benchmark.
- Paraphrasing leads to a significant performance drop: PPO-T+ drops from 0.45 to 0.35, demonstrating that current methods lack robustness against linguistic variations.
- PPO-T+ (with planning) performs best on generalizing to new objects: A success rate of 0.28 shows that decomposing instructions into structured plans aids goal-level generalization.
- FiLM performs best on building tasks (0.38), indicating that its feature-level modulation mechanism is more flexible in handling vision-language interactions.
- Conditional tasks remain extremely challenging for all methods: The highest success rate is only 0.17–0.20, showing that conditional logical reasoning is a major bottleneck for existing approaches.
Highlights & Insights¶
- Comprehensiveness: CrafText concurrently addresses environmental dynamics, linguistic diversity, rich interactions, GPU acceleration, and a dual-evaluation protocol, making it uniquely comprehensive compared to existing benchmarks.
- Uncovering Core Bottlenecks: Experiments clearly demonstrate that methods performing well in static environments (such as Dynalang) fail completely when faced with dynamic and linguistically complex conditions.
- JAX Implementation: Support for large-scale parallel training resolves the practical efficiency bottlenecks associated with RL training.
- Value of Planning Augmentation: The simple yet effective GPT-4 planning step in PPO-T+ suggests that leveraging LLMs for task decomposition is a promising future direction.
Limitations & Future Work¶
- All instructions in the dataset are AI-generated, lacking human-authored instructions, which may fail to capture the subtle nuances of human language.
- The environment lacks real-world interaction components, such as instruction negotiation, clarification, and dynamic dialogues.
- Success rates of current baseline methods are generally low; stronger methods are needed to fully evaluate the discriminative power of the benchmark.
- Though built on Craftax, it remains a 2D pixel-based environment, leaving a gap to the 3D physical world.
- Language representation is limited to DistilBERT and T5 embeddings, leaving the investigation of more powerful VLMs as policy networks to future study.
Related Work & Insights¶
- Compared with template-based instruction environments like BabyAI and HomeGrid, CrafText provides richer vocabularies and higher linguistic complexity.
- Compared with MineDojo, CrafText offers precise objective verification functions and a dual-evaluation protocol.
- Insight: Integrating LLMs for instruction preprocessing/planning (e.g., PPO-T+) is a highly promising direction to advance instruction-following capabilities.
- The challenges posed by environmental dynamics for instruction following remain under-investigated, representing an important open problem.
Rating¶
- Novelty: ⭐⭐⭐⭐ The first instruction-following benchmark to satisfy multiple key attributes simultaneously; the dual-evaluation protocol is novel.
- Experimental Thoroughness: ⭐⭐⭐ Baseline evaluations are limited (4 methods); lacks comparisons with VLM-based approaches and more diverse RL algorithms.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with complete comparative tables, though some environment descriptions could be more concise.
- Value: ⭐⭐⭐⭐ Fills a crucial gap for dynamic-environment, complex-instruction-following benchmarks, offering high value to the RL and NLP communities.