CrafText Benchmark: Advancing Instruction Following in Complex Multimodal Open-Ended World¶

Conference: ACL 2025
arXiv: 2505.11962
Code: GitHub
Area: Multimodal VLMs
Keywords: Instruction Following, Multimodal Benchmark, Reinforcement Learning, Open World, Language Grounding

TL;DR¶

This paper proposes CrafText, a multimodal instruction-following benchmark based on the Craftax open-world environment. It contains 3,924 instructions with 3,423 unique words, covering four task categories: localization, conditional, building, and achievement. It also introduces a dual-evaluation protocol designed to test the language and goal generalization capabilities of agents.

Background & Motivation¶

Instruction following in the real world faces two core challenges: (1) decision-making in dynamically changing environments, where the environment is unpredictable and states evolve independently of agent behavior; and (2) generalizing across various tasks and instruction expressions, where agents must correctly interpret diversely phrased instructions and ground them to observations.

Limitations of Prior Work: - Most environments are static (e.g., Alfred, Touchdown), lacking environmental dynamics. - Instructions are typically generated procedurally via templates, resulting in limited vocabularies (e.g., BabyAI, HomeGrid). - Even environments with rich vocabularies (e.g., Alfred) lack diverse object interactions. - No existing environments offer a dual-evaluation protocol to simultaneously test "language generalization" and "goal generalization."

CrafText aims to fill this gap by establishing a comprehensive benchmark that incorporates environmental dynamics, linguistic diversity, rich interactions, and a dual-evaluation scheme.

Method¶

Overall Architecture¶

CrafText is built upon Craftax (a Minecraft-like open-world RL environment) and extends it with a natural language instruction interface. The overall framework comprises three components: dataset design, the instruction generation pipeline, and environmental extension.

Key Designs¶

Hierarchical Dataset Structure: A three-tier structure of "Scenario \(\rightarrow\) Goal \(\rightarrow\) Instruction" is adopted. Scenarios define abstract task classes (e.g., "build a square"), goals parameterize them into specific instances (e.g., "build a 2×2 wooden square"), and instructions represent multiple natural language expressions of the goal (approx. 5–6 formulations per goal).
Four Task Categories:
- Building: Requires agents to build specific structures, demanding that they remember the starting point and potentially leave to collect additional resources.
- Conditional: Tests instruction understanding, such as "Collect two stones and then craft a sword" vs. "Before crafting a sword, collect two stones."
- Localization: Evaluates spatial instruction comprehension, including compass directions (South, East, West, North) and relative directions (right, above).
- Achievement: Executes standard in-game achievements and their combinations, such as collecting wood and mining diamonds.
Three Difficulty Levels: Structured based on the sequence length of prerequisite actions required to complete the task:
- Easy: Achievement scenarios, completing game achievements and their combinations.
- Medium: All scenario types, but with shorter action sequences (\(<10\) steps).
- Hard: Complex goals or long action sequences.
Instruction Generation Pipeline: Combines procedural goal generation with GPT-4 language generation. First, expert-defined scenario checking functions and parameter ranges are used to enumerate combinations and generate a large set of goal templates. Then, GPT-4 is leveraged to generate diverse natural language instructions and paraphrases for each goal, ensuring linguistic complexity and diversity.
Dual Evaluation Protocol:
- Paraphrased Test Set: Uses the same goals as the training set but with rephrased instructions to test language generalization capabilities.
- New Objects Test Set: Introduces combinations of objects unseen during training (though all individual objects have appeared in the training set) to test goal-level generalization.
JAX-Accelerated Environment: All evaluation checking functions are implemented in JAX, supporting JIT compilation and GPU acceleration to enable highly parallelized, large-scale training.

Reward System¶

A reward of 1 is granted upon instruction completion.
Achievement discovery rewards provided by the Craftax environment are scaled by 1/50.
Scenario checking functions are executed at each step to verify the completion status.

Key Experimental Results¶

Main Results (Medium tasks, 50 seeds)¶

Algorithm	Conditional	Building	Localization	Achievement	Total
PPO-T	0.15	0.25	0.33	0.55	0.40
PPO-T+	0.17	0.24	0.30	0.70	0.45
Dynalang	0.00	0.12	0.15	0.17	0.15
FiLM	0.07	0.38	0.29	0.76	0.43

Generalization Experiments¶

Test Set	PPO-T	PPO-T+	Dynalang	FiLM
Train	0.40	0.45	0.15	0.43
Paraphrased	0.36	0.35	0.05	0.35
New Objects	0.22	0.28	0.10	0.26

Key Findings¶

Dynalang performs far below expectations: Despite its outstanding performance in the Crafter environment, Dynalang only achieves a 0.15 success rate on CrafText. This indicates that the combination of complex linguistic instructions and dynamic environments dramatically increases the learning difficulty.
All baseline methods yield low success rates: Even the best method, PPO-T+, only achieves 0.45, confirming the high difficulty of the CrafText benchmark.
Paraphrasing leads to a significant performance drop: PPO-T+ drops from 0.45 to 0.35, demonstrating that current methods lack robustness against linguistic variations.
PPO-T+ (with planning) performs best on generalizing to new objects: A success rate of 0.28 shows that decomposing instructions into structured plans aids goal-level generalization.
FiLM performs best on building tasks (0.38), indicating that its feature-level modulation mechanism is more flexible in handling vision-language interactions.
Conditional tasks remain extremely challenging for all methods: The highest success rate is only 0.17–0.20, showing that conditional logical reasoning is a major bottleneck for existing approaches.

Highlights & Insights¶

Comprehensiveness: CrafText concurrently addresses environmental dynamics, linguistic diversity, rich interactions, GPU acceleration, and a dual-evaluation protocol, making it uniquely comprehensive compared to existing benchmarks.
Uncovering Core Bottlenecks: Experiments clearly demonstrate that methods performing well in static environments (such as Dynalang) fail completely when faced with dynamic and linguistically complex conditions.
JAX Implementation: Support for large-scale parallel training resolves the practical efficiency bottlenecks associated with RL training.
Value of Planning Augmentation: The simple yet effective GPT-4 planning step in PPO-T+ suggests that leveraging LLMs for task decomposition is a promising future direction.

Limitations & Future Work¶

All instructions in the dataset are AI-generated, lacking human-authored instructions, which may fail to capture the subtle nuances of human language.
The environment lacks real-world interaction components, such as instruction negotiation, clarification, and dynamic dialogues.
Success rates of current baseline methods are generally low; stronger methods are needed to fully evaluate the discriminative power of the benchmark.
Though built on Craftax, it remains a 2D pixel-based environment, leaving a gap to the 3D physical world.
Language representation is limited to DistilBERT and T5 embeddings, leaving the investigation of more powerful VLMs as policy networks to future study.

Compared with template-based instruction environments like BabyAI and HomeGrid, CrafText provides richer vocabularies and higher linguistic complexity.
Compared with MineDojo, CrafText offers precise objective verification functions and a dual-evaluation protocol.
Insight: Integrating LLMs for instruction preprocessing/planning (e.g., PPO-T+) is a highly promising direction to advance instruction-following capabilities.
The challenges posed by environmental dynamics for instruction following remain under-investigated, representing an important open problem.

Rating¶

Novelty: ⭐⭐⭐⭐ The first instruction-following benchmark to satisfy multiple key attributes simultaneously; the dual-evaluation protocol is novel.
Experimental Thoroughness: ⭐⭐⭐ Baseline evaluations are limited (4 methods); lacks comparisons with VLM-based approaches and more diverse RL algorithms.
Writing Quality: ⭐⭐⭐⭐ Well-structured with complete comparative tables, though some environment descriptions could be more concise.
Value: ⭐⭐⭐⭐ Fills a crucial gap for dynamic-environment, complex-instruction-following benchmarks, offering high value to the RL and NLP communities.