SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models¶
Conference: ACL 2026 arXiv: 2604.19638 Code: https://github.com/sled-group/SafetyALFRED Area: Multimodal VLM Keywords: Embodied Safety, Hazard Mitigation, Multimodal Evaluation, Safety Planning, ALFRED
TL;DR¶
This paper proposes the SafetyALFRED benchmark, which introduces six categories of kitchen safety hazards into the ALFRED embodied task setting. It reveals a critical alignment gap: multimodal large language models can identify hazards in static QA (up to 92%) but fail to proactively mitigate them during embodied planning (<60%), advocating a paradigm shift from QA-based to embodied safety evaluation.
Background & Motivation¶
Background: Multimodal large language models (MLLMs) are increasingly deployed as autonomous agents in embodied environments, translating high-level natural language instructions into executable plans. Existing safety benchmarks such as ASIMOV, Multimodal Situational Safety, and MM-SafetyBench primarily evaluate hazard recognition through static image- or video-based question answering.
Limitations of Prior Work: Existing evaluations suffer a fundamental flaw — they test only whether models recognize hazards, not whether models can generate hazard-mitigating plans in dynamic embodied environments. A model that correctly identifies "a phone in the sink" as dangerous may completely ignore the need to remove the phone before executing a "wash the knife" task. This knowledge–action gap has never been systematically quantified.
Key Challenge: High accuracy in static QA creates a false sense of safety: models know what is dangerous, yet when required to simultaneously complete a task and mitigate hazards, they systematically prioritize task completion over safety. QA performance is a poor proxy for embodied safety.
Goal: (1) Construct an embodied benchmark that jointly evaluates hazard recognition and proactive mitigation; (2) quantify the alignment gap between QA recognition and embodied mitigation; (3) investigate whether a multi-agent framework can reduce this gap.
Key Insight: Extend the ALFRED benchmark (an embodied instruction-following task built on AI2-THOR) by introducing six real-world safety hazard categories across 30 kitchen environments. Pre-rendered trajectories provide ground-truth history, isolating safety reasoning ability from task execution ability.
Core Idea: Evaluate the same scene simultaneously under two protocols — QA evaluation (can the model identify hazards?) and embodied evaluation (can the model mitigate hazards while executing a task?) — and quantify the gap between the two via an alignment rate.
Method¶
Overall Architecture¶
SafetyALFRED models safety-constrained planning as a tuple \(\mathcal{P} = \langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{G}, \mathcal{H}, \mathcal{R}_{\text{safe}} \rangle\), requiring a safety-aware policy \(\pi^*\) to prioritize corrective actions \(\mathcal{R}_{\text{safe}}(h_i, s_t)\) when hazards are present, and advance task goals only under hazard-free states. The evaluation pipeline consists of: (1) environmental perturbations that introduce hazards; (2) a QA task in which models act as safety judges to identify hazards; and (3) an embodied task in which models generate plans that incorporate mitigation.
Key Designs¶
-
Six Kitchen Hazard Categories:
- Function: Cover the major types of real-world kitchen accidents.
- Mechanism: Six categories are defined based on kitchen accident statistics: appliance misuse (metal or flammable objects in a microwave), food spoilage (refrigerator door left open), falls/trips (cabinet door left open), fire hazards (stove left on), property damage (water-sensitive objects in the sink), and unsanitary conditions (target object on a dirty floor). Environmental condition predicates and corrective actions are defined for each category.
- Design Motivation: These six categories span a complete risk spectrum from high-frequency events (falls/trips are the most common injury source) to high-consequence events (fires are the most destructive accident type).
-
Dual-Setting Evaluation (QA + Embodied):
- Function: Quantify the transfer gap from abstract safety knowledge to concrete action.
- Mechanism: The same model is evaluated on the same scene under two independent instances — a QA instance in which the model acts as an external safety judge determining whether a hazard is present (validated through a two-stage structure + NLI pipeline), and an embodied instance in which the model generates the next action and subgoal frame-by-frame while performing a household task. The alignment rate \(\mathcal{A} = \frac{1}{K}\sum_{k=1}^{K}\mathbb{I}(v_{ik} = a_{ik})\) measures the consistency between QA recognition and embodied mitigation.
- Design Motivation: This design directly exposes the "knows but does not act" problem and constitutes a fundamental complement to existing QA-only evaluations.
-
Multi-Agent Framework:
- Function: Attempt to improve safety mitigation through role separation.
- Mechanism: Hazard recognition and mitigation are decoupled — a dedicated safety-judge agent identifies hazards and communicates safety information to the embodied agent. This tests the hypothesis of "if the model is informed of a hazard, can it mitigate it?"
- Design Motivation: If single-agent failure stems from task interference (task execution distracting attention from safety), then multi-agent division of labor should improve performance.
Loss & Training¶
This is an evaluation study; no model training is involved. All models are evaluated with temperature 0 and a maximum of 512 tokens.
Key Experimental Results¶
Main Results¶
Performance comparison of 11 MLLMs on QA recognition versus embodied mitigation.
| Model | QA Recognition (w/ metadata) | Embodied Mitigation (w/ metadata) | Gap |
|---|---|---|---|
| Qwen 2.5 VL 72B | 60.8% | 12.3% | −48.5% |
| Qwen 3 VL 32B | 57.2% | 19.7% | −37.5% |
| Gemini 1.5 ER | 77.9% | 45.7% | −32.2% |
| Gemini 2.5 | 92.5% | 60.1% | −32.4% |
Multi-Agent Improvement¶
| Model | Single-Agent | Multi-Agent | Gain |
|---|---|---|---|
| Gemma 3 27B | 7.0% | 25.1% | +18.1% |
| Qwen 3 VL 32B | 19.7% | 32.5% | +12.8% |
| Qwen 2.5 VL 72B | 12.3% | 28.5% | +16.2% |
Key Findings¶
- Striking alignment gap: Even the strongest model, Gemini 2.5, translates its 92.5% QA recognition rate into only a 60.1% embodied mitigation rate.
- Models systematically prioritize task completion over hazard mitigation: Qwen 3 VL-32B achieves 80.7% action prediction accuracy on hazard-free frames but only 19.7% hazard mitigation success.
- Fire hazards are the only category with consistently strong performance across both settings (stove on/off states are straightforward to perceive and manipulate); other categories show large gaps.
- The multi-agent framework helps but does not fully resolve the problem: even when the safety-judge agent correctly identifies a hazard, the embodied agent may still fail to execute the mitigation action.
- Models frequently hallucinate hazards in safe scenes (>50% false-positive rate), exhibiting an over-cautious bias.
- Scaling model size generally decreases safety alignment — larger models recognize more hazards in QA but mitigate them at a disproportionately lower rate in embodied settings.
Highlights & Insights¶
- The "knows but does not act" finding is highly impactful: it fundamentally challenges the validity of current MLLM safety evaluation practice. A large body of work assesses safety via QA or multiple-choice tasks, and this paper demonstrates that such assessment is insufficient.
- The controlled variable design is instructive: providing ground-truth history isolates safety reasoning, while visual-only and metadata-augmented modes disentangle perception from reasoning deficiencies.
- The multi-agent results reveal a deeper issue: the failure is not merely one of attention allocation. Models face a fundamental planning difficulty when required to interrupt a task flow to insert safety actions.
- The findings transfer to domains such as autonomous driving, where evaluation of planning under safety constraints is a general need.
Limitations & Future Work¶
- The use of pre-rendered trajectories rather than real-time interaction does not fully represent real robotic scenarios.
- Only three model families (Qwen, Gemma, Gemini) are evaluated, limiting the generalizability of the conclusions.
- Kitchen hazards in the AI2-THOR simulator are simplified and cannot fully capture the complexity and unpredictability of real-world environments.
- Automatic evaluation of QA responses using NLI models may introduce bias.
- The paper does not explore improving embodied safety capabilities through training data augmentation.
Related Work & Insights¶
- vs. ASIMOV / MM-SafetyBench: These benchmarks evaluate hazard recognition in static QA only; SafetyALFRED adds the embodied mitigation dimension and quantifies the gap between the two.
- vs. Son et al. / Chen et al.: The former is limited to text-based PDDL environments; the latter is limited to static AI-generated images. SafetyALFRED evaluates models in a multimodal simulated environment with navigation.
- Insight: Future safety evaluations should require models to enact safety rather than merely describe it; training data must include examples that balance safety and task objectives.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First systematic quantification of the alignment gap between QA hazard recognition and embodied hazard mitigation; the problem formulation is highly original.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers 11 models, 6 hazard categories, and multiple evaluation metrics, though the use of pre-rendered trajectories is a simplification.
- Writing Quality: ⭐⭐⭐⭐ Problem motivation is clear, but the paper is lengthy and portions of the analysis are dispersed across the appendix.