SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models¶

Conference: ACL 2026 arXiv: 2604.19638 Code: https://github.com/sled-group/SafetyALFRED Area: Multimodal VLM Keywords: Embodied Safety, Hazard Mitigation, Multimodal Evaluation, Safety Planning, ALFRED

TL;DR¶

This paper proposes the SafetyALFRED benchmark, which introduces six categories of kitchen safety hazards into the ALFRED embodied task setting. It reveals a critical alignment gap: multimodal large language models can identify hazards in static QA (up to 92%) but fail to proactively mitigate them during embodied planning (<60%), advocating a paradigm shift from QA-based to embodied safety evaluation.

Background & Motivation¶

Background: Multimodal large language models (MLLMs) are increasingly deployed as autonomous agents in embodied environments, translating high-level natural language instructions into executable plans. Existing safety benchmarks such as ASIMOV, Multimodal Situational Safety, and MM-SafetyBench primarily evaluate hazard recognition through static image- or video-based question answering.

Limitations of Prior Work: Existing evaluations suffer a fundamental flaw — they test only whether models recognize hazards, not whether models can generate hazard-mitigating plans in dynamic embodied environments. A model that correctly identifies "a phone in the sink" as dangerous may completely ignore the need to remove the phone before executing a "wash the knife" task. This knowledge–action gap has never been systematically quantified.

Key Challenge: High accuracy in static QA creates a false sense of safety: models know what is dangerous, yet when required to simultaneously complete a task and mitigate hazards, they systematically prioritize task completion over safety. QA performance is a poor proxy for embodied safety.

Goal: (1) Construct an embodied benchmark that jointly evaluates hazard recognition and proactive mitigation; (2) quantify the alignment gap between QA recognition and embodied mitigation; (3) investigate whether a multi-agent framework can reduce this gap.

Key Insight: Extend the ALFRED benchmark (an embodied instruction-following task built on AI2-THOR) by introducing six real-world safety hazard categories across 30 kitchen environments. Pre-rendered trajectories provide ground-truth history, isolating safety reasoning ability from task execution ability.

Core Idea: Evaluate the same scene simultaneously under two protocols — QA evaluation (can the model identify hazards?) and embodied evaluation (can the model mitigate hazards while executing a task?) — and quantify the gap between the two via an alignment rate.

Method¶

Overall Architecture¶

SafetyALFRED models safety-constrained planning as a tuple \(\mathcal{P} = \langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{G}, \mathcal{H}, \mathcal{R}_{\text{safe}} \rangle\), requiring a safety-aware policy \(\pi^*\) to prioritize corrective actions \(\mathcal{R}_{\text{safe}}(h_i, s_t)\) when hazards are present, and advance task goals only under hazard-free states. The evaluation pipeline consists of: (1) environmental perturbations that introduce hazards; (2) a QA task in which models act as safety judges to identify hazards; and (3) an embodied task in which models generate plans that incorporate mitigation.

Key Designs¶

Six Kitchen Hazard Categories:
- Function: Cover the major types of real-world kitchen accidents.
- Mechanism: Six categories are defined based on kitchen accident statistics: appliance misuse (metal or flammable objects in a microwave), food spoilage (refrigerator door left open), falls/trips (cabinet door left open), fire hazards (stove left on), property damage (water-sensitive objects in the sink), and unsanitary conditions (target object on a dirty floor). Environmental condition predicates and corrective actions are defined for each category.
- Design Motivation: These six categories span a complete risk spectrum from high-frequency events (falls/trips are the most common injury source) to high-consequence events (fires are the most destructive accident type).
Dual-Setting Evaluation (QA + Embodied):
- Function: Quantify the transfer gap from abstract safety knowledge to concrete action.
- Mechanism: The same model is evaluated on the same scene under two independent instances — a QA instance in which the model acts as an external safety judge determining whether a hazard is present (validated through a two-stage structure + NLI pipeline), and an embodied instance in which the model generates the next action and subgoal frame-by-frame while performing a household task. The alignment rate \(\mathcal{A} = \frac{1}{K}\sum_{k=1}^{K}\mathbb{I}(v_{ik} = a_{ik})\) measures the consistency between QA recognition and embodied mitigation.
- Design Motivation: This design directly exposes the "knows but does not act" problem and constitutes a fundamental complement to existing QA-only evaluations.
Multi-Agent Framework:
- Function: Attempt to improve safety mitigation through role separation.
- Mechanism: Hazard recognition and mitigation are decoupled — a dedicated safety-judge agent identifies hazards and communicates safety information to the embodied agent. This tests the hypothesis of "if the model is informed of a hazard, can it mitigate it?"
- Design Motivation: If single-agent failure stems from task interference (task execution distracting attention from safety), then multi-agent division of labor should improve performance.

Loss & Training¶

This is an evaluation study; no model training is involved. All models are evaluated with temperature 0 and a maximum of 512 tokens.

Key Experimental Results¶

Main Results¶

Performance comparison of 11 MLLMs on QA recognition versus embodied mitigation.

Model	QA Recognition (w/ metadata)	Embodied Mitigation (w/ metadata)	Gap
Qwen 2.5 VL 72B	60.8%	12.3%	−48.5%
Qwen 3 VL 32B	57.2%	19.7%	−37.5%
Gemini 1.5 ER	77.9%	45.7%	−32.2%
Gemini 2.5	92.5%	60.1%	−32.4%

Multi-Agent Improvement¶

Model	Single-Agent	Multi-Agent	Gain
Gemma 3 27B	7.0%	25.1%	+18.1%
Qwen 3 VL 32B	19.7%	32.5%	+12.8%
Qwen 2.5 VL 72B	12.3%	28.5%	+16.2%

Key Findings¶

Striking alignment gap: Even the strongest model, Gemini 2.5, translates its 92.5% QA recognition rate into only a 60.1% embodied mitigation rate.
Models systematically prioritize task completion over hazard mitigation: Qwen 3 VL-32B achieves 80.7% action prediction accuracy on hazard-free frames but only 19.7% hazard mitigation success.
Fire hazards are the only category with consistently strong performance across both settings (stove on/off states are straightforward to perceive and manipulate); other categories show large gaps.
The multi-agent framework helps but does not fully resolve the problem: even when the safety-judge agent correctly identifies a hazard, the embodied agent may still fail to execute the mitigation action.
Models frequently hallucinate hazards in safe scenes (>50% false-positive rate), exhibiting an over-cautious bias.
Scaling model size generally decreases safety alignment — larger models recognize more hazards in QA but mitigate them at a disproportionately lower rate in embodied settings.

Highlights & Insights¶

The "knows but does not act" finding is highly impactful: it fundamentally challenges the validity of current MLLM safety evaluation practice. A large body of work assesses safety via QA or multiple-choice tasks, and this paper demonstrates that such assessment is insufficient.
The controlled variable design is instructive: providing ground-truth history isolates safety reasoning, while visual-only and metadata-augmented modes disentangle perception from reasoning deficiencies.
The multi-agent results reveal a deeper issue: the failure is not merely one of attention allocation. Models face a fundamental planning difficulty when required to interrupt a task flow to insert safety actions.
The findings transfer to domains such as autonomous driving, where evaluation of planning under safety constraints is a general need.

Limitations & Future Work¶

The use of pre-rendered trajectories rather than real-time interaction does not fully represent real robotic scenarios.
Only three model families (Qwen, Gemma, Gemini) are evaluated, limiting the generalizability of the conclusions.
Kitchen hazards in the AI2-THOR simulator are simplified and cannot fully capture the complexity and unpredictability of real-world environments.
Automatic evaluation of QA responses using NLI models may introduce bias.
The paper does not explore improving embodied safety capabilities through training data augmentation.

vs. ASIMOV / MM-SafetyBench: These benchmarks evaluate hazard recognition in static QA only; SafetyALFRED adds the embodied mitigation dimension and quantifies the gap between the two.
vs. Son et al. / Chen et al.: The former is limited to text-based PDDL environments; the latter is limited to static AI-generated images. SafetyALFRED evaluates models in a multimodal simulated environment with navigation.
Insight: Future safety evaluations should require models to enact safety rather than merely describe it; training data must include examples that balance safety and task objectives.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic quantification of the alignment gap between QA hazard recognition and embodied hazard mitigation; the problem formulation is highly original.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 11 models, 6 hazard categories, and multiple evaluation metrics, though the use of pre-rendered trajectories is a simplification.
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clear, but the paper is lengthy and portions of the analysis are dispersed across the appendix.