HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios¶
Conference: CVPR 2025
arXiv: 2603.11975
Code: Yes
Area: Multimodal VLM / AI Safety
Keywords: Household safety, unsafe action detection, VLM evaluation, embodied AI, dual-brain architecture
TL;DR¶
HomeSafe-Bench is the first benchmark to evaluate VLMs on unsafe action detection in household scenarios (438 cases covering 6 functional areas), and proposes HD-Guard, a hierarchical streaming architecture that coordinates a lightweight FastBrain and a large-scale SlowBrain to achieve real-time safety monitoring.
Background & Motivation¶
Background: Household robots are developing rapidly, but household environments introduce unpredictable safety risks (e.g., perception delays, lack of common-sense leading to dangerous operations). Existing safety evaluations are mostly limited to static images, text, or general harms.
Limitations of Prior Work: (1) Lack of a standardized benchmark for dynamic unsafe action detection; (2) household scenarios are more complex and variable than industrial environments, requiring contextual understanding to determine if an action is safe; (3) the capabilities and bottlenecks of VLMs in safety detection remain unclear.
Key Challenge: Real-time safety monitoring demands low latency, whereas accurate unsafe action detection requires deep multimodal reasoning—making it difficult to achieve both simultaneously.
Goal: To build an evaluation benchmark and design a real-time safety monitoring architecture.
Key Insight: (1) Construct a diverse unsafe action dataset through a hybrid pipeline of physical simulation and video generation; (2) utilize a dual-brain architecture to balance inference efficiency and detection accuracy.
Core Idea: FastBrain performs high-frequency lightweight screening, while SlowBrain conducts asynchronous deep reasoning, coordinating both to achieve real-time safety.
Method¶
Overall Architecture¶
HomeSafe-Bench contains 438 unsafe cases covering 6 functional areas such as kitchens and living rooms, with multi-dimensional fine-grained annotations. During inference, HD-Guard coordinates the fast and slow dual brains: FastBrain continuously screens video frames at a high frequency, triggering SlowBrain for deep multimodal analysis when suspicious actions are detected.
Key Designs¶
-
Hybrid Data Construction Pipeline:
- Function: Generate diverse and realistic videos of unsafe actions
- Mechanism: A physical simulator generates basic scenes and actions, combined with advanced video generation models to enhance visual realism, followed by human annotation of unsafe categories, severity, and context
- Design Motivation: Pure simulation lacks realism, while pure real-world data is difficult to scale to cover sufficient unsafe scenarios
-
Hierarchical Dual-Brain Guard (HD-Guard):
- Function: Real-time safety monitoring architecture
- Mechanism: FastBrain is a lightweight model (e.g., a small ViT) that scans video frames at high frequency to output quick safety scores for each frame. When a score exceeds a threshold, it asynchronously triggers SlowBrain (a large VLM like GPT-4V) to conduct deep multimodal reasoning, combining vision, language, and common-sense knowledge to make the final decision
- Design Motivation: Analogous to the human fast-and-slow systems (System 1/2)—fast intuitive judgment suffices most of the time, and deep reasoning is initiated only when necessary
-
Multi-dimensional Fine-grained Annotation:
- Function: Support systematic evaluation
- Mechanism: Each case is annotated with the type of unsafety (e.g., collision, fall, fire), severity, involved objects, and contextual dependency. The division into 6 functional areas ensures the evaluation covers various typical spaces in a home
- Design Motivation: Coarse-grained "safe/unsafe" binary classification is insufficient to diagnose specific weaknesses of the models
Loss & Training¶
FastBrain can be fine-tuned with a small amount of annotated data, while SlowBrain uses pretrained VLMs for zero/few-shot inference.
Key Experimental Results¶
Main Results¶
| Method | Detection Accuracy | Latency | Explanation |
|---|---|---|---|
| HD-Guard | Best trade-off | Low | Dual-brain coordination |
| Large VLM only | Highest accuracy | Very High | Unsuitable for real-time |
| Lightweight model only | Lower | Lowest | Severe missed detections |
Ablation Study¶
| Configuration | Performance | Explanation |
|---|---|---|
| FastBrain + SlowBrain | Best | Complementary coordination |
| FastBrain only | High missed detections | Lacks deep reasoning |
| SlowBrain only | High latency | Unable to run in real-time |
| Different trigger thresholds | Trade-off exists | Lower threshold -> More SlowBrain calls |
Key Findings¶
- Existing VLMs perform far from perfectly in unsafe action detection, especially in scenarios requiring common-sense reasoning (e.g., determining whether "a knife pointing towards a toddler" is hazardous).
- The fast-slow dual-brain coordination of HD-Guard significantly outperforms single-model strategies.
- Contextual dependency is a key bottleneck—the exact same action may be safe or unsafe depending on the context.
Highlights & Insights¶
- Critical Safety Evaluation Gap: This is the first work to systematically evaluate the capability of VLMs in household unsafe action detection, which holds direct significance for embodied AI safety.
- Practicality of the Dual-Brain Architecture: The analogy to System 1/2 is intuitive and effective; this architecture can be transferred to other scenarios requiring real-time monitoring and deep analysis.
- Hybrid Data Construction: The simulation + generation pipeline presents a practical solution to address the scarcity of safety-related data.
Limitations & Future Work¶
- The scale of 438 cases is relatively small and may not cover all unsafe scenarios.
- Unsafe actions generated via video models may possess a distribution shift compared to real-world actions.
- The optimal balance between the false positive rate of FastBrain and the invocation frequency of SlowBrain requires scenario-specific tuning.
- Safety concerns in multi-person interaction scenarios are not considered.
Related Work & Insights¶
- vs SafetyBench (Text Safety): Text-based safety evaluations do not involve visual and physical interactions, whereas HomeSafe-Bench is closer to embodied scenarios.
- vs RoboCasa/Habitat: Simulation platforms provide environments but do not focus on safety evaluations; HomeSafe-Bench fills this gap.
- vs System 1/2 Architectures: HD-Guard serves as a concrete implementation of dual-system theory in cognitive science for AI safety.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of a new benchmark and a dual-brain architecture is novel.
- Experimental Thoroughness: ⭐⭐⭐⭐ Includes multi-VLM comparisons and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ The problem motivation and methodology are clearly described.
- Value: ⭐⭐⭐⭐⭐ Holds significant practical importance for embodied AI safety.