SAFE: Multitask Failure Detection for Vision-Language-Action Models¶
Conference: NeurIPS 2025 arXiv: 2506.09937 Code: https://vla-safe.github.io/ Area: Robot Learning / VLA Safety Keywords: Failure Detection, VLA Models, Multitask Generalization, Functional Conformal Prediction, MLP/LSTM Detector
TL;DR¶
SAFE identifies consistent "failure regions" in the internal feature space of VLA models that generalize across tasks. Leveraging this observation, it trains lightweight MLP/LSTM failure detectors and applies Functional Conformal Prediction (FCP) for threshold calibration. The approach achieves 78% ROC-AUC on unseen tasks with less than 1% computational overhead, substantially outperforming token-uncertainty and action-consistency baselines.
Background & Motivation¶
Background: VLA models (e.g., OpenVLA, π₀) achieve only 30–60% zero-shot success rates on unseen tasks. Real-world deployment requires automated in-execution failure detection to trigger human intervention or retry mechanisms.
Limitations of Prior Work: Existing failure detectors are task-specific—requiring failure rollout data for each new task. Token-uncertainty methods (logit-based) perform poorly on VLAs (ROC-AUC 45–60%). Action-consistency methods (STAC) demand 10× inference time.
Key Challenge: VLA models are designed for open-world tasks, making it infeasible to pre-collect failure data for every possible task. What is needed is a failure detector trained on seen tasks that generalizes to new ones.
Goal: Train efficient failure detectors that generalize across tasks without requiring any data collection for new tasks.
Key Insight: In the final-layer hidden states of VLAs, the feature distributions of successful and failed trajectories exhibit consistent separation patterns across different tasks. A simple MLP/LSTM can learn this "failure region" and generalize accordingly.
Core Idea: Train lightweight failure detectors (MLP/LSTM) on the last-layer hidden states of the VLA, combined with FCP-based threshold calibration, to achieve real-time, multitask-generalizable failure detection.
Method¶
Overall Architecture¶
During VLA execution → extract the last-layer hidden state \(\mathbf{e}_t\) at each step → MLP Detector (per-step scoring with cumulative sum \(s = \sum_t \sigma(g(\mathbf{e}_t))\)) or LSTM Detector (sequential processing \(s_t = \sigma(\text{LSTM}(\mathbf{e}_{0:t}))\)) → FCP Threshold Calibration (derive time-varying thresholds corresponding to confidence level \(\alpha\) using a validation set) → flag failure when threshold is exceeded.
Key Designs¶
-
"Failure Regions" in the VLA Feature Space:
- Function: Discover and exploit failure patterns in VLA internal representations.
- Mechanism: The last-layer hidden states \(\mathbf{e}_t\) (prior to action decoding) encode task execution status. Features from successful trajectories cluster in one region, while those from failed trajectories deviate into a distinct "failure region." This pattern is consistent across tasks.
- Design Motivation: Compared to token probabilities/logits (which only reflect prediction confidence), hidden states encode richer information about execution status. Token uncertainty is nearly ineffective on VLAs (ROC-AUC 45–60%).
-
Dual-Architecture Failure Detectors (MLP + LSTM):
- Function: Predict failure probability from feature sequences.
- Mechanism: MLP—independent per-step scoring followed by cumulative aggregation \(f_{MLP} = \sum_\tau \sigma(g(\mathbf{e}_\tau))\), with loss \(L = \sum_i [y_i \sum_t(t-s_t) + (1-y_i)\sum_t s_t]\), encouraging low scores for successful and high scores for failed trajectories. LSTM—sequential processing to capture temporal dependencies \(f_{LSTM} = \sigma(\text{LSTM}(\mathbf{e}_{0:t}))\), trained with BCE loss.
- Design Motivation: MLP is simpler and more efficient; LSTM captures temporal patterns (e.g., oscillatory behavior as a precursor to failure).
-
Functional Conformal Prediction (FCP) Threshold Calibration:
- Function: Calibrate time-varying thresholds on a validation set to guarantee false positive rate (FPR) ≤ \(\alpha\).
- Mechanism: Compute upper confidence bands \(C_\alpha = \{[lower_t, upper_t]\}\) from scores on successful validation trajectories. At test time, failure is declared when \(s_t > upper_t\). Conformal prediction provides a theoretical guarantee: FPR ≤ \(\alpha\).
- Design Motivation: Fixed thresholds do not transfer across different VLAs or tasks; FCP provides adaptive, theoretically grounded thresholds with false alarm rate guarantees.
Loss & Training¶
- MLP: Cumulative score loss (low for success / high for failure).
- LSTM: Stepwise BCE loss.
- Training data: Mixed successful and failed rollouts across multiple tasks.
- Inference overhead: <1% (only last-layer feature extraction + MLP forward pass).
Key Experimental Results¶
Main Results (ROC-AUC %)¶
| VLA Model | Benchmark | Seen Tasks | Unseen Tasks |
|---|---|---|---|
| OpenVLA | LIBERO | 72.68 | 73.47 |
| π₀ | LIBERO | 92.98 | 84.48 |
| π₀-FAST | LIBERO | 90.06 | 80.44 |
| π₀* | SimplerEnv | 89.50 | 84.82 |
| Average | All | 81.43 | 78.00 |
Method Comparison¶
| Method | ROC-AUC | Inference Overhead |
|---|---|---|
| Token Uncertainty | 48–54% | ~0% |
| Sample Consistency (STAC) | ~71% | 10× |
| Embedding Distance | 57–82% | ~1% |
| SAFE (MLP/LSTM) | 78–85% | <1% |
Key Findings¶
- Token uncertainty is nearly ineffective for VLAs—token probabilities do not reflect execution quality.
- SAFE degrades by only 3.4% on unseen tasks (81.43→78.00), indicating acceptable generalization.
- On real robots: π₀-FAST + Franka achieves 64.16% on unseen tasks; OpenVLA + WidowX achieves 88.42%.
- Detected failure modes include: imprecise insertion, oscillatory motion, grasp failure, and object slippage.
- FCP converges to near-optimal thresholds with approximately 100 calibration samples.
Highlights & Insights¶
- The discovery of "failure regions" is an insightful finding: Failed trajectories form consistent clusters in the VLA hidden state space, suggesting that VLAs internally "know" they are failing—yet lack an explicit monitoring mechanism.
- Extremely low computational overhead: Less than 1% additional cost (vs. 10× for STAC), making real-time deployment feasible.
- Conformal prediction provides theoretical guarantees: The FPR ≤ \(\alpha\) guarantee is particularly valuable for safety-critical applications.
Limitations & Future Work¶
- Validation is limited to manipulation tasks; navigation and mobile manipulation remain untested.
- Only last-layer features are used; aggregating across multiple layers may yield stronger representations.
- Training still requires collecting successful and failed rollouts; purely zero-shot operation is not supported.
- Performance drops 8–13% on unseen tasks, which may be insufficient for safety-critical deployments.
Related Work & Insights¶
- vs. Token Uncertainty: Token probabilities are ineffective for VLAs, whereas hidden-state features are substantially more informative—an important finding for the VLA safety community.
- vs. STAC (Action Consistency): STAC requires multiple inference passes to check consistency; SAFE operates within a single inference pass.
- vs. OOD Detection (LogpZO): OOD methods detect anomalous inputs, while SAFE detects execution failures—a more direct and practically relevant objective.
Rating¶
- Novelty: ⭐⭐⭐⭐ The discovery of "failure regions" and the FCP calibration design are genuinely novel contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 4 VLAs × 2 benchmarks + real robots + multiple baselines + FCP analysis.
- Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated.
- Value: ⭐⭐⭐⭐⭐ Addresses a core safety challenge in VLA deployment with strong practical utility.