SAFE: Multitask Failure Detection for Vision-Language-Action Models¶

Conference: NeurIPS 2025 arXiv: 2506.09937 Code: https://vla-safe.github.io/ Area: Robot Learning / VLA Safety Keywords: Failure Detection, VLA Models, Multitask Generalization, Functional Conformal Prediction, MLP/LSTM Detector

TL;DR¶

SAFE identifies consistent "failure regions" in the internal feature space of VLA models that generalize across tasks. Leveraging this observation, it trains lightweight MLP/LSTM failure detectors and applies Functional Conformal Prediction (FCP) for threshold calibration. The approach achieves 78% ROC-AUC on unseen tasks with less than 1% computational overhead, substantially outperforming token-uncertainty and action-consistency baselines.

Background & Motivation¶

Background: VLA models (e.g., OpenVLA, π₀) achieve only 30–60% zero-shot success rates on unseen tasks. Real-world deployment requires automated in-execution failure detection to trigger human intervention or retry mechanisms.

Limitations of Prior Work: Existing failure detectors are task-specific—requiring failure rollout data for each new task. Token-uncertainty methods (logit-based) perform poorly on VLAs (ROC-AUC 45–60%). Action-consistency methods (STAC) demand 10× inference time.

Key Challenge: VLA models are designed for open-world tasks, making it infeasible to pre-collect failure data for every possible task. What is needed is a failure detector trained on seen tasks that generalizes to new ones.

Goal: Train efficient failure detectors that generalize across tasks without requiring any data collection for new tasks.

Key Insight: In the final-layer hidden states of VLAs, the feature distributions of successful and failed trajectories exhibit consistent separation patterns across different tasks. A simple MLP/LSTM can learn this "failure region" and generalize accordingly.

Core Idea: Train lightweight failure detectors (MLP/LSTM) on the last-layer hidden states of the VLA, combined with FCP-based threshold calibration, to achieve real-time, multitask-generalizable failure detection.

Method¶

Overall Architecture¶

During VLA execution → extract the last-layer hidden state \(\mathbf{e}_t\) at each step → MLP Detector (per-step scoring with cumulative sum \(s = \sum_t \sigma(g(\mathbf{e}_t))\)) or LSTM Detector (sequential processing \(s_t = \sigma(\text{LSTM}(\mathbf{e}_{0:t}))\)) → FCP Threshold Calibration (derive time-varying thresholds corresponding to confidence level \(\alpha\) using a validation set) → flag failure when threshold is exceeded.

Key Designs¶

"Failure Regions" in the VLA Feature Space:
- Function: Discover and exploit failure patterns in VLA internal representations.
- Mechanism: The last-layer hidden states \(\mathbf{e}_t\) (prior to action decoding) encode task execution status. Features from successful trajectories cluster in one region, while those from failed trajectories deviate into a distinct "failure region." This pattern is consistent across tasks.
- Design Motivation: Compared to token probabilities/logits (which only reflect prediction confidence), hidden states encode richer information about execution status. Token uncertainty is nearly ineffective on VLAs (ROC-AUC 45–60%).
Dual-Architecture Failure Detectors (MLP + LSTM):
- Function: Predict failure probability from feature sequences.
- Mechanism: MLP—independent per-step scoring followed by cumulative aggregation \(f_{MLP} = \sum_\tau \sigma(g(\mathbf{e}_\tau))\), with loss \(L = \sum_i [y_i \sum_t(t-s_t) + (1-y_i)\sum_t s_t]\), encouraging low scores for successful and high scores for failed trajectories. LSTM—sequential processing to capture temporal dependencies \(f_{LSTM} = \sigma(\text{LSTM}(\mathbf{e}_{0:t}))\), trained with BCE loss.
- Design Motivation: MLP is simpler and more efficient; LSTM captures temporal patterns (e.g., oscillatory behavior as a precursor to failure).
Functional Conformal Prediction (FCP) Threshold Calibration:
- Function: Calibrate time-varying thresholds on a validation set to guarantee false positive rate (FPR) ≤ \(\alpha\).
- Mechanism: Compute upper confidence bands \(C_\alpha = \{[lower_t, upper_t]\}\) from scores on successful validation trajectories. At test time, failure is declared when \(s_t > upper_t\). Conformal prediction provides a theoretical guarantee: FPR ≤ \(\alpha\).
- Design Motivation: Fixed thresholds do not transfer across different VLAs or tasks; FCP provides adaptive, theoretically grounded thresholds with false alarm rate guarantees.

Loss & Training¶

MLP: Cumulative score loss (low for success / high for failure).
LSTM: Stepwise BCE loss.
Training data: Mixed successful and failed rollouts across multiple tasks.
Inference overhead: <1% (only last-layer feature extraction + MLP forward pass).

Key Experimental Results¶

Main Results (ROC-AUC %)¶

VLA Model	Benchmark	Seen Tasks	Unseen Tasks
OpenVLA	LIBERO	72.68	73.47
π₀	LIBERO	92.98	84.48
π₀-FAST	LIBERO	90.06	80.44
π₀*	SimplerEnv	89.50	84.82
Average	All	81.43	78.00

Method Comparison¶

Method	ROC-AUC	Inference Overhead
Token Uncertainty	48–54%	~0%
Sample Consistency (STAC)	~71%	10×
Embedding Distance	57–82%	~1%
SAFE (MLP/LSTM)	78–85%	<1%

Key Findings¶

Token uncertainty is nearly ineffective for VLAs—token probabilities do not reflect execution quality.
SAFE degrades by only 3.4% on unseen tasks (81.43→78.00), indicating acceptable generalization.
On real robots: π₀-FAST + Franka achieves 64.16% on unseen tasks; OpenVLA + WidowX achieves 88.42%.
Detected failure modes include: imprecise insertion, oscillatory motion, grasp failure, and object slippage.
FCP converges to near-optimal thresholds with approximately 100 calibration samples.

Highlights & Insights¶

The discovery of "failure regions" is an insightful finding: Failed trajectories form consistent clusters in the VLA hidden state space, suggesting that VLAs internally "know" they are failing—yet lack an explicit monitoring mechanism.
Extremely low computational overhead: Less than 1% additional cost (vs. 10× for STAC), making real-time deployment feasible.
Conformal prediction provides theoretical guarantees: The FPR ≤ \(\alpha\) guarantee is particularly valuable for safety-critical applications.

Limitations & Future Work¶

Validation is limited to manipulation tasks; navigation and mobile manipulation remain untested.
Only last-layer features are used; aggregating across multiple layers may yield stronger representations.
Training still requires collecting successful and failed rollouts; purely zero-shot operation is not supported.
Performance drops 8–13% on unseen tasks, which may be insufficient for safety-critical deployments.

vs. Token Uncertainty: Token probabilities are ineffective for VLAs, whereas hidden-state features are substantially more informative—an important finding for the VLA safety community.
vs. STAC (Action Consistency): STAC requires multiple inference passes to check consistency; SAFE operates within a single inference pass.
vs. OOD Detection (LogpZO): OOD methods detect anomalous inputs, while SAFE detects execution failures—a more direct and practically relevant objective.

Rating¶

Novelty: ⭐⭐⭐⭐ The discovery of "failure regions" and the FCP calibration design are genuinely novel contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 4 VLAs × 2 benchmarks + real robots + multiple baselines + FCP analysis.
Writing Quality: ⭐⭐⭐⭐ Problem motivation is clearly articulated.
Value: ⭐⭐⭐⭐⭐ Addresses a core safety challenge in VLA deployment with strong practical utility.