Skip to content

Failure Prediction at Runtime for Generative Robot Policies

Conference: NeurIPS 2025 arXiv: 2510.09459 Code: GitHub Project Page: FIPER Website Authors: Ralf Römer, Adrian Kobras, Luca Worbis, Angela P. Schoellig (TUM Learning Systems and Robotics Lab) Area: Image Generation Keywords: Failure Prediction, Generative Policies, RND, Action Chunk Entropy, Conformal Prediction

TL;DR

This paper proposes FIPER, a framework for runtime failure prediction in generative robot policies (diffusion/flow matching). It jointly evaluates an observation-side metric RND-OE (OOD detection) and an action-side metric ACE (Action Chunk Entropy) to enable early and accurate failure prediction without any failure data, with statistical guarantees provided via conformal prediction.

Background & Motivation

Background: Generative imitation learning methods—including diffusion policy and flow matching—have achieved remarkable progress in recent years, enabling robots to perform complex long-horizon manipulation tasks. By learning multimodal conditional action distributions, these approaches demonstrate strong task generalization.

Limitations of Prior Work: - Distribution shifts at deployment (unseen environments, lighting changes, object position variations) or accumulated action errors can lead to unpredictable and dangerous behavior. - Existing OOD detection methods operate solely on the observation side, producing numerous false positives for benign OOD states that the policy can actually generalize to. - VLM-based methods can only retrospectively detect failures that have already occurred, offering no early warning capability. - Many methods rely on failure data collection, which is both unsafe and impractical in real-world settings. - Existing uncertainty measures fail to properly handle the multimodal action distributions of generative policies.

Key Challenge: Safe deployment requires precise runtime failure prediction, yet no single signal source—observation or action alone—can reliably distinguish genuine failure precursors from situations the policy can handle.

Goal: To provide an early failure prediction mechanism for generative robot policies at runtime, without requiring failure data, while minimizing false positives on benign OOD situations.

Key Insight: Based on the observation that failures are typically accompanied by both unfamiliar observations and incoherent actions, the paper designs a dual-metric detection framework that only triggers an alarm when both signals are simultaneously anomalous.

Core Idea: Combining observation-space OOD detection with action-space uncertainty quantification in a complementary, noise-reducing fashion, with conformal prediction for threshold calibration, to achieve runtime failure prediction requiring no failure data.

Method

Overall Architecture

FIPER (Fiailure Prediction at Runtime) is a modular runtime failure prediction framework built on a key insight: failures tend to coincide with both unfamiliar observations and ambiguous/incoherent actions. The framework consists of three core components:

  1. Observation-side metric RND-OE: Applies Random Network Distillation within the policy's own observation embedding space to detect whether the current observation deviates from the training distribution.
  2. Action-side metric ACE: Proposes Action Chunk Entropy, which samples multiple batches of action chunks from the conditional action distribution, computes entropy scores in end-effector space, and quantifies action uncertainty.
  3. Temporal window aggregation + dual-threshold triggering: Both scores are smoothed over a short temporal window, and calibrated thresholds from conformal prediction are applied; a failure alarm is raised only when both scores simultaneously exceed their respective thresholds.

Overall pipeline: observations are encoded by the policy encoder to obtain embeddings → RND-OE computes an OOD score → the policy samples multiple action chunk batches → ACE computes an entropy score → both scores are aggregated over a temporal window → dual-threshold detection → alarm triggered or not.

Key Designs

  1. Random Network Distillation in Observation Embeddings (RND-OE)

    • Function: Detects whether the current observation deviates from the policy's training data distribution.
    • Mechanism: RND is deployed in the policy's own observation embedding space rather than the raw pixel space. A randomly initialized teacher network is fixed, and a student network is trained to replicate the teacher's outputs on training embeddings. For in-distribution embeddings, the student accurately matches the teacher (low prediction error); for OOD embeddings, large prediction errors emerge and serve as the OOD signal.
    • Design Motivation: Performing OOD detection in raw observation space is easily perturbed by task-irrelevant visual changes (e.g., lighting, background texture), causing false positives. Using the policy's learned embedding space naturally filters out task-irrelevant visual variation and focuses on semantic representations that truly influence policy decisions, improving detection robustness.
  2. Action Chunk Entropy (ACE)

    • Function: Quantifies the uncertainty of actions produced by a generative policy at the current state.
    • Mechanism: A batch of action chunks is sampled from the policy's conditional action distribution. Each chunk is transformed into end-effector space, and an entropy score is computed over the batch. The specially designed entropy measure distinguishes between "multimodal yet each mode is confident" (benign, low uncertainty) and "disordered across modes" (high uncertainty / failure precursor).
    • Design Motivation: A core strength of generative policies (diffusion/flow matching) is the ability to learn multimodal action distributions. Conventional variance/entropy measures incorrectly classify reasonable multimodal behavior as high uncertainty. ACE is computed in end-effector space and properly accounts for temporal modal consistency, effectively distinguishing benign multimodality from genuine action confusion.
  3. Conformal Prediction Calibration and Dual-Metric Joint Decision

    • Function: Sets statistically guaranteed thresholds for both metrics and reduces false positives through joint decision-making.
    • Mechanism: A small set of successful demonstration rollouts (50 in simulation, only 10 in the real world) serves as the calibration set. Conformal prediction is used to independently derive thresholds for RND-OE and ACE. At inference time, both scores are averaged over a short moving window to smooth noise; a failure alarm is triggered only when both scores simultaneously exceed their respective thresholds within the window.
    • Design Motivation: A single metric tends to produce characteristic false positives—RND-OE is sensitive to benign OOD, while ACE may miss failures in which the observation is anomalous but the action appears confident. The intersection logic of dual metrics naturally filters out each metric's specific false positive sources. Conformal prediction provides a statistical upper bound on the false positive rate, enhancing the method's credibility in safety-critical scenarios.

Loss & Training

  • RND student network training: Trained using MSE loss \(\mathcal{L}_{\text{RND}} = \| f_\theta(\mathbf{z}) - f_{\text{teacher}}(\mathbf{z}) \|^2\) on observation embeddings from successful rollouts, where \(\mathbf{z}\) is the output of the policy encoder.
  • No policy training data required: Neither RND training nor ACE computation requires access to the policy's original training dataset.
  • No failure data required: The entire calibration process relies solely on successful rollouts, without any failure demonstrations.
  • Conformal prediction calibration: The \((1-\alpha)\) quantile of scores on the calibration set is used as the threshold, where \(\alpha\) controls the upper bound on the allowed false positive rate.

Key Experimental Results

Main Results

FIPER is evaluated across 5 diverse task environments (3 simulation + 2 real-world), covering both diffusion policy and flow matching, and spanning multiple failure modes including grasping, rope manipulation, and bimanual coordination.

Method Accuracy Early Prediction Time False Positive Rate Benign OOD Discrimination
OOD-only (RND) Moderate Earlier High Poor (benign OOD also triggers)
Action-only (Variance) Moderate Moderate Moderate Moderate
VLM-based Relatively high Late (retrospective) Low
FIPER (RND-OE + ACE) Highest Earliest Lowest Best
  • FIPER outperforms all baselines across all 5 environments, with especially significant advantages in distinguishing genuine failures from benign OOD situations.
  • In real-world rope manipulation tasks, FIPER can provide warnings several seconds before failure occurs, allowing sufficient time for human intervention.

Ablation Study

Ablation Configuration Result
RND-OE only Detects OOD but high false positive rate; benign OOD frequently triggers alarms
ACE only Captures action uncertainty but misses failures caused by observation anomalies
RND-OE + ACE (no temporal window) Unstable detection; single-frame noise causes spurious triggers
RND-OE + ACE (with temporal window) Prediction stability and accuracy significantly improved
Raw pixel-space RND vs. embedding-space RND-OE Embedding-space version is more robust with fewer false positives
Different policy types (diffusion vs. flow matching) FIPER is effective on both, validating framework generality
Calibration data volume sensitivity 50 successful rollouts in simulation and 10 in the real world are sufficient

Key Findings

  1. Dual-metric complementarity is central: Neither RND-OE nor ACE alone can reliably predict failures; their combination substantially reduces false positives while maintaining high detection rates.
  2. Embedding space >> raw space: Applying RND in the policy's embedding space is more effective than in the raw pixel space, as the embedding space filters out task-irrelevant visual variation.
  3. ACE correctly handles multimodality: Conventional variance measures incorrectly flag multimodal action distributions as high uncertainty; ACE properly distinguishes "multimodal but confident" from "genuinely disordered."
  4. Minimal calibration data is sufficient: Only 10 successful rollouts in the real world are needed for effective calibration, demonstrating strong practicality.
  5. Cross-policy generalization: The same framework performs well on both diffusion policy and flow matching without task-specific modifications.

Highlights & Insights

  1. Zero failure data requirement: From a safe deployment perspective, independence from failure data is a substantial practical advantage—collecting failure data is itself dangerous and costly.
  2. Input–output dual-side detection: The observation side checks "whether what is seen is anomalous," and the action side checks "whether what is to be done is incoherent." This I/O dual-side detection philosophy is both elegant and effective.
  3. Deep understanding of generative policies: The design of ACE reflects a thorough understanding of the core characteristic of diffusion/flow matching models—namely, producing multimodal action distributions—rather than naively applying conventional uncertainty methods.
  4. Statistical guarantees enhance credibility: Conformal prediction provides a mathematical upper bound on false positive rates, which is critical for safety-critical robot deployment.
  5. Interpretability: The framework can distinguish whether a failure is caused by an observation anomaly or action confusion, providing valuable diagnostic information for debugging and human–robot interaction.
  6. Modular design: RND-OE and ACE function as independent plug-and-play modules that can be attached to any generative policy without modifying the policy itself.

Limitations & Future Work

  1. Theoretical assumptions of conformal prediction: Coverage guarantees rely on the exchangeability assumption of data, which may not hold in highly non-stationary or adversarial environments.
  2. Passive prediction rather than active recovery: The current framework only predicts failures and does not integrate active recovery mechanisms (e.g., automatically requesting human takeover, switching to a safe policy, or executing fallback actions).
  3. Temporal window hyperparameter: The window size must be manually selected; different tasks may require different window lengths, and no adaptive tuning mechanism is provided.
  4. Detection latency in extreme unseen scenarios: In novel scenarios that differ substantially from the training distribution, some detection delay may still occur.
  5. End-effector space assumption: ACE is computed in end-effector space, which may require redefinition of an appropriate action space for non-manipulation tasks such as navigation.
  • Random Network Distillation (RND) was originally proposed for curiosity-driven exploration in deep reinforcement learning (Burda et al., 2019); this paper innovatively transfers it to the observation embedding space for OOD detection.
  • Conformal prediction is increasingly popular in uncertainty quantification (Angelopoulos & Bates, 2023); this paper demonstrates its practical value for providing guaranteed threshold calibration in robot safety monitoring.
  • Compared to ensemble-based uncertainty methods, FIPER does not require training multiple policy copies, making it more computationally tractable.
  • The dual-metric joint-triggering design philosophy is generalizable to other safety-critical systems requiring low false positive detection rates.

Rating

  • Novelty: ⭐⭐⭐⭐ — The joint observation-plus-action dual-side detection framework is novel; ACE as a tailored entropy measure for multimodal action distributions represents a valuable technical contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Five environments spanning simulation and the real world, two policy types, and thorough ablations; some quantitative comparison details could be more comprehensive.
  • Value: ⭐⭐⭐⭐⭐ — Requires no failure data, no policy training data, minimal calibration data, lightweight computation, and plug-and-play deployment; extremely practical.
  • Writing Quality: ⭐⭐⭐⭐ — Motivation is clearly articulated, the method is intuitive, and the project page videos provide effective illustration.

title: >- [Paper Notes] Failure Prediction at Runtime for Generative Robot Policies description: >- [NeurIPS 2025][Image Generation][Failure Prediction] Proposes the FIPER framework, which combines observation-space OOD detection (RND) and action-space uncertainty quantification (ACE) to achieve early failure prediction for generative robot policies at runtime without requiring any failure data. tags: - NeurIPS 2025 - Image Generation - Failure Prediction - Imitation Learning - Diffusion Models - Out-of-Distribution Detection - Conformal Prediction


Failure Prediction at Runtime for Generative Robot Policies

Conference: NeurIPS 2025 arXiv: 2510.09459 Code: GitHub Area: Image Generation / Robot Learning Keywords: Failure Prediction, Imitation Learning, Diffusion Models, Out-of-Distribution Detection, Conformal Prediction

TL;DR

This paper proposes FIPER, which combines observation-space OOD detection (RND) and action-space uncertainty quantification (ACE) to achieve early failure prediction for generative robot policies at runtime without requiring any failure data.

Background & Motivation

Generative imitation learning methods such as diffusion policy and flow matching enable robots to perform complex long-horizon tasks, but pose safety concerns during real-world deployment:

Distribution shift: Unseen environments or accumulated action errors can lead to unpredictable behavior.

Limitations of prior work: - OOD detection methods based solely on observations produce numerous false positives for benign OOD states. - VLM-based methods can only retrospectively detect failures (too late to intervene). - Many methods rely on failure data collection, which is unsafe and impractical. - The multimodal action distribution characteristic of generative policies is typically ignored.

Safety-critical need: Early runtime failure prediction is essential in human–robot coexistence and safety-critical environments.

Method

Overall Architecture

FIPER (Failure Prediction at Runtime) combines two complementary failure metrics, motivated by the insight that failures are typically accompanied by both unfamiliar observations and ambiguous actions:

  1. Observation-side metric (RND-OE): Detects OOD conditions in the policy's observation embedding space.
  2. Action-side metric (ACE): Quantifies the uncertainty of generated actions.
  3. Joint decision: A failure alarm is triggered only when both metrics simultaneously exceed their respective thresholds within a temporal window.

Key Designs

Random Network Distillation (RND-OE): - Applies RND within the policy's own observation embedding space rather than the raw observation space. - A student network is trained to replicate the output of a randomly initialized teacher network. - For in-distribution embeddings, the student closely matches the teacher; for OOD embeddings, large prediction errors emerge. - Operating in the embedding space rather than the raw pixel space improves robustness to irrelevant distribution shifts.

Action Chunk Entropy (ACE): - A novel uncertainty measure designed for the multimodal action distributions of generative policies. - Samples a batch of action chunks from the conditional action distribution. - Computes entropy scores in end-effector space. - Lightweight and effective at handling multimodal distributions by distinguishing benign multimodality from genuinely high uncertainty.

Conformal Prediction Calibration: - Thresholds are calibrated using a small number of successful rollouts (50 in simulation, 10 in the real world). - Both scores are aggregated over a short temporal window before thresholding, providing statistically guaranteed prediction performance.

Loss & Training

  • The RND student network is trained with MSE loss to replicate teacher outputs.
  • No failure data or policy training data is required.
  • Calibration relies solely on successful rollouts without any labeling.

Key Experimental Results

Main Results

FIPER is evaluated across 5 diverse environments (simulation + real world) using both diffusion and flow matching policies:

Failure prediction performance in simulation:

Method Accuracy Early Prediction Time False Positive Rate
OOD-only Lower High
Action-only Moderate Moderate
FIPER (RND-OE + ACE) Highest Earliest Lowest

Key comparison results: - FIPER significantly outperforms observation-only and action-only baselines in distinguishing genuine failures from benign OOD situations the policy can generalize to. - Failure predictions are both earlier and more accurate than existing methods.

Ablation Study

Contribution of each component:

Configuration Result
RND-OE only Detects OOD but produces many false positives (benign OOD also triggers)
ACE only Captures action uncertainty but misses some failures
RND-OE + ACE Complementary combination significantly reduces false positives while maintaining high detection rate
Without temporal window aggregation Unstable detection
With temporal window aggregation Improved prediction stability and accuracy

Different policy types: FIPER performs well on both diffusion policy and flow matching, validating framework generality.

Calibration data volume: A small number of successful rollouts (50 in simulation, 10 in the real world) is sufficient for effective calibration.

Key Findings

  1. Complementarity of observation and action signals: Neither metric alone is sufficient for reliable failure prediction; their combination is essential.
  2. Embedding space vs. raw space: Applying RND in the policy's embedding space is more effective than in the raw observation space.
  3. Necessity of multimodal handling: Conventional entropy measures cannot correctly handle multimodal distributions; ACE specifically addresses this issue.
  4. Generalization capability: FIPER correctly identifies benign OOD states that the policy can successfully generalize to, avoiding unnecessary interventions.

Highlights & Insights

  1. No failure data required: From a safety perspective, this is a critically important practical advantage.
  2. Two-stage complementary design: The observation side checks "whether the input is anomalous" and the action side checks "whether the output is incoherent"—a dual-sided I/O approach.
  3. Policy-specific design for generative models: ACE accounts for the multimodal action distribution characteristic of diffusion/flow matching models.
  4. Conformal prediction: Provides statistical guarantees on prediction performance, enhancing the method's credibility.
  5. Interpretability: The framework can indicate whether a failure stems from an observation anomaly or action confusion, increasing diagnostic value.

Limitations & Future Work

  1. Coverage guarantees from conformal prediction rely on the data exchangeability assumption, which may not hold in non-stationary environments.
  2. The temporal window size must be manually selected and may require task-specific tuning.
  3. The framework currently only passively predicts failures without integrating active recovery strategies (e.g., requesting human intervention or switching policies).
  4. Detection latency may still occur in extreme unseen scenarios.
  • Random Network Distillation (RND) was originally proposed for exploration rewards in reinforcement learning; this paper innovatively repurposes it for OOD detection.
  • Conformal prediction is increasingly popular in uncertainty quantification; this paper demonstrates its application to robot safety monitoring.
  • Compared to ensemble-based uncertainty methods, FIPER does not require training multiple policy copies, resulting in lower computational overhead.

Rating

  • Novelty: ⭐⭐⭐⭐ — The combination of observation-side and action-side dual detection is novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Five environments spanning simulation and the real world.
  • Value: ⭐⭐⭐⭐⭐ — No failure data required, simple calibration, lightweight computation.
  • Writing Quality: ⭐⭐⭐⭐ — Clear motivation, intuitive method.