Skip to content

Detecting High-Stakes Interactions with Activation Probes

Conference: NeurIPS 2025 arXiv: 2506.10805 Code: GitHub Area: AI Safety / LLM Monitoring Keywords: activation probes, high-stakes detection, cascaded monitoring, synthetic data, LLM safety

TL;DR

Linear activation probes (lightweight classifiers trained on LLM internal representations) are used to detect "high-stakes interactions" from users. Trained on synthetic data, these probes achieve AUROC of 0.88–0.92 across 6 real-world datasets, matching fine-tuned 8–12B LLMs at a computational cost six orders of magnitude lower. A cascaded architecture (probe pre-filtering + LLM refinement) further surpasses either component used alone.

Background & Motivation

Background: Detecting high-stakes interactions in LLM deployments (e.g., medical advice, mental health, red-teaming attacks) is essential for routing to human review or triggering safety measures.

Limitations of Prior Work: (a) Using another LLM (e.g., GPT-4) to monitor all interactions is prohibitively expensive ($0.01–0.10/query); (b) predefined rules cannot cover the fuzzy boundary of "high-stakes," which depends on context rather than keywords; (c) "high-stakes" is inherently ambiguous—vague generative advice may be harmful.

Key Challenge: Real-time, low-cost monitoring is required, yet high-stakes detection demands deep semantic understanding that appears to necessitate large models.

Goal: (1) Demonstrate that LLM internal representations already encode sufficient high-stakes signals; (2) extract these signals using linear probes at minimal cost; (3) design a cascaded architecture to balance accuracy and cost.

Key Insight: LLM hidden layers produce distinct activation patterns when processing different types of text—linear probes can be trained as lightweight classifiers on these patterns.

Core Idea: LLMs already "know" internally whether an interaction is high-stakes; linear probes can extract this signal with less than 1% additional inference overhead.

Method

Overall Architecture

Synthetic training data (GPT-4o generates ~10K samples + multi-step filtering) → train linear probes on LLM internal representations (5 architectures) → evaluate on 6 real-world datasets → design a cascaded system (probe + LLM combination).

Key Designs

  1. Probe Architectures (5 variants):

    • Mean Probe: Averages hidden states across all tokens in the sequence → linear classification
    • Last Token: Uses only the hidden state of the last token
    • Max/Rolling Max: Takes the per-dimension maximum or sliding-window maximum
    • Softmax Probe: Weights token hidden states using learned scores
    • Attention Probe: Learns query vectors for attention-weighted aggregation (best performing)
    • All probes are ultimately linear classifiers—overhead is minimal
  2. Synthetic Training Data Generation:

    • Function: GPT-4o generates ~10K high/low-stakes interaction samples
    • Multi-step filtering: generate → GPT-4o labeling → human review → confound removal
    • Design Motivation: Real high-stakes data is difficult to obtain at scale and privacy-sensitive; synthetic data is scalable and controllable
    • Identified Issue: Confounds exist in synthetic data (e.g., specific phrasings such as "minor wound") requiring iterative cleaning
  3. Cascaded Monitoring System:

    • Function: Probe monitors all interactions → uncertain top-\(k\)% are forwarded to an LLM for fine-grained judgment → scores from both are averaged
    • Mechanism: Probes are cheap but limited in precision (AUROC ~0.92); LLMs are accurate but expensive (AUROC ~0.95). The cascade achieves equivalent performance to 100% LLM usage at 40% of the computational budget
    • Design Motivation: Not all interactions require LLM-level judgment; the majority can be handled quickly by the probe

Loss & Training

Binary cross-entropy loss. Probes are trained on synthetic data and evaluated on 6 real-world datasets (Anthropic HH-RLHF, ToolACE, MT Samples, MTS Dialog, Mental Health, Aya Redteaming).

Key Experimental Results

Main Results

Method Mean AUROC TPR@1%FPR Compute Cost
Attention Probe 0.88–0.92 ~43% ~\(10^{-6}\)
Fine-tuned Llama-3.1-8B ~0.92 ~48%
Fine-tuned Gemma-3-12B ~0.87 ~1.5×
Prompted Llama-3.3-70B 0.95+ ~55% ~10×
Cascade (Probe + fine-tuned 8B) Surpasses both ~0.4×

Ablation Study

Configuration Key Finding Notes
32 deployment samples for adaptation +5% AUROC Very few samples yield significant gains
256 samples Probe surpasses prompted 70B Few-shot adaptation closes domain gap
Cascade at 40% budget Equivalent to 100% LLM-only Optimal efficiency–accuracy trade-off
5 probe variants compared Attention best, Mean second Attention weighting captures key tokens
Layer selection Mid-to-late layers optimal Upper layers too abstract, lower too raw

Key Findings

  • Linear probe AUROC of 0.92 approaches that of fine-tuned 8B LLMs—LLM internals do encode high-stakes signals
  • Only 32 target-domain samples yield a +5% AUROC gain—adaptation cost is extremely low
  • The cascaded architecture achieves full-LLM accuracy at 40% of the computational budget—highly valuable for large-scale deployment
  • TPR@1%FPR is only 43%—probes remain insufficient under extremely strict false-positive requirements

Highlights & Insights

  • "The LLM already knows internally": This parallels findings from VLA failure detection (the SAFE paper)—internal model representations encode richer signals than outputs alone. Probes serve as an exceptionally cheap signal extraction mechanism.
  • Practical value of cascaded architecture: Probes serve as a fast pre-filter (quickly eliminating the vast majority of low-stakes interactions), while LLMs handle fine-grained judgment (only for uncertain minority cases)—this constitutes an industrially viable LLM safety monitoring solution.
  • Synthetic data works: Probes trained on GPT-4o-generated synthetic data generalize across 6 diverse real-world datasets, suggesting that the internal representation of "high-stakes" is universal.

Limitations & Future Work

  • The definition of "high-stakes" is ambiguous—relying solely on GPT-4o labels may reflect a specific LLM's biases rather than an objective standard
  • Linear probes are vulnerable to targeted adversarial attacks (adversaries may craft inputs to evade probe detection)
  • TPR@1%FPR of only 43% implies an unacceptably high miss rate for safety-critical scenarios
  • Confounds identified in synthetic data (e.g., "minor wound") may not have been fully removed
  • The approach has not been tested in multi-turn dialogue settings—single-turn interactions may be comparatively simpler
  • vs. Llama Guard / ShieldGemma: These are full safety classifiers requiring independent inference; probes reuse the computation of the host model with zero additional inference cost
  • vs. Representation Engineering (Zou et al., 2023): RepE uses probes to understand/control model behavior; this work uses probes for real-time safety monitoring
  • vs. SAFE (VLA failure detection): Both leverage internal model representations for anomaly detection, but target different application domains (text safety vs. robot safety)

Rating

  • Novelty: ⭐⭐⭐⭐ Using probes for safety monitoring is a natural yet underexplored direction; the cascaded design is practical
  • Experimental Thoroughness: ⭐⭐⭐⭐ 5 probe variants × 6 datasets × cascade × few-shot adaptation
  • Writing Quality: ⭐⭐⭐⭐ System design is clear and practically oriented
  • Value: ⭐⭐⭐⭐⭐ Direct practical value for LLM safety deployment