Detecting High-Stakes Interactions with Activation Probes¶

Conference: NeurIPS 2025 arXiv: 2506.10805 Code: GitHub Area: AI Safety / LLM Monitoring Keywords: activation probes, high-stakes detection, cascaded monitoring, synthetic data, LLM safety

TL;DR¶

Linear activation probes (lightweight classifiers trained on LLM internal representations) are used to detect "high-stakes interactions" from users. Trained on synthetic data, these probes achieve AUROC of 0.88–0.92 across 6 real-world datasets, matching fine-tuned 8–12B LLMs at a computational cost six orders of magnitude lower. A cascaded architecture (probe pre-filtering + LLM refinement) further surpasses either component used alone.

Background & Motivation¶

Background: Detecting high-stakes interactions in LLM deployments (e.g., medical advice, mental health, red-teaming attacks) is essential for routing to human review or triggering safety measures.

Limitations of Prior Work: (a) Using another LLM (e.g., GPT-4) to monitor all interactions is prohibitively expensive ($0.01–0.10/query); (b) predefined rules cannot cover the fuzzy boundary of "high-stakes," which depends on context rather than keywords; (c) "high-stakes" is inherently ambiguous—vague generative advice may be harmful.

Key Challenge: Real-time, low-cost monitoring is required, yet high-stakes detection demands deep semantic understanding that appears to necessitate large models.

Goal: (1) Demonstrate that LLM internal representations already encode sufficient high-stakes signals; (2) extract these signals using linear probes at minimal cost; (3) design a cascaded architecture to balance accuracy and cost.

Key Insight: LLM hidden layers produce distinct activation patterns when processing different types of text—linear probes can be trained as lightweight classifiers on these patterns.

Core Idea: LLMs already "know" internally whether an interaction is high-stakes; linear probes can extract this signal with less than 1% additional inference overhead.

Method¶

Overall Architecture¶

Synthetic training data (GPT-4o generates ~10K samples + multi-step filtering) → train linear probes on LLM internal representations (5 architectures) → evaluate on 6 real-world datasets → design a cascaded system (probe + LLM combination).

Key Designs¶

Probe Architectures (5 variants):
- Mean Probe: Averages hidden states across all tokens in the sequence → linear classification
- Last Token: Uses only the hidden state of the last token
- Max/Rolling Max: Takes the per-dimension maximum or sliding-window maximum
- Softmax Probe: Weights token hidden states using learned scores
- Attention Probe: Learns query vectors for attention-weighted aggregation (best performing)
- All probes are ultimately linear classifiers—overhead is minimal
Synthetic Training Data Generation:
- Function: GPT-4o generates ~10K high/low-stakes interaction samples
- Multi-step filtering: generate → GPT-4o labeling → human review → confound removal
- Design Motivation: Real high-stakes data is difficult to obtain at scale and privacy-sensitive; synthetic data is scalable and controllable
- Identified Issue: Confounds exist in synthetic data (e.g., specific phrasings such as "minor wound") requiring iterative cleaning
Cascaded Monitoring System:
- Function: Probe monitors all interactions → uncertain top-$k$% are forwarded to an LLM for fine-grained judgment → scores from both are averaged
- Mechanism: Probes are cheap but limited in precision (AUROC ~0.92); LLMs are accurate but expensive (AUROC ~0.95). The cascade achieves equivalent performance to 100% LLM usage at 40% of the computational budget
- Design Motivation: Not all interactions require LLM-level judgment; the majority can be handled quickly by the probe

Loss & Training¶

Binary cross-entropy loss. Probes are trained on synthetic data and evaluated on 6 real-world datasets (Anthropic HH-RLHF, ToolACE, MT Samples, MTS Dialog, Mental Health, Aya Redteaming).

Key Experimental Results¶

Main Results¶

Method	Mean AUROC	TPR@1%FPR	Compute Cost
Attention Probe	0.88–0.92	~43%	~$10^{-6}$
Fine-tuned Llama-3.1-8B	~0.92	~48%	1×
Fine-tuned Gemma-3-12B	~0.87	—	~1.5×
Prompted Llama-3.3-70B	0.95+	~55%	~10×
Cascade (Probe + fine-tuned 8B)	Surpasses both	—	~0.4×

Ablation Study¶

Configuration	Key Finding	Notes
32 deployment samples for adaptation	+5% AUROC	Very few samples yield significant gains
256 samples	Probe surpasses prompted 70B	Few-shot adaptation closes domain gap
Cascade at 40% budget	Equivalent to 100% LLM-only	Optimal efficiency–accuracy trade-off
5 probe variants compared	Attention best, Mean second	Attention weighting captures key tokens
Layer selection	Mid-to-late layers optimal	Upper layers too abstract, lower too raw

Key Findings¶

Linear probe AUROC of 0.92 approaches that of fine-tuned 8B LLMs—LLM internals do encode high-stakes signals
Only 32 target-domain samples yield a +5% AUROC gain—adaptation cost is extremely low
The cascaded architecture achieves full-LLM accuracy at 40% of the computational budget—highly valuable for large-scale deployment
TPR@1%FPR is only 43%—probes remain insufficient under extremely strict false-positive requirements

Highlights & Insights¶

"The LLM already knows internally": This parallels findings from VLA failure detection (the SAFE paper)—internal model representations encode richer signals than outputs alone. Probes serve as an exceptionally cheap signal extraction mechanism.
Practical value of cascaded architecture: Probes serve as a fast pre-filter (quickly eliminating the vast majority of low-stakes interactions), while LLMs handle fine-grained judgment (only for uncertain minority cases)—this constitutes an industrially viable LLM safety monitoring solution.
Synthetic data works: Probes trained on GPT-4o-generated synthetic data generalize across 6 diverse real-world datasets, suggesting that the internal representation of "high-stakes" is universal.

Limitations & Future Work¶

The definition of "high-stakes" is ambiguous—relying solely on GPT-4o labels may reflect a specific LLM's biases rather than an objective standard
Linear probes are vulnerable to targeted adversarial attacks (adversaries may craft inputs to evade probe detection)
TPR@1%FPR of only 43% implies an unacceptably high miss rate for safety-critical scenarios
Confounds identified in synthetic data (e.g., "minor wound") may not have been fully removed
The approach has not been tested in multi-turn dialogue settings—single-turn interactions may be comparatively simpler

vs. Llama Guard / ShieldGemma: These are full safety classifiers requiring independent inference; probes reuse the computation of the host model with zero additional inference cost
vs. Representation Engineering (Zou et al., 2023): RepE uses probes to understand/control model behavior; this work uses probes for real-time safety monitoring
vs. SAFE (VLA failure detection): Both leverage internal model representations for anomaly detection, but target different application domains (text safety vs. robot safety)

Rating¶

Novelty: ⭐⭐⭐⭐ Using probes for safety monitoring is a natural yet underexplored direction; the cascaded design is practical
Experimental Thoroughness: ⭐⭐⭐⭐ 5 probe variants × 6 datasets × cascade × few-shot adaptation
Writing Quality: ⭐⭐⭐⭐ System design is clear and practically oriented
Value: ⭐⭐⭐⭐⭐ Direct practical value for LLM safety deployment