Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs¶
Conference: AAAI 2026
arXiv: 2512.03994
Code: GitHub
Area: Medical Imaging
Keywords: Policy violation detection, activation-space whitening, out-of-distribution detection, LLM internal representations, training-free
TL;DR¶
This work reformulates LLM policy violation detection as an out-of-distribution (OOD) detection problem in activation space. A training-free whitening approach is proposed: a whitening transform is fitted on compliant activations, and the Euclidean norm serves as the compliance score. Deployment requires only policy text and a small number of examples. The method achieves 86.0% F1 on DynaBench, outperforming fine-tuned baselines by 9.1 points and LLM-as-Judge by 16 points.
Background & Motivation¶
- Compliance challenges in enterprise LLM deployment: Organizations deploying LLMs in sensitive domains such as legal, financial, and healthcare must ensure adherence to both internal organizational policies and external regulatory requirements.
- Enterprise policies typically contain dozens of policies, each potentially comprising hundreds of rules.
- Even high-performing LLMs may inadvertently violate organizational policies, creating legal and financial risk.
- Limitations of prior work:
- Guardrail systems (e.g., LlamaGuard): Constrained to safety taxonomies and unable to generalize to complex organizational policies.
- LLM-as-a-Judge: Flexible but introduces significant latency (1.47 s/sample).
- Fine-tuned detectors (e.g., DynaGuard): Require large amounts of labeled data and training cost, with poor adaptability.
- All of the above methods evaluate compliance at the generated text level.
- Core insight: The internal activation states of LLMs encode information about output correctness that is not fully reflected in the generated tokens.
- Hypothesis: Policy-violating states occupy distinct regions of the LLM embedding space and can be identified via OOD detection methods.
Method¶
Mechanism¶
Policy compliance detection is modeled as an OOD problem in activation space: - Compliant behavior → in-distribution - Violating behavior → out-of-distribution
Offline Phase: Reference Statistics Preprocessing¶
Whitening Transform¶
Activation vectors \(\{x_i^{(\ell)}\}_{i=1}^N\) are extracted from in-policy interactions at each layer. The empirical mean and covariance are computed:
The whitening matrix \(W^{(\ell)}\) satisfies \({W^{(\ell)}}^\top W^{(\ell)} = (\Sigma^{(\ell)})^{-1}\) and is computed via PCA whitening.
The whitened representation is:
Compliance Score¶
In the whitened space, deviation from compliant behavior is quantified by the Euclidean norm:
This score is equivalent to the Mahalanobis distance in the original space, but focuses on the principal directions of compliant variation.
Layer Selection¶
Whitening parameters are computed independently per layer. A small mixed set of compliant and violating samples is used to evaluate the separability of each layer, and the optimal operating layer \(\ell^\star\) is selected.
Threshold Calibration¶
On the operating layer \(\ell^\star\), the decision threshold \(\tau\) is calibrated by maximizing the Youden statistic (\(J = TPR - FPR\)).
Policy-Conditioned Whitening¶
Policies are grouped into categories sharing common behavioral patterns. Independent whitening parameters are estimated for each category, enabling category-specific detection.
Online Phase: Real-Time Detection¶
Each response is verified before being returned:
\(\hat{y}=1\) indicates a violation; \(\hat{y}=0\) indicates compliance. When policy grouping is used, the nearest policy category is selected via cosine similarity.
Contrastive Data Construction¶
- For each policy rule in DynaBench, natural language prompts are generated using GPT-4.1.
- Contrastive sample pairs (compliant good + violating bad) are generated for each prompt.
- A GPT-4.1 validator is used to ensure data quality.
Experiments¶
Benchmarks and Setup¶
| Benchmark | Description |
|---|---|
| DynaBench | Policy compliance evaluation on multi-turn user–agent dialogues, covering 12 business impact categories |
| τ-bench | Tool-call correctness evaluation for AI agents (airline domain) |
Evaluated models: Mistral-7B, Gemma-2-9B, Llama-3.1-8B, Qwen3-8B, Qwen2.5-7B
Main Results (DynaBench)¶
| Method Category | Model | F1 (%) |
|---|---|---|
| LLM-as-Judge | GPT-4o-mini | 70.1 |
| LLM-as-Judge | Qwen3-8B | 60.7 |
| Fine-tuned | LlamaGuard-3 | 20.9 |
| Fine-tuned | DynaGuard-8B | 73.1 |
| Whitening (Ours) | Mistral-7B | 66.8 |
| Whitening (Ours) | Gemma-2-9B | 75.2 |
| Whitening (Ours) | Llama-3.1-8B | 75.6 |
| Whitening (Ours) | Qwen3-8B | 78.4 |
| Whitening (Ours) | Qwen2.5-7B | 86.0 |
- Achieves state-of-the-art on 4 out of 5 backbones.
- Qwen2.5-7B reaches 86.0% F1, surpassing the strongest fine-tuned baseline DynaGuard-8B by 12.9 points.
- No fine-tuning required.
Representation-Level vs. Generation-Level Analysis¶
| Model | Generation Classifier F1 | Whitening F1 |
|---|---|---|
| DynaGuard-1.7B | 65.2 | 77.6 |
| DynaGuard-4B | 72.0 | 78.5 |
| DynaGuard-8B | 73.1 | 80.6 |
Key finding: for the same fine-tuned model, the whitening method outperforms its native generation classifier by 5–12 points, demonstrating that internal representations encode richer policy-relevant information than output tokens.
Generalization on τ-bench¶
- Synthetic data: The whitening method substantially outperforms DynaGuard-8B and GPT-4o-mini.
- Real trajectories: The whitening method achieves AUC=0.87, confirming that activation-space separation generalizes across different interaction formats.
Comparison with Other OOD Methods¶
| Method | Qwen2.5-7B F1 | Llama-3.1-8B F1 |
|---|---|---|
| Mahalanobis | 67.2 | 65.8 |
| KNN | 78.5 | 66.2 |
| Energy Score | 66.4 | 72.1 |
| Whitening (Ours) | 82.2 | 74.3 |
The whitening method outperforms the strongest OOD baseline by 3.7%/2.2%. The Mahalanobis distance is constrained by the full covariance matrix in high-dimensional settings.
Runtime Efficiency¶
| Category | Model | Time (s/sample) |
|---|---|---|
| LLM-as-Judge | GPT-4o-mini | 1.47 |
| Fine-tuned detector | DynaGuard-8B | 2.71 |
| Same-model representation | Qwen2.5-7B | 0.03 |
| Surrogate-model representation | Llama-3.1-8B | 0.98 |
Using internal representations adds only 0.03–0.05 s overhead, making the method suitable for real-time monitoring.
Ablation Study¶
- Top-K components: F1 varies only 72.4%–76.7% across \(K=10\)–\(50\), demonstrating robustness.
- Samples per category: 100 samples achieve 75.6%; increasing to 750 samples yields only 79.1% (diminishing returns).
- Layer-wise analysis: The optimal layer differs across policy categories (e.g., information leakage favors earlier layers; transaction-related policies favor mid-to-late layers).
- Category-specific whitening vs. unified whitening: Category-specific whitening consistently performs better.
Highlights & Insights¶
- OOD framework for policy violation detection: Shifting analysis from generated text to activation space represents a paradigm-level innovation.
- Core finding: Policy compliance information is already encoded in the model's internal representations; the decoding process is a lossy bottleneck — whitening merely exposes pre-existing structure.
- 0.03 s/sample inference overhead enables real-time deployment, approximately 50× faster than LLM-as-Judge.
- Training-free with minimal calibration data (approximately one sample per rule): Supports rapid policy updates.
- Model safety benchmark scores (SORRY-Bench, HarmBench) are positively correlated with whitening separability, establishing a link between safety alignment and internal representation quality.
Limitations & Future Work¶
- Performance depends on the quality of the model's internal representations — models with weaker safety awareness (e.g., Mistral-7B) exhibit lower separability.
- Threshold calibration is required and may need recalibration under distribution shift.
- Detection only, not prevention: The method does not directly intervene in the generation process.
- Access to model internal activations is required (or a surrogate model must be used), incurring additional inference cost for API-only models.
- Contrastive data is generated by GPT-4.1, which may introduce generative bias.
- The gap between the policy complexity of the DynaBench benchmark and real-world policies is not thoroughly discussed.
Related Work & Insights¶
- Guardrail systems: LlamaGuard (Inan et al. 2023), NeMo Guardrails
- Policy compliance: DynaBench/DynaGuard (Hoover et al. 2025)
- OOD detection: Mahalanobis distance, Energy Score, KNN, whitening transform (Betser et al. 2025)
- LLM internal representation analysis: Zou et al. 2023 (truthfulness probes), Gekhman et al. 2025 (error detection)
- Activation-space control: SCANS (activation steering to mitigate over-safety)
Rating ⭐⭐⭐⭐⭐¶
The approach is remarkably elegant — reducing complex policy violation detection to a norm computation in whitened space. Its training-free nature, extremely fast inference, and minimal calibration data requirements make it highly practical. The experiments are comprehensive and convincing (5 models, 2 benchmarks, multiple OOD comparisons, runtime analysis). The core finding — that internal representations outperform generated outputs — carries broad implications. This work sets a strong benchmark for LLM governance and enterprise safety deployment.