The Impact of Off-Policy Training Data on Probe Generalisation¶
Conference: ACL2026
arXiv: 2511.17408
Code: https://github.com/SamDower/LASR-probe-gen
Area: Interpretability / Safety Monitoring
Keywords: Activation Probes, off-policy data, Generalisation, Deception Detection, Distribution Shift
TL;DR¶
This paper systematically compares the effects of four training data types—on-policy natural, on-policy incentivised, on-policy prompted, and off-policy—on the generalisation of LLM activation probes. It finds that text-evident behaviors are robust, whereas "intentional" behaviors like deception, sycophancy, and sandbagging are highly susceptible to domain shifts. It proposes using an on-policy incentivised test set to predict generalisation failures in real-world monitoring.
Background & Motivation¶
Background: In LLM safety monitoring, activation probes represent an attractive approach. Rather than relying solely on output text, probes directly read internal model representations to detect behaviors such as refusal, deception, sycophancy, and sandbagging using low-cost classifiers. A typical practice involves collecting labeled model responses, extracting activations from specific layers, and training linear or lightweight attention probes.
Limitations of Prior Work: Many behaviors critical for monitoring are rare in natural scenarios and difficult to label. For example, strategic deception may not leave obvious traces in the text; incorrect answers in sandbagging might stem from a lack of ability or intentional concealment. Consequently, researchers often use synthetic data, text generated by external models, or direct prompts to induce specific behaviors in the target model. However, whether such data generalises to real on-policy behaviors remains an open question.
Key Challenge: The central conflict is whether a probe learns a representation of "the target model internally producing a behavior" or "spurious features related to specific text, prompts, or domains in the training distribution." Off-policy data is easier to collect but lacks the target model's own generation process; data from different domains might cause probes to learn specific topics, formats, or styles rather than the behavior itself.
Goal: The authors aim to distinguish the impacts of two types of shifts: response strategy shift (whether training responses come from natural generation, incentivised induction, explicit prompting, or external models) and domain shift (whether training and testing occur in the same task domain). The paper also seeks a practical diagnostic method to predict probe failure when on-policy natural test data is unavailable.
Key Insight: The paper selects eight categories of LLM behaviors and prepares two data domains for each. Rather than testing a single deception probe, it compares text-evident behaviors and text-ambiguous behaviors within the same experimental framework. This explains why some off-policy probes are reliable while others fail abruptly in seemingly similar tests.
Core Idea: The problem of probe generalisation is decomposed into two factors: "whether the behavior is determined by surface text" and "whether training domains are aligned." The authors use on-policy incentivised data as a risk proxy when real on-policy natural data is unavailable.
Method¶
The paper trains behavior probes around a fixed target model \(M\). Given input \(x\) and output \(y\), activation matrices \(A\) are extracted from certain layers and token positions. A binary classifier is then trained to predict whether the response contains the target behavior. The core experimental variables are not the probes themselves but how the responses are generated, how they are labeled, and the consistency between training and testing domains.
Overall Architecture¶
The authors define four response strategies. "On-policy natural" involves the target model answering original prompts naturally; "on-policy incentivised" introduces indirect incentives to make the behavior advantageous; "on-policy prompted" uses explicit instructions to exhibit the behavior; "off-policy" uses text generated by other models or external data, which is then fed into the target model to extract activations. For prompted and incentivised data, additional prompts \(p\) are typically removed before activation extraction to prevent the probe from merely identifying prompt tokens.
Each behavior uses two domains. For instance, list-making, metaphors, science, and refusal are categorized as text-evident behaviors; sycophancy, deferral-to-authority, sandbagging, and deception are text-ambiguous behaviors, where the model's true motivation is often unclear from the output alone. Training samples are balanced; for the first six behaviors, 3,500 samples are used for training, 500 for validation, and 1,000 for testing per strategy. For deception and sandbagging, 2,500 training, 500 validation, and 500 testing samples are used due to limited inputs.
Key Designs¶
-
Factorised Comparison of Four Response Generation Strategies:
- Function: Separates the effects of "whether data comes from the target model's policy" and "whether the behavior was induced."
- Mechanism: On-policy natural serves as the ideal baseline. On-policy incentivised encourages natural target behavior through task incentives. On-policy prompted uses explicit commands. Off-policy uses external data or other models. All strategies are evaluated within the same framework, primarily using AUROC.
- Design Motivation: Prior work often vaguely cited "synthetic data" or "external data," which conflate model policy, prompt traces, and domain differences. Factorised comparison clarifies which shift causes probe failure.
-
Counterfactual Construction of Behavior Labels:
- Function: Provides more credible labels for text-ambiguous behaviors, avoiding mislabeling ordinary agreement or errors as dangerous behaviors.
- Mechanism: For sycophancy, deferral, and some sandbagging, the authors generate control and suffixed responses for the same question. A positive label is assigned only if the model changes its answer due to user opinion, authority, or threats. Refusal is identified via HarmBench; other behaviors use GPT-5-Nano for automated scoring, filtering low-confidence labels. For deception and WMDP sandbagging, private/public answers or in-context facts determine hidden intent.
- Design Motivation: Many monitoring targets are about "why text was generated" rather than "what it looks like." Counterfactual labeling pushes the behavior definition back to the generation process, enabling the distinction between text-evident and text-ambiguous behaviors.
-
Statistical Separation of Domain Shift and Response Strategy Shift:
- Function: Quantifies the AUROC loss caused by domain shift versus response strategy shift.
- Mechanism: Both mixed-effects models and OLS interaction regressions are used. Mixed-effects models estimate average effects across behaviors, while OLS interaction terms decompose sensitivity to domain shift and generation method for each behavior. Trends are verified across Llama-3.2-3B, Gemma-3-27B, Ministral-8B, Qwen models, and both linear and attention probes.
- Design Motivation: Significant variations in difficulty across probe tasks mean simple averages can mask behavioral heterogeneity. Statistical models ensure "domain shift matters more" is a significant finding with behavior-level explanations.
Loss & Training¶
Linear probes use logistic regression on averaged sequence activations with L2 regularization. Attention probes learn query/value weights to weight token activations via softmax, outputting classification probabilities via sigmoid. Attention probes minimize binary cross-entropy using AdamW. Layers and hyperparameters are selected via the validation set, with results reported as AUROC with 95% confidence intervals. The core strategy balances positive/negative samples and strictly uses offsets to prevent input leakage between training and testing for different response strategies.
Key Experimental Results¶
Main Results¶
The primary finding is that same-domain off-policy data is generally more reliable than different-domain on-policy data. For Llama linear probes, the average AUROC for same-domain on-policy natural training is \(0.90 \pm 0.019\); same-domain off-policy is \(0.85 \pm 0.039\); whereas different-domain on-policy is only \(0.74 \pm 0.051\). This suggests that in the absence of natural on-policy data, prioritizing domain alignment is more critical than focusing on the generation policy.
| Training Condition | Mean AUROC | Significance | Conclusion |
|---|---|---|---|
| Same-domain on-policy natural | \(0.90 \pm 0.019\) | Closest to real monitoring | Preferred when available |
| Same-domain off-policy | \(0.85 \pm 0.039\) | Strategy shift but domain consistency | High value when natural data is missing |
| Different-domain on-policy | \(0.74 \pm 0.051\) | Consistent strategy but inconsistent domain | Inferior to same-domain off-policy |
| Attention probe same-domain on-policy | \(0.946 \pm 0.015\) | Stronger architecture, same trend | Higher absolute scores, but shift issues persist |
Regression results further show domain shift as the dominant factor. On the on-policy natural test set, the mixed-effects model estimates an average loss of -0.162 AUROC due to different-domain training (\(p<0.001\)), while the main effects of response strategies were small and non-significant. Similar results appeared for the on-policy incentivised test set, with a domain penalty of -0.189 (\(p<0.001\)).
| Test Setting | Domain Shift Coeff | Significance | Response Strategy Conclusion | Interpretation |
|---|---|---|---|---|
| On-policy natural test | -0.162 | p<0.001 | Main effects non-sig (p>0.05) | Domain shift significantly degrades generalisation |
| On-policy incentivised test | -0.189 | p<0.001 | Main effects still non-sig | Incentivised tests also expose domain shift |
| Prompted × different-domain | +0.050 | p<0.05 | Small interaction term | Statistically present but does not change main conclusion |
Ablation Study¶
Behavior-level OLS results reveal differences between text-evident and text-ambiguous tasks. Metaphors, refusal, and science showed domain shift penalties of only -0.045 to -0.099. However, deception reached -0.416, sycophancy -0.283, and authority -0.222. This implies that behaviors visible through surface format or content retain features across sources, while intent-dependent behaviors are most likely to learn spurious correlations.
| Behavior Type | Specific Behavior | Different-domain AUROC Change | Response Strategy Impact | Explanation |
|---|---|---|---|---|
| Text-evident | Metaphors | -0.045 | Non-significant | Metaphorical vocabulary is visible in text |
| Text-evident | Refusal | -0.088 | Minimal | Refusal phrases are relatively stable |
| Text-evident | Science | -0.099 | Non-significant | Scientific terms are clearly represented |
| Text-evident | Lists | -0.272 | Incentivised -0.064 | List format is clear but domain-sensitive |
| Text-ambiguous | Authority | -0.222 | Mostly non-sig | Agreement vs. true belief is hard to distinguish |
| Text-ambiguous | Sycophancy | -0.283 | Mostly non-sig | Relies on counterfactual labels |
| Text-ambiguous | Deception | -0.416 | Off-policy +0.130 (same domain) | Most affected by domain shift |
| Text-ambiguous | Sandbagging | -0.182 | Prompted -0.111 | Concealed ability looks like normal errors |
The authors propose a useful proxy diagnostic: if on-policy natural test data is missing, assess if the probe generalises from off-policy to on-policy incentivised. For linear probes across models, the Pearson correlation between generalisation failure on incentivised data and natural data was 0.793. This supports using incentivised data as a diagnostic set closer to real behavioral processes.
| Proxy Diagnostic | Pearson Correlation | Behaviors Covered | Conclusion |
|---|---|---|---|
| Incentivised loss predicts natural loss | 0.793 | All except Deception/Sandbagging | Strong correlation, suitable for failure prediction |
| Prompted loss predicts natural loss | 0.639 | Same as above | Correlation exists but is weaker |
| Incentivised test for Deception | Qualitative gap | Deception & Sandbagging | Predicts current probes struggle to generalise |
Key Findings¶
- Off-policy data is not universally unusable. For text-evident behaviors (lists, metaphors, science, refusal), same-domain off-policy probes maintain high AUROC.
- Domain shift is often more damaging than response strategy shift. In practice, training data domain should prioritize proximity to the deployment scenario.
- Text-ambiguous behaviors are the most "dangerous." Probes for deception, sycophancy, and sandbagging easily learn dataset style rather than the true internal mechanism.
- Attention probes improve absolute scores but do not eliminate the shift patterns. Stronger architectures do not automatically solve distribution mismatch.
- Mixing a small amount of on-policy natural data is not necessarily effective. The performance of other strategies only approached the natural probe when nearly all natural training samples were included.
Highlights & Insights¶
- The paper clarifies the probe data issue: the key is not whether synthetic data can train a probe, but whether that data captures the internal processes of the target model in the target domain.
- The distinction between text-evident and text-ambiguous is insightful. It explains why refusal probes succeed while deception detection—closer to "mind reading"—always faces extrapolation risks.
- The on-policy incentivised proxy test is a practical contribution. Real deception samples are rare; incentivised scenarios can at least expose whether a probe has merely learned off-policy spurious features.
- The finding that "same-domain off-policy is better than different-domain on-policy" is counter-intuitive but actionable. It reminds developers that matching task domains and user scenarios may be more important than the generation source.
Limitations & Future Work¶
- The definition of off-policy data remains broad. The distributional distance between different external models or manual text could vary significantly.
- Model scales were relatively limited. While Llama-3.2-3B and Gemma-27B show consistent trends, it is not guaranteed for \(100B+\) models or future reasoning models.
- Probes were restricted to linear and single-layer attention. Deeper non-linear probes or sparse feature probes might exhibit different generalisation behaviors.
- Counterfactual labels for text-ambiguous behaviors still rely on specific incentives and automated graders. If incentives alter the model's internal process too much, the proxy test might still deviate from natural behavior.
- Future work could introduce domain adaptation or invariant risk minimisation (IRM) into probe learning to explicitly penalize domain-dependent features.
Related Work & Insights¶
- vs. Sleeper Agent / Coup Probes: While those works show probes can detect high-risk states, this paper provides the generalisation diagnostics, noting that off-policy success depends on behavior type and domain.
- vs. Deception Probe Research: Building on findings by Goldowsky-Dill et al. regarding cross-dataset issues in deception probes, this paper places deception within a broader comparison of eight behaviors.
- vs. Representation Engineering: RepE assumes distinct directions in activation space for behaviors; this paper warns these directions drift with domain and generation process, especially for intentional behaviors.
- vs. LLM-as-judge Monitoring: LLM judges are costly and struggle with text-ambiguity. Activation probes are cheaper but require counterfactual and proxy validation to ensure they read internal mechanisms.
Rating¶
- Novelty: ⭐⭐⭐⭐☆ Very practical focus on factorizing off-policy, prompted, incentivised, and domain shifts.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive coverage across eight behaviors, multiple domains, models, architectures, and regression analyses.
- Writing Quality: ⭐⭐⭐⭐☆ Clear structure and takeaways; detailed appendices, though main text relies heavily on figures for specific values.
- Value: ⭐⭐⭐⭐⭐ Highly useful for safety monitoring and mechanistic interpretability, preventing the misinterpretation of high in-distribution scores as deployment reliability.