Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer¶
Conference: ICLR 2026 arXiv: 2509.23886 Code: GitHub Area: Interpretability Keywords: subliminal learning, knowledge distillation, divergence tokens, hidden bias transfer, AI safety
TL;DR¶
Through controlled experiments and mechanistic analysis, this paper reveals the nature of subliminal learning: hidden preferences of teacher models are transferred to student models via a small number of "divergence tokens," with early layers playing a critical role. The phenomenon is also shown to be fragile and can be suppressed by simple paraphrasing.
Background & Motivation¶
Knowledge distillation is a core technique for model compression and knowledge transfer. The conventional assumption is that what gets transferred depends on the semantic content of the training data—if the teacher's outputs do not exhibit a certain trait (e.g., a preference for a particular animal), the student should not learn that trait.
Cloud et al. (2025) challenged this assumption: a teacher's hidden preferences can transfer to the student even when the training data is entirely unrelated to those preferences (e.g., number sequences, code). This phenomenon is termed subliminal learning.
Subliminal learning is expected under soft distillation (where the student observes the teacher's full next-token distribution). Surprisingly, however, it also occurs under hard distillation (where the student observes only sampled tokens). Prior explanations attributed this to token entanglement and logit leakage, but this paper finds such explanations insufficient.
The central research question is: When and how does subliminal learning occur?
Method¶
Overall Architecture¶
The analysis is organized around the central concept of "divergence tokens":
- Ruling out prior hypotheses: demonstrating that token entanglement and logit leakage are not necessary conditions.
- Identifying divergence tokens: locating the key carriers of preference transfer.
- Mechanistic analysis: pinpointing critical layers and characterizing fragility.
Key Design 1: Ruling Out Token Entanglement and Logit Leakage¶
- Logit leakage: Greedy sampling is used to generate fine-tuning data, eliminating logit leakage entirely. Experiments show that preferences still transfer—certain preferences previously untransferable (e.g., 'dog' on Qwen) are actually transferred more successfully under greedy sampling.
- Token entanglement: After removing all training samples containing the 50 most entangled tokens, hidden preferences still transfer.
Key Design 2: Divergence Tokens¶
Under greedy sampling, teachers with different preferences often produce largely identical token sequences for the same prompt, but diverge at specific positions.
Definition: Given a prefix \(x_{<k}\) generated by a teacher with preference \(b\), token \(x_k\) is a divergence token if and only if there exists another teacher with preference \(b' \neq b\) such that:
Divergence tokens are rare (approximately 7.5% for Qwen and 18.3% for Gemma), yet their causal effect is substantial.
Key Design 3: Loss Masking Experiments¶
- Training on divergence tokens only (approximately 4.7% of tokens): preference transfer is generally preserved or even enhanced.
- Masking divergence tokens (training on the remaining ~95% of tokens): preference transfer is largely eliminated.
Key Design 4: Critical Layer Localization¶
Using causal mediation analysis and attribution patching, the paper finds that:
- Early layers exhibit strong causal influence at positions where divergence tokens first appear.
- Fine-tuning a single early layer (e.g., layer 0 or layer 7) is sufficient to induce subliminal learning.
- Fine-tuning middle or later layers (layers 14, 21, 27, 33) yields virtually no preference transfer.
Key Design 5: Fragility Analysis¶
- Paraphrasing prompts: randomly replacing phrases such as "look at these numbers" with synonymous alternatives (e.g., "examine these numbers") typically suppresses preference transfer without affecting task performance.
- Mixing teacher data: incorporating 10% of data from an unbiased teacher significantly reduces transfer; 25% largely eliminates it.
- Even when the biased teacher itself paraphrases the prompts, transfer is typically suppressed.
Key Experimental Results¶
Main Results¶
| Setting | Method | Preference Transfer |
|---|---|---|
| Qwen 2.5-7B | Temperature sampling (FT) | Partial animals transferred |
| Qwen 2.5-7B | Greedy sampling (FT greedy) | Most animals transferred (sometimes stronger) |
| Qwen 2.5-7B | Entangled tokens removed | Some animals still transferred |
| Gemma 3-4B | Temperature sampling (FT) | Most animals transferred |
| Gemma 3-4B | Greedy sampling (FT greedy) | Transfer consistent |
Ablation Study: Role of Divergence Tokens¶
| Method | Divergence Token Proportion | Preference Transfer |
|---|---|---|
| Divergence tokens only (greedy) | ~7.5% (Qwen) | Preserved or enhanced |
| Non-divergence tokens (greedy) | ~92.5% | Largely eliminated |
| Divergence tokens only (temperature) | ~4.7% (Qwen) | Preserved or enhanced |
| Non-divergence tokens (temperature) | ~95.3% | Largely eliminated |
Key Findings¶
- Subliminal learning can occur without logit leakage or token entanglement.
- Divergence tokens are rare but carry disproportionate causal influence.
- Early layers are critical; single-layer fine-tuning is sufficient.
- Paraphrasing prompts suppresses transfer.
- Mixing multi-teacher data also suppresses transfer.
Misalignment Transfer Experiments¶
Using a Qwen model fine-tuned on harmful financial advice, experiments confirm that divergence tokens play an equally critical role in the transfer of misaligned tendencies.
Highlights & Insights¶
- First work to reveal the core mechanism of subliminal learning: driven by a small number of divergence tokens rather than global token entanglement.
- Demonstrates that a single early layer is sufficient for subliminal learning, enabling precise mechanistic localization.
- Establishes the fragility of subliminal learning, providing simple and effective defense strategies.
- Methodological contribution: using greedy sampling to eliminate stochastic interference, enabling controlled analysis.
Limitations & Future Work¶
- The distillation tasks used (e.g., number sequences) are relatively stylized and may not fully reflect trait transfer in real-world frontier models.
- The mechanism behind certain exceptions (e.g., 'penguin') remains unclear.
- Some models never successfully transfer hidden preferences, and the underlying reasons are not fully understood.
- The proposed defenses, while simple and effective, may not be sufficiently robust; stronger defense methods remain to be developed.
Related Work & Insights¶
- Subliminal learning: first identified by Cloud et al. (2025); Zur et al. (2025) attributed it to token entanglement, which this paper refutes.
- Clean-label poisoning attacks: similar in spirit but rely on embedded signals rather than optimization.
- Dark knowledge in distillation: the seminal work of Hinton et al. (2015).
- AI safety: closely related to deceptive alignment and hidden goal detection.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First to identify divergence tokens as the core mechanism of subliminal learning.
- Theoretical Depth: ⭐⭐⭐⭐ — Causal analysis and layer localization are rigorous, though formal theoretical guarantees are absent.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive validation across multiple models, preferences, and settings.
- Value: ⭐⭐⭐⭐ — Provides simple and actionable defense strategies for distillation safety.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, progressive exposition, and well-defined conclusions.