Skip to content

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Conference: ICLR 2026 arXiv: 2509.23886 Code: GitHub Area: Interpretability Keywords: subliminal learning, knowledge distillation, divergence tokens, hidden bias transfer, AI safety

TL;DR

Through controlled experiments and mechanistic analysis, this paper reveals the nature of subliminal learning: hidden preferences of teacher models are transferred to student models via a small number of "divergence tokens," with early layers playing a critical role. The phenomenon is also shown to be fragile and can be suppressed by simple paraphrasing.

Background & Motivation

Knowledge distillation is a core technique for model compression and knowledge transfer. The conventional assumption is that what gets transferred depends on the semantic content of the training data—if the teacher's outputs do not exhibit a certain trait (e.g., a preference for a particular animal), the student should not learn that trait.

Cloud et al. (2025) challenged this assumption: a teacher's hidden preferences can transfer to the student even when the training data is entirely unrelated to those preferences (e.g., number sequences, code). This phenomenon is termed subliminal learning.

Subliminal learning is expected under soft distillation (where the student observes the teacher's full next-token distribution). Surprisingly, however, it also occurs under hard distillation (where the student observes only sampled tokens). Prior explanations attributed this to token entanglement and logit leakage, but this paper finds such explanations insufficient.

The central research question is: When and how does subliminal learning occur?

Method

Overall Architecture

The analysis is organized around the central concept of "divergence tokens":

  1. Ruling out prior hypotheses: demonstrating that token entanglement and logit leakage are not necessary conditions.
  2. Identifying divergence tokens: locating the key carriers of preference transfer.
  3. Mechanistic analysis: pinpointing critical layers and characterizing fragility.

Key Design 1: Ruling Out Token Entanglement and Logit Leakage

  • Logit leakage: Greedy sampling is used to generate fine-tuning data, eliminating logit leakage entirely. Experiments show that preferences still transfer—certain preferences previously untransferable (e.g., 'dog' on Qwen) are actually transferred more successfully under greedy sampling.
  • Token entanglement: After removing all training samples containing the 50 most entangled tokens, hidden preferences still transfer.
\[\text{Finding 1: Hidden preferences transfer even without logit leakage or entangled tokens.}\]

Key Design 2: Divergence Tokens

Under greedy sampling, teachers with different preferences often produce largely identical token sequences for the same prompt, but diverge at specific positions.

Definition: Given a prefix \(x_{<k}\) generated by a teacher with preference \(b\), token \(x_k\) is a divergence token if and only if there exists another teacher with preference \(b' \neq b\) such that:

\[\arg\max_t p_b(t \mid x_{<k}) = x_k \quad \text{and} \quad \arg\max_t p_{b'}(t \mid x_{<k}) \neq x_k\]

Divergence tokens are rare (approximately 7.5% for Qwen and 18.3% for Gemma), yet their causal effect is substantial.

Key Design 3: Loss Masking Experiments

  • Training on divergence tokens only (approximately 4.7% of tokens): preference transfer is generally preserved or even enhanced.
  • Masking divergence tokens (training on the remaining ~95% of tokens): preference transfer is largely eliminated.
\[\text{Finding 2: Divergence tokens are the key driver of subliminal learning.}\]

Key Design 4: Critical Layer Localization

Using causal mediation analysis and attribution patching, the paper finds that:

  • Early layers exhibit strong causal influence at positions where divergence tokens first appear.
  • Fine-tuning a single early layer (e.g., layer 0 or layer 7) is sufficient to induce subliminal learning.
  • Fine-tuning middle or later layers (layers 14, 21, 27, 33) yields virtually no preference transfer.
\[\text{Finding 3: Early layers are critical; fine-tuning a single early layer suffices for subliminal learning.}\]

Key Design 5: Fragility Analysis

  • Paraphrasing prompts: randomly replacing phrases such as "look at these numbers" with synonymous alternatives (e.g., "examine these numbers") typically suppresses preference transfer without affecting task performance.
  • Mixing teacher data: incorporating 10% of data from an unbiased teacher significantly reduces transfer; 25% largely eliminates it.
  • Even when the biased teacher itself paraphrases the prompts, transfer is typically suppressed.
\[\text{Finding 4 \& 5: Subliminal learning is fragile.}\]

Key Experimental Results

Main Results

Setting Method Preference Transfer
Qwen 2.5-7B Temperature sampling (FT) Partial animals transferred
Qwen 2.5-7B Greedy sampling (FT greedy) Most animals transferred (sometimes stronger)
Qwen 2.5-7B Entangled tokens removed Some animals still transferred
Gemma 3-4B Temperature sampling (FT) Most animals transferred
Gemma 3-4B Greedy sampling (FT greedy) Transfer consistent

Ablation Study: Role of Divergence Tokens

Method Divergence Token Proportion Preference Transfer
Divergence tokens only (greedy) ~7.5% (Qwen) Preserved or enhanced
Non-divergence tokens (greedy) ~92.5% Largely eliminated
Divergence tokens only (temperature) ~4.7% (Qwen) Preserved or enhanced
Non-divergence tokens (temperature) ~95.3% Largely eliminated

Key Findings

  1. Subliminal learning can occur without logit leakage or token entanglement.
  2. Divergence tokens are rare but carry disproportionate causal influence.
  3. Early layers are critical; single-layer fine-tuning is sufficient.
  4. Paraphrasing prompts suppresses transfer.
  5. Mixing multi-teacher data also suppresses transfer.

Misalignment Transfer Experiments

Using a Qwen model fine-tuned on harmful financial advice, experiments confirm that divergence tokens play an equally critical role in the transfer of misaligned tendencies.

Highlights & Insights

  • First work to reveal the core mechanism of subliminal learning: driven by a small number of divergence tokens rather than global token entanglement.
  • Demonstrates that a single early layer is sufficient for subliminal learning, enabling precise mechanistic localization.
  • Establishes the fragility of subliminal learning, providing simple and effective defense strategies.
  • Methodological contribution: using greedy sampling to eliminate stochastic interference, enabling controlled analysis.

Limitations & Future Work

  • The distillation tasks used (e.g., number sequences) are relatively stylized and may not fully reflect trait transfer in real-world frontier models.
  • The mechanism behind certain exceptions (e.g., 'penguin') remains unclear.
  • Some models never successfully transfer hidden preferences, and the underlying reasons are not fully understood.
  • The proposed defenses, while simple and effective, may not be sufficiently robust; stronger defense methods remain to be developed.
  • Subliminal learning: first identified by Cloud et al. (2025); Zur et al. (2025) attributed it to token entanglement, which this paper refutes.
  • Clean-label poisoning attacks: similar in spirit but rely on embedded signals rather than optimization.
  • Dark knowledge in distillation: the seminal work of Hinton et al. (2015).
  • AI safety: closely related to deceptive alignment and hidden goal detection.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First to identify divergence tokens as the core mechanism of subliminal learning.
  • Theoretical Depth: ⭐⭐⭐⭐ — Causal analysis and layer localization are rigorous, though formal theoretical guarantees are absent.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive validation across multiple models, preferences, and settings.
  • Value: ⭐⭐⭐⭐ — Provides simple and actionable defense strategies for distillation safety.
  • Writing Quality: ⭐⭐⭐⭐⭐ — Clear structure, progressive exposition, and well-defined conclusions.