Emergence of Linear Truth Encodings in Language Models¶

Conference: NeurIPS 2025 arXiv: 2510.15804 Code: https://github.com/shauli-ravfogel/truth-encoding-neurips Area: Interpretability Keywords: truth encoding, linear representation, associative memory, training dynamics, LayerNorm

TL;DR¶

This paper proposes the Truth Co-occurrence Hypothesis (TCH)—that true statements tend to co-occur with other true statements—and uses a minimal single-layer Transformer toy model to provide an end-to-end demonstration of how linear truth subspaces emerge naturally through a two-phase training dynamic (memorization first → truth encoding later). This constitutes the first mechanistic explanation for the widely reported linear truth representations in LLMs.

Background & Motivation¶

Key Observation: Extensive probing studies have found that low-rank linear subspaces exist in the residual stream of LLMs, capable of separating true and false statements across domains (e.g., "2+2=4" vs. "The capital of France is Rome"), a finding that has attracted attention for LLM hallucination mitigation.

Two Open Questions: - Why: Why do such subspaces emerge during training? Why would a model need to encode "truth" as a latent variable? - How: During inference, how is this linear separation computed?

Limitations of Prior Explanations: - The Persona hypothesis (Joshi et al., 2024) associates truth values with lexical style (e.g., Wikipedia vs. social media), but this is a surface-level cue rather than a fundamental mechanism. - A mechanistic explanation grounded in training dynamics that does not rely on lexical cues is lacking.

Key Insight: If true and false statements within the same passage are correlated (TCH), then inferring "truth value" reduces language model loss—providing an optimization-level motivation for models to learn truth representations.

Method¶

Overall Architecture¶

Three-step approach: 1. Validate the TCH hypothesis on real corpora (MAVEN-FACT dataset). 2. Construct a minimal toy model to analyze the emergence mechanism of truth encoding. 3. Validate theoretical predictions on synthetic and natural language data, and examine pre-trained LLMs.

Key Designs¶

1. Validation of the Truth Co-occurrence Hypothesis (TCH)¶

Event-level factuality annotations are statistically analyzed in the MAVEN-FACT corpus: - Overall falsity rate $p = 0.0209$ - The probability of two events in the same document both being false $= 0.0009$, approximately 2× the independent baseline $p^2 = 0.00044$ - Clustering ratio (excess variance) $= 1.23$, indicating 23% additional inter-document heterogeneity - $\chi^2$ test $= 4.17 \times 10^3$, $p \approx 9 \times 10^{-49}$ (highly significant) - Conclusion: False statements are clustered within documents—tracking "truth" is beneficial for a language model.

2. Data Generation Process¶

Each training sample is a four-token sequence $x \; y \; x' \; y'$: - $x, x'$ are subjects (e.g., "capital of France") and $y, y'$ are attributes (e.g., "Paris"). - Each $x$ has a unique correct attribute $g(x)$. - With probability $\rho$, a true sequence is generated ($y = g(x), y' = g(x')$). - With probability $1-\rho$, a false sequence is generated ($y$ and $y'$ are sampled uniformly at random from the attribute set). - Key: Within the same sample, the truth values of $y$ and $y'$ are perfectly correlated—this is the core instantiation of TCH.

Quantifying the benefit of inferring truth: As $|\mathcal{A}| \to \infty$, the loss difference between knowing $T$ versus not knowing $T$ equals $H_2(\rho)$ (the binary entropy of $\rho$), which is maximized at $\rho = 0.5$.

3. Toy Model Analysis¶

Model: A single-layer Transformer with uniform causal attention and LayerNorm, orthogonal one-hot embeddings, and dimension $d = 4N + 3$. - Token embeddings $e_z$, positional embeddings $p_t$, and unembedding vectors $u_z$ are all orthogonal. - Forward pass: $F_W(z_{1:t})_t = U \cdot \mathsf{N}\left(e_{z_t} + p_t + \frac{1}{t}\sum_{s=1}^t W(e_{z_s} + p_s)\right)$ - where $\mathsf{N}(v) = v / \|v\|$ denotes LayerNorm.

Structure of the value matrix $W$: After training, $W$ exhibits a clear block structure: - $W e_x = -\alpha_1 e_x + \beta_1 u_{g(x)}$: subject → correct attribute + self-suppression - $W e_y = \alpha_2 e_{g^{-1}(y)} - \beta_2 u_y$: attribute → corresponding subject + self-suppression - $W p_t$: positional embedding → uniform distribution over attributes/subjects

4. Linear Separation and Sharpening Mechanism¶

Core quantity: For a token attending to both $x$ and $y$, the residual stream contains: $$\zeta(x,y) = W(e_x + e_y) = -\alpha_1 e_x + \alpha_2 e_{g^{-1}(y)} + \beta_1 u_{g(x)} - \beta_2 u_y$$

Key inequality: $\|\zeta(x, g(x))\|^2 = \|\zeta(x, y)\|^2 - 2\alpha_1\alpha_2 - 2\beta_1\beta_2$ (when $y \neq g(x)$)

True sequences have smaller $\zeta$ norm (components cancel each other out).
False sequences have larger $\zeta$ norm (components do not cancel).
LayerNorm converts norm differences into confidence differences: after normalization, true sequences have larger amplitude → sharper softmax output → higher confidence in the correct attribute.

Theorem 1 (Sharpening): For $W$ satisfying the above structure, the logit margin for $g(x')$ is $> 0$ in true contexts and $= 0$ in false contexts.

Theorem 2 (Linear Truth Direction): - Without LayerNorm: No linear separator exists for true/false at position $y$. - With LayerNorm: A linear separator exists with margin at least $\delta = \frac{1}{2\sqrt{2}}\left(1 - \frac{1}{\sqrt{1 + \alpha^2 + \beta^2}}\right)$. - Key Insight: LayerNorm is a necessary condition for the emergence of linear truth encodings.

Theorem 3 (Training Dynamics): Under a simplified setting, two gradient steps on $L_1$ plus one step on $L_3$ suffice to produce the required block structure; LayerNorm plays a critical role in the gradient structure.

Loss & Training¶

Standard autoregressive language modeling loss (cross-entropy): $$L(W) = \sum_{t=1}^3 \mathbb{E}_{z_{1:t+1}}\left[-\log \mathcal{S}_\beta(F_W(z_{1:t}))_{z_{t+1}}\right]$$

The Adam optimizer is used with inverse temperature $\beta = \sqrt{d}$ (corresponding to RMSNorm).

Key Experimental Results¶

Main Results¶

Synthetic data (1-layer model, $\rho = 0.99$, $|\mathcal{S}| = |\mathcal{A}| = 512$):

Two-phase training dynamics: 1. Memorization phase (first ~1,000 batches): The model rapidly learns the $g(x)$ mapping, with the probability of predicting the correct attribute approaching 1 for both true and false sequences. 2. Truth encoding phase (sudden emergence after ~7,500 batches): Linear classification AUC rises sharply, while the model begins to reduce the probability of predicting the correct attribute on false sequences.

PCA visualization: - Before LayerNorm: True and false representations cluster around the origin with no separation. - After LayerNorm: Clear linear separation emerges. - Input embeddings: $e_x \approx -e_{g(x)}$ (subject embeddings and their correct attribute embeddings are approximately antipodal).

Ablation Study¶

Effect of $\rho$: - $\rho = 0.999$: Emergence is delayed but still occurs. - $\rho = 1.0$: No emergence (without false samples, there is no need to encode truth)—though the toy model predicts emergence even at $\rho = 1$ (OOD generalization).

Effect of depth: - 1 layer: Truth encoding emerges at position $x'$. - Multiple layers: Encoding may first appear at position $y$ and be copied to $x'$, or emerge directly at $x'$.

Natural language experiments (CounterFact dataset): - Small Transformers (2/5/9 layers, $d=256$) trained on paired data instantiating TCH. - Results are consistent with synthetic data: rapid memorization → delayed emergence of linear encoding → probability decrease on false sequences. - The 1-layer model exhibits epoch-wise double descent.

Key Findings¶

Validation on pre-trained LLaMA-3-8B: - False sentences in the prefix reduce the model's probability of predicting the correct attribute for subsequent facts (2 false prefixes lead to a 4.55× probability decrease), supporting TCH. - LLaMA-3-8B linearly separates true and false instances across all intermediate and final layers (classification accuracy >95%). - Interventions in the truth subspace (adding $\alpha(\mu_T - \mu_F)$ to representations) can reverse the effect of false contexts, increasing the probability of the correct attribute.

Highlights & Insights¶

Elegance of the TCH hypothesis: Without relying on lexical style cues, truth encoding emergence can be explained purely through the co-occurrence statistics of true and false statements—a more fundamental account than the "persona" explanation.
Unexpected criticality of LayerNorm: It is a necessary component for linear separation—no LayerNorm, no linear separation. This finding is consistent with Stolfo et al. (2024) on "confidence neurons."
Universality of the two-phase dynamics: Observed in the toy model, synthetic NLP data, and pre-trained LLMs alike, suggesting this may be a universal training mechanism.
Closed theoretical-empirical loop: The argument chain from hypothesis (TCH) → data validation → toy model analysis → synthetic experiments → pre-trained model validation is complete and coherent.
Practical value: Understanding the mechanism of truth encoding may facilitate the development of better hallucination detection and mitigation strategies.

Limitations & Future Work¶

Toy model is highly simplified: Single-layer, orthogonal embeddings, uniform attention—there is a large gap with real Transformers.
Single relation type: Synthetic data contains only one latent relation; real corpora involve complex interactions among many relations.
Logical constraints ignored: Real-world truth values exhibit transitivity, mutual exclusivity, and other logical properties not captured by the model.
Overly simplified falsity distribution: Uniform random substitution vs. the more complex conditional distribution of misinformation in the real world.
TCH validated on a small corpus: The falsity rate in MAVEN-FACT is only 2%; large-scale validation is still lacking.
Behavior of multi-layer models not fully explained: Different random seeds may lead to qualitatively different learned strategies.

vs. Li et al. (2024b) / Marks & Tegmark (2024): These works identify the existence of linear truth encodings; the present paper explains their emergence mechanism.
vs. Joshi et al. (2024, Persona hypothesis): The Persona hypothesis relies on lexical cues; TCH provides a deeper statistical explanation—the two may act in concert.
vs. Geva et al. (2021, key-value memory): This work builds on the understanding of MLPs as associative memories, showing how truth encoding leverages memorized factual associations.
vs. Stolfo et al. (2024, confidence neurons): Both works identify a consistent mechanism—LayerNorm modulates confidence through norm scaling while simultaneously rendering true and false representations linearly separable.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The TCH hypothesis and mechanistic analysis via toy models constitute a genuinely novel theoretical contribution with fundamental implications for understanding LLM internal representations.
Experimental Thoroughness: ⭐⭐⭐⭐ Synthetic experiments are thorough and pre-trained model validation is convincing, though natural language experiments are relatively small in scale.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are rigorous and the logical chain from hypothesis to validation is clear and elegant.
Value: ⭐⭐⭐⭐⭐ The work has far-reaching implications for LLM interpretability and hallucination understanding, and may inspire novel hallucination mitigation techniques.