Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs¶

Conference: ACL 2025
arXiv: 2505.16520
Code: https://github.com
Area: LLM / Factuality Analysis
Keywords: factual hallucination, hidden states, truthfulness encoding, probe classifier, dataset construction

TL;DR¶

This paper challenges the prior conclusion that LLM hidden states can encode the truthfulness of facts. By constructing more realistic and challenging datasets (perplexity-guided negative sampling and QA-based LLM generation datasets), the authors find that prior methods exhibit limited generalization on data that closer resembles real-world scenarios, providing a more rigorous benchmark and practical guidance for LLM factuality evaluation.

Background & Motivation¶

Background: Factual hallucination is a core challenge in LLMs—models generate grammatically fluent but factually incorrect content. Recent studies (such as SAPLMA by Azaria & Mitchell, 2023) suggest that the hidden layer activations of LLMs can be utilized to determine whether a statement is factually correct, achieving truth value judgment by training a simple probe classifier. This "self-evaluation" capability is considered an important pathway to mitigating hallucinations.

Limitations of Prior Work: (1) In datasets used by prior studies, incorrect statements are generated through simple random replacement, such as physically implausible sentences like "Zebra moves by flying", which LLMs would not generate during normal generation. (2) The "false" samples in the datasets do not match the actual generation patterns of LLMs—model hallucinations are typically subtle, plausible errors rather than absurd ones. (3) The high accuracy of probes on synthetic data may not generalize to detecting errors generated by the model itself.

Key Challenge: High probe accuracy on simple datasets \(\neq\) LLMs actually possessing "factual self-introspection" capabilities. If error samples are too easy to distinguish (e.g., completely implausible statements), the probe may merely learn "surface anomalies" rather than "factual truth".

Goal: Design more realistic and challenging evaluation datasets to strictly test the limits of LLM hidden states in encoding factual information.

Key Insight: Increase dataset difficulty from two directions: (1) Make incorrect statements more plausible—using LLM perplexity-guided negative sampling to ensure false statements are linguistically "plausible"; (2) Make the sources of statements more realistic—directly generating statements with LLMs and manually annotating their correctness.

Core Idea: By constructing true/false datasets highly aligned with LLM generation patterns, reveal the limitations of existing factual probing methods when facing more realistic challenges.

Method¶

Overall Architecture¶

This paper is an evaluation study with the core workflow of: (1) replicating the methods and results of prior work (SAPLMA); (2) proposing two new dataset construction strategies; (3) re-evaluating probe performance on the new datasets; (4) analyzing the boundaries of generalization capability.

Key Designs¶

Perplexity-Guided Negative Sampling Strategy:
- Function: Generate more plausible and difficult-to-distinguish false statements.
- Mechanism: When replacing factual values to generate false statements from tabular data in the original dataset, instead of randomly choosing replacement values, select replacement values that minimize the overall perplexity of the false statement. The LLM is used to calculate the perplexity of candidate false statements, and the one with the lowest perplexity (i.e., the most "plausible-looking") is selected as the final negative sample. For example, the negative sample for "The atomic number of Hydrogen is 1" changes from "The atomic number is 34" to "The atomic number is 2"—the latter being more natural to the model.
- Design Motivation: False statements with low perplexity are closer to the generation distribution of the model, thereby testing whether the probe can distinguish "plausible but incorrect" statements instead of "obviously absurd" ones.
QA-based LLM-generated Dataset:
- Function: Generate true/false statements highly consistent with actual LLM outputs.
- Mechanism: Prompt LLMs to generate answers using questions from QA datasets, and convert the answers into statement forms. Compare the LLM answers against ground truth answers to automatically evaluate statement correctness. In datasets generated this way, all statements are content that the LLMs "actually generate", and correctness depends on whether the model happens to know the correct answer.
- Design Motivation: This is the closest evaluation method to practical scenarios—what we care about is "which of the self-generated contents of the model are factually correct". If the probe does not work on this task, it indicates that the factual self-evaluation capability is limited.
Cross-Topic and Cross-Dataset Generalization Evaluation:
- Function: Test whether the generalization capability of the probe extends beyond the training data distribution.
- Mechanism: Follow the leave-one-topic-out evaluation of prior work: train the probe on five topics and test on a sixth hold-out topic. Additionally, apply probes trained on the original synthetic data to the new perplexity-sampled data and LLM-generated data to test cross-dataset generalization.
- Design Motivation: If the probe is only effective within the training distribution but fails across distributions, it shows that it learns data-specific spurious correlations rather than general factuality encoding.

Loss & Training¶

The probe (SAPLMA) is a 3-layer fully connected network (256-128-64), trained with binary cross-entropy loss for 5 epochs using the Adam optimizer. Inputs are the hidden state activations of specific LLM layers. The 16th, 20th, 24th, 28th, and 32nd layers of OPT-6.7B and Llama2-7B were analyzed.

Key Experimental Results¶

Main Results¶

Dataset Type	Model	Probe Accuracy (Trained on Orig. Data)	Probe Accuracy (Trained on New Data)
Original True-False (Random Replacement)	Llama2-7B	72.3%	—
Original True-False (Random Replacement)	OPT-6.7B	68.5%	—
Perplexity-Sampled (Harder)	Llama2-7B	58.2%	63.7%
Perplexity-Sampled (Harder)	OPT-6.7B	54.8%	60.1%
LLM-Generated (Most Realistic)	Llama2-7B	53.1%	56.8%
LLM-Generated (Most Realistic)	OPT-6.7B	52.4%	54.2%

Ablation Study¶

Configuration	Accuracy (Llama2, LLM-Generated Data)	Description
Probe@Layer 32	56.8%	Last layer
Probe@Layer 28	55.4%	Second-to-last layer
Probe@Layer 24	54.1%	Middle layer
Probe@Layer 20	53.2%	Shallower layer
Random Baseline	50.0%	Random guess
Orig. Data \(\rightarrow\) LLM Data Transfer	53.1%	Severe degradation in generalization
LLM Data \(\rightarrow\) Orig. Data Transfer	65.8%	Reverse transfer performs slightly better

Key Findings¶

On the original dataset, the probe indeed achieves 70%+ accuracy, partially confirming the findings of prior work.
However, on more challenging perplexity-sampled and LLM-generated data, accuracy drops significantly—close to the random baseline (50%), indicating that the probe primarily learns statistical biases in simple data.
Probes trained on original data can hardly transfer to LLM-generated data (53.1%), demonstrating a severe lack of generalization capability.
Deeper layers show slightly better performance, but the difference is marginal (the deepest layer is only 3.6% higher than the shallowest layer).
The trends are consistent across both models (OPT-6.7B and Llama2-7B).

Highlights & Insights¶

The core contribution of this paper is "questioning" rather than "proving"—rigorous adversarial evaluation is crucial for AI safety research. Overly optimistic conclusions may lead to building downstream applications on unreliable factual detection methods.
Perplexity-guided negative sampling is a general data augmentation strategy that can be applied to any NLP evaluation task requiring high-quality negative samples (e.g., fact-checking, fake-news detection).
"Self-Contradiction Paradox": If an LLM truly knows a statement is false (via hidden states), why does it still generate it? This contradiction implies that hidden states might encode "fluency" rather than "factuality".

Limitations & Future Work¶

Only analyzed two relatively small models (6.7B and 7B); larger scale models might have stronger factuality encoding capabilities.
The probe architecture is simple (3-layer MLP); more complex probes might be able to extract richer signals.
Multi-layer fusion was not considered—concurrently using multiple layers of hidden states may be more effective than single-layer hidden states.
Future work can combine attention analysis and hidden state analysis to achieve a more comprehensive understanding of factuality encoding.

vs SAPLMA (Azaria & Mitchell, 2023): This paper directly challenges the core conclusions of SAPLMA, proving that it is not robust under more realistic evaluation settings.
vs ITI (Li et al., 2024): Inference-time intervention methods rely on the assumption that "truthful directions" exist within hidden states, an assumption questioned by this paper.
vs CCS (Burns et al., 2022): Contrastive Consistent Search also relies on true/false encoding in hidden states, which similarly requires more rigorous evaluation.

Rating¶

Novelty: ⭐⭐⭐⭐ Proposing two more realistic dataset construction strategies delivers high value for critical evaluations of prior work.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets, two models, multi-layer analysis, and cross-dataset transfer evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ Precise problem definition, rigorous experimental design, and convincing conclusions.
Value: ⭐⭐⭐⭐ Provides an important "cool-down" for the LLM factuality research community, fostering more rigorous evaluation standards.