The Trilemma of Truth in Large Language Models¶

Conference: NeurIPS 2025 arXiv: 2506.23921 Code: GitHub Area: LLM NLP / Interpretability / Veracity Probing Keywords: veracity probing, multiple-instance learning, conformal prediction, truth direction, LLM internals

TL;DR¶

This paper proposes sAwMIL (Sparse-Aware Multiple Instance Learning), a three-class probing framework that combines MIL and conformal prediction to classify LLM internal activations into true/false/neither, revealing that truth and falsity signals are not encoded as simple bidirectional opposites but as distributed representations spanning a multi-dimensional subspace.

Background & Motivation¶

Background: Existing veracity probing methods (e.g., mean-difference, CCS, TTPD) separate "true" and "false" directions in LLM internal activations via linear probes, performing binary classification based on the last-token representation.

Limitations of Prior Work: - They assume symmetric encoding of truth and falsity (\(P(\phi|K_\mathcal{M}) = 1 - P(\neg\phi|K_\mathcal{M})\)), an assumption unsupported by empirical evidence. - They assume LLMs have knowledge of all facts, ignoring cases where the model simply lacks certain knowledge. - Using only the last-token representation discards signals from critical positions within the sentence (e.g., factual actualization points). - Output scores are uncalibrated and cannot serve as reliable confidence estimates. - Binary classification (true/false) cannot handle cases where the model does not know.

Key Challenge: Binary logic fails to accurately characterize the internal knowledge states of LLMs — a model may regard a statement as neither true nor false.

Key Insight: Introduce three-valued logic (true/false/neither) and apply a MIL mechanism to attend to the most informative tokens in a sentence, rather than defaulting to the last token.

Core Idea: Multiple instance learning enables probes to automatically discover token positions that carry veracity signals within a sentence; conformal prediction is used to quantify uncertainty, yielding a three-class classification scheme.

Method¶

Overall Architecture¶

The input is the intermediate-layer activation \(h_i(\boldsymbol{x}) \in \mathbb{R}^{L \times d}\) (representations of all tokens) obtained by encoding a declarative statement through an LLM; the output is a three-class probability distribution \(\{p_{true}, p_{false}, p_{neither}\}\). The pipeline consists of three steps: (1) training three one-vs-all sbMIL probes, (2) integrating their outputs into multi-class probabilities via softmax regression, and (3) calibrating the outputs with conformal prediction.

Key Designs¶

Sparse-Aware Multiple Instance Learning (sAwMIL):
- Function: Treats the entire sentence as a "bag" and each token's activation as an "instance," automatically identifying which tokens carry veracity signals.
- Mechanism: Two-stage training. In the first stage, MIL-SVM identifies the highest-scoring instance in positive bags and computes an \(\eta\)-quantile threshold. In the second stage, only tokens above the threshold that belong to the actualized part of the sentence are used to train a standard SVM.
- Design Motivation: Only factually critical tokens (e.g., "Latvia" in "The city of Riga is in Latvia") carry veracity signals; the prefix ("The city of Riga is in") contains no decisive information. MIL automatically localizes these key tokens without requiring manual annotation.
Intra-bag Label Mechanism:
- Function: Partitions the sentence into a pre-actualized segment \(\boldsymbol{x}^p\) (label 0) and an actualized segment \(\boldsymbol{x}^a\) (label 1).
- Mechanism: After \(\eta\)-quantile filtering, intra-bag labels further restrict training to high-scoring tokens within the actualized region only.
- Design Motivation: Prevents MIL from mistakenly treating noisy signals from non-critical positions as veracity signals.
One-vs-All → Multi-class Integration:
- Three independent sAwMIL probes are trained: is-true, is-false, and is-neither.
- Their outputs are integrated via softmax regression: \(p_k = \frac{\exp(z_k)}{\sum_j \exp(z_j)}\), where \(z_k = g_i^k(\boldsymbol{x}) \cdot \alpha_k + \beta_k\).
- The final output is a calibrated three-class probability distribution.
Conformal Prediction:
- Function: Converts raw SVM decision scores into prediction sets with statistical coverage guarantees.
- Mechanism: Constructs nonconformity scores to ensure prediction sets cover the true label with probability \(1-\alpha\).
- Design Motivation: SVM decision scores are not calibrated probabilities; direct sigmoid compression is also unreliable.

Key Experimental Results¶

Main Results: Probe Comparison (Average over 16 LLMs and 3 Datasets)¶

Method	Type	Correlation MCC	Generalization MCC
Zero-shot prompting	Black-box	~0.35	~0.25
MD+CP (last token)	Binary+CP	~0.30	~0.15
TTPD+CP (last token)	Binary+CP	~0.32	~0.18
sPCA+CP (last token)	Binary+CP	~0.30	~0.17
SVM+CP (last token)	Multi-class	~0.55	~0.30
sAwMIL (full bag)	Multi-class+MIL	~0.60	~0.45

Binary probes degrade substantially under the full-bag setting, indicating sensitivity to noisy tokens.
sAwMIL maintains stable or improved performance under the full-bag setting, demonstrating effective noise filtering via MIL.

Truth–Falsity Direction Analysis¶

Metric	SVM	sAwMIL
Cosine similarity: is-true vs. is-false	~-0.5	~-0.3
Spearman correlation: is-true vs. is-false	~-0.6	~-0.4
Effective rank of direction matrix	1.73±0.012	1.93±0.004

If truth and falsity were strictly symmetric, cosine similarity would approach -1 and effective rank would approach 1.
The observed effective rank close to 2 demonstrates that truth and falsity directions span a two-dimensional subspace and are not antipodal on a single axis.

Key Findings¶

Binary probes are unreliable: Methods such as MD and TTPD perform worse than zero-shot prompting under generalization evaluation.
The neither class is critical: Binary probes frequently misclassify neither statements as true or false with high confidence.
Truth–falsity encoding is asymmetric: The effective rank of 1.93 for sAwMIL indicates that truth and falsity are independently encoded within LLMs.
Key token positions matter: Relying solely on the last token discards signals at factual actualization points.

Highlights & Insights¶

Introduction of three-valued logic: Extends LLM veracity probing from binary to ternary classification; the "neither" class elegantly handles cases of model ignorance.
Automatic key-token localization via MIL: No manual annotation of signal-bearing tokens is required — the data speaks for itself — yielding greater robustness than fixed last-token approaches.
Systematic critique of five assumptions: The paper systematically examines assumptions prevalent in the existing literature and provides a corrected formulation for each — a research paradigm transferable to other probing tasks.

Limitations & Future Work¶

Evaluation is limited to binary relational statements (city–country, drug–indication); complex statements involving multiple relations or entities remain untested.
The neither class is constructed using synthetic entities (e.g., fictitious city names such as "Staakess"); it is unclear whether this genuinely reflects a model's "don't know" state.
Model scale is restricted to 3–14B parameters; behavior at larger scales (70B+) may differ.
Probes remain linear; nonlinear probes may capture more complex veracity signals.

vs. Mean-Difference Probe (Marks et al.): Uses only the last token with binary classification, cannot handle neither, and generalizes poorly; sAwMIL employs the full sentence with three-class classification and generalizes substantially better.
vs. CCS (Burns et al.): Employs unsupervised contrastive learning to identify truth–falsity directions, but still assumes binary symmetry; sAwMIL provides direct empirical evidence against this assumption.
vs. TTPD (Burger et al.): Although multi-directional encodings are observed, the probe remains binary; sAwMIL is the first complete three-class solution.
Implications for LLM safety and hallucination detection: Real-time detection of "neither" signals in model internals could enable proactive abstention during generation.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of three-valued logic and MIL is novel; the critique of five assumptions is valuable.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers 16 models × 3 datasets with diverse baselines and in-depth directional analysis.
Writing Quality: ⭐⭐⭐⭐⭐ Argumentation is logically clear; the assumption–critique–solution structure is well-executed.
Value: ⭐⭐⭐⭐ Makes an important contribution to understanding internal knowledge representations in LLMs, though further validation is needed for practical applications.