Steer LLM Latents for Hallucination Detection¶
Conference: ICML 2025
arXiv: 2503.01917
Code: -
Area: Hallucination Detection
Keywords: steering vector, hallucination detection, optimal transport, pseudo-labeling, TSV
TL;DR¶
Proposes Truthfulness Separator Vector (TSV), a lightweight steering vector that reshapes the LLM representation space at inference time to enhance the separation between truthful and hallucinated outputs, achieving performance close to full supervision with only 32 labeled exemplars.
Background & Motivation¶
Background¶
Background: LLM hallucination is a significant risk for safe deployment.
Key Challenge¶
Key Challenge: Existing latent-space-based methods rely on pre-trained LLM embeddings, but these embeddings are optimized for linguistic coherence rather than factual accuracy.
Limitations of Prior Work¶
Limitations of Prior Work: Truthful and hallucinated contents heavily overlap in pre-trained embeddings (see T-SNE visualization).
Proposed Approach¶
Proposed Approach: Fine-tuning LLMs is computationally expensive and alters model parameters.
Supplementary Notes¶
Supplementary Notes: Core Problem: How to reshape the latent space to distinguish hallucinations without modifying model parameters?
Method¶
1. Truthfulness Separator Vector (TSV)¶
Define a trainable vector \(\mathbf{v} \in \mathbb{R}^d\), which is added to the hidden state of an intermediate layer \(l\) during inference:
where \(\lambda\) controls the steering intensity. TSV is shared across all token positions and influences the final-layer embeddings through subsequent non-linear transformations.
2. Initial Training Phase¶
The final-layer embeddings are modeled using the von Mises-Fisher distribution, with the class-conditional probability defined as:
where \(\mathbf{r}^{\mathbf{v}}\) is the normalized final embedding, and \(\boldsymbol{\mu}_c\) represents the class prototype.
The training objective is to maximize the log-likelihood on the exemplar set \(\mathcal{D}_E\):
3. Enhanced Training Phase¶
Pseudo-Label Assignment based on Optimal Transport¶
For the unlabeled data \(\mathcal{D}_U\), pseudo-labels are assigned by solving the optimal transport problem using the Sinkhorn algorithm:
Subject to constraints where row sums equal \(1/M\) (total probability per sample is 1) and column sums match the class distribution prior \(\mathbf{w}\).
Confidence Filtering¶
Only the \(K\) pseudo-labeled samples with the lowest prediction uncertainty are selected for training:
where \(\Omega_i\) represents the uncertainty measured by cross-entropy.
Key Experimental Results¶
Main Results: TruthfulQA (AUROC)¶
| Method | LLaMA-3.1-8B |
|---|---|
| CCS | 58.1 |
| SAPLMA | 63.2 |
| Probing (supervised) | 71.3 |
| HaloScope | 71.4 |
| TSV (32 exemplars) | 84.2 |
| Fully supervised upper bound | 85.5 |
- TSV improves upon the SOTA by +12.8% AUROC.
- Using only 32 labeled exemplars, TSV performs close to the fully supervised upper bound (84.2% vs 85.5%).
Cross-Dataset Generalization¶
The TSV trained on TruthfulQA is evaluated on TriviaQA and HaluEval: - It remains competitive, demonstrating robust out-of-distribution generalization.
Ablation Study¶
| Component | AUROC |
|---|---|
| TSV (Initial Training Only) | 79.8 |
| + OT Pseudo-Labeling | 82.5 |
| + Confidence Filtering | 84.2 |
| Without TSV (Direct Embedding) | 71.3 |
- Each component makes a clear contribution.
- Best intervention layer: Mid-level layers (around the 16th layer out of 32 layers total).
Highlights & Insights¶
- This work is the first to apply steering vectors to hallucination detection (rather than generation mitigation), filling an important gap.
- Optimal transport pseudo-label assignment accounts for class imbalance, outperforming simple thresholding approaches.
- Remarkably low annotation requirement (32 exemplars) achieves performance close to full supervision, offering high practical value.
- TSV can be applied after generation is completed without modifying model parameters.
- Modeling with the von Mises-Fisher distribution aligns perfectly with the unit-norm property of post-RMSNorm embeddings.
Limitations & Future Work¶
- The selection of \(\lambda\) and intervention layer \(l\) requires tuning on a validation set.
- The TSV may need to be retrained for different LLMs.
- Evaluated only on closed-book QA tasks; its effectiveness in open-ended generation scenarios remains unexplored.
- The class distribution prior \(\mathbf{w}\) for pseudo-labeling is derived from the exemplar set, which may not be completely accurate.
- There is currently a lack of deep theoretical explanation for why TSV can effectively separate these spaces.
Rating¶
⭐⭐⭐⭐⭐ — The method is lightweight, elegant, and achieves impressive results. Obtaining performance close to the fully supervised upper bound with only 32 labeled exemplars is of significant importance to the field of hallucination detection.