Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders¶

Conference: ICLR 2026 arXiv: 2512.08892 Code: GitHub Area: RAG / Interpretable AI Keywords: Retrieval-Augmented Generation, Sparse Autoencoders, Hallucination Detection, Interpretability, Faithfulness

TL;DR¶

This paper proposes RAGLens, which leverages sparse autoencoders (SAEs) to disentangle RAG-hallucination-specific features from LLM internal activations, and constructs a lightweight, interpretable hallucination detector via mutual information-based feature selection combined with a Generalized Additive Model (GAM). RAGLens surpasses existing methods across multiple benchmarks and supports token-level interpretable feedback and hallucination mitigation.

Background & Motivation¶

Core Problem of RAG: Retrieval-Augmented Generation (RAG) enhances LLM factuality through externally retrieved documents, yet models still produce hallucinated outputs that contradict retrieved content, fabricate details, or exceed the scope of evidence. Such unfaithful generation severely limits deployment in high-reliability domains such as medicine and law.

Limitations of Prior Work: - Training dedicated detectors: Requires large-scale, high-quality annotated data with high adaptation costs. - LLM-as-Judge: Employs external LLMs to evaluate faithfulness, but incurs high computational overhead, struggles to detect hallucinations in self-generated content, and produces explanations that do not faithfully reflect internal decision processes. - Internal representation probing: Exploits hidden states or attention scores to capture hallucination signals, but the polysemanticity of neurons complicates signal extraction and limits detection accuracy.

Key Insight: SAEs from the mechanistic interpretability literature can decompose LLM hidden states into monosemantic features—each feature corresponding to a specific semantic concept. This raises the question: do SAE features exist that activate specifically during RAG hallucinations? If so, can they be used to build detectors that are both accurate and interpretable?

RAG Hallucination vs. General Hallucination: Although prior work has applied SAEs to general LLM hallucination detection, the RAG setting involves complex interactions between retrieved evidence and generated content, yielding distinct hallucination patterns. Whether SAE features can capture this dynamic remains unclear.

Method¶

Overall Architecture: The RAGLens Pipeline¶

The core mechanism of RAGLens follows: freeze the LLM → extract SAE features → mutual information selection → GAM classification → output interpretable detection results.

Step 1: SAE Feature Extraction For each token \(y_t\) generated by the LLM, the hidden state at layer \(L\) is obtained as \(h_t = \Phi_L(y_{1:t}, q, \mathcal{C})\), which is then encoded by a pretrained SAE encoder to yield a sparse feature vector:

\[z_t = \mathcal{E}(h_t), \quad z_t \in \mathbb{R}^K\]

where \(K\) is the dictionary size and only a small number of features are activated at each position.

Step 2: Instance-Level Feature Aggregation Since labels are instance-level, token-level activations are aggregated via channel-wise max pooling:

\[\bar{z}_k = \max_{1 \leq t \leq T} z_{t,k}, \quad k = 1, \ldots, K\]

The paper provides a theoretical justification for max pooling under sparse activation conditions (Theorem 1): when \(Tp \ll 1\), the mutual information between pooled features and labels grows linearly with sequence length \(T\), effectively amplifying signal while suppressing noise.

Step 3: Mutual Information Feature Selection The mutual information \(I(\bar{z}_k; \ell)\) between each pooled feature \(\bar{z}_k\) and the hallucination label \(\ell\) is computed, and the top-\(K'\) features (\(K' \ll K\)) are selected, yielding a sub-vector \(\tilde{\bar{z}} \in \mathbb{R}^{K'}\). Mutual information is estimated in practice using a binning approach.

Step 4: GAM Classification A GAM models hallucination probability as:

\[g(\mathbb{E}[\ell | \tilde{\bar{z}}]) = \beta_0 + \sum_{j=1}^{K'} f_j(\tilde{\bar{z}}_j)\]

where each univariate shape function \(f_j\) is learned via bagged gradient boosting. The additive structure of GAM guarantees interpretability—the contribution of each feature to the prediction can be directly visualized.

Key Design 1: Middle-Layer SAE Features Are Most Informative¶

Experiments across all layers on Llama3.2-1B, Llama3-8B, Qwen3-0.6B, and Qwen3-4B reveal that: - Summarization and QA tasks: Performance peaks at middle layers. - Data2txt task: Performance is relatively flat across layers. - Conclusion: Middle-layer SAE features encode the richest hallucination-related signals; shallow layers lack sufficient information, while deep layers may have signals overwritten by subsequent transformations.

Key Design 2: GAM Outperforms MLP, XGBoost, and Other Complex Models¶

Comparing Logistic Regression (LR), GAM, MLP, and XGBoost as classifiers: - GAM consistently outperforms LR, as the mapping from individual features to outputs is nonlinear. - GAM also surpasses MLP and XGBoost, as SAE features are nearly independent and the additive assumption holds. - GAM additionally provides interpretability, representing the optimal balance between performance and transparency.

Key Design 3: Pre-Activation Features Outperform Post-Activation Features¶

Comparing SAE and Transcoder as feature extractors, with signals taken before and after the activation function: - Pre-activation features consistently outperform post-activation features across all three datasets. - SAE and Transcoder yield comparable performance with no clear winner. - Conclusion: The position of the activation function is more critical than the choice of architecture.

Interpretability and Hallucination Mitigation¶

Local Explanation: The additive structure of GAM allows each prediction to be decomposed into per-feature contributions. Aligning activations to token positions yields token-level feedback that precisely identifies unreliable text spans (e.g., fabricated numbers, dates, or entity names).

Global Explanation: Each SAE feature corresponds to a semantically coherent concept (e.g., Feature 22790 = "unsupported numerical/temporal details"; Feature 17721 = "well-documented high-salience tokens"). The GAM shape functions reveal stable mappings from feature values to hallucination risk.

Mitigation Strategy: Detection results are fed back to the LLM as instance-level warnings or token-level highlights to guide correction of hallucinated content. Token-level feedback proves more effective than instance-level feedback.

Experiments¶

Experimental Setup¶

Datasets: RAGTruth (multi-task: summarization / QA / data-to-text), Dolly (Accurate Context), AggreFact, TofuEval
Models: Llama2-7B/13B, Llama3.2-1B, Llama3.1-8B, Qwen3-0.6B/4B
Metrics: Balanced Accuracy (Acc), Macro F1, AUC
Baselines: 16 methods including Prompt, LLM-as-Judge (ChainPoll / RAGAS / TruLens / RefCheck), uncertainty-based methods (SelfCheckGPT / Perplexity / EigenScore), and internal representation methods (SEP / SAPLMA / ITI / Focus / ReDeEP)

Table 1: Main Detection Performance Comparison (RAGTruth & Dolly)¶

Method	RAGTruth-7B AUC	RAGTruth-7B F1	Dolly-7B AUC	Dolly-7B F1	RAGTruth-13B AUC	Dolly-13B AUC
ChainPoll	0.6738	0.7006	0.6593	0.5581	0.7414	0.7070
RAGAS	0.7290	0.6667	0.6648	0.6392	0.7541	0.6412
ReDeEP	0.7458	0.7190	0.7949	0.7833	0.8244	0.8420
RAGLens	0.8413	0.7636	0.8764	0.8070	0.8964	0.8568

RAGLens comprehensively outperforms all baselines across all settings, achieving AUC ≥ 0.84 in every configuration.

Table 2: Cross-Dataset / Cross-Task Generalization (AUC)¶

Train → Test	RAGTruth	AggreFact	TofuEval
None (CoT)	0.4842	0.5741	0.5562
RAGTruth	0.8806	0.8019	0.7637
AggreFact	0.5330	0.8330	0.6123
TofuEval	0.7747	0.6161	0.7846

Detectors trained on the high-diversity RAGTruth dataset generalize substantially better across domains than those trained on single-task datasets.

Table 3: Hallucination Mitigation Results¶

Evaluator	Original Hallucination Rate	+ Instance-Level Feedback	+ Token-Level Feedback
Llama3.3-70B	43.78%	42.22%	39.11%
GPT-4o	37.78%	36.44%	34.22%
GPT-o3	64.44%	60.44%	58.88%
Human Annotation	71.11%	62.22%	55.56%

Token-level feedback (which highlights suspicious tokens via interpretability) is more effective than instance-level feedback under all evaluators, reducing hallucination rate from 71.11% to 55.56% in human evaluation.

Key Findings¶

LLMs "know more than they say": SAE features reveal latent faithfulness signals that CoT reasoning cannot consistently capture; cross-model experiments show that SAE-based detectors consistently outperform models' own CoT judgments.
Model scale affects internal knowledge quality: Larger LLMs yield higher detection performance through SAE-based detectors; Qwen3-0.6B, despite reasonable CoT performance, lags behind larger models in SAE detection, indicating that internal knowledge correlates more strongly with model scale than training procedure.
Specific SAE features carry well-defined semantics: Feature 22790 corresponds to "numerical/temporal details without contextual support," with hallucination probability monotonically increasing as activation strength rises; Feature 17721 corresponds to "well-documented high-salience tokens" and is negatively correlated with hallucination.
Cross-domain generalization depends on training data diversity: Detectors trained on the multi-task RAGTruth dataset generalize best across domains; transfer from summarization to QA outperforms transfer from data-to-text to other tasks.
Max pooling has theoretical guarantees: Under sparse activation conditions, mutual information after max pooling grows linearly with sequence length \(T\), effectively amplifying weak hallucination signals.

Highlights & Insights¶

First systematic study validating SAEs for RAG hallucination detection: Fills the research gap of SAEs in RAG-specific hallucination scenarios and proposes a complete detect–explain–mitigate pipeline.
Lightweight and interpretable: Requires only a small number of SAE features and a simple GAM classifier—no LLM fine-tuning or external LLM calls—while providing token-level attribution and feature-level global explanations.
Dual theoretical and empirical support: The information-theoretic proof for max pooling (Theorem 1) and extensive ablation studies (layer selection, feature count, classifier comparison, extractor comparison) ground all design choices rigorously.
Cross-model flexibility: Although SAE features do not transfer across models, the RAGLens detector can be flexibly applied to text generated by other LLMs, making it broadly practical.
Counterfactual validation: Counterfactual perturbations of retrieved documents verify that the selected SAE features are genuinely sensitive to hallucination patterns specific to the RAG setting.

Limitations & Future Work¶

Dependence on pretrained SAE availability: The method requires open-source SAE weights for the target LLM (e.g., Gemma Scope, EleutherAI SAE) and is not applicable to closed-source models.
Instance-level hallucination labels: The current approach cannot distinguish which specific claims within an instance are hallucinated; token-level attribution is approximate and relies on heuristic alignment.
Limited mitigation effectiveness: Although token-level feedback outperforms instance-level feedback, hallucination rates remain relatively high (55.56% in human evaluation), indicating an upper bound on post-hoc mitigation using detection signals alone.
Generalization depends on training distribution: Detectors trained on single-task datasets exhibit notable cross-domain performance degradation; diverse training data is required for practical deployment.
Computational overhead not systematically reported: Despite claims of being lightweight, the end-to-end cost and latency of SAE encoding, MI computation, and GAM training have not been benchmarked systematically.

RAG Hallucination Detection: Manakul et al. 2023 (SelfCheckGPT), Bao et al. 2024 (HHEM), Sun et al. 2025 (ReDeEP), Li et al. 2024 (LLM-as-Judge series)
SAEs and Interpretability: Bricken et al. 2023 (dictionary learning + monosemanticity), Huben et al. 2023, Shu et al. 2025; applications to hallucination detection: Ferrando et al. 2025, Suresh et al. 2025
Generalized Additive Models (GAM): Lou et al. 2012, Nori et al. 2019 (InterpretML / EBM)
Internal Representation Probing: Azaria & Mitchell 2023 (SAPLMA), Han et al. 2024, Zhou et al. 2025

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic application of SAEs to RAG hallucination detection with a complete pipeline
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6 models × 4 datasets × 16 baselines + comprehensive ablations + cross-model/cross-domain experiments + interpretability case studies + mitigation experiments
Writing Quality: ⭐⭐⭐⭐ Clear structure and rigorous theoretical presentation, though some experimental details are deferred to the appendix
Value: ⭐⭐⭐⭐ A lightweight, interpretable hallucination detection solution with direct practical value for RAG system reliability
Overall: ⭐⭐⭐⭐ A solid contribution at the intersection of interpretable AI and RAG, with comprehensive experiments and a novel methodology