Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation¶
Conference: ICLR 2026 arXiv: 2505.16415 Code: https://github.com/ruizheliUOA/ARC_JSD Area: RAG / Interpretability / Mechanistic Analysis Keywords: Context Attribution, Jensen-Shannon Divergence, RAG, Mechanistic Interpretability, Attention Heads, MLP Layers, Hallucination Mitigation
TL;DR¶
This paper proposes ARC-JSD, a method that computes the Jensen-Shannon Divergence (JSD) between response distributions under full context and sentence-ablated context, enabling efficient and accurate RAG context attribution without fine-tuning, gradient computation, or surrogate models. Combined with Logit Lens for mechanistic analysis, ARC-JSD identifies the attention heads and MLP layers responsible for context attribution, and reduces hallucination rates by approximately 39% via a gating mechanism.
Background & Motivation¶
Background: RAG improves LLM generation accuracy by incorporating external context, yet reliably attributing generated content to specific context passages (context attribution) remains an open challenge.
Limitations of Prior Work: - Manual annotation is prohibitively expensive (Zeng et al., 2021; Slobodkin et al., 2024) - Gradient-based methods (MIRAGE) require backpropagation, incurring high computational cost - ContextCite requires hundreds of forward passes to train a linear surrogate model - DPO-based fine-tuning approaches (SelfCite) require additional training
Key Challenge: Existing methods struggle to balance attribution accuracy and computational efficiency—they are either accurate but expensive, or fast but imprecise.
Key Insight: Leveraging the mathematical properties of JSD—symmetry, boundedness (\([0, \log 2]\)), and scale-invariance—to directly measure the change in response distribution upon ablating individual context sentences, bypassing surrogate model training.
Core Idea: If removing a context sentence induces the largest shift in the model's output distribution (highest JSD), that sentence is most critical to the generated response.
Method¶
Overall Architecture¶
ARC-JSD consists of two modules: (1) JSD-based context attribution (identifying key sentences); and (2) JSD + Logit Lens mechanistic analysis (identifying key attention heads and MLP layers).
Key Designs¶
-
JSD-Driven Context Attribution (§4.1)
- Function: For each sentence \(c_i\) in the context, compute the JSD between the response distribution under full context and that under the ablated context.
- Core formula: \(\text{JSD}(c_i) = \sum_{j=1}^{|\mathcal{R}|} \text{JSD}(\mathcal{P}_{\text{LM}}(r_j|\mathcal{C},\mathcal{Q}) \| \mathcal{P}_{\text{LM}}(r_j|\mathcal{C}_{\text{ABLATE}}(c_i),\mathcal{Q}))\)
- The sentence with the highest JSD is identified as the most relevant: \(c_{\text{Top-1}} = \arg\max_{c_i} \text{JSD}(c_i)\)
- Design Motivation: Accumulating JSD over response tokens captures locally sensitive tokens (e.g., named entities) without being dominated by high-entropy tokens.
-
JSD + Logit Lens Mechanistic Analysis (§5)
- Function: Extends JSD analysis from the model level down to individual attention heads and MLP layers.
- Mechanism: For each attention head \((\ell,h)\) and each MLP layer \(\ell\), intermediate representations are projected into vocabulary space via Logit Lens, and JSD is computed between full-context and ablated-context distributions.
- Key Findings: Attention heads responsible for context attribution are concentrated primarily in higher layers, while MLP layers contribute most in the mid-to-upper layers, partially consistent with findings in Wu et al. (2025a) under the NIAH setting.
-
Semantic Gain Validation (§6)
- Function: Validates the components identified by JSD from an independent angle by measuring the cosine similarity gain toward the correct answer.
- Mechanism: \(\Delta^{\ell,\text{Attn}}\) and \(\Delta^{\ell,\text{MLP}}\) are defined to quantify the semantic gain of each attention and MLP layer. Spearman \(\rho\) is computed between JSD rankings and semantic gain rankings; Table 3 shows significant positive correlation, providing mutual validation.
-
JSD Gating for Hallucination Reduction (§7)
- Function: Uses JSD scores as a confidence gate to suppress high-JSD attention heads and MLP layers with negative semantic gain.
- Gating formula: \(\text{Mask} = 0.7 + 0.3 \times \text{sigmoid}(G)\); when \(G < 0\), the mask approaches 0.7, reducing the contribution of the corresponding component.
- Effect: On HotpotQA, Qwen2-7B-IT hallucination rate drops from 13.4% to 8.2% (↓39%), with negligible change in Factual F1 (76.1→75.9).
Computational Efficiency¶
- ARC-JSD FLOPs: \(2PT|\mathcal{C}|^2\) (\(P\): parameter count, \(T\): tokens per sentence, \(|\mathcal{C}|\): number of sentences)
- ContextCite (256 calls) FLOPs: \(2PT \times 256^2\); ARC-JSD is cheaper when \(|\mathcal{C}| < 256\)
- MIRAGE requires gradient computation with FLOPs of \(4PT|\mathcal{C}|(2|\mathcal{C}|+1)\)
- Achieves approximately 3× practical speedup
Key Experimental Results¶
Datasets & Models¶
- Three QA datasets: TyDi QA (440, single-hop), HotpotQA (1000, multi-hop), MuSiQue (1000, multi-hop, avg. 93.6 context sentences)
- Four instruction-tuned models: Qwen2-1.5B/7B-IT, Gemma2-2B/9B-IT
- Additional generalization evaluation: LLaMA-3.1-8B-IT, Qwen3-Next-80B-A3B-IT
Main Results (Context Attribution Top-1 Accuracy)¶
- ARC-JSD consistently dominates all baselines on the compute-accuracy trade-off on MuSiQue (Fig. 2a)
- Average attribution accuracy improves by approximately 10.7%
- ContextCite-32 is faster than ARC-JSD when \(|\mathcal{C}|>32\), but consistently achieves lower attribution accuracy
- ARC-JSD lies on the Pareto optimal front, balancing accuracy and efficiency
Metric Comparison Ablation (§8, Fig. 6)¶
| Metric | Relative Performance |
|---|---|
| JSD | Best; symmetric, bounded, scale-invariant |
| KL | Diverges when ablated distribution has zero probability; incomparable across layers |
| TV | Bounded but too coarse; cannot distinguish high-entropy tail shifts from key-token probability transfers |
| Wasserstein | Requires defining distances over a 152K vocabulary; \(O(V^3)\) complexity |
| MMD | Requires kernel function and token distance definition |
Mechanistic Analysis Validation¶
- Table 3: Spearman \(\rho\) between JSD rankings and semantic gain rankings is significant (\(p<0.05\) or \(p<0.01\)) across all datasets and models
- Table 5: JSD change when ablating top-10 JSD attention heads (2.23±0.12) is significantly larger than when ablating random 10 heads (1.53±0.76)
Hallucination Reduction (Table 4)¶
| Setting | Hallucination Rate | Factual F1 |
|---|---|---|
| Base RAG | 13.4% | 76.1 |
| Gate Top-5 Attn & MLP | 8.2% | 75.9 |
| Gate Random 5 | 12.7% | 69.4 |
Generalizability¶
- The compute-accuracy advantage is maintained on LLaMA-3.1-8B-IT and Qwen3-Next-80B-A3B-IT (MoE) (Fig. 7)
Highlights & Insights¶
- Simplicity and Efficiency: The method is conceptually straightforward—sentence-level ablation + JSD comparison—requiring no auxiliary model training, and can be integrated into any RAG system in a plug-and-play manner.
- Theoretically Grounded Choice of JSD: Symmetry avoids directionality issues; boundedness enables meaningful cross-layer comparison; the ablation comparisons against KL/TV/Wasserstein are convincing.
- Closed-Loop Mechanistic Analysis: JSD localization → semantic gain validation → causal ablation verification → gating application constitutes a complete validation and application pipeline.
- Visualization of MLP-Layer Semantic Evolution: Logit Lens visualizations reveal how Qwen2 progressively transitions from Chinese tokens to English in upper layers (e.g., "一只→A", "翅膀→wings"), consistent with the language anchoring phenomenon.
- Practical Value: The gating mechanism reduces hallucination rates by 39% without any retraining.
Limitations & Future Work¶
- Quadratic Complexity in Context Length: The \(O(|\mathcal{C}|^2)\) complexity remains expensive for very long contexts (e.g., hundreds of sentences); the paper does not discuss how to scale the approach.
- Only Top-1 Attribution is Evaluated: Existing QA datasets provide only sentence-level gold labels; finer-grained attribution (phrase-level or clause-level) remains underexplored.
- Limited Scale of Gating Experiments: Hallucination reduction is validated on only 200 HotpotQA samples; large-scale and multi-dataset validation is absent.
- No Direct Accuracy Comparison with Fine-Tuning Methods such as SelfCite: The comparison is limited to the compute-accuracy trade-off plot.
- Threshold Selection for "All JSD Scores Small" (0.02 bits) lacks systematic analysis.
Related Work & Insights¶
- vs. ContextCite (Cohen-Wang et al., 2024): ContextCite requires hundreds of forward passes to train a surrogate model, and the linearity assumption may miss non-linear dependencies; ARC-JSD directly quantifies true distributional change via JSD.
- vs. MIRAGE (Qi et al., 2024): Gradient-based attribution is computationally expensive, and aggregating token-level signals to the sentence level introduces information loss.
- vs. Wu et al. (2025a): Their NIAH setting evaluates copy-paste behavior, whereas this paper targets the more realistic scenario of paraphrasing and information integration.
- vs. Sun et al. (2025): Sun et al. focus on localizing sources of hallucination, while this paper localizes sources responsible for correct generation; the two are complementary.
Rating¶
- Novelty: ⭐⭐⭐⭐ Applying JSD to RAG context attribution is novel and well-motivated, with strong integration of theory and practice.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets, four plus two models, multi-angle ablation and validation with consistent results.
- Writing Quality: ⭐⭐⭐⭐ Architecture diagrams are clear, mathematical derivations are coherent, and case studies are intuitive.
- Value: ⭐⭐⭐⭐ The plug-and-play attribution method combined with mechanistic insights meaningfully advances RAG transparency.