Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs¶
Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=XsEZcigEjq
Code: https://github.com/VishalPramanik/HETA
Area: Interpretability / Token Attribution
Keywords: token attribution, decoder-only LLM, Hessian, second-order sensitivity, KL divergence, faithfulness
TL;DR¶
HETA integrates "causal semantic flow gating + Hessian second-order curvature sensitivity + KL information loss" into a unified token attribution score. Specially designed for decoder-only autoregressive LLMs, it significantly outperforms existing methods in faithfulness and robustness to decoding hyperparameters and syntactic paraphrasing.
Background & Motivation¶
- Background: Classical feature attribution methods such as LIME, KernelSHAP, Integrated Gradients, Grad-CAM, and LRP are mostly designed for encoder architectures. They rely on local linear or first-order derivative approximations, assuming the model is approximately linear near the input.
- Limitations of Prior Work: This linear assumption collapses in autoregressive LLMs—token interactions are highly nonlinear and strongly context-dependent; attention weights only reflect "where the model looks" rather than "what truly influences the output," being susceptible to perturbations without altering predictions and ignoring multi-hop indirect paths in residuals/MLPs during cross-layer aggregation; first-order gradients completely miss influence in regions where gradients vanish but the function remains sensitive to finite perturbations.
- Key Challenge: Encoder models use bidirectional attention and require only a single attribution map; decoder-only models generate tokens autoregressively, requiring conditional attribution at each target position. Directly porting encoder-era methods is "non-trivial and often unfaithful," while recent generative methods like ContextCite, TDD, or Peering have downsides (sparse linear proxies are sensitive to redundancy and restricted to sentence-level; logit lens confuses correlation with causality; representation matching is only effective for verbatim segments).
- Goal: Construct a token-lever attribution framework that respects the causal and contextual structure of generative models while remaining stable across decoding settings and syntactic paraphrasing.
- Core Idea: [Trilateral Integration of Second-Order + Causality + Information Theory] Use causal rollout of attention-value flow as gating to ensure "target-oriented causal directionality," use Hessian-vector products to capture curvature sensitivity missed by first-order gradients, and use KL divergence to measure information loss of the output distribution when a token is masked. The unified attribution score is formed by multiplying/adding these three components.
Method¶
Overall Architecture¶
HETA decomposes the "contribution of token \(x_i\) to the predicted target token \(x_T\)" into three complementary components: semantic transfer influence (causal gating), Hessian second-order sensitivity, and KL information contribution. The final score treats gating as a causal mask and the curvature/information terms as internal intensities, resulting in target-conditioned, curvature-aware token-level explanations.
flowchart LR
X[Input Token Embedding X] --> A[Attention-Value Flow Rollout<br/>Causal Gating M_T]
X --> B[Hessian-Vector Product<br/>Second-order Sensitivity S_i]
X --> C[Token Masking<br/>KL Information Loss I_i]
A --> F[Final Attribution Attr = M_T·βS + γI]
B --> F
C --> F
F --> O[Target-conditioned Token-level Attribution]
Key Designs¶
1. Semantic Flow Gating: Causal rollout ensuring "target-oriented" directionality. To prevent posterior attention aggregation from misassigning importance to tokens with no causal connection to the output, HETA tracks attention-value flows that terminate only at position \(T\) under the decoder's causal mask. For each layer \(l\) and head \(h\), target-conditioned rollout \(\Phi^{(l,h)}(i\to T)\) is computed using the masked attention matrix \(A^{(l,h)}\), value vectors \(V^{(l,h)}\), and output projection \(W_O^{(l,h)}\). Only paths "ending at \(T\)" are aggregated to obtain semantic transfer influence \(M_T[i]=\frac{1}{Z}\sum_{l,h}\Phi^{(l,h)}(i\to T)\lVert V_i^{(l,h)}W_O^{(l,h)}\rVert_1\). It is simplex-normalized (\(\sum_i M_T[i]=1\)), assigning mass only to tokens with a causal path to \(T\). Compared to pure attention, this encodes both "alignment" (attention) and "semantic intensity" (norm of value vectors).
2. Hessian Second-Order Sensitivity: Discovering influence where gradients vanish. The core motivation is the second-order Taylor expansion \(f(x)=f(x_0)+\nabla f(x_0)^\top(x-x_0)+\tfrac12(x-x_0)^\top\nabla^2 f(\xi)(x-x_0)\). When first-order gradients are near zero in saturation regions (e.g., \(w^\top x+b\ll0\) in \(\log(1+e^{w^\top x+b})\)), function changes are driven entirely by curvature encoded in the Hessian; first-order methods systematically underestimate or miss this influence. HETA takes the Hessian of the target log-probability with respect to embeddings: \(H_T=\nabla^2_X\log P_\theta(x_T\mid x_{<T})\). Since explicitly constructing the \((Td)\times(Td)\) Hessian is infeasible, it uses the Hutchinson estimator combined with the Pearlmutter trick for Hessian-vector products (HVP) to estimate the sensitivity of each token block: \(S_i^{(T)}\approx\frac1m\sum_k\lVert\Pi_i H_T(\Pi_i r_k)\rVert_1\) (where \(r_k\) is a Rademacher vector restricted to the \(i\)-th block). Gauss-Newton/Fisher proxies can optionally enhance numerical stability. Operationally, it defaults to rank-64 block low-rank + 512-token window with 50% overlap (HETA-LR+WIN). For a 70B model, second-order terms are computed only for the last six layers, significantly reducing costs with minimal quality loss.
3. KL Information Contribution: Quantifying output distribution shift via mask perturbation. To provide a probabilistic interpretation of contribution, HETA masks each \(x_i\) and compares the original target distribution \(P_{\text{orig}}\) with the masked distribution \(P^{(i)}_{\text{masked}}\) via KL divergence: \(I(x_i\to x_T)=D_{KL}(P_{\text{orig}}\,\Vert\,P^{(i)}_{\text{masked}})\). This directly measures "how much prediction uncertainty changes after removing this token."
4. Unified Attribution Score: Gating × (Curvature + Information). The three are fused as \(\mathrm{Attr}(x_i\to x_T)=M_T[i]\big(\beta S_i^{(T)}+\gamma I(x_i\to x_T)\big)\), where the gating \(M_T[i]\) restricts attribution to tokens on causal paths and redistributes mass along them. \(\beta,\gamma\ge0\) weight curvature sensitivity and information contribution (default 0.5 each). The resulting score is simultaneously causally grounded, curvature-aware, and robust for generative decoder-only models.
Key Experimental Results¶
Settings: Four decoder-only models (Qwen2.5-3B, GPT-J-6B, Phi-3-Medium-14B, LLaMA-3.1-70B); LongRA / TellMeWhy / WikiBio benchmarks + a self-constructed 2000-entry mixed paragraph dataset; single A100-80GB GPU. Faithfulness used Soft-NC / Soft-NS; alignment used custom DSA metrics.
Main Results: Faithfulness (Soft-NC↑ / Soft-NS↑, GPT-J 6B)¶
| Method | LongRA NC | LongRA NS | TellMeWhy NC | TellMeWhy NS | WikiBio NC | WikiBio NS |
|---|---|---|---|---|---|---|
| Integrated Gradients | 1.87 | 0.45 | 1.54 | 0.04 | 1.38 | 0.77 |
| Peering (PML) | 2.05 | 0.50 | 1.68 | 0.06 | 1.50 | 0.83 |
| Attention Rollout | 0.41 | -0.01 | 0.25 | -0.09 | 1.91 | 0.46 |
| ReAgent (Second best) | 1.68 | 0.37 | 1.45 | 0.36 | 1.22 | 0.39 |
| HETA (Ours) | 10.3 | 2.31 | 9.2 | 2.04 | 3.80 | 2.20 |
HETA's Soft-NC on LongRA/TellMeWhy is over 2x higher than the runner-up ReAgent; consistent trends are observed on Phi-3 and LLaMA.
Alignment Experiment: DSA Metric (Distractor dataset, ↑ higher is better)¶
| Method | GPT-J | LLaMA | Phi-3 | Qwen2.5 |
|---|---|---|---|---|
| Integrated Gradients | -0.34 | -0.28 | -0.41 | -0.31 |
| Attention Rollout | -0.44 | -0.39 | -0.52 | -0.41 |
| ReAgent (Second best) | 3.60 | 3.78 | 3.35 | 3.50 |
| HETA (Ours) | 4.80 | 5.10 | 4.25 | 4.65 |
Gradient/attention-based methods show negative DSA when distractors are present, indicating an inability to isolate causal tokens; HETA achieves \(DSA\ge4.2\) across all models.
Ablation Study and Robustness¶
- Component Ablation: Removing semantic flow, Hessian, or KL components consistently decreases faithfulness and alignment, proving they are complementary.
- Robustness to Decoding Hyperparameters (Max relative change \(\Delta\% \downarrow\)): Across a grid of temperature/top-p/top-k/repetition penalty + 3 seeds, HETA's metrics show \(\Delta\% < 1\%\), whereas all baselines exceed \(2\%\).
- Stress Testing: HETA is globally optimal in Gaussian perturbation sensitivity (\(\downarrow\)), robustness to active/passive voice paraphrasing (Spearman \(\uparrow\)), and alignment F1 with GPT-4o/GPT-5 annotations (\(\uparrow\)).
Key Findings¶
ReAgent consistently ranks second; SEA-CoT and Progressive Inference show moderate gains over traditional methods. First-order gradient and attention methods often give low or negative Soft-NS / DSA, confirming the core argument that "the linear assumption fails in autoregressive generation."
Highlights & Insights¶
- Explicit Motivation for Second-order Perspective: Uses saturated activation counterexamples (gradient zero but Hessian non-zero) to provide a rigorous argument for "why curvature is needed," rather than just piling on terminology.
- Tri-signal Orthogonal Complementarity: Causal gating manages "direction," Hessian manages "nonlinear intensity," and KL manages "information theoretic impact." Ablations show all are indispensable.
- Scalable Engineering: HVP + Hutchinson + low-rank windowing + restricting second-order to final layers makes curvature attribution for 70B-scale models computationally feasible.
- New Evaluation Benchmark + DSA Metric: Uses NarrativeQA distractors concatenated with SciQ evidence segments to create "semantically rich but non-diagnostic" controls, enabling direct quantification of whether attribution lands on true predictive evidence (annotation F1=0.91, \(\kappa=0.89\)).
Limitations & Future Work¶
- Computational Cost: Even with low-rank/windowing, second-order terms and per-token masked KL are more expensive than first-order methods; further approximations (e.g., only calculating the last six layers) are needed for ultra-long contexts or massive models.
- Weight Tuning: \(\beta,\gamma\) default to 0.5; the optimal ratio might vary by task/model, and the paper does not provide an adaptive scheme.
- Evaluation Dependency on LLM Labeling: The gold standard for DSA relies on the intersection of GPT-4o/GPT-5 annotations, posing a risk of evaluation model bias.
- Theoretical Guarantees: Error bounds are in the appendix; the main text does not fully explore the bias-variance trade-off of HVP estimation under low-rank approximation.
Related Work & Insights¶
- Spectrum of Attribution Methods: Transitions from the first-order paradigm (LIME/SHAP/IG/Grad-CAM/LRP) to generative-specific methods like ContextCite (sparse linear proxy), TDD (logit lens), Peering (representation matching), and ReAgent. HETA is positioned as the branch adding "second-order + causal gating."
- Norm-based Attention (Kobayashi et al. 2020) inspired the "attention \(\times\) value norm" design in semantic flow.
- Hessian Sensitivity (Alvarez-Melis & Jaakkola, Dhamdhere et al.) concepts are migrated to the token embedding level with scalable estimation.
- Insight: For any scenario where "linear/first-order approximation fails," the trio of second-order signals + causal path constraints + information theoretic perturbation can serve as a universal enhancement template; its scalable HVP estimation is also applicable to other interpretability/optimization tasks requiring Hessians.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Fusing second-order curvature, causal attention flow, and KL information loss systematically with scalable HVP estimation for decoder-only attribution is novel and well-motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 4 models × multiple datasets × multiple baselines, covering faithfulness, alignment, ablation, and robustness to decoding/syntax. Robust coverage, though some results (Qwen) and error bounds are relegated to the appendix.
- Writing Quality: ⭐⭐⭐⭐ — Clear derivation of motivation (second-order Taylor + saturation counterexamples), good synergy between formulas and flowcharts, smooth logic.
- Value: ⭐⭐⭐⭐ — Provides a more faithful and stable attribution tool for autoregressive LLMs, alongside reusable evaluation benchmarks and the DSA metric, offering practical value to the interpretability community.