🔬 Interpretability¶
🧪 ICML2026 · 21 paper notes
📌 Same area in other venues: 💬 ACL2026 (27) · 📷 CVPR2026 (28) · 🔬 ICLR2026 (55) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (80) · 📹 ICCV2025 (10)
🔥 Top topics: LLM ×3
- All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs
-
This paper systematically falsifies a core implicit assumption in mechanistic interpretability—"one LLM capability corresponds to a unique circuit"—using the Overlap-Aware Sheaf Repulsion (OASR) algorithm. It finds that the same task can be supported by multiple, nearly non-overlapping (IoU ~4–11%) but all faithful/sparse/complete circuits or sheaves, and proposes the "Distributive Dense Circuit Hypothesis" as a theoretical explanation.
- Barriers to Counterfactual Credit Attribution for Autoregressive Models
-
This paper formalizes the problem of "Counterfactual Credit Attribution (CCA)" for generative models in RAG/in-context deployment, and proves two surprising negative results: (1) Even if the underlying next-token predictor is (0,0)-CCA, the autoregressive rollout is not CCA—CCA does not naturally compose under autoregression as DP does; (2) Black-box "CCA retrofitting" of a deployed non-attributing model requires at least exponential (in output length \(\ell\)) number of queries.
- Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path
-
This paper proposes the Circuit Fingerprint hypothesis: when an answer token is fed into a Transformer in isolation, the direction it leaves in the hidden space precisely corresponds to the circuit path required to produce that answer. Based on this, circuit discovery can be achieved via pure geometric alignment (without gradients/interventions), and the same set of directions can be used for activation steering, demonstrating that "read" and "write" are two sides of the same geometric object.
- CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features
-
By selecting interpretable steering features whose SAE activations on generated tokens are Pearson-correlated with task correctness, and directly using the mean activation on positive samples as the coefficient—without needing contrastive datasets or backpropagation—CorrSteer achieves +3.3% on MMLU and +27.1% on HarmBench for Gemma-2 2B / LLaMA-3.1 8B, with lower side effect rates than fine-tuning.
- Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis
-
This paper uses an L2-matched perturbation protocol to demonstrate that, in the Pythia series, direction (angle) perturbations are 42.9 times more destructive to language modeling loss than magnitude perturbations of the same displacement, while magnitude perturbations are far more damaging to syntax (subject-verb agreement) than angle—constituting a "double dissociation" in the cognitive neuroscience sense, with direction effects propagating via the attention pathway and magnitude via the LayerNorm pathway.
- Do Activation Verbalization Methods Convey Privileged Information?
-
This work systematically demonstrates that current popular activation verbalization methods (Patchscopes / LIT / SelfIE), when used as LLM interpretability tools, have their performance fully explained by the "verbalizer model's own knowledge," without requiring any internal activations from the target model. This implies that these tools only appear to work on existing benchmarks due to flaws in benchmark design, and when the verbalizer's knowledge exceeds that of the target, it fabricates "explanations" the target does not possess.
- SemGrad: Gradients w.r.t. Semantics-Preserving Embeddings Tell LLM Uncertainty
-
SemGrad is the first to bring "gradient-based" uncertainty quantification to LLM free-form generation. It uses the Semantics-Preserving Score (SPS) to identify hidden states encoding input semantics, and treats the norm of the log-likelihood gradient with respect to these states as a measure of LLM confidence. Without sampling and with only a single backward pass, it outperforms 11 SOTA baselines on 3 QA datasets, especially surpassing SAR by 3.27 AUROC on the multi-answer TruthfulQA.
- Grokking: From Abstraction to Intelligence
-
This paper provides a unified explanation of the grokking phenomenon from the perspective of structural simplification (Occam's razor): during training, the model undergoes four types of "internal condensation" that occur synchronously—causal mediation degradation, manifold collapse to the \(\mathbb{Z}_{97}\) ring, spectral energy concentrating on sparse Fourier modes, and a sharp drop in BDM algorithmic complexity. Using an analytically tractable singular feature machine (SFM), it is shown that these are equivalent to a free energy-driven phase transition.
- Interpretability Can Be Actionable
-
This is a position paper arguing that "what interpretability research lacks is not new methods, but evaluation criteria": research should use actionability (whether insights can drive concrete decisions/interventions outside the interpretability domain) as a core evaluation dimension. The authors define actionability along the axes of concreteness and validation, analyze obstacles, list five high-leverage application domains, and provide a six-step checklist for researchers.
- Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models
-
The authors conduct the first large-scale hierarchical mechanistic analysis of six mainstream tabular foundation models (TFMs), discovering that the middle and later layers mainly perform "iterative refinement" and contain substantial redundancy. Based on this, they design a single-layer recurrent TFM using only 20% of the parameters, achieving performance nearly matching the original six-layer version.
- Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution
-
This paper proposes MA-GIG: transferring the “select low-gradient features and take a step” strategy of Guided IG from pixel space to the latent space of a pretrained VAE. By leveraging the decoder Jacobian, axis-aligned updates in latent space are mapped into updates within the tangent space of the data manifold, thus both avoiding high-gradient noise regions and ensuring that samples along the integration path remain close to the true data manifold, resulting in more reliable attributions.
- Memory as a Markov Matrix: Sample Efficient Knowledge Expansion via Token-to-Dictionary Mapping
-
Interprets the next-token distribution of an autoregressive LLM as the state transition matrix of a Markov chain, so "learning new words" becomes "adding new states to the state space and representing them as sparse combinations of existing states." Theoretically, only \(O(s)\) samples are needed (\(s\) is the number of mapped old tokens); in practice, simply finetuning the embedding of the new token suffices to achieve cross-lingual/new concept expansion with strict zero forgetting.
- Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions
-
Within the high-dimensional linear regression ICL framework, this work employs an "approximate softmax attention" that preserves softmax normalization and temperature selectivity while remaining analytically tractable, deriving a closed-form solution for ICL generalization error and an explicit formula for the optimal attention temperature \(\tau_{\text{opt}}\). It is proven that simply tuning the temperature at inference can recover near Bayes-optimal performance. The effectiveness of this "lightweight knob" is also validated on real QA tasks with GPT-2 and Llama2-7B.
- Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks
-
The authors formalize neural networks (especially LLMs) as composite agents synthesized from multiple implicit sub-agents (each a probability distribution over outcomes) via log-weighted pooling. Within the cognitive utility framework \(W_i(o)=\log P_i(o)\), they prove that "strict unanimity benefit" is impossible under linear pooling or binary outcomes, but feasible when \(|\mathcal O|\ge 3\). This leads to the alignment principle that "explicitly manifesting Waluigi before suppression" is strictly superior to "only reinforcing Luigi".
- Provably Learning Attention with Queries
-
The authors prove that single-head softmax attention can be exactly recovered with remarkable simplicity under value-query access—requiring only \(O(d^2)\) queries, which is much easier than for ReLU MLPs of similar structure. When the head dimension \(r\ll d\), compressed sensing reduces this to \(O(rd)\). The results extend to noisy oracles, membership queries, and the unidentifiability of multi-head attention.
- Steer Like the LLM: Activation Steering that Mimics Prompting
-
This paper reinterprets "prompt steering" as a form of activation steering natively implemented by LLMs, and then distills the activation difference induced by prompt injection using a token-wise ReLU probe. The resulting Prompt Steering Replacement (PSR) module not only outperforms existing activation steering methods (CAA, ReFT-R1, Stolfo, etc.) on three steering benchmarks, but also matches or surpasses prompting on AxBench and persona steering tasks.
- The Cylindrical Representation Hypothesis for Language Model Steering
-
This paper proposes the Cylindrical Representation Hypothesis (CRH), which relaxes the orthogonality assumption of the LRH while retaining "concept linearity." It demonstrates that the superposition of concept vectors naturally induces a cylindrical geometry of "axis + normal plane + sensitive sector," thereby providing the first geometric explanation for why activation steering is unpredictable at the sample level but observable at the population level.
- The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity
-
This paper reveals the structural root of "attention sink to the first token" in LLMs—under causal masking, the first token lacks value aggregation, leading to variance discrepancy, which is selectively amplified by super neurons in the FFN, resulting in extreme dimension disparity. This ultimately locks the QK projection, forcing the formation of an attention sink. Based on this, the authors propose head-wise RMSNorm during pretraining to fundamentally suppress the sink.
- Towards Steering without Sacrifice: Principled Training of Steering Vectors for Prompt-only Interventions
-
The authors leverage infinite-width neural network scaling theory to derive that joint training of the steering vector’s factor/direction should satisfy the scaling constraint \(\eta_{\mathbf{v}}\eta_{\alpha}=\Theta(1)\), thereby eliminating the need for manual selection of \(\alpha\) during inference. Inspired by ReFT, they apply additive intervention only to the first 4 prompt tokens (PrOSV). On AxBench, this approach maintains model utility and consistently outperforms full-sequence FSSV across three Gemma2/Qwen2.5 model scales.
- Understanding LoRA as Knowledge Memory: An Empirical Analysis
-
The authors conduct a systematic empirical audit using the PhoneBook and the newly constructed PaperQA benchmarks, treating LoRA as independently trainable/loadable/combinable knowledge memory units. They quantitatively provide a full-chain design guideline from "rank → capacity → efficiency → multi-module composition → complementarity with RAG/ICL".
- Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints
-
This paper provides an architectural-level explanation for "why the internal representations of transformers can be repeatedly and successfully decoded by simple linear methods (probe, SAE, activation steering)": as long as semantic features are read out via linear interfaces such as OV circuits or unembedding, they must reside in a context-invariant linear subspace (Invariant Subspace Necessity theorem); this leads to a zero-shot application—the Self-Reference Property, i.e., the embedding direction of a token itself is its concept direction, enabling unsupervised classification directly using the geometric position of class tokens.