🔬 Interpretability¶

💬 ACL2025 · 22 paper notes

📌 Same area in other venues: 📷 CVPR2026 (34) · 🔬 ICLR2026 (196) · 💬 ACL2026 (63) · 🧪 ICML2026 (92) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (80)

🔥 Top topics: LLM ×7 · Reasoning ×3

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability: This work proposes a dual-perspective NLG meta-evaluation framework that decomposes traditional human-metric correlation into a global perspective (ordinal classification to judge coarse-grained quality levels) and a local perspective (adjacent pairwise comparison to distinguish fine-grained quality differences). By employing an automatic benchmark construction method, it avoids manual annotation and data contamination. Experiments on 16 LLM evaluators reveal that Qwen-2.5-72B achieves global optimality, while DeepSeek-V3 performs best locally.
An Empirical Study of Mechanistic Interpretability Approaches for Factual Recall: This paper systematically compares multiple mechanistic interpretability methods (such as causal tracing, activation patching, and probing analysis) in localizing and explaining the mechanisms of factual recall in LLMs, revealing the consistencies, discrepancies, and respective application scenarios of different approaches.
Around the World in 24 Hours: Probing LLM Knowledge of Time and Place: This paper presents the GeoTemp dataset (320k prompts covering 289 cities and 37 time zones) to evaluate the capability of LLMs in joint temporal and spatial reasoning for the first time. The study finds that models can handle time calculation and geographic knowledge independently, but their performance drops sharply when combining both is required.
Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages: Extends an information-theoretic bias attribution score metric to agglutinative languages (Filipino) by averaging subword scores to handle complex morphemic structures. Analysis on four multilingual PLMs reveals that bias in Filipino models is driven by entity-type topical words (people/objects/relationships), contrasting sharply with action-type topical words (crime/sexual activity) in English.
CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction: This paper proposes CLEME2.0, an interpretable reference-based GEC evaluation metric. By disentangling edits into four categories (correct correction TP, wrong correction FPne, under-correction FN, and over-correction FPun) and combining them with edit weighting techniques, it achieves state-of-the-art correlation with human judgments on both GJG15 and SEEDA datasets.
Cracking Factual Knowledge: A Comprehensive Analysis of Degenerate Knowledge Neurons in Large Language Models: This paper redefines degenerate knowledge neurons (DKNs) in LLMs from both structural and functional perspectives, proposes a neural topological clustering (NTC) method to identify DKNs of arbitrary sizes and structures, and reveals the intrinsic relationships of DKNs with LLM robustness, evolvability, and complexity through 34 experiments.
EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations: This paper proposes EXPERT, a reference-free image captioning evaluation metric based on VLM fine-tuning. By constructing a large-scale structured explanation dataset and designing a two-stage evaluation template, it achieves SOTA performance on multiple benchmark datasets while providing high-quality structured explanations across three dimensions: fluency, relevance, and descriptiveness.
IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory: IRT-Router borrows Item Response Theory (IRT) from psychometrics, treating LLMs as "test-takers" and queries as "exam questions." It learns multi-dimensional ability vectors along with difficulty and discrimination parameters to achieve interpretable multi-LLM routing, achieving over 87% accuracy in OOD scenarios at only 1/30 of the cost of GPT-4o.
Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs: This paper discovers and defines the phenomenon of "contextual entrainment" — where LLMs assign higher probabilities to any tokens that have appeared in the context. Using a differentiable masking method, the study localizes the entrainment heads responsible for this phenomenon and demonstrates that turning off these heads significantly suppresses distraction effects.
Mechanistic Interpretability of Emotion Inference in Large Language Models: By utilizing three mechanistic interpretability techniques—probing, activation patching, and generation steering—this study reveals that the emotional representations of LLMs are functionally localized in the MHSA units of intermediate layers. Furthermore, based on cognitive appraisal theory, it demonstrates that these representations are psychologically plausible, successfully steering emotional output through interventions on appraisal concepts (such as self-agency and pleasantness).
Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attribution Explainability: This paper reveals that the widely used AOPC (Area Over the Perturbation Curve) faithfulness metric yields misleading conclusions when comparing across different models (due to the vast differences in upper and lower bounds of AOPC for distinct models). It proposes Normalized AOPC (NAOPC) to eliminate inter-model incomparability using min-max normalization. Experiments demonstrate that normalization can fundamentally reverse model faithfulness rankings.
Enhancing Automated Interpretability with Output-Centric Feature Descriptions: This paper proposes output-centric feature description methods (VocabProj and TokenChange) to overcome the limitation of existing automated interpretability pipelines that solely rely on input activation examples. An ensemble approach combining both input and output perspectives achieves state-of-the-art performance across both types of evaluation.
Position-aware Automatic Circuit Discovery: Proposes Position-aware Edge Attribution Patching (PEAP) and a dataset Schema mechanism to address the cancellation effect and overestimation of importance in automatic circuit discovery caused by ignoring position information, enabling smaller and more faithful circuit discovery.
Probing Subphonemes in Morphology Models: This paper proposes a language-agnostic probing method to investigate how Transformer models trained on morphological inflection tasks implicitly learn phonological features. It is found that local features (such as final devoicing) are well-encoded in phoneme embeddings, while long-distance dependencies (such as vowel harmony) are more prominent in the contextualized representations of encoder layers.
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions: This work systematically investigates the consistency and generalization of the internal "truth direction" in LLMs, finding that only highly capable models stably exhibit a consistent truth direction, and that truthfulness probes trained on simple atomic statements can generalize to logical transformations, question-answering tasks, and in-context knowledge scenarios.
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference: The complete circuit for syllogistic reasoning in language models is discovered using mechanistic interpretability techniques (Activation Patching + Logit Lens + Circuit Ablation). The circuit operates via a three-stage mechanism: long induction bias \(\rightarrow\) middle-term suppression (h11.10) \(\rightarrow\) transitive term movement. This circuit is both sufficient and necessary on symbolic inputs, generalizes to natural language inputs, and exhibits compatible patterns across four architectures: GPT-2, Pythia, LLaMA, and Qwen.
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety: Proposes the Rational framework, which employs reasoning-enhanced fine-tuning to enable LLMs to perform explicit safety reasoning (analyzing intent, ethics, and potential harms) before responding, rather than relying on rigid refusal heuristics. This significantly improves robustness against reasoning-level adversarial attacks while maintaining helpfulness.
Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers: Through activation patching experiments, this work provides the first causal evidence demonstrating the existence of language-decoupled concept representations inside large language models. The model first determines the output language and then the concept, and averaging concept representations across languages not only preserves but actually improves translation accuracy.
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis: This paper proposes to locate "shortcut neurons" in contaminated models through comparative analysis and causal analysis, and suppress these neurons via activation patching to achieve more trustworthy LLM evaluation, achieving a Spearman correlation coefficient of over 0.95 with MixEval.
The Anatomy of Evidence: An Investigation Into Explainable ICD Coding: This paper conducts an in-depth, application-oriented analysis of the MDACE dataset and current explainable ICD coding systems, revealing the overlap patterns between human-annotated evidence and code descriptions, the distributional characteristics of evidence within documents, and proposes new matching metrics to evaluate the utility of model explanations.
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons: This paper systematically validates through experiments that features decomposed by SAEs (Sparse Autoencoders) comprehensively outperform traditional neurons as analytical units in three dimensions: knowledge representation influence, interpretability, and monosemanticity. It proposes FeatureEdit, the first feature-based model editing method, which significantly outperforms neuron-based methods in private knowledge erasing tasks.
Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework: Proposes the GETER framework, which injects temporal knowledge graph structural information into LLMs via a lightweight Structure-Text Adapter, enabling the model to deliver both accurate predictions and explainable reasoning explanations in temporal reasoning tasks.