Skip to content

🔬 Interpretability

🧪 ICML2025 · 31 paper notes

📌 Same area in other venues: 📷 CVPR2026 (34) · 🔬 ICLR2026 (196) · 💬 ACL2026 (63) · 🧪 ICML2026 (92) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (80)

🔥 Top topics: LLM ×3 · Reasoning ×2

A Reasoning-Based Approach to Cryptic Crossword Clue Solving

A three-stage LLM reasoning pipeline (Answer Candidate Generation \(\rightarrow\) Wordplay Suggestion \(\rightarrow\) Python Formalisation & Verification) is proposed. Using open-source 9B models, it achieves a new SOTA on the Cryptonite dataset. The key innovation lies in formalizing wordplay reasoning into executable Python code and iteratively correcting it via a verifier with hints.

Ab Initio Nonparametric Variable Selection for Scalable Symbolic Regression with Large p

Proposed the PAN+SR framework, which reduces high-dimensional symbolic regression problems to low-dimensional subspaces through BART-based nonparametric variable pre-screening, achieving significant performance improvements for 19 existing SR methods in high-dimensional scenarios.

Concept-Based Unsupervised Domain Adaptation

Proposes the CUDA framework, which combines Concept Bottleneck Models (CBMs) with Unsupervised Domain Adaptation (UDA). By aligning concept representations via relaxed consistency (allowing minor domain discrepancies) and inferring unlabeled concepts in the target domain, CUDA simultaneously provides interpretability and cross-domain generalization under domain shift for the first time, backed by theoretical guarantees.

Configurable Preference Tuning with Rubric-Guided Synthetic Data

This paper proposes the Configurable Preference Tuning (CPT) framework, which trains LLMs using synthetic preference data generated from fine-grained rubrics. This enables the model to dynamically adjust its behavioral style at inference time simply by modifying the system prompt without retraining, improving accuracy from 0.52-0.68 to 0.76-0.83 across multiple base models.

DeltaSHAP: Explaining Prediction Evolutions in Online Patient Monitoring with Shapley Values

DeltaSHAP is an explainable AI algorithm designed specifically for online patient monitoring systems. By adapting Shapley values to temporal scenarios, it explains the evolution (change) between consecutive predictions rather than absolute prediction values. It provides both the direction and magnitude of feature attributions, achieving a 62% improvement in explanation quality and a 33% reduction in computation time on the MIMIC-III benchmark.

Do Sparse Autoencoders Generalize? A Case Study of Answerability

This paper systematically evaluates the out-of-domain (OOD) generalization capabilities of features extracted by Sparse Autoencoders (SAEs) on the task of "answerability." The study reveals highly inconsistent OOD transfer performance of SAE features—outperforming residual stream linear probes on some datasets while performing near-randomly on others, highlighting the fundamental limitations of current SAE interpretability methods in capturing abstract concepts.

Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Proposes the NeuronEval unified framework, formalizing 19 existing neuron explanation evaluation methods into the same mathematical paradigm. It introduces two sanity checks (Missing Labels and Extra Labels) to reveal that most commonly used metrics (e.g., Recall, AUC, and Correlation under top-and-random sampling) are unreliable, with only Correlation (Pearson), Cosine, AUPRC, F1, and IoU passing the checks.

Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective

Proposes the PromptQuine framework, which performs token-level pruning on ICL prompts through evolutionary search. It discovers that pruning clear exemplars into seemingly "gibberish" subsequences can actually improve LLM performance, matching or surpassing SOTA prompt optimization methods.

Explaining, Fast and Slow: Abstraction and Refinement of Provable Explanations

This paper proposes an abstraction-refinement-based method to efficiently compute provably sufficient explanations for neural network predictions, accelerating the verification process by abstracting large networks into small ones with formal guarantees on explanation quality.

FastCAV: Efficient Computation of Concept Activation Vectors for Explaining Deep Neural Networks

FastCAV is proposed to replace SVM training with the normalized mean difference vector of concept activation samples. This approach is theoretically equivalent to a simplified form of Fisher Discriminant Analysis. It achieves up to 63.6\(\times\) (average 46.4\(\times\)) acceleration while maintaining comparable classification accuracy and downstream explanation quality to SVM-CAV.

Foundation Molecular Grammar: Multi-Modal Foundation Models Induce Interpretable Molecular Grammar

FMG leverages the chemical knowledge of multi-modal foundation models (MMFMs) to induce interpretable molecular graph grammars. By rendering molecules as images and describing them via text, combined with cross-modal alignment through prompt learning, it replaces traditional grammar learning methods that rely on expert annotations or heuristics.

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

ITDA is proposed, an inference-time activation decomposition method based on Matching Pursuit. It achieves comparable reconstruction performance at only 1% of the training cost of SAEs, scales to a 405B parameter model, and inherently supports cross-model representation comparison.

Leveraging Predictive Equivalence in Decision Trees

Proposed converting decision trees into minimal Disjunctive Normal Form (DNF) representations to eliminate the "predictive equivalence" problem, unifying the representation of different decision trees with identical decision boundaries, thereby improving variable importance metrics, robustness to missing data, and feature acquisition cost optimization.

MIB: A Mechanistic Interpretability Benchmark

This paper proposes MIB (Mechanistic Interpretability Benchmark), which includes two tracks (circuit discovery and causal variable localization), four tasks, and five models. Through standardized counterfactual intervention evaluation and new metrics (CPR/CMD), MI methods are systematically compared. The study finds that attribution + mask optimization methods perform best in circuit discovery, while SAE features do not outperform original neurons in causal variable localization.

LANTERN: Modeling User Behavior from Adaptive Surveys with Supplemental Context

Proposes LANTERN (Late-Attentive Network for Enriched Response Modeling), a modular user behavior modeling architecture that treats adaptive survey data as the primary signal and achieves late fusion via cross-attention. Selective gating and residual connections maintain the dominance of survey signals, while external context (demographics, behavioral logs, etc.) is integrated only when relevant, significantly outperforming the survey-only baseline on a production-scale dataset of approximately 35,000 users with an F1 score of 0.775 compared to 0.734.

Near-Optimal Decision Trees in a SPLIT Second

A family of SPLIT algorithms is proposed, which implements a hybrid scheme of global optimal search near the root of the decision tree and greedy strategies near the leaf nodes, achieving decision tree construction that is over 100 times faster than global optimization methods with almost no loss in accuracy.

On the Effect of Uncertainty on Layer-wise Inference Dynamics

Using Tuned Lens, this work systematically analyzes the layer-wise token probability evolution trajectories of 5 LLMs on 11 datasets. It reveals that the layer-wise inference dynamics of certain and uncertain predictions are highly aligned (sudden jumps in confidence occur at similar layers). This indicates that uncertainty does not affect the structural dynamics of model inference, which challenges the feasibility of detecting uncertainty through simple intermediate-layer features.

On the Power of Context-Enhanced Learning in LLMs

This paper formally defines "context-enhanced learning" (CEL), proving that its sample efficiency is exponentially higher than that of standard learning under simplified settings, and reveals at a mechanistic level that its advantage stems from more precise gradient learning signals.

Position: We Need An Algorithmic Understanding of Generative AI

Proposes the AlgEval framework, advocating for the systematic study of the algorithms learned and used by generative AI—including algorithmic primitives (vocabulary) and their composition (grammar)—as an alternative understanding pathway to pure scaling, and demonstrates a methodology combining top-down hypothesis with bottom-up validation through a case study on graph navigation tasks.

Reactivation: Empirical NTK Dynamics Under Task Shifts

This work presents the first systematic empirical study on NTK dynamics in continual learning, finding that task shifts consistently trigger abrupt deviations in the NTK. Even in the lazy learning regime, NTK norm, velocity, and alignment metrics deviate sharply at task boundaries, revealing a feature learning phenomenon termed "reactivation." The driving factors are precisely pinpointed by distinguishing between conceptual and frequency distribution shifts.

Rethinking Explainable Machine Learning as Applied Statistics

This position paper proposes that explainable machine learning (XAI) should be viewed as "applied statistics of high-dimensional functions." Explanation algorithms are fundamentally statistical functionals of functions (functionals), and research should focus on their interpretation—similar to traditional statistics (such as p-values or confidence intervals)—rather than merely studying their mathematical properties. The most significant deficiency in the current literature is the neglect of the core issue: "What intuitive question does the output of an explanation algorithm actually answer?"

SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior

Proposes the SafetyAnalyst framework, which generates an interpretable "harm-benefit tree" via chain-of-thought (CoT) reasoning (enumerating harmful and beneficial effects potentially caused by AI actions, along with their likelihood, severity, and immediacy). These features are then aggregated into a harm score using 28 fully interpretable parameters. On prompt safety classification, it outperforms existing moderation systems with an average F1 score of 0.81 (compared to F1 < 0.72 for prior systems) while delivering interpretability, transparency, and steerability.

SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Proposes SLiM, a one-shot compression framework that seamlessly integrates hardware-friendly uniform quantization, semi-structured sparsity, and saliency-based low-rank adapters, achieving up to 5.66% accuracy improvement under 4-bit + 2:4 sparsity conditions.

Supernova Event Dataset: Interpreting Large Language Models' Personality through Critical Event Analysis

This paper proposes the Supernova Event Dataset (comprising Wikipedia articles of biographies, historical events, news, and scientific discoveries). By instructing LLMs to extract and rank key events from long texts, and utilizing another LLM as a judge to infer the target model's "personality traits," this work reveals differences in the consistent behavioral patterns of different LLMs during subjective decision-making.

Taming Knowledge Conflicts in Language Models

Unveils the phenomenon of "Context-Parametric Superposition" (CP Superposition) within the attention heads of language models. Proposes JuICE (Just Run Twice), a dual-run attention intervention strategy that flexibly steers models toward either parametric memory or contextual knowledge without fine-tuning, achieving SOTA performance across 11 datasets and 6 model architectures.

To Steer or Not to Steer? Mechanistic Error Reduction with Abstention for Language Models

This work proposes MERA (Mechanistic Error Reduction with Abstention), a principled activation steering framework based on a linear error estimator. By employing constrained optimization to derive a closed-form optimal steering intensity and introducing a calibration step to guarantee intervention only when provably effective, MERA addresses the understeering and oversteering issues caused by traditional fixed steering intensities.

Towards Attributions of Input Variables in a Coalition

This paper re-derives the computation mechanism of the Shapley value from the perspective of AND-OR interactions, proving that attribution conflicts under different variable partitions inherently stem from interaction effects that only cover a subset of coalition variables. Based on this, the authors define a coalition attribution metric and three fidelity metrics, with experiments validating their consistency with human intuition.

Towards Flexible Perception with Visual Memory

Shift the knowledge representation of deep visual models from being "carved in weights" to "stored in an external database." By constructing a flexible Visual Memory using pre-trained encoders and kNN retrieval, this approach enables plug-and-play data operations (adding, deleting, and scaling) and interpretable classification, achieving an 88.5% top-1 accuracy on ImageNet.

Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs

FlashTrace proposes an efficient multi-token attribution method that reduces the attribution complexity of multi-token targets from \(\mathcal{O}(M \cdot N)\) to \(\mathcal{O}(N)\) using span-wise aggregation. It also traces importance propagation in reasoning chains via a recursive attribution mechanism, achieving a speedup of over 130x.

Validating Mechanistic Interpretations: An Axiomatic Approach

Drawing inspiration from the concept of abstract interpretation in program analysis, this paper proposes an axiomatic framework to formally define and validate the mechanistic interpretations of neural networks, and verifies the effectiveness of this framework through two case studies: a 2-SAT solver and modular addition.

What Makes an Ensemble (Un)interpretable?

This paper systematically investigates the interpretability of ensemble learning methods—identifying what factors make ensemble models difficult to interpret and how to improve ensemble interpretability while maintaining predictive performance. It proposes a theoretical framework to quantify ensemble interpretability and practical methods to construct interpretable ensembles.