🔬 Interpretability¶

🤖 AAAI2026 · 37 paper notes

A Closer Look at Knowledge Distillation in Spiking Neural Network Training: To address the overlooked distribution mismatch between teacher ANN continuous features/logits and student SNN discrete sparse spike features/logits in ANN→SNN knowledge distillation, this paper proposes the CKDSNN framework based on Saliency-scaled Activation Map Distillation (SAMD) and Noise-smoothed Logits Distillation (NLD), achieving new state-of-the-art SNN training performance on CIFAR-10/100, ImageNet-1K, and CIFAR10-DVS.
A Coherence-Based Measure of AGI: This paper identifies that existing AGI scores rely on arithmetic averaging, which implicitly encodes a "compensatory" assumption (strengths offsetting weaknesses), and proposes \(\text{AGI}_{\text{AUC}}\)—a coherence measure based on the continuous spectrum of generalized means. By integrating over the compensability parameter \(p \in [-1, 1]\), the metric penalizes uneven capability profiles and exposes bottlenecks concealed by arithmetic averaging.
Adaptive Evidential Learning for Temporal-Semantic Robustness in Moment Retrieval: This paper proposes DEMR, a framework that introduces Deep Evidential Regression (DER) into video moment retrieval. It mitigates modal imbalance via a Reflective Flipped Fusion (RFF) module and corrects the counter-intuitive uncertainty estimation bias in vanilla DER via a Geom-regularizer, achieving significant improvements on both standard and debiased benchmarks.
Attention Gathers, MLPs Compose: A Causal Analysis of an Action-Outcome Circuit in VideoViT: This work applies mechanistic interpretability to reverse-engineer the internal circuits of a Video Vision Transformer (ViViT), revealing a functional division of labor in which attention heads are responsible for "gathering evidence" and MLP modules for "composing concepts." The analysis demonstrates that the model develops semantic knowledge beyond its training objective even on simple classification tasks.
Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models: This paper proposes the Composite Reliability Score (CRS), which unifies calibration, robustness, and uncertainty quantification into a single interpretable metric. A systematic evaluation of 10 open-source LLMs across 5 QA datasets reveals that Mistral-8x22B achieves the highest overall reliability (CRS=0.81), and that model size does not directly determine reliability.
Can LLMs Truly Embody Human Personality? Analyzing AI and Human Behavior Alignment in Dispute Resolution: This paper proposes the first systematic comparative framework that directly contrasts strategic behavioral differences between humans and personality-prompted LLMs in paired dispute mediation scenarios, finding significant divergence in personality-behavior mapping and challenging the assumption that personality prompting can serve as a proxy for human behavior.
Concepts from Representations: Post-hoc Concept Bottleneck Models via Sparse Decomposition of Visual Representations: This paper proposes PCBM-ReD, a post-hoc concept bottleneck model that automatically extracts concepts from pretrained visual encoders via sparse autoencoders, annotates and filters them using MLLMs, and selects a representative subset through reconstruction-guided search. Image representations are then sparsely decomposed into linear combinations of concept embeddings via CLIP's vision-language alignment. The method achieves state-of-the-art accuracy on 11 classification benchmarks while maintaining interpretability.
CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution: CrossCheck-Bench is a three-level hierarchical benchmark comprising 15k adversarial QA samples. It diagnoses compositional reasoning failures of VLMs in multimodal conflict resolution via 7 atomic capabilities and 15 tasks, revealing systematic performance degradation from perception (L1) to reasoning (L3) and exposing the limitations of conventional prompting strategies.
Data Whitening Improves Sparse Autoencoder Learning: This paper introduces PCA whitening — a standard preprocessing step from classical sparse coding — into modern sparse autoencoder (SAE) training. Through theoretical analysis and simulation, it demonstrates that whitening renders the optimization landscape more convex and isotropic. Experiments on SAEBench show that whitening substantially improves interpretability metrics (Sparse Probing +7.3%, SCR +54%, TPP +372%), albeit with a slight decrease in reconstruction quality.
Distribution-Based Feature Attribution for Explaining the Predictions of Any Classifier: This paper proposes DFAX, the first distribution-based feature attribution method, which quantifies feature importance by comparing the conditional probability density of a target instance under the target class versus non-target classes. It provides the first formal definition of feature attribution, and demonstrates significant improvements over SHAP/LIME and other baselines across 10 datasets while being orders of magnitude faster.
DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment: This paper proposes the DR.Experts framework, which leverages DA-CLIP to obtain distortion-type priors, employs a Distortion Saliency Differential Module (DSDM) to disentangle distortion attention from semantic attention and thereby purify distortion features, and then applies a Dynamic Distortion Weighting Module (DDWM) to adaptively weight each distortion type's features according to its perceptual impact. The method achieves state-of-the-art performance on five BIQA benchmarks.
ElementaryNet: A Non-Strategic Neural Network for Predicting Human Behavior in Normal-Form Games: This paper proposes ElementaryNet, a neural network architecture that is provably incapable of strategic reasoning, designed to model "level-0" (non-strategic) human behavior in games. It achieves prediction accuracy statistically indistinguishable from GameNet (current SOTA) while offering substantially better interpretability.
Enhancing Binary Encoded Crime Linkage Analysis Using Siamese Network: This paper proposes a Siamese Autoencoder-based crime linkage analysis framework that integrates geo-temporal features at the decoder stage and employs a domain expert-driven dimensionality reduction strategy. Evaluated on the real-world ViCLAS database from the UK National Crime Agency (NCA), the method achieves up to 9% AUC improvement, providing an effective machine learning solution for high-dimensional sparse binary-encoded crime data.
Explainable Melanoma Diagnosis with Contrastive Learning and LLM-based Report Generation: This paper proposes the CEFM framework, which aligns ViT visual features with ABCD-rule-based clinical features (asymmetry, border, color) via cross-modal contrastive learning, and subsequently employs CLIP and DeepSeek to generate structured diagnostic reports. On the ISIC dataset, the framework achieves 92.79% accuracy and 0.961 AUC, with an expert-rated interpretability score of 4.6/5.
Finding the Translation Switch: Discovering and Exploiting the Task-Initiation Features in LLMs: This work leverages sparse autoencoders (SAEs) to discover "translation-initiation features" within LLMs that govern the activation of translation tasks. Causal interventions validate their functional roles—amplifying these features improves translation quality and reduces hallucinations, while suppressing them induces hallucinations. The mechanistic insight is further operationalized into a practical data selection strategy that prioritizes "mechanistically difficult" samples for fine-tuning, substantially improving data efficiency and hallucination suppression.
FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding: This paper proposes the FineVAU benchmark, which decomposes Video Anomaly Understanding (VAU) into three dimensions — Event (What), Entity (Who), and Location (Where) — introduces the FV-Score metric with high alignment to human perception, and constructs the FineW³ dataset via a fully automated LVLM-assisted pipeline. Experiments reveal critical shortcomings of current LVLMs in fine-grained anomalous event perception.
FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold Transformer: This paper provides an in-depth analysis of the root cause behind KAT (Kolmogorov-Arnold Transformer) training being 123× slower than ViT. The bottleneck is identified not as FLOPs but as memory stalls caused by gradient accumulation during backpropagation (global memory contention from atomic add operations). The proposed FlashKAT restructures GPU kernels to achieve an 86.5× training speedup and reduces gradient rounding errors by nearly an order of magnitude.
Flexible Concept Bottleneck Model: This paper proposes the Flexible Concept Bottleneck Model (FCBM), which introduces a hypernetwork to dynamically generate concept weights and a sparsemax module with a learnable temperature parameter, enabling dynamic adaptation of the concept pool—including complete replacement. FCBM achieves accuracy comparable to state-of-the-art baselines with a similar number of effective concepts across five public datasets, and requires only a single epoch of fine-tuning to adapt to an entirely new concept set.
FourierPET: Deep Fourier-based Unrolled Network for Low-count PET Reconstruction: This work identifies three categories of degradation in low-count PET that are separable in the frequency domain — Poisson noise and photon deficiency induce high-frequency phase perturbations, while attenuation correction errors suppress low-frequency amplitude — and proposes FourierPET: an ADMM-unrolled, frequency-aware reconstruction framework that achieves comprehensive state-of-the-art performance across three datasets with only 0.44M parameters.
GateRA: Token-Aware Modulation for Parameter-Efficient Fine-Tuning: This paper proposes GateRA, which introduces a lightweight token-aware gating module into PEFT methods (LoRA/DoRA/HiRA). A sigmoid gate dynamically adjusts the adaptation intensity per token—suppressing updates for in-distribution or simple tokens to preserve pre-trained knowledge, while amplifying adaptation for challenging tokens. Combined with entropy regularization to encourage near-binary gating decisions, GateRA consistently outperforms HiRA on commonsense reasoning (+1.1%), dialogue, and mathematical reasoning.
GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction Framework: This paper proposes GenePheno, the first interpretable multi-label prediction framework for end-to-end prediction of gene knockout-induced phenotype abnormalities directly from gene sequences. The framework captures inter-phenotype correlations via contrastive multi-label learning, enforces biological consistency through exclusivity regularization, and provides interpretability via a Gene Ontology (GO) bottleneck layer. GenePheno achieves state-of-the-art gene-centric \(F_{\max}\) and phenotype-centric AUC across four datasets.
HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning: This paper introduces HSKBenchmark, the first benchmark for staged modeling and writing assessment of Chinese second language acquisition (SLA) in LLMs. It comprises HSK levels 3–6 textbooks (6.76M tokens), 16K synthetic instruction data, 30 test prompts, a linguistically-grounded evaluation system, and a curriculum tuning framework designed to simulate human acquisition trajectories.
Hypothesis Generation via LLM-Automated Language Bias for ILP: This paper proposes the first end-to-end framework in which a multi-agent LLM system (Actor/Critic) automatically constructs ILP language bias (predicate system, type declarations, and mode constraints) from raw text. A Translator agent converts text into Prolog facts, and the MAXSYNTH solver then induces a globally optimal rule set based on the MDL principle. The framework achieves 88.3% and 81.3% accuracy on the SHOES and ZENDO tasks, respectively, with variance below 5% across four LLMs.
iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference: iMAD proposes a framework for selectively triggering multi-agent debate (MAD): a single agent first generates a structured response with self-critique, from which 41 interpretable linguistic/semantic features are extracted; a lightweight MLP classifier trained with the FocusCal loss then determines whether to trigger MAD. Across 6 QA/VQA benchmarks, iMAD reduces token overhead by up to 92% while improving accuracy by up to 13.5%.
Induce, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning: This paper proposes the CIRF framework, which abstracts transferable reasoning patterns from LLM-generated first-order logic via unsupervised schema induction (USI), and performs explainable zero-shot stance reasoning through structural alignment using a schema-enhanced graph kernel model (SEGKM). The method achieves state-of-the-art performance on three benchmarks while requiring only 30% of labeled data.
LLM Circuit Analyses Are Consistent Across Training and Scale: This paper presents the first systematic tracking of internal circuits in decoder-only LLMs across 300 billion tokens of training and model scales ranging from 70M to 2.8B parameters. It finds that while specific attention heads may be replaced over the course of training, the underlying algorithms remain stable and consistent across scales, suggesting that circuit analyses conducted on smaller models generalize to larger models and longer training runs.
Partially Shared Concept Bottleneck Models: This paper proposes PS-CBM, a framework that integrates multimodal concept generation (combining LLM semantics with visual cues from exemplar images), a partially shared concept strategy (merging concepts based on activation patterns), and a Concept-Efficient Accuracy (CEA) evaluation metric. PS-CBM achieves higher classification accuracy and interpretability with fewer concepts across 11 datasets.
Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models: This paper proposes MRMBench, a benchmark that evaluates whether reward models (RMs) effectively capture multi-dimensional preferences via probing tasks across 6 dimensions (harmlessness, helpfulness, correctness, coherence, complexity, and verbosity). Probe performance is shown to strongly correlate with PPO alignment quality (Pearson \(r > 0.8\)), and an inference-time probing method is proposed that improves AlpacaEval win rate from 57.3% to 62.5%.
Quiet Feature Learning in Algorithmic Tasks: Across 10 algorithmic tasks (18,544 training runs, \(10^9\)–\(10^{16}\) FLOPs), this work demonstrates that loss plateaus in Transformer training do not indicate stalled learning. During these plateaus, models quietly acquire "quiet features"—intermediate algorithmic subroutines that do not directly reduce output loss yet are causally necessary for final performance (ablating them reduces accuracy by 41–75%). This challenges the common practice of using loss curves to assess training progress.
SCoPe: Intrinsic Semantic Space Control for Mitigating Copyright Infringement in LLMs: This paper reframes copyright infringement mitigation in LLMs as an intrinsic semantic space control problem. It leverages sparse autoencoders (SAEs) to map hidden states into a high-dimensional sparse space, identifies copyright-sensitive subspaces, and clamps their activations to zero during decoding—effectively reducing verbatim reproduction of copyrighted content without external filters or parameter updates, while preserving general model capabilities.
ShapBPT: Image Feature Attributions Using Data-Aware Binary Partition Trees: This paper proposes ShapBPT, which combines data-aware Binary Partition Trees (BPT) as hierarchical coalition structures with Owen-approximated Shapley values to achieve feature attributions aligned with image morphology. ShapBPT converges faster and yields more accurate shape recognition than existing Shapley-based methods, with a 20-participant user study confirming that its explanations are preferred by human evaluators.
SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models: This paper demonstrates that refusal behavior in LLMs is not encoded by a single direction but rather forms a low-dimensional manifold. It employs self-organizing maps (SOM) to extract multiple refusal directions and applies Bayesian optimization to search for the optimal ablation combination, surpassing single-direction baselines and dedicated jailbreak algorithms across multiple models.
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning: This paper proposes SparK — a training-free, token-wise unstructured channel pruning method for KV cache. It selects salient channels via query-aware saliency scoring and recovers the contribution of pruned channels through a recovery mechanism. At an 80% pruning ratio, performance degradation remains below 5%. The method is orthogonal to token eviction approaches and can reduce KV cache storage by an additional 30%+.
ToC: Tree-of-Claims Search with Multi-Agent Language Models: This paper proposes the Tree-of-Claims (ToC) framework, which models patent claim editing as a structured search problem. Through MCTS combined with EditorAgent/ExaminerAgent multi-agent collaboration, ToC jointly optimizes novelty, scope preservation, and semantic consistency, achieving an average improvement of approximately 8% in overall score over zero/few-shot LLM baselines.
Universal Safety Controllers with Learned Prophecies: This paper proposes UCLearn, which learns CTL (Computation Tree Logic) formulas as approximate representations of prophecies from a small number of representative plant models, replacing exact but computationally expensive tree automata to achieve efficient, scalable, and interpretable universal safety controller synthesis.
Unsupervised Feature Selection Through Group Discovery: This paper proposes GroupFS, the first end-to-end differentiable unsupervised feature selection framework that simultaneously discovers latent feature groups and selects the most informative ones, requiring neither predefined groupings nor label supervision.
Using Certifying Constraint Solvers for Generating Step-wise Explanations: This paper proposes leveraging unsatisfiability proofs generated by certifying constraint solvers as a starting point, and applies a series of simplification and transformation techniques to efficiently produce user-facing step-wise explanation sequences, achieving speedups of up to 100× over approaches that construct explanations from scratch.