🔬 Interpretability¶

🧠 NeurIPS2025 · 82 paper notes

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders: This paper identifies and systematically studies the phenomenon of "feature absorption" in SAEs: apparently monosemantic SAE latents fail to activate on certain tokens because their feature directions are "absorbed" by more specific sub-latents. This is shown to be an inevitable consequence of hierarchical features combined with sparsity loss, posing a fundamental challenge to using SAEs for reliable LLM interpretation.
A Unified Reasoning Framework for Holistic Zero-Shot Video Anomaly Analysis: A fully zero-shot, training-free video anomaly analysis framework that employs Intra-Task Reasoning (confidence-gated self-refinement) and Inter-Task Chaining (cascaded prompt passing from temporal detection to spatial localization to semantic understanding), achieving comprehensive improvements of 4–6% AUC over prior zero-shot methods across 4 benchmarks.
AdaptGrad: Adaptive Sampling to Reduce Noise: AdaptGrad analyzes the theoretical origin of noise in SmoothGrad—out-of-boundary (OOB) sampling behavior—and proposes adaptively adjusting the Gaussian sampling variance for each input dimension to bound the additional noise. The method nearly eliminates gradient noise while revealing richer fine-grained features, requires minimal implementation effort, and is composable with arbitrary gradient-based explanation methods.
Additive Models Explained: A Computational Complexity Approach: This paper presents a systematic computational complexity analysis of multiple explanation types for Generalized Additive Models (GAMs), covering 54 combinations of "component model × input domain × explanation method." It reveals that the explanation complexity of GAMs is highly sensitive to the type of input domain — a phenomenon never observed in other ML models such as decision trees or neural networks — thereby challenging the intuitive assumption that "additive implies interpretable."
AgentiQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation: This paper proposes AgentiQL, a multi-expert agent framework for Text-to-SQL: a reasoning agent decomposes questions into sub-problems, a coding agent generates sub-queries, a refinement step corrects column selection, and an adaptive router intelligently routes between a baseline parser and the modular pipeline. Using a 14B open-source model, AgentiQL achieves 86.07% EX on Spider, approaching GPT-4 SOTA (89.65%).
An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations: This paper presents the first systematic study on the impact of annotation noise on Concept Bottleneck Models (CBMs). It identifies approximately 23% of concepts as "susceptible concepts" that drive the majority of performance degradation, and proposes a two-stage mitigation strategy combining SAM at training time and uncertainty-guided intervention at inference time to restore model robustness.
Are Greedy Task Orderings Better Than Random in Continual Linear Regression?: This paper systematically analyzes the convergence differences between greedy task orderings (maximizing dissimilarity between consecutive tasks) and random orderings in continual linear regression. It reveals that greedy orderings are competitive with random orderings in the full-rank setting, but single-pass greedy ordering can fail catastrophically in the general-rank setting, whereas greedy ordering with repetition achieves a convergence rate of \(\mathcal{O}(1/\sqrt[3]{k})\).
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation: ARECHO frames speech multi-metric evaluation as a chain-based autoregressive token prediction task. It designs a unified speech information tokenization pipeline to handle 87 heterogeneous metrics (numerical/categorical/bounded/unbounded), explicitly captures inter-metric dependencies (e.g., intelligibility–naturalness correlation) via dynamic classification chains, and employs two-step confidence-guided decoding to reduce error propagation. ARECHO comprehensively outperforms the UniVERSA baseline across enhancement, synthesis, and noisy speech evaluation (Avg Test MSE 23.26 vs. 96.99, −76%).
ARC-JSD: Attributing Response to Context via Jensen-Shannon Divergence Driven Mechanistic Study: ARC-JSD proposes a RAG context attribution method based on Jensen-Shannon Divergence — by comparing the JSD between model output distributions with and without specific context sentences, it localizes the context that a response depends on without fine-tuning or gradient computation. The method achieves 3× faster computation than baselines, improves Top-1 attribution accuracy by 10.7% on average, and reveals via Logit Lens that attribution-relevant attention heads are concentrated in higher layers.
Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models: This paper systematically audits the generation and propagation mechanisms of hallucinations in reasoning large language models (RLLMs), finding that reflection in long CoT amplifies hallucinations through metacognitive bias rather than correcting them. Even targeted interventions at the hallucination source fail to alter final outputs (chain disloyalty), exposing critical shortcomings of existing hallucination detection methods in multi-step reasoning scenarios.
Base Models Know How to Reason, Thinking Models Learn When: Through unsupervised SAE clustering, this work discovers a taxonomy of reasoning mechanisms in thinking models, then activates the corresponding latent capabilities in base models via steering vectors. The resulting hybrid model recovers up to 91% of the performance gap between thinking and base models—without any weight updates—demonstrating that base models already possess reasoning capabilities, and that thinking models merely learn when to deploy them.
Better Estimation of the Kullback-Leibler Divergence Between Language Models: This paper proposes a Rao-Blackwellized Monte Carlo estimator for KL divergence—computing the exact KL over the next-token distribution at each position (rather than relying solely on the sampled token). The estimator is theoretically proven to be unbiased with variance strictly no greater than the standard MC estimator, incurs zero additional computational overhead, and yields more stable training in an RLHF sentiment-control task, with models appearing on the Pareto frontier 78% of the time.
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning: This paper proposes SPARKLE, a three-axis analytical framework (plan following, knowledge integration, subproblem decomposition) for fine-grained dissection of how RL shapes LLM reasoning behavior. The analysis reveals that RL primarily enhances knowledge integration and planning flexibility rather than plan execution. The paper further introduces SparkleRL-PSS, a multi-stage RL training pipeline that effectively exploits hard problem data via partial step scaffolding.
Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits: This paper proposes a direction-level interpretability framework based on SVD singular vectors. By applying unified SVD decomposition to augmented matrices of attention heads and MLPs, combined with a learnable diagonal mask optimized via KL+L₁, the framework reveals orthogonal low-rank subfunctions superposed within a single component — on the IOI task, retaining only ~9% of directions suffices to reproduce model behavior with KLD=0.21.
Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT: This paper organizes all hidden-layer activations of an LLM into an "activation tensor" (layers × tokens × hidden dimension), treats it analogously to an image, and processes it with a ViT-based architecture (ACT-ViT) that supports joint training across multiple LLMs. The method consistently outperforms conventional probing approaches across 15 LLM–dataset combinations and demonstrates strong zero-shot/few-shot transfer to unseen datasets and unseen LLMs.
Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models: Using continuous sparsification, the authors identify bigram subnetworks containing only ~10M parameters within Transformer language models. These subnetworks are concentrated in the first MLP layer, suffice to reproduce bigram predictions (\(r>0.95\)), and cause dramatic performance degradation when ablated — demonstrating that they constitute minimal next-token prediction circuits that are both necessary and sufficient in language models.
Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers: This paper proposes Causal Head Gating (CHG), which learns a differentiable gating scalar for each attention head in a Transformer and applies positive/negative regularization to classify heads into three causal roles—facilitating, interfering, and irrelevant—without requiring manual labels or prompt templates. The framework discovers causal sub-circuits at scale and extends to Contrastive CHG for disentangling independent circuits underlying in-context learning (ICL) and instruction following.
CBMAS: Cognitive Behavioral Modeling via Activation Steering: CBMAS proposes a framework that repurposes activation steering as a continuous diagnostic tool. By conducting dense α-sweeps and decoupling injection layers from readout layers, the framework elevates cognitive bias analysis from a binary "biased / unbiased" judgment to a continuous trajectory analysis capable of tracking flip points, propagation paths, and attenuation patterns. Experiments on GPT-2 Small reveal that appeasement behavior is strongly encoded in shallow layers but decays rapidly toward deeper layers.
CHiQPM: Calibrated Hierarchical Interpretable Image Classification: CHiQPM proposes a calibrated hierarchical interpretable image classification method that selects and assigns features to classes via quadratic programming, constructs hierarchical explanation paths, and incorporates interpretable Conformal Prediction set prediction, retaining 99% of black-box model accuracy while providing both global and local interpretability.
Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning: This paper proposes the CogQA benchmark dataset and a multi-class probing framework to systematically analyze cognitive functional specialization of attention heads in LLMs. The study reveals that cognitive heads exhibit sparsity, universality, and hierarchical functional organization; ablating cognitive heads significantly degrades reasoning performance, while amplifying them improves accuracy.
Conditional Distribution Compression via the Kernel Conditional Mean Embedding: This work presents the first compression algorithm targeting conditional distributions (rather than joint distributions), introducing a novel metric AMCMD based on the kernel conditional mean embedding (KCME) and a linear-time algorithm ACKIP for constructing compressed datasets that preserve the statistical properties of conditional distributions.
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter: This paper proposes Curvature Tuning (CT), which provably modulates the curvature of a model's decision boundary by injecting a single hyperparameter \(\beta\) into the activation function. CT improves generalization and robustness without modifying weights, and as a fine-tuning method requires far fewer parameters than LoRA rank 1.
Dataset Distillation for Pre-Trained Self-Supervised Vision Models: This paper proposes Linear Gradient Matching, a dataset distillation method for pre-trained self-supervised vision models. A single synthetic image per class suffices to train a linear classifier approaching full-dataset performance, and the distilled images transfer across model architectures.
Deep Modularity Networks with Diversity-Preserving Regularization: This work augments Deep Modularity Networks (DMoN) with three diversity-preserving regularization terms—distance-based, variance-based, and entropy-based—to explicitly promote inter-cluster separation and assignment diversity in feature space, achieving significant clustering quality improvements on feature-rich graph datasets.
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences: This paper proposes the Deep Value Benchmark (DVB), which employs a confound-then-deconfound experimental design to measure whether LLMs learn deep human values or merely memorize shallow preference patterns. Results show that the Deep Value Generalization Rate (DVGR) of all evaluated models averages only 0.30, far below chance level.
Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework: This paper proposes HAP, a Hybrid Attribution and Pruning framework that first applies fast Edge Attribution Patching (EAP) to filter high-potential subgraphs, then runs precise Edge Pruning (EP) on the reduced search space. On the IOI task with GPT-2 Small, HAP achieves a 46% speedup over pure EP while maintaining comparable circuit faithfulness, and successfully recovers S-inhibition heads that EAP alone fails to identify.
Distributional Autoencoders Know the Score: This paper establishes rigorous theoretical guarantees for the Distributional Principal Autoencoder (DPA): it derives a closed-form relationship between the level-set geometry of the optimal encoder and the score function of the data distribution, and proves that latent components beyond the manifold dimensionality are conditionally independent of the data—thereby unifying distributional learning and intrinsic dimension discovery within a single framework.
Do Different Prompting Methods Yield a Common Task Representation?: By generalizing the Function Vectors (FV) framework from few-shot demonstrations to text instructions, this paper finds that different prompting methods do not induce a unified task representation within LLMs; instead, they activate partially overlapping but largely distinct attention head mechanisms.
Dynamic Algorithm for Explainable k-medians Clustering under lp Norm: This paper presents the first explainable k-medians clustering algorithm for general \(\ell_p\) norms, achieving an approximation ratio of \(\tilde{O}(p(\log k)^{1+1/p-1/p^2})\) (improving the best known bound for \(p=2\)), along with the first dynamic variant: maintaining an explainable clustering under center insertions/deletions with \(O(d \log^3 k)\) amortized update time and \(O(\log k)\) amortized reassignments.
Dynamic Features Adaptation in Networking: Toward Flexible Training and Explainable Inference: This paper proposes DAFI (Drift-Aware Feature Importance), an algorithm that leverages distribution drift detection to dynamically switch between SHAP and MDI feature importance methods. Combined with Adaptive Random Forest (ARF), DAFI enables flexible training and efficient explainable inference in communication network scenarios where features are dynamically introduced over time.
Efficient Vision-Language Reasoning via Adaptive Token Pruning: This paper proposes Adaptive Token Pruning (ATP), a training-free plug-and-play module that selects the most informative visual tokens by fusing ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance). ATP achieves less than 1% accuracy degradation on VQA/GQA/COCO Captioning in exchange for approximately 40% FLOPs reduction and 1.5× speedup.
Emergence of Linear Truth Encodings in Language Models: This paper proposes the Truth Co-occurrence Hypothesis (TCH)—that true statements tend to co-occur with other true statements—and uses a minimal single-layer Transformer toy model to provide an end-to-end demonstration of how linear truth subspaces emerge naturally through a two-phase training dynamic (memorization first → truth encoding later). This constitutes the first mechanistic explanation for the widely reported linear truth representations in LLMs.
Empowering Decision Trees via Shape Function Branching: This paper proposes the Shape Generalized Tree (SGT), which replaces the conventional linear threshold split at each internal node of a decision tree with a learnable axis-aligned shape function, enabling the capture of nonlinear feature effects within more compact tree structures while preserving interpretability.
Encoding and Understanding Astrophysical Information in Large Language Model-Generated Summaries: This work investigates whether LLM embeddings encode physically meaningful quantities derived from X-ray astronomical observations—specifically hardness ratios, power-law indices, and variability indices. Results show that structured prompt design improves clustering purity of physical attributes by 5.9%–57.5%, and sparse autoencoders reveal that LLMs infer physical parameters not explicitly stated by recognizing object types.
Evaluating LLMs in Open-Source Games: This work introduces a novel paradigm of open-source games—where agents submit programs rather than raw actions—to systematically evaluate LLMs on strategic reasoning, mutual learning, and cooperative gameplay, finding that LLMs can automatically discover approximate program equilibria.
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions: FIxLIP proposes a game-theoretic framework based on weighted Banzhaf interaction indices that unifies the decomposition of similarity predictions in vision-language encoders (e.g., CLIP, SigLIP-2) into first-order token attributions and second-order cross-modal/intra-modal interactions, surpassing existing first-order attribution methods in both efficiency and faithfulness.
FaCT: Faithful Concept Traces for Explaining Neural Network Decisions: This paper proposes FaCT, an inherently interpretable model combining B-cos transformations and sparse autoencoders (SAE) that faithfully decomposes model predictions into concept contributions (Logit = \(\sum\) concept contributions) and faithfully visualizes each concept down to the input pixel level (concept activation = \(\sum\) pixel contributions). A DINOv2-based C²-score is also introduced to evaluate concept consistency.
Fantastic Features and Where to Find Them: A Probing Method to Combine Features from Multiple Foundation Models: This paper proposes ComBo, a lightweight probing-based adapter that compresses multi-layer activations from multiple frozen foundation models via affine projection, then fuses them with a small transformer—without backpropagation through any backbone. ComBo efficiently integrates complementary representations across models, surpassing prior probing methods and matching distillation-based methods on VTAB-1k.
Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement: This paper proposes a residual disentanglement method that decomposes LLM hidden states into four approximately orthogonal embeddings—lexical, syntactic, semantic, and reasoning—for predicting intracranial ECoG brain signals. The study finds that reasoning signals exhibit independent neural signatures both temporally (~350–400 ms) and spatially (extending beyond classical language areas into visual cortex), revealing a computational alignment between LLM reasoning and human brain processing.
FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed: This paper proposes FastDINOv2, a two-stage frequency-based curriculum learning strategy: the model is first trained on low-resolution images for 75% of epochs to learn low-frequency features and accelerate convergence, then trained at full resolution with Gaussian noise patching for the remaining 25% to balance frequency bias. The approach achieves a 1.6× speedup and 2.25× FLOPs reduction while improving robustness.
Improving Perturbation-based Explanations by Understanding the Role of Uncertainty Calibration: This paper reveals a fundamental connection between uncertainty calibration (the alignment between model confidence and actual accuracy) and the quality of perturbation-based explanation methods. It demonstrates that miscalibration of models on perturbed inputs directly degrades the quality of both global and local explanations, and proposes ReCalX, which applies perturbation-level-adaptive temperature scaling to substantially improve the robustness and fidelity of explanations.
Interpretable Next-token Prediction via the Generalized Induction Head: This paper proposes Induction-Gram (GIM), an interpretable language model that combines exact n-gram matching with fuzzy matching. By constructing a "generalized induction head" to retrieve similar sequences from the input context for next-token prediction, it achieves up to 25 percentage points improvement over interpretable baselines and a 20% improvement in fMRI brain response prediction.
Knowing When to Stop: Efficient Context Processing via Latent Sufficiency Signals: This paper proposes dynamic context cutoff, which trains lightweight classifiers to detect "information sufficiency signals" encoded in specific Transformer attention heads, enabling the model to determine when sufficient context has been gathered and terminate processing early. On 6 QA datasets, the method achieves an average accuracy improvement of 3.4% while reducing token consumption by 1.33×.
Latent Principle Discovery for Language Model Self-Improvement: STaPLe proposes a posterior-regularized Monte Carlo EM algorithm that enables small 7–8B models to autonomously discover "principles" (latent principles) guiding self-correction. Through an iterative discover-and-learn loop, the method achieves self-improvement with an 8–10% win-rate gain on AlpacaEval and an average improvement of +0.3 on MT-Bench. The discovered principles can be compressed into an interpretable constitution via clustering.
Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning: This paper proposes the Learning to Focus (LeaF) framework, which leverages gradient-guided detection to identify "confounding tokens" in training data. During knowledge distillation, these tokens are pruned to construct counterfactual samples, aligning the student model's attention to the key contextual tokens attended by the teacher model, thereby improving accuracy on mathematical reasoning and code generation tasks.
LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS: This paper presents a rigorous analysis of the unsupervised probing method CCS (Contrast-Consistent Search) and proposes its reformulation as Contrastive Eigenproblems, yielding closed-form solutions with interpretable eigenvalues. This formulation eliminates CCS's sensitivity to random initialization and naturally extends to multivariate settings.
Minimizing False-Positive Attributions in Explanations of Non-Linear Models: This paper proposes PatternLocal to address false-positive attributions caused by suppressor variables in XAI explanations of non-linear models. The method converts local discriminative surrogate weights into a generative representation, and significantly reduces false-positive feature attributions on three datasets: the XAI-TRIS benchmark, MRI artificial lesions, and EEG motor imagery.
Monte Carlo Expected Threat (MOCET) Scoring: This paper proposes the MOCET (Monte Carlo Expected Threat) scoring framework, which decomposes LLM-generated bioweapon synthesis protocols into sequential Bernoulli trials, combines k-NN semantic embedding-based success probability estimation with Monte Carlo simulation, and produces interpretable, automatable threat quantification metrics for measuring the real-world risk of LLMs in the biosecurity domain.
MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition: MoPFormer is proposed to decompose wearable sensor signals into sequences of motion primitives and model their temporal dependencies via a Transformer, surpassing state-of-the-art methods on multiple HAR benchmarks while maintaining a lightweight architecture.
nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers: nnterp is a lightweight wrapper over NNsight that provides a unified interface for accessing internal activations across 50+ Transformer model variants spanning 21 architecture families, achieved through systematic module renaming and automated validation tests. It ships with built-in interpretability methods including logit lens, patchscope, and activation steering, resolving the fundamental trade-off between the correctness issues of TransformerLens and the lack of standardization in bare NNsight usage.
OrdShap: Feature Position Importance for Sequential Black-Box Models: This paper proposes OrdShap, a feature attribution method for sequential models that, for the first time, decouples Value Importance (VI) from Position Importance (PI) for each feature, providing theoretical guarantees grounded in the Sanchez-Bergantiños game-theoretic value.
Out of Control -- Why Alignment Needs Formal Control Theory (and an Alignment Control Stack): This position paper argues for formal optimal control theory as a foundational tool for AI alignment research, and proposes the Alignment Control Stack (ACS)—a ten-layer hierarchical framework spanning from the physical hardware layer to the social governance layer—for systematically organizing and analyzing measurement, control, and interoperability across different alignment methods.
Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions: Two complementary tools are proposed: Thin-PID is an efficient Gaussian PID algorithm (10× faster than existing methods), and Flow-PID applies normalizing flows to map arbitrary input distributions to Gaussian space before computing PID, addressing the infeasibility of PID on continuous high-dimensional data. The paper also resolves an open problem regarding whether the joint Gaussian solution is optimal.
Probabilistic Token Alignment for Large Language Model Fusion: This work reformulates the token alignment problem in LLM fusion as an Optimal Transport (OT) problem, replacing traditional hard mappings with soft probabilistic alignment via dynamic token pairing and the Sinkhorn algorithm. On 78 tasks across 6 benchmarks, PTA-LLM achieves an average improvement of +1.72% over FuseLLM, while substantially mitigating performance degradation on challenging tasks (from −13.04% to −4.07%).
Rectifying Shortcut Behaviors in Preference-based Reward Learning: This paper proposes PRISM (Preference-based Reward Invariance for Shortcut Mitigation), which unifies reward hacking as a shortcut learning problem and employs group-invariant kernels approximated via random feature maps to simultaneously mitigate multiple spurious correlations (verbosity, sycophancy, tone, etc.), achieving consistent improvements on out-of-distribution preference data and downstream policy models.
Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games: By running multi-round "telephone games" (image→text→image loops), this paper exploits the preference biases of multimodal systems to quantify the connection strength between concepts in the system's implicit space (i.e., the "hidden language"). It contributes the Telescope dataset (10,000+ concept pairs) and establishes a scalable test-time "world map" of multimodal systems.
scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery: This work proposes the scPilot framework and scBench benchmark, enabling LLMs to perform "omics-native reasoning" (ONR) directly on single-cell RNA-seq data—reading marker genes, forming hypotheses, invoking tools for verification, and iteratively refining conclusions—achieving an 11% improvement in cell-type annotation accuracy and a 30% reduction in trajectory inference graph-edit distance.
Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning: This paper theoretically proves that self-supervised contrastive learning (DCL) is approximately equivalent to a supervised contrastive loss (NSCL), with the gap vanishing at rate \(O(1/C)\) as the number of classes increases. It further proves that the global optimum of NSCL satisfies Neural Collapse (augmentation collapse + within-class collapse + Simplex ETF), and proposes a tighter few-shot error bound based on directional CDNV.
SHAP Values via Sparse Fourier Representation: This paper proposes FourierShap, an algorithm that first approximates a black-box predictor as a sparse Fourier representation and then leverages closed-form SHAP value formulas for Fourier basis functions to efficiently compute feature attributions, achieving 10–10,000× speedups over KernelShap while supporting a tunable accuracy–efficiency trade-off.
Simulating Society Requires Simulating Thought: This paper proposes a paradigm shift from "behaviorism" to "cognitive modeling" in LLM-based social simulation. The GenMinds framework models the internal reasoning processes of LLM agents via causal belief graphs, and the RECAP benchmark evaluates reasoning fidelity along three dimensions: traceability, demographic sensitivity, and intervention consistency.
Sloth: Scaling Laws for LLM Skills to Predict Multi-Benchmark Performance Across Families: This paper proposes Skills Scaling Laws (Sloth), which assumes that LLM performance is driven by low-dimensional latent skills (e.g., reasoning, instruction following). By exploiting inter-benchmark correlations, Sloth constructs scaling laws that generalize across model families, enabling prediction of large-model performance on multiple benchmarks using only a small amount of family-specific data.
SpEx: A Spectral Approach to Explainable Clustering: This paper proposes SpEx, a general spectral graph partitioning-based framework for explainable clustering that can "round" any reference clustering (without requiring centroids) into an explainable clustering via coordinate-cut decision trees, or perform reference-free clustering directly on a kNN graph.
Steering Information Utility in Key-Value Memory for Language Model Post-Training: This paper proposes InfoSteer, a lightweight method that treats the FFN layers of Transformers as associative key-value memories, promoting more complete utilization of pretrained knowledge during post-training via forward-pass intervention (boosting key coefficients of low-activation memory vectors) and backward-pass regularization (maximizing the entropy of key distributions). Across 6 models from 3 model families (Qwen/LLaMA/Gemma) and 15 in-distribution and out-of-distribution tasks, consistent improvements are observed, and steered language models exhibit adaptive information allocation behavior.
SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning: This paper proposes SynBrain, a framework that models fMRI responses as visual-semantic-conditioned probability distributions via BrainVAE, and employs an S2N Mapper for one-step semantic-to-neural-space mapping. SynBrain substantially outperforms MindSimulator on visual-to-fMRI synthesis (65% reduction in MSE, 96% improvement in Pearson correlation), and the synthesized fMRI signals effectively enhance few-shot cross-subject decoding performance.
Table as a Modality for Large Language Models: This paper proposes TaMo, a framework that treats tables as an independent modality, encoding their structural information via a hypergraph neural network and fusing the resulting structural embeddings with the text modality of an LLM. TaMo achieves an average improvement of 42.65% over pure-text methods across multiple table reasoning benchmarks, and approaches GPT-4 in terms of structural robustness.
TangledFeatures: Robust Feature Selection in Highly Correlated Spaces: This paper proposes TangledFeatures, a selection framework centered on feature stability, implementing a three-stage pipeline of correlation-graph clustering → ensemble representative selection → random forest refinement. The framework achieves highly reproducible, domain-knowledge-consistent feature subsets across resampling in highly correlated feature spaces, validated on alanine dipeptide backbone torsion angle prediction.
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?: This paper proves that when alignment maps in causal abstraction are unconstrained by linearity, any neural network can be mapped to any algorithm, rendering causal abstraction trivial and uninformative. This gives rise to the "non-linear representation dilemma"—the absence of a principled trade-off between the complexity and the fidelity of alignment maps.
The Trilemma of Truth in Large Language Models: This paper proposes sAwMIL (Sparse-Aware Multiple Instance Learning), a three-class probing framework that combines MIL and conformal prediction to classify LLM internal activations into true/false/neither, revealing that truth and falsity signals are not encoded as simple bidirectional opposites but as distributed representations spanning a multi-dimensional subspace.
Time-Evolving Dynamical System for Learning Latent Representations of Mouse Visual Cortex: This paper proposes TE-ViDS, a sequential latent variable model that decomposes visual neural activity into an external representation linked to visual stimuli and an internal representation reflecting internal states. By incorporating a time-evolving structure and contrastive learning, TE-ViDS achieves state-of-the-art decoding performance on natural scenes and videos.
How Intrinsic Motivation Shapes Learned Representations in Decision Transformers: A Cognitive Interpretability Analysis: This paper proposes a systematic post-hoc interpretability framework to analyze how intrinsic motivation (based on Random Network Distillation) shapes the geometric structure of the embedding space in Elastic Decision Transformers. The analysis reveals that different intrinsic motivation variants create fundamentally distinct representational structures—EDT-SIL promotes compact representations while EDT-TIL enhances orthogonality—and that embedding properties exhibit strong environment-specific correlations with task performance.
Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis: This paper proposes FSTS, a Fourier-series-inspired forgery synthesis framework that models the "invisible distribution" (the high-dimensional distribution of forgery operation parameters) from 16,750 real-world forgery instances collected from 67 human participants, generating synthetic training data that more closely approximates real-world forgeries and substantially improving the generalization of text image forgery localization models.
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders: This paper proposes Mixture of Decoders (MxD), which decomposes the MLP layers of LLMs into tens of thousands of sparsely activated expert sub-layers (layer-level sparsity). Each expert implements a full-rank linear transformation via Hadamard product tensor factorization. MxD significantly outperforms Transcoders on the sparsity–accuracy trade-off while maintaining interpretability.
Towards Scaling Laws for Symbolic Regression: This work presents the first systematic study of scaling laws for symbolic regression (SR), demonstrating that end-to-end Transformer-based SR follows power-law scaling trends across three orders of magnitude of compute, and derives empirical rules for the optimal token-to-parameter ratio (\(\approx 15\)), as well as batch size and learning rate scaling with model size.
Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders: This paper systematically compares the interpretability of features derived from Transformer feed-forward (FF) layer key-value memories with those learned by sparse autoencoders (SAEs), finding the two approaches perform comparably on existing evaluation metrics—with FF-KV outperforming SAEs on certain dimensions—thereby questioning the necessity of SAEs as a feature discovery tool.
Tropical Attention: Neural Algorithmic Reasoning for Combinatorial Algorithms: Tropical Attention replaces softmax dot-product attention with tropical algebraic geometry, performing piecewise-linear reasoning in tropical projective space to align with the polyhedral decision structures of combinatorial algorithms. It is the first approach to extend neural algorithmic reasoning to NP-hard problems, comprehensively outperforming softmax baselines across three OOD generalization axes: length, magnitude, and noise.
Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing: This paper applies a circuit tracing framework to analyze the internal mechanisms of decoder-only Transformers on graph reasoning tasks, uncovering two core reasoning mechanisms: token merging and structural memorization.
URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training: This paper systematically evaluates three categories of metadata (URLs, quality scores, and topic/format domain information) as pretraining context. The key finding is that only URLs accelerate training (achieving equivalent downstream performance with 60B tokens instead of 100B), and this effect only holds under long prompts (5-shot); quality scores and topic/format domain information do not accelerate training but can be used for classifier-free guidance to enable controllable generation.
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity: This paper proposes VADTree, a training-free video anomaly detection framework that leverages a pretrained Generic Event Boundary Detection (GEBD) model to construct a Hierarchical Granularity-aware Tree (HGTree), enabling adaptive sampling and multi-granularity reasoning for anomalous events of varying temporal spans. VADTree achieves state-of-the-art performance among training-free methods on three benchmarks—UCF-Crime, XD-Violence, and MSAD—and even surpasses certain weakly supervised approaches.
ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making: This paper proposes ValuePilot, a two-phase framework that constructs value-annotated decision scenarios via a Dataset Generation Toolkit (DGT) and performs multi-criteria decision-making through a Decision-Making Module (DMM) conditioned on personalized user value preferences, outperforming strong baselines including GPT-5 in alignment with human decisions.
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set: This paper proposes VL-SAE, a sparse autoencoder with a distance-based encoder and modality-specific decoders that maps the semantics of both visual and linguistic representations onto a unified concept set, thereby interpreting and enhancing the vision-language alignment mechanism of VLMs. The approach yields an average improvement of 0.6–0.9% on zero-shot classification and outperforms the dedicated method VCD on POPE hallucination mitigation.
What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers: This paper systematically investigates the phenomenon of "abrupt learning" in Transformer training, revealing that during the loss plateau the model has already learned partial solutions while simultaneously exhibiting output repetition bias and representation collapse. It further demonstrates that the slow learning of attention maps constitutes the key bottleneck, with findings validated in the early pretraining stages of LLMs such as Pythia and OLMo.
Why Is Attention Sparse in Particle Transformer?: This paper systematically analyzes the near-binary sparse attention phenomenon observed in Particle Transformer (ParT) after training on jet tagging tasks. Through cross-dataset comparisons and ablation studies, it demonstrates that the sparsity primarily originates from the attention mechanism itself rather than the physics-inspired interaction matrix. Nevertheless, the interaction matrix remains indispensable to final performance by influencing the argmax particle selection for the vast majority of tokens.