Skip to content

🔬 Interpretability

🧠 NeurIPS2025 · 76 paper notes

📌 Same area in other venues: 📷 CVPR2026 (33) · 🔬 ICLR2026 (196) · 💬 ACL2026 (63) · 🧪 ICML2026 (92) · 🤖 AAAI2026 (37) · 📹 ICCV2025 (10)

🔥 Top topics: LLM ×7 · Reasoning ×5 · Multimodal/VLM ×4 · Alignment/RLHF ×4 · Model Compression ×3

A Controllable Examination for Long-Context Language Models

This paper proposes LongBioBench, which uses synthetically generated fictional biographies as both needles and haystacks to construct a long-context LLM evaluation framework satisfying three core principles: seamless context, controllable settings, and reliable evaluation. Evaluating 18 models, the benchmark reveals that current LCLMs exhibit substantial deficiencies in reasoning and trustworthiness despite adequate retrieval performance.

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

This paper identifies and systematically studies the phenomenon of "feature absorption" in SAEs: apparently monosemantic SAE latents fail to activate on certain tokens because their feature directions are "absorbed" by more specific sub-latents. This is shown to be an inevitable consequence of hierarchical features combined with sparsity loss, posing a fundamental challenge to using SAEs for reliable LLM interpretation.

AdaptGrad: Adaptive Sampling to Reduce Noise

AdaptGrad analyzes the theoretical origin of noise in SmoothGrad—out-of-boundary (OOB) sampling behavior—and proposes adaptively adjusting the Gaussian sampling variance for each input dimension to bound the additional noise. The method nearly eliminates gradient noise while revealing richer fine-grained features, requires minimal implementation effort, and is composable with arbitrary gradient-based explanation methods.

Additive Models Explained: A Computational Complexity Approach

This paper presents a systematic computational complexity analysis of multiple explanation types for Generalized Additive Models (GAMs), covering 54 combinations of "component model × input domain × explanation method." It reveals that the explanation complexity of GAMs is highly sensitive to the type of input domain — a phenomenon never observed in other ML models such as decision trees or neural networks — thereby challenging the intuitive assumption that "additive implies interpretable."

AgentiQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation

This paper proposes AgentiQL, a multi-expert agent framework for Text-to-SQL: a reasoning agent decomposes questions into sub-problems, a coding agent generates sub-queries, a refinement step corrects column selection, and an adaptive router intelligently routes between a baseline parser and the modular pipeline. Using a 14B open-source model, AgentiQL achieves 86.07% EX on Spider, approaching GPT-4 SOTA (89.65%).

An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations

This paper presents the first systematic study on the impact of annotation noise on Concept Bottleneck Models (CBMs). It identifies approximately 23% of concepts as "susceptible concepts" that drive the majority of performance degradation, and proposes a two-stage mitigation strategy combining SAM at training time and uncertainty-guided intervention at inference time to restore model robustness.

Are Greedy Task Orderings Better Than Random in Continual Linear Regression?

This paper systematically analyzes the convergence differences between greedy task orderings (maximizing dissimilarity between consecutive tasks) and random orderings in continual linear regression. It reveals that greedy orderings are competitive with random orderings in the full-rank setting, but single-pass greedy ordering can fail catastrophically in the general-rank setting, whereas greedy ordering with repetition achieves a convergence rate of \(\mathcal{O}(1/\sqrt[3]{k})\).

ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

ARECHO frames speech multi-metric evaluation as a chain-based autoregressive token prediction task. It designs a unified speech information tokenization pipeline to handle 87 heterogeneous metrics (numerical/categorical/bounded/unbounded), explicitly captures inter-metric dependencies (e.g., intelligibility–naturalness correlation) via dynamic classification chains, and employs two-step confidence-guided decoding to reduce error propagation. ARECHO comprehensively outperforms the UniVERSA baseline across enhancement, synthesis, and noisy speech evaluation (Avg Test MSE 23.26 vs. 96.99, −76%).

ARC-JSD: Attributing Response to Context via Jensen-Shannon Divergence Driven Mechanistic Study

ARC-JSD proposes a RAG context attribution method based on Jensen-Shannon Divergence — by comparing the JSD between model output distributions with and without specific context sentences, it localizes the context that a response depends on without fine-tuning or gradient computation. The method achieves 3× faster computation than baselines, improves Top-1 attribution accuracy by 10.7% on average, and reveals via Logit Lens that attribution-relevant attention heads are concentrated in higher layers.

Base Models Know How to Reason, Thinking Models Learn When

Through unsupervised SAE clustering, this work discovers a taxonomy of reasoning mechanisms in thinking models, then activates the corresponding latent capabilities in base models via steering vectors. The resulting hybrid model recovers up to 91% of the performance gap between thinking and base models—without any weight updates—demonstrating that base models already possess reasoning capabilities, and that thinking models merely learn when to deploy them.

Benchmarking Probabilistic Time Series Forecasting Models on Neural Activity

The first systematic evaluation of 12 probabilistic time series forecasting models on mouse cortical calcium imaging data. PatchTST consistently achieves top performance (informative prediction horizon up to 1.5 s), zero-shot foundation models (Chronos) fail entirely but become competitive after fine-tuning, and the intrinsic predictability ceiling of neural activity is found to be approximately 1.5 seconds.

Better Estimation of the Kullback-Leibler Divergence Between Language Models

This paper proposes a Rao-Blackwellized Monte Carlo estimator for KL divergence—computing the exact KL over the next-token distribution at each position (rather than relying solely on the sampled token). The estimator is theoretically proven to be unbiased with variance strictly no greater than the standard MC estimator, incurs zero additional computational overhead, and yields more stable training in an RLHF sentiment-control task, with models appearing on the Pareto frontier 78% of the time.

Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

This paper proposes a direction-level interpretability framework based on SVD singular vectors. By applying unified SVD decomposition to augmented matrices of attention heads and MLPs, combined with a learnable diagonal mask optimized via KL+L₁, the framework reveals orthogonal low-rank subfunctions superposed within a single component — on the IOI task, retaining only ~9% of directions suffices to reproduce model behavior with KLD=0.21.

Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

Using continuous sparsification, the authors identify bigram subnetworks containing only ~10M parameters within Transformer language models. These subnetworks are concentrated in the first MLP layer, suffice to reproduce bigram predictions (\(r>0.95\)), and cause dramatic performance degradation when ablated — demonstrating that they constitute minimal next-token prediction circuits that are both necessary and sufficient in language models.

Causal Head Gating: A Framework for Interpreting Roles of Attention Heads in Transformers

This paper proposes Causal Head Gating (CHG), which learns a differentiable gating scalar for each attention head in a Transformer and applies positive/negative regularization to classify heads into three causal roles—facilitating, interfering, and irrelevant—without requiring manual labels or prompt templates. The framework discovers causal sub-circuits at scale and extends to Contrastive CHG for disentangling independent circuits underlying in-context learning (ICL) and instruction following.

CBMAS: Cognitive Behavioral Modeling via Activation Steering

CBMAS proposes a framework that repurposes activation steering as a continuous diagnostic tool. By conducting dense α-sweeps and decoupling injection layers from readout layers, the framework elevates cognitive bias analysis from a binary "biased / unbiased" judgment to a continuous trajectory analysis capable of tracking flip points, propagation paths, and attenuation patterns. Experiments on GPT-2 Small reveal that appeasement behavior is strongly encoded in shallow layers but decays rapidly toward deeper layers.

Conditional Distribution Compression via the Kernel Conditional Mean Embedding

This work presents the first compression algorithm targeting conditional distributions (rather than joint distributions), introducing a novel metric AMCMD based on the kernel conditional mean embedding (KCME) and a linear-time algorithm ACKIP for constructing compressed datasets that preserve the statistical properties of conditional distributions.

Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

This paper proposes Curvature Tuning (CT), which provably modulates the curvature of a model's decision boundary by injecting a single hyperparameter \(\beta\) into the activation function. CT improves generalization and robustness without modifying weights, and as a fine-tuning method requires far fewer parameters than LoRA rank 1.

Dataset Distillation for Pre-Trained Self-Supervised Vision Models

This paper proposes Linear Gradient Matching, a dataset distillation method for pre-trained self-supervised vision models. A single synthetic image per class suffices to train a linear classifier approaching full-dataset performance, and the distilled images transfer across model architectures.

Deep Modularity Networks with Diversity-Preserving Regularization

This work augments Deep Modularity Networks (DMoN) with three diversity-preserving regularization terms—distance-based, variance-based, and entropy-based—to explicitly promote inter-cluster separation and assignment diversity in feature space, achieving significant clustering quality improvements on feature-rich graph datasets.

Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences

This paper proposes the Deep Value Benchmark (DVB), which employs a confound-then-deconfound experimental design to measure whether LLMs learn deep human values or merely memorize shallow preference patterns. Results show that the Deep Value Generalization Rate (DVGR) of all evaluated models averages only 0.30, far below chance level.

Dense SAE Latents Are Features, Not Bugs

This paper systematically investigates frequently activating "dense latents" in sparse autoencoders (SAEs), demonstrating that they are not training artifacts but rather reflections of intrinsically dense subspaces in language model residual streams. The authors propose a six-category taxonomy of dense latents encompassing position tracking, context binding, null space, alphabetic, part-of-speech, and PCA latents.

Discovering Transformer Circuits via a Hybrid Attribution and Pruning Framework

This paper proposes HAP, a Hybrid Attribution and Pruning framework that first applies fast Edge Attribution Patching (EAP) to filter high-potential subgraphs, then runs precise Edge Pruning (EP) on the reduced search space. On the IOI task with GPT-2 Small, HAP achieves a 46% speedup over pure EP while maintaining comparable circuit faithfulness, and successfully recovers S-inhibition heads that EAP alone fails to identify.

Distributional Autoencoders Know the Score

This paper establishes rigorous theoretical guarantees for the Distributional Principal Autoencoder (DPA): it derives a closed-form relationship between the level-set geometry of the optimal encoder and the score function of the data distribution, and proves that latent components beyond the manifold dimensionality are conditionally independent of the data—thereby unifying distributional learning and intrinsic dimension discovery within a single framework.

Do Different Prompting Methods Yield a Common Task Representation?

By generalizing the Function Vectors (FV) framework from few-shot demonstrations to text instructions, this paper finds that different prompting methods do not induce a unified task representation within LLMs; instead, they activate partially overlapping but largely distinct attention head mechanisms.

Dynamic Algorithm for Explainable k-medians Clustering under lp Norm

This paper presents the first explainable k-medians clustering algorithm for general \(\ell_p\) norms, achieving an approximation ratio of \(\tilde{O}(p(\log k)^{1+1/p-1/p^2})\) (improving the best known bound for \(p=2\)), along with the first dynamic variant: maintaining an explainable clustering under center insertions/deletions with \(O(d \log^3 k)\) amortized update time and \(O(\log k)\) amortized reassignments.

Dynamic Features Adaptation in Networking: Toward Flexible Training and Explainable Inference

This paper proposes DAFI (Drift-Aware Feature Importance), an algorithm that leverages distribution drift detection to dynamically switch between SHAP and MDI feature importance methods. Combined with Adaptive Random Forest (ARF), DAFI enables flexible training and efficient explainable inference in communication network scenarios where features are dynamically introduced over time.

Efficient Vision-Language Reasoning via Adaptive Token Pruning

This paper proposes Adaptive Token Pruning (ATP), a training-free plug-and-play module that selects the most informative visual tokens by fusing ViT CLS attention (intra-modal saliency) and CLIP text-image similarity (inter-modal relevance). ATP achieves less than 1% accuracy degradation on VQA/GQA/COCO Captioning in exchange for approximately 40% FLOPs reduction and 1.5× speedup.

Emergence of Linear Truth Encodings in Language Models

This paper proposes the Truth Co-occurrence Hypothesis (TCH)—that true statements tend to co-occur with other true statements—and uses a minimal single-layer Transformer toy model to provide an end-to-end demonstration of how linear truth subspaces emerge naturally through a two-phase training dynamic (memorization first → truth encoding later). This constitutes the first mechanistic explanation for the widely reported linear truth representations in LLMs.

Empowering Decision Trees via Shape Function Branching

This paper proposes the Shape Generalized Tree (SGT), which replaces the conventional linear threshold split at each internal node of a decision tree with a learnable axis-aligned shape function, enabling the capture of nonlinear feature effects within more compact tree structures while preserving interpretability.

Evaluating LLMs in Open-Source Games

This work introduces a novel paradigm of open-source games—where agents submit programs rather than raw actions—to systematically evaluate LLMs on strategic reasoning, mutual learning, and cooperative gameplay, finding that LLMs can automatically discover approximate program equilibria.

Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions

FIxLIP proposes a game-theoretic framework based on weighted Banzhaf interaction indices that unifies the decomposition of similarity predictions in vision-language encoders (e.g., CLIP, SigLIP-2) into first-order token attributions and second-order cross-modal/intra-modal interactions, surpassing existing first-order attribution methods in both efficiency and faithfulness.

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

This paper proposes FaCT, an inherently interpretable model combining B-cos transformations and sparse autoencoders (SAE) that faithfully decomposes model predictions into concept contributions (Logit = \(\sum\) concept contributions) and faithfully visualizes each concept down to the input pixel level (concept activation = \(\sum\) pixel contributions). A DINOv2-based C²-score is also introduced to evaluate concept consistency.

Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement

This paper proposes a residual disentanglement method that decomposes LLM hidden states into four approximately orthogonal embeddings—lexical, syntactic, semantic, and reasoning—for predicting intracranial ECoG brain signals. The study finds that reasoning signals exhibit independent neural signatures both temporally (~350–400 ms) and spatially (extending beyond classical language areas into visual cortex), revealing a computational alignment between LLM reasoning and human brain processing.

FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed

This paper proposes FastDINOv2, a two-stage frequency-based curriculum learning strategy: the model is first trained on low-resolution images for 75% of epochs to learn low-frequency features and accelerate convergence, then trained at full resolution with Gaussian noise patching for the remaining 25% to balance frequency bias. The approach achieves a 1.6× speedup and 2.25× FLOPs reduction while improving robustness.

Improving Perturbation-based Explanations by Understanding the Role of Uncertainty Calibration

This paper reveals a fundamental connection between uncertainty calibration (the alignment between model confidence and actual accuracy) and the quality of perturbation-based explanation methods. It demonstrates that miscalibration of models on perturbed inputs directly degrades the quality of both global and local explanations, and proposes ReCalX, which applies perturbation-level-adaptive temperature scaling to substantially improve the robustness and fidelity of explanations.

Knowing When to Stop: Efficient Context Processing via Latent Sufficiency Signals

This paper proposes dynamic context cutoff, which trains lightweight classifiers to detect "information sufficiency signals" encoded in specific Transformer attention heads, enabling the model to determine when sufficient context has been gathered and terminate processing early. On 6 QA datasets, the method achieves an average accuracy improvement of 3.4% while reducing token consumption by 1.33×.

Latent Principle Discovery for Language Model Self-Improvement

STaPLe proposes a posterior-regularized Monte Carlo EM algorithm that enables small 7–8B models to autonomously discover "principles" (latent principles) guiding self-correction. Through an iterative discover-and-learn loop, the method achieves self-improvement with an 8–10% win-rate gain on AlpacaEval and an average improvement of +0.3 on MT-Bench. The discovered principles can be compressed into an interpretable constitution via clustering.

Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning

This paper proposes the Learning to Focus (LeaF) framework, which leverages gradient-guided detection to identify "confounding tokens" in training data. During knowledge distillation, these tokens are pruned to construct counterfactual samples, aligning the student model's attention to the key contextual tokens attended by the teacher model, thereby improving accuracy on mathematical reasoning and code generation tasks.

LLM Probing with Contrastive Eigenproblems: Improving Understanding and Applicability of CCS

This paper presents a rigorous analysis of the unsupervised probing method CCS (Contrast-Consistent Search) and proposes its reformulation as Contrastive Eigenproblems, yielding closed-form solutions with interpretable eigenvalues. This formulation eliminates CCS's sensitivity to random initialization and naturally extends to multivariate settings.

LLM World Models Are Mental: Output Layer Evidence of Brittle World Model Use in LLM Mechanical Reasoning

Drawing on cognitive science methodology for studying mental models, this work evaluates LLM mechanical reasoning ability using TikZ code representations of pulley systems. LLMs can approximately estimate mechanical advantage and distinguish functional from non-functional systems (Studies 1 & 2), but completely fail at fine-grained structural connectivity reasoning (Study 3), indicating that LLM "world models" exist but are brittle.

Minimizing False-Positive Attributions in Explanations of Non-Linear Models

This paper proposes PatternLocal to address false-positive attributions caused by suppressor variables in XAI explanations of non-linear models. The method converts local discriminative surrogate weights into a generative representation, and significantly reduces false-positive feature attributions on three datasets: the XAI-TRIS benchmark, MRI artificial lesions, and EEG motor imagery.

Model-Behavior Alignment under Flexible Evaluation: When the Best-Fitting Model Isn't the Right One

Through large-scale model recovery experiments, this paper demonstrates that even with 4.5 million behavioral data points, flexible evaluation methods based on linear probing achieve model recovery accuracy below 80% across 20 visual models. This reveals a fundamental trade-off between predictive accuracy and model identifiability, challenging the prevailing paradigm that the best-fitting model is the most appropriate one.

Monte Carlo Expected Threat (MOCET) Scoring

This paper proposes the MOCET (Monte Carlo Expected Threat) scoring framework, which decomposes LLM-generated bioweapon synthesis protocols into sequential Bernoulli trials, combines k-NN semantic embedding-based success probability estimation with Monte Carlo simulation, and produces interpretable, automatable threat quantification metrics for measuring the real-world risk of LLMs in the biosecurity domain.

MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition

MoPFormer is proposed to decompose wearable sensor signals into sequences of motion primitives and model their temporal dependencies via a Transformer, surpassing state-of-the-art methods on multiple HAR benchmarks while maintaining a lightweight architecture.

nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers

nnterp is a lightweight wrapper over NNsight that provides a unified interface for accessing internal activations across 50+ Transformer model variants spanning 21 architecture families, achieved through systematic module renaming and automated validation tests. It ships with built-in interpretability methods including logit lens, patchscope, and activation steering, resolving the fundamental trade-off between the correctness issues of TransformerLens and the lack of standardization in bare NNsight usage.

OrdShap: Feature Position Importance for Sequential Black-Box Models

This paper proposes OrdShap, a feature attribution method for sequential models that, for the first time, decouples Value Importance (VI) from Position Importance (PI) for each feature, providing theoretical guarantees grounded in the Sanchez-Bergantiños game-theoretic value.

Out of Control -- Why Alignment Needs Formal Control Theory (and an Alignment Control Stack)

This position paper argues for formal optimal control theory as a foundational tool for AI alignment research, and proposes the Alignment Control Stack (ACS)—a ten-layer hierarchical framework spanning from the physical hardware layer to the social governance layer—for systematically organizing and analyzing measurement, control, and interoperability across different alignment methods.

Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions

Two complementary tools are proposed: Thin-PID is an efficient Gaussian PID algorithm (10× faster than existing methods), and Flow-PID applies normalizing flows to map arbitrary input distributions to Gaussian space before computing PID, addressing the infeasibility of PID on continuous high-dimensional data. The paper also resolves an open problem regarding whether the joint Gaussian solution is optimal.

Probabilistic Token Alignment for Large Language Model Fusion

This work reformulates the token alignment problem in LLM fusion as an Optimal Transport (OT) problem, replacing traditional hard mappings with soft probabilistic alignment via dynamic token pairing and the Sinkhorn algorithm. On 78 tasks across 6 benchmarks, PTA-LLM achieves an average improvement of +1.72% over FuseLLM, while substantially mitigating performance degradation on challenging tasks (from −13.04% to −4.07%).

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought

This paper theoretically demonstrates the expressive advantage of continuous chain-of-thought (Coconut) on directed graph reachability: a two-layer Transformer using \(D\) continuous thought steps suffices to solve graph reachability with diameter \(D\), whereas discrete CoT requires \(O(n^2)\) steps. The core mechanism is that continuous thought vectors encode multiple search frontiers simultaneously in a "superposition state," enabling implicit parallel BFS.

Rectifying Shortcut Behaviors in Preference-based Reward Learning

This paper proposes PRISM (Preference-based Reward Invariance for Shortcut Mitigation), which unifies reward hacking as a shortcut learning problem and employs group-invariant kernels approximated via random feature maps to simultaneously mitigate multiple spurious correlations (verbosity, sycophancy, tone, etc.), achieving consistent improvements on out-of-distribution preference data and downstream policy models.

Representation Consistency for Accurate and Coherent LLM Answer Aggregation

This paper proposes Representation Consistency (RC), which improves answer aggregation by analyzing the consistency of internal activations when an LLM generates multiple candidate answers. Reasoning paths that yield the same answer with highly consistent internal representations are more likely to be correct. A sparse variant, RC-S, leveraging sparse autoencoders achieves the best performance, consistently outperforming Self-Consistency across 4 LLMs and 4 reasoning datasets.

Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games

By running multi-round "telephone games" (image→text→image loops), this paper exploits the preference biases of multimodal systems to quantify the connection strength between concepts in the system's implicit space (i.e., the "hidden language"). It contributes the Telescope dataset (10,000+ concept pairs) and establishes a scalable test-time "world map" of multimodal systems.

SHAP Values via Sparse Fourier Representation

This paper proposes FourierShap, an algorithm that first approximates a black-box predictor as a sparse Fourier representation and then leverages closed-form SHAP value formulas for Fourier basis functions to efficiently compute feature attributions, achieving 10–10,000× speedups over KernelShap while supporting a tunable accuracy–efficiency trade-off.

Simulating Society Requires Simulating Thought

This paper proposes a paradigm shift from "behaviorism" to "cognitive modeling" in LLM-based social simulation. The GenMinds framework models the internal reasoning processes of LLM agents via causal belief graphs, and the RECAP benchmark evaluates reasoning fidelity along three dimensions: traceability, demographic sensitivity, and intervention consistency.

Sloth: Scaling Laws for LLM Skills to Predict Multi-Benchmark Performance Across Families

This paper proposes Skills Scaling Laws (Sloth), which assumes that LLM performance is driven by low-dimensional latent skills (e.g., reasoning, instruction following). By exploiting inter-benchmark correlations, Sloth constructs scaling laws that generalize across model families, enabling prediction of large-model performance on multiple benchmarks using only a small amount of family-specific data.

Specialization after Generalization: Towards Understanding Test-Time Training in Foundation Models

This paper proposes a "specialization after generalization" framework that theoretically and empirically explains the effectiveness of test-time training (TTT) on in-distribution data under the Linear Representation Hypothesis (LRH). Foundation models are globally underparameterized, leading to concept superposition interference. TTT mitigates this by locally specializing the model—reallocating model capacity to the small subset of concepts relevant to the test task—thereby improving predictive performance without increasing model size.

SpEx: A Spectral Approach to Explainable Clustering

This paper proposes SpEx, a general spectral graph partitioning-based framework for explainable clustering that can "round" any reference clustering (without requiring centroids) into an explainable clustering via coordinate-cut decision trees, or perform reference-free clustering directly on a kNN graph.

Steering Information Utility in Key-Value Memory for Language Model Post-Training

This paper proposes InfoSteer, a lightweight method that treats the FFN layers of Transformers as associative key-value memories, promoting more complete utilization of pretrained knowledge during post-training via forward-pass intervention (boosting key coefficients of low-activation memory vectors) and backward-pass regularization (maximizing the entropy of key distributions). Across 6 models from 3 model families (Qwen/LLaMA/Gemma) and 15 in-distribution and out-of-distribution tasks, consistent improvements are observed, and steered language models exhibit adaptive information allocation behavior.

Superposition Yields Robust Neural Scaling

This paper identifies representational superposition as the core driver of neural scaling laws: in the strong-superposition regime, loss universally scales inversely with model dimension (\(L \propto 1/m\)), independent of the specific form of the data frequency distribution—consistent with empirical scaling behavior in real LLMs.

Table as a Modality for Large Language Models

This paper proposes TaMo, a framework that treats tables as an independent modality, encoding their structural information via a hypergraph neural network and fusing the resulting structural embeddings with the text modality of an LLM. TaMo achieves an average improvement of 42.65% over pure-text methods across multiple table reasoning benchmarks, and approaches GPT-4 in terms of structural robustness.

TangledFeatures: Robust Feature Selection in Highly Correlated Spaces

This paper proposes TangledFeatures, a selection framework centered on feature stability, implementing a three-stage pipeline of correlation-graph clustering → ensemble representative selection → random forest refinement. The framework achieves highly reproducible, domain-knowledge-consistent feature subsets across resampling in highly correlated feature spaces, validated on alanine dipeptide backbone torsion angle prediction.

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

This paper proves that when alignment maps in causal abstraction are unconstrained by linearity, any neural network can be mapped to any algorithm, rendering causal abstraction trivial and uninformative. This gives rise to the "non-linear representation dilemma"—the absence of a principled trade-off between the complexity and the fidelity of alignment maps.

Time-Evolving Dynamical System for Learning Latent Representations of Mouse Visual Cortex

This paper proposes TE-ViDS, a sequential latent variable model that decomposes visual neural activity into an external representation linked to visual stimuli and an internal representation reflecting internal states. By incorporating a time-evolving structure and contrastive learning, TE-ViDS achieves state-of-the-art decoding performance on natural scenes and videos.

How Intrinsic Motivation Shapes Learned Representations in Decision Transformers: A Cognitive Interpretability Analysis

This paper proposes a systematic post-hoc interpretability framework to analyze how intrinsic motivation (based on Random Network Distillation) shapes the geometric structure of the embedding space in Elastic Decision Transformers. The analysis reveals that different intrinsic motivation variants create fundamentally distinct representational structures—EDT-SIL promotes compact representations while EDT-TIL enhances orthogonality—and that embedding properties exhibit strong environment-specific correlations with task performance.

Toward Real-world Text Image Forgery Localization: Structured and Interpretable Data Synthesis

This paper proposes FSTS, a Fourier-series-inspired forgery synthesis framework that models the "invisible distribution" (the high-dimensional distribution of forgery operation parameters) from 16,750 real-world forgery instances collected from 67 human participants, generating synthetic training data that more closely approximates real-world forgeries and substantially improving the generalization of text image forgery localization models.

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

This paper proposes Mixture of Decoders (MxD), which decomposes the MLP layers of LLMs into tens of thousands of sparsely activated expert sub-layers (layer-level sparsity). Each expert implements a full-rank linear transformation via Hadamard product tensor factorization. MxD significantly outperforms Transcoders on the sparsity–accuracy trade-off while maintaining interpretability.

Towards Scaling Laws for Symbolic Regression

This work presents the first systematic study of scaling laws for symbolic regression (SR), demonstrating that end-to-end Transformer-based SR follows power-law scaling trends across three orders of magnitude of compute, and derives empirical rules for the optimal token-to-parameter ratio (\(\approx 15\)), as well as batch size and learning rate scaling with model size.

Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

This paper systematically compares the interpretability of features derived from Transformer feed-forward (FF) layer key-value memories with those learned by sparse autoencoders (SAEs), finding the two approaches perform comparably on existing evaluation metrics—with FF-KV outperforming SAEs on certain dimensions—thereby questioning the necessity of SAEs as a feature discovery tool.

Uncovering Graph Reasoning in Decoder-only Transformers with Circuit Tracing

This paper applies a circuit tracing framework to analyze the internal mechanisms of decoder-only Transformers on graph reasoning tasks, uncovering two core reasoning mechanisms: token merging and structural memorization.

Understanding Prompt Tuning and In-Context Learning via Meta-Learning

This paper systematically analyzes the theoretical foundations and limitations of prompt tuning from a Bayesian meta-learning perspective. It proves that soft prompts can achieve optimal adaptation on a single target task within the pretraining distribution, yet face fundamental limitations under multi-task mixture target distributions. Furthermore, soft prefixes can surpass the optimal hard-token sequence by manipulating activations outside the token space.

URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training

This paper systematically evaluates three categories of metadata (URLs, quality scores, and topic/format domain information) as pretraining context. The key finding is that only URLs accelerate training (achieving equivalent downstream performance with 60B tokens instead of 100B), and this effect only holds under long prompts (5-shot); quality scores and topic/format domain information do not accelerate training but can be used for classifier-free guidance to enable controllable generation.

ValuePilot: A Two-Phase Framework for Value-Driven Decision-Making

This paper proposes ValuePilot, a two-phase framework that constructs value-annotated decision scenarios via a Dataset Generation Toolkit (DGT) and performs multi-criteria decision-making through a Decision-Making Module (DMM) conditioned on personalized user value preferences, outperforming strong baselines including GPT-5 in alignment with human decisions.

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

This paper proposes VL-SAE, a sparse autoencoder with a distance-based encoder and modality-specific decoders that maps the semantics of both visual and linguistic representations onto a unified concept set, thereby interpreting and enhancing the vision-language alignment mechanism of VLMs. The approach yields an average improvement of 0.6–0.9% on zero-shot classification and outperforms the dedicated method VCD on POPE hallucination mitigation.

What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

This paper systematically investigates the phenomenon of "abrupt learning" in Transformer training, revealing that during the loss plateau the model has already learned partial solutions while simultaneously exhibiting output repetition bias and representation collapse. It further demonstrates that the slow learning of attention maps constitutes the key bottleneck, with findings validated in the early pretraining stages of LLMs such as Pythia and OLMo.