ICML2026 Interpretability AI paper notes paper summaries LLM Reasoning Alignment/RLHF Multimodal/VLM Adversarial Robustness Layout & Composition

🔬 Interpretability¶

🧪 ICML2026 · 92 paper notes

📌 Same area in other venues: 📷 CVPR2026 (34) · 🔬 ICLR2026 (196) · 💬 ACL2026 (63) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (80) · 📹 ICCV2025 (10)

🔥 Top topics: LLM ×9 · Reasoning ×4 · Alignment/RLHF ×3 · Multimodal/VLM ×3 · Adversarial Robustness ×3

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents: This paper proposes an evaluation framework for LLM Agent goal-directedness that integrates behavioral assessment with internal representation probing. In grid navigation tasks using GPT-OSS-20B, it was discovered that while the agent behaviorally follows goals, and internally encodes coarse-grained spatial maps and short-term plans, it can be misled by non-functional goal-like objects.
A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments: This paper constrains model design using VR interactive experiments and proposes a mental rotation model composed of a 3D equivariant spatial encoder, a neuro-symbolic object encoder, and an MLP for action decision-making. The model replicates human mental rotation behavior in terms of accuracy, number of actions, and partial response time trends.
Accurate Evaluation of Quickest Changepoint Detectors via Non-parametric Survival Analysis: This work reformulates the ARL/ADD evaluation in online quickest changepoint detection (QCD) as a right-censored survival analysis problem. By using Kaplan-Meier curves to estimate detection time and delay under finite and irregular sequence lengths, the proposed method provides more robust and less biased estimators compared to traditional methods that only count triggered samples.
Adaptive Querying with AI Persona Priors: The authors package "LLM response distributions conditioned on personas" into a finite mixture Bayesian prior. This allows for efficient prediction of remaining responses via closed-form posterior updates on personas after asking only a few questions, outperforming classic CAT/IRT baselines.
AI Engram: In Search of Memory Traces in Artificial Intelligence: The authors translate four classic criteria of "engrams" (memory traces) from neuroscience (specificity, reactivation, sufficiency, necessity) into algebraic constraints in parameter space. This leads to a closed-form estimator calculated in a single forward pass using input statistics. It "carves out" the causal sub-components of a concept within network weights, allowing arbitrary knowledge to be injected or erased via simple linear arithmetic—proving that this biologically motivated solution is equivalent to a natural gradient projection under the Fisher metric.
All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs: This paper systematically disproves the implicit assumption in mechanistic interpretability—"one LLM capability corresponds to one unique circuit"—using the Overlap-Aware Sheaf Repulsion (OASR) algorithm. It reveals that the same task can be supported by multiple, nearly non-overlapping sheaves (IoU ~4–11%) that satisfy requirements for being faithful, sparse, and complete. The authors propose the "Distributive Dense Circuit Hypothesis" as a theoretical explanation.
Analytic Bijections for Smooth and Interpretable Normalizing Flows: This paper constructs three families of "globally smooth ($C^\infty$), defined on the entire $\mathbb{R}$, and analytically invertible in closed-form" scalar bijections. These serve as plug-and-play replacements for splines or affine transforms in coupling flows and enable a directly parameterized radial flow that transforms the radius while preserving angular directions. The latter is highly stable to train, geometrically interpretable, and achieves comparable quality to coupling flows on targets with radial structures using three orders of magnitude fewer parameters.
Beyond Additive Decompositions: Interpretability Through Separability: Ours proposes Tensor Separable Learning (TSL), a stagewise greedy regression method that models the conditional mean as the difference between positive rank-1 separable products. By utilizing a separable structure, it avoids signal cancellation and interaction masking issues inherent in additive decompositions under strong interactions, while its partial dependence functions can precisely recover the shapes of the fitting factors.
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking: BLOCK-EM utilizes SAEs to identify a sparse set of internal latents that "causally control emergent misalignment." During narrow-domain SFT, a one-sided regularization is applied to prohibit the model from amplifying these latents in the "misalignment direction." This mechanism reduces emergent misalignment (EM) by an average of 93% across six fine-tuning domains with almost no degradation in in-domain task performance.
Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression: Proposes SimpliPy (a rule-based simplification engine 100x faster than SymPy) and Flash-ANSR (a Transformer-based amortized symbolic regression framework). It matches or exceeds the legacy genetic programming method PySR on the FastSRB benchmark with a ~58% recovery rate, while generating increasingly concise expressions as the inference budget grows.
Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions: This paper reveals a widespread "knowledge-prediction gap" in LLMs on MCQs—correct answers are already linearly encoded in hidden layers, but the final predictions deviate. Through geometric analysis, this gap is attributed to the misalignment between knowledge and prediction subspaces. The authors propose KAPPA, which uses closed-form affine transformations to align these subspaces during inference, consistently closing the gap and improving accuracy across models and benchmarks.
CB-SLICE: Concept-Based Interpretable Error Slice Discovery: CB-SLICE utilizes the concept prediction space of Concept Bottleneck Models (CBMs) to discover and explain systematic error slices in deep learning models. Through a three-step pipeline—filtering error-prone concepts, GMM clustering for slice formation, and keyword-based concept explanation—it consistently outperforms existing methods across multiple benchmarks while providing faithful explanations directly grounded in the model's internal decision logic.
Certified Circuits: Stability Guarantees for Mechanistic Circuits: The authors propose the Certified Circuits framework, which provides provable dataset-level stability guarantees for circuit discovery in mechanistic interpretability via deletion-based randomized smoothing. This ensures that discovered circuits remain invariant under bounded edit distance perturbations of the concept dataset, resulting in more compact, accurate circuits with superior OOD generalization.
Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path: This paper proposes the "Circuit Fingerprint" hypothesis—feeding a standalone answer token into a Transformer leaves a directional signature in the latent space that corresponds exactly to the circuit path required to generate that answer. Based on this, it achieves circuit discovery through pure geometric alignment (without gradients or intervention). It further demonstrates that the same set of directions can perform activation steering, proving that "reading" and "writing" are two sides of the same geometric object.
CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees: Addressing the dilemma where "greedy regression trees are fast but inaccurate, while optimal regression trees are accurate but computationally prohibitive," CLARITree combines one-step lookahead search with rank-one Cholesky updates of the Ridge Regression Gram matrix. It learns near-optimal, sparse regression trees with linear models at the leaves, achieving accuracy close to optimal solutions while scaling an order of magnitude better than existing state-of-the-art methods.
Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation: This paper introduces PID feedback control from control theory into Sparse Autoencoder (SAE)-based activation steering. By using the integral term to accumulate error, the method overcomes the Top-K sparsity threshold which causes static steering to fail at low intensities during transitions. Temporal PID dynamically adjusts $\lambda(t)$ at each autoregressive step, achieving smooth transitions for pitch and duration in symbolic music with 62–67% lower intervention intensity and a 5% reduction in FMD degradation.
Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement: This paper formalizes the degradation phenomenon of autoregressive language models in long-sequence generation as "cognitive fatigue." It proposes the Fatigue Index (FI), a lightweight, model-agnostic online diagnostic metric that aggregates three signals: prompt attention decay, representation drift, and entropy dysregulation. The predictive power of FI for degradation (AUROC=0.976) is validated across 9 models, revealing non-monotonic scaling behavior.
CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features: By calculating the Pearson correlation between SAE activations on generation-time tokens and task correctness, CorrSteer identifies interpretable steering features. Using mean activations from positive samples as coefficients without contrasting datasets or backpropagation, it improves MMLU by +3.3% and HarmBench by +27.1% on Gemma-2 2B / LLaMA-3.1 8B, achieving a lower side-effect rate than fine-tuning.
Courtroom Analogy: New Perspective on Uncertainty-Aware Classification: This paper proposes a "courtroom analogy" perspective, modeling second-order uncertainty in classification as a structured mixture of $K$ class-advocate Dirichlet opinions under input-dependent weights. This is instantiated as the MoDEX network (comprising three lightweight heads: shared evidence $\bm{\alpha}$, class-specific advocacy strength $\tau_k$, and credibility $\bm{\omega}$). MoDEX consistently outperforms baselines such as EDL and $\mathcal{F}$-EDL across benchmarks including CIFAR/SVHN/TIN/CIFAR-10-C/CIFAR-10-LT with a single forward pass, providing semantically clear uncertainty decomposition.
Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory: This paper applies the Graded Response Model (GRM) from psychometric Item Response Theory (IRT) to LLM-as-a-Judge. It decomposes "judgment scores" into judge attributes $(\alpha, \beta)$ and latent sample quality $\theta$. Using four interpretable metrics, it systematically diagnoses whether 7 mainstream LLMs across 11 evaluation criteria act as "stable measurement instruments" through a two-stage process (intrinsic consistency + human alignment).
Dimensionality Controls When Modularity Helps in Continual Learning: This paper systematically compares "task-blocked modular recurrent networks" with "single networks" using an A→B→A sequential learning paradigm. It finds that modularity is not always beneficial—only when the initialization scale $\gamma$ compresses representations into the low-dimensional "rich" regime does modularity lead to lower interference and spontaneously organize a gradient geometry where "similar tasks overlap in subspace while dissimilar tasks are orthogonal." In the high-dimensional "lazy" regime, the two architectures show negligible differences.
Discovering Differences in Strategic Behavior Between Humans and LLMs: This paper utilizes AlphaEvolve (an LLM-based program synthesis framework) to "evolve" interpretable Python behavior models directly from behavioral data. By comparing humans with frontier LLMs in Iterated Rock-Paper-Scissors (IRPS), the study finds that Gemini 2.5 Pro/Flash and GPT 5.1 significantly outperform humans in both win rates and "opponent modeling" dimensions, whereas GPT OSS 120B exhibits deteriorating performance over time.
Discovering Implicit Large Language Model Alignment Objectives: Obj-Disco reverse-engineers opaque reward signals from RLHF/GRPO into a sparse linear combination of natural language objectives (DIR) along the "model checkpoint trajectory." By utilizing a Matching Pursuit-style greedy approach combined with dual LLM-as-Judge verification, it stably recovers >90% of reward behavior across multiple tasks and models, uncovering hidden misalignment drivers such as "relaxed restrictions on discussing illegal activities."
Discovering Interpretable Algorithms by Decompiling Transformers to RASP: This paper proposes a "decompilation" pipeline that faithfully rewrites a trained GPT-2 style Transformer into an equivalent RASP program (D-RASP), then prunes it using causal intervention into a short, readable symbolic algorithm. Experiments demonstrate the automatic recovery of known algorithms such as histogram mode, induction head copying, and Dyck counting from length-generalizing small models, providing the most direct evidence to date that Transformers indeed implement simple RASP programs internally.
Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis: This paper employs an L2-matched perturbation protocol to demonstrate that in the Pythia series, angular (direction) perturbations are 42.9 times more destructive to language modeling loss than magnitude perturbations of equal displacement. Conversely, magnitude perturbations damage syntax (subject-verb agreement) significantly more than angular ones. This constitutes a "double dissociation" in the cognitive neuroscience sense, corresponding to attention pathways for direction and LayerNorm pathways for magnitude.
Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers: The authors disassemble the training data requirements and attention circuits of multimodal in-context learning (ICL) using a controllable two-layer Transformer and synthetic GMM data. They identify a "primary-secondary modality asymmetry": after pre-training on a high-diversity primary modality, the secondary modality requires significantly lower data complexity to unlock multimodal ICL. Through head knockout experiments on Qwen2.5-VL-3B, they validate a circuit landscape where "induction heads dominate multimodal ICL, and multimodal training primarily refines rather than reconstructs these circuits."
DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models: The study incorporates a JEPA representation alignment objective into the fine-tuning phase of masked diffusion language models. By partitioning the same sentence into a "low-mask context view" and a "high-mask target view" via different masking ratios, the model performs a single gradient-based forward pass on the context view to simultaneously compute diffusion loss and JEPA embeddings, while utilizing an EMA replica for a gradient-free forward pass on the target view. Compared to LLM-JEPA, this method saves 33% of training FLOPs and achieves consistent performance gains across 4 tasks and 2 backbones (up to +18.7 pp on GSM8K).
Do Activation Verbalization Methods Convey Privileged Information?: This paper systematically demonstrates that the performance of currently popular activation verbalization methods (Patchscopes / LIT / SelfIE), when used as LLM interpretability tools, can be entirely explained by the "verbalizer model's own knowledge" without requiring any internal activations from the target model. This implies that these tools appear effective on existing benchmarks due to flawed benchmark design, and they tend to fabricate "explanations" that the target does not actually possess when the verbalizer's knowledge exceeds that of the target.
Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models: This paper utilizes difference-in-means to extract "intrinsic" (no system prompt) and "prompted" (with value-based system prompts) directions representing 10 Schwartz values in the residual stream. Using SVD, these directions are decomposed into shared and unique axes. Causal evidence at both the vector and MLP neuron levels demonstrates that: the shared component carries true value semantics and generalizes across languages, replicating the Schwartz circumplex structure; the intrinsic-unique component contributes to lexical/semantic diversity; and the prompted-unique component encodes a value-agnostic "universal instruction-following" channel that increases jailbreak attack success rates from 13%–27% to 83%–97%.
Ensembling Sparse Autoencoders: A single Sparse Autoencoder (SAE) only captures a limited subset of features in the activation space. This paper adapts bagging and boosting from supervised learning to SAEs, demonstrating that "ensembling multiple SAE reconstructions" is mathematically equivalent to "concatenating their feature dictionaries." Using naive bagging and boosting implementations, the authors simultaneously improve reconstruction quality, feature stability, and downstream task performance.
Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning: This paper reinterprets models performing reasoning via iterative latent variable updates as learned attractor dynamical systems. It proposes Equilibrium Reasoners (EqR), which use two lightweight training interventions—Random Initialization (RI) and Path Noise Injection (NI)—to shape the attractor landscape. Combined with a "Depth (iteration steps $D$) + Breadth (random restarts $B$)" test-time scaling strategy and a selection rule based on residual convergence, EqR improves the exact accuracy on Sudoku-Extreme from 2.6% (feedforward) to 99.8% (equivalent to 40,000 layers) while being trained with only 16 iterations.
Expand Neurons, Not Parameters: By "splitting" each neuron into $\alpha$ sparse sub-neurons that partition the original input edges while keeping the total number of non-zero parameters constant, feature interference (polysemanticity) between neurons can be significantly reduced. This leads to consistent accuracy improvements across Boolean tasks and real-world vision tasks such as CLIP, CNN, and ImageNet.
ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior: ExPLAIND unifies three traditionally separate interpretability strands—"model component attribution, data attribution, and training trajectory attribution"—into a single theoretical framework. By strictly rewriting models trained with AdamW as kernel machines (an extension of the Exact Path Kernel, EPK), it derives additive influence scores indexed by parameter/sample/training step. These can be accumulated along any dimension to explain model behavior at any granularity, newly characterizing the learning phases of Grokking and the two-stage dynamics of EuroLLM pre-training.
Formal Concept Lattices are Good Semantic Scaffolds for Concept-Based Learning: To be added after in-depth reading.
Formalizing the Binding Problem: This paper formalizes the "binding problem in neural networks" as the mutual information $I(O;Z)$ regarding the object code $O$ within the representation $Z$. By designing autoregressive probabilistic probes to measure binding information in ViTs such as DINOv2 and CLIP, the study finds that the [CLS] token encodes <50% of binding information with a structure approximating a quadratic form, while an attention probe on the full set of spatial tokens recovers ~92% of the binding information.
From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets: PRAXIS utilizes a "fast but approximate" proxy algorithm (an improved version of LicketySPLIT) to estimate the optimal objective value of subproblems, enabling "expand-on-demand" pruning search for sparse decision tree Rashomon sets. This reduces runtime and memory complexity from being exponential in tree space to "polynomial time per output tree," successfully processing datasets with 11M samples and 472 features while maintaining recall $\ge 0.98$.
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation: GEM reformulates the LLM pre-training data categorization problem as a variational objective involving vMF mixtures on a hypersphere combined with balance regularization. Solved via a provably monotonic Minorize-Maximize (MM) algorithm and distilled into a FastText classifier via a Teacher-Student setup, GEM achieves an average improvement of approximately 1.2% across DoReMi, Perf, and RegMix frameworks on 1.1B models.
Global Plane Waves from Local Gaussians: Periodic Charge Densities in a Blink: ELECTRAFI predicts the parameters of a set of anisotropic Gaussians in real space, then utilizes the analytical Fourier transform of Gaussians combined with the Poisson summation formula to compute the plane wave coefficients of the periodic crystal's charge density in reciprocal space in a single pass. A single inverse FFT yields the full-field density. While its NMAE is comparable to or better than ChargE3Net, the inference is $463\times \sim 633\times$ faster, truly reducing the total end-to-end DFT time by $\sim 20\%$.
Grokking: From Abstraction to Intelligence: This paper provides a unified explanation of the grokking phenomenon through the lens of structural simplification (Occam's Razor). It demonstrates that during training, models undergo four synchronized "internal consolidations": causal mediation degradation, manifold collapse to a $\mathbb{Z}_{97}$ circle, spectral energy concentration into sparse Fourier modes, and a sharp drop in BDM algorithmic complexity. Using an analytically tractable Singular Feature Machine (SFM), the authors prove this is equivalent to a phase transition driven by free energy.
How Few-Shot Examples Add Up: A Causal Decomposition of Function Vectors in In-Context Learning: This paper performs a prompt-granularity causal analysis of the formation mechanism of the function vector (FV) for n-shot prompts. It demonstrates that the FV can be linearly superimposed as a weighted sum of sub-FVs from individual examples, where the weights are determined by FV-head attention. Through 2×2 QK/V causal intervention, the study shows that contextualization primarily improves FV quality via the QK path (rather than V) by concentrating the model's attention on unambiguous demonstrations.
How Language Models Process Negation: This paper employs mechanistic interpretability to dissect the internal circuits of Llama-3.1-8B and Mistral-7B when processing negation sentences like "X that is not Y is __". It discovers that models can actually "perform" negation (middle-layer attention constructs the $\bar Y$ representation directly at the final position, e.g., "not gas" → solid). However, this is suppressed by "shortcut" attention heads in later layers. Ablating these heads via "attention sinking" achieves an absolute accuracy improvement of up to 17% on negation tasks.
IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension: This paper introduces IdEst: utilizing the Minimum Spanning Tree dimension estimator $\mathrm{dim}_{\mathrm{MST}}$ to measure the intrinsic dimension (ID) of self-supervised representations. Using this unlabeled geometric quantity as a proxy for downstream linear probe accuracy, it achieves a Spearman $\rho \approx -0.8$ across 33 SSL models and enables unlabeled hyperparameter selection.
In Defense of Information Leakage in Concept-based Models: This is a position paper: the authors defend "information leakage" in concept models, an often-criticized phenomenon. They point out that when concept annotations are naturally incomplete in real-world scenarios, moderate "benign leakage" is actually a necessary condition for building accurate and intervenable models, and they provide a loss $\mathcal{L}_{\text{int}}$, requiring only one extra forward pass, to induce this benign leakage.
Interpretability Can Be Actionable: This position paper argues that "interpretability research lacks evaluation criteria rather than new methods." It advocates for actionability—the ability of insights to drive specific decisions or interventions outside the interpretability field—as the core evaluative dimension. The authors define actionability via two dimensions (concreteness and validation), analyze systemic barriers, identify five high-leverage application domains, and provide a 6-step checklist for researchers.
Interpretable Self-Supervised Learning via Representer Landmarks and Nyström Approximation: KREPES utilizes eNTK to approximate arbitrary SSL models as kernel models, then leverages the Representer Theorem to express representations as kernel-weighted combinations of "landmark samples." By employing Nyström approximation and one-step GGN-Newton, it analytically solves the influence coefficients for non-convex objectives like SimCLR/BYOL/VICReg/Barlow Twins, enabling unsupervised auditing of SSL latent spaces at a scale of 1M+ samples.
IQA-Spider: Unifying Multi-Granularity Image Quality Assessment with Reasoning, Grounding and Referring: This paper proposes IQA-Spider, a multi-granularity image quality assessment method that unifies four types of tasks—"global quality description + local quality description + pixel-level grounding + region-level referring"—into a single LMM framework. Accompanied by a multi-task dataset of 33K scale, it introduces a training-free text-to-point paradigm that directly maps location word logits from the language model to point prompts for SAM. It comprehensively outperforms existing specialized models like Q-Instruct and Q-Ground on multi-granularity IQA benchmarks.
Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models: The authors perform the first large-scale layered mechanistic analysis of six mainstream Tabular Foundation Models (TFMs). They find that middle and late layers primarily perform "iterative refinement" with significant redundancy. Based on this, they design a single-layer recurrent TFM using only 20% of parameters that nearly matches the performance of the original six-layer version.
MAAT: Heterogeneous Partial Observation State Reconstruction Based on Knowledge-Guided Kernel Regression: MAAT reformulates the problem of "recovering a physically consistent latent state trajectory from sparse, heterogeneous, and noisy observations" as a constrained kernel ridge regression problem in Reproducing Kernel Hilbert Space (RKHS). It integrates observation operators, smoothness, and physical priors (e.g., non-negativity, conservation, monotonicity) into a unified objective function. This provides high-quality trajectories with analytical time derivatives for downstream symbolic regression (SINDy / PySR), reducing reconstruction MSE by 1–3 orders of magnitude across 9 synthetic benchmarks and real COVID-19 data.
LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs: This paper proposes LatentLens—a training-free interpretability method that uses contextualized text token representations from a large corpus as a reference to perform nearest-neighbor retrieval for visual tokens at each layer of a VLM, returning sentence-level descriptions. The study proves that previously common methods like LogitLens/EmbeddingLens significantly underestimate the interpretability of visual tokens (average 68% vs. 24%/32% interpretable) and reveals a "mid-layer leap" phenomenon.
Learn from A Rationalist: Distilling Intermediate Interpretable Rationales: This paper proposes REKD, which introduces knowledge distillation into the "select-predict" rationale extraction framework. It enables a small student model to simultaneously mimic a teacher's feature selection distribution and final prediction distribution. By tying the distillation temperature to the Gumbel-Softmax annealing schedule, an implicit "soft-to-hard selection" curriculum is formed, improving the RE accuracy of ViT-Tiny on CIFAR-10 from 0.797 to 0.936.
Learning Coherent Representations: A Topological Approach to Interpretability: This paper introduces coherence, a geometric property inspired by neural coding that requires rows and columns of the sample-feature matrix to be topologically interleaved under Vietoris-Rips filtration. By providing a differentiable Coh loss, the method achieves topologically aligned and semantically readable features on autoencoders and BERT token embeddings, significantly outperforming $L^1$ sparsity.
LLMs Lean on Priors, Not Programming Language Semantics: The authors construct PLSemanticsBench—pairing a featherweight C language $\text{C}^{\star}$ with two formal systems: small-step operational semantics $\mathbb{S}$ and K semantics $\mathbb{K}$. By systematically perturbing semantics through KeywordSwap (swapping operators like +/-) and KeywordObf (replacing them with rare Caucasian-Albanian symbols), the study evaluates 11 frontier LLMs. Findings show that while models achieve up to 90% accuracy in final state prediction under standard semantics, accuracy drops by 40–60 percentage points under semantic perturbation. Long-range rule maintenance accuracy peaks at only 35%, suggesting that contemporary LLMs rely primarily on pre-trained lexical priors rather than explicit formal rule reasoning.
Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution: This paper proposes MA-GIG: moving the "feature selection based on low gradient magnitude" strategy of Guided IG from pixel space to the latent space of a pre-trained VAE. By utilizing the decoder Jacobian to map axis-aligned updates in the latent space to updates in the tangent space of the data manifold, the method avoids high-gradient noise regions while ensuring the integration path remains close to the true data manifold, resulting in more reliable attributions.
Memorization Dynamics of Fill-in-the-Middle Pretraining: The authors trained a pair of Llama 3.2 3B models (one using standard LTR and one using FIM) with identical architectures, data, and compute. By systematically comparing verbatim memorization behavior on repeated Gutenberg corpora, they find that FIM spreads probability mass across more "partial reconstructions" (showing stronger short-span/overlap recall that grows approximately linearly with repetitions), whereas LTR excels at long-span, high-confidence continuations. Furthermore, FIM memorization remains heavily dependent on the prefix, with the suffix serving only as a secondary signal.
MiniMax Learning of Interpretable Factored Stochastic Policies from Conjoint Data, with Uncertainty Quantification: This paper reformulates traditional conjoint analysis—moving from "estimating AMCE marginal effects" to "learning interpretable product-form Categorical stochastic policies over an exponential factor action space." It provides a closed-form solution with $L_2$ trust regions under a second-order interaction model, a differentiable general solution, and a two-player minimax extension incorporating primary election systems. By propagating uncertainty from the outcome model to policy probabilities and values via the Delta method, it successfully brings adversarial equilibrium "vote shares" back to historical ranges for the first time in the 2016 US Presidential conjoint data.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality: MUSE attributes the "understanding-generation" zero-sum dilemma in unified visual tokenizers to manifold misalignment. It proposes the Gradient Orthogonality Hypothesis—injecting semantics into $W_V$ while routing structural gradients through $W_{Q,K}$. Through Synergistic Blocks, DINOv3 topological alignment, and NCE semantic anchoring, it achieves complete decoupling. Consequently, gFID 3.08 and 85.2% linear probing (surpassing the InternViT-300M teacher's 82.5%) coexist, marking the first instance of true "mutual reinforcement" rather than trade-off.
Neural Collapse by Design: Learning Class Prototypes on the Hypersphere: The paper unifies "Classifier Learning (CE)" and "Supervised Contrastive Learning (SCL)" as prototype contrast on the hypersphere. By introducing two new losses, NTCE/NONL (to fix the CE pathway), and a Fixed Prototype classifier (FP) (to fix the SCL pathway), the authors ensure that Neural Collapse (NC) is achieved "by design." This approach shows comprehensive advantages in accuracy, transferability, long-tail classification, and robustness.
OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization: Addressing the issue where GRPO-style reasoning RL is dominated by a few tasks due to the inherent heterogeneity of social behavior data (10 tasks across emotion/cognition/pathology/socializing, spanning speech/vision/text modalities), this paper proposes HARPO. By approximating the contribution of each sample and task to policy updates using advantage magnitudes, HARPO derives structured modulation factors via "geometric mean reference + reciprocal ratio" combined with inertial smoothing. Trained on Qwen 2.5-Omni-7B, OmniSapiens-7B 2.0 achieves the top average rank across multiple tasks, wins all 5 zero-shot tasks, improves reasoning consistency from 66.5% to 87.7%, and compresses token count to 19.86.
On the Relationship Between Activation Outliers and Feature Death in Sparse Autoencoders: This paper identifies the true root cause of the "dead feature" problem in SAEs as the geometric properties of the activation distribution rather than training dynamics. It quantifies the severity of "dimension-level outliers" using $\gamma=\|\bm{\mu}\|/\|\bm{\sigma}\|$, analytically predicts the death rate from initialization (Spearman $\rho=0.82\sim0.89$ across 454 model-layer combinations), and demonstrates that mean-centering alone can reduce the death rate of high-$\gamma$ models like AlphaFold3/ESM3 from 70%+ to near zero.
Optimal Attention Temperature Improves the Robustness of In-Context Learning under Distribution Shift in High Dimensions: Within the high-dimensional linear regression ICL framework, this paper adopts "approximate softmax attention"—a surrogate that preserves row-wise normalization and temperature selectivity while remaining analytically solvable—to derive the closed-form solution for ICL generalization error and an explicit expression for the optimal attention temperature $\tau_{\text{opt}}$. It proves that correctly tuning the inference-time temperature can recover near-Bayes-optimal performance and validates this "lightweight knob" in real-world QA tasks using GPT-2 and Llama2-7B.
Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions: This paper provides the first structural identifiability theorem for identifying second-order linear ODE parameters $(\gamma_1,\gamma_0)$ from raw video using only an encoder (without a decoder/pixel reconstruction). It characterizes the boundary between "single trajectory sufficiency vs. three trajectories necessity" via a geometric condition called level-set slope coverage. It proves that underdamped systems are identifiable from a single video, while other regimes require three distinct trajectories, and proposes a finite-sample estimator using "variance lower bound regularization + central difference."
PINE: Pruning Boosted Tree Ensembles with Conformal In-Distribution Prediction Equivalence: PINE contracts the "equivalence constraint" of faithful pruning for boosted tree ensembles from the entire input space to an "in-distribution region" $\mathcal{X}_{\text{ID}}(\alpha)$ defined by Chow-Liu tree likelihood and split conformal calibration. Using a single parameter $\alpha$ to smoothly control the compression-fidelity trade-off, it improves compression rates by up to 30% relative to FIPE across 12 public tabular datasets while providing provable guarantees that the probability of "prediction consistency before and after pruning" is at least $1-\alpha$.
PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding: PolySAE introduces secondndnd and third-order polynomial terms based on shared low-rank projections alongside the standard linear decoder of Sparse Autoencoders (SAEs). With a minimal parameter overhead (~3% on GPT-2 small), it explicitly models multiplicative interactions between sparse features. Across 4 LLMs and 3 SAE variants, it improves probe F1 by approximately 8%, increases the 1-Wasserstein distance of class-conditional distributions by 2–10x, and enables causal steering of compositional semantics using learned interaction directions.
Position: Ideas Should be the Center of Machine Learning Research: The authors propose the "Ideas First" stance: treating "idea → observable signature → tailored experiment" as the core evaluation unit of machine learning research. This approach opposes treating leaderboard gains or idealized theorems as ends in themselves, aiming to bridge the theory-practice gap while lowering the participation threshold for researchers with limited compute resources.
Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance: The authors argue that rather than continuing trial-and-error with massive real-world corpora, researchers should design "data probes"—synthetic sequences sampled from completely known stochastic processes. By training/fine-tuning LLMs on these and feeding generated results back into the known distribution for likelihood analysis, the question of "what data teaches the model what" can be elevated from empirical heuristics to falsifiable scientific propositions.
Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!: This is a position paper from Kambhampati’s team. The core claim: labeling the "intermediate tokens" generated by reasoning models (such as DeepSeek-R1) before providing an answer as "reasoning traces" or "thinking traces" is a form of dangerous anthropomorphism. It is (1) wishful thinking, (2) largely lacks empirical support, (3) creates false trust in models, and (4) pushes the community toward meaningless research directions. The authors use a series of experimental findings (A maze trace swapping, decoupling trace length from problem complexity, human trust experiments) to argue that trace semantics and final answer correctness are fundamentally decoupled*. They call for the community to stop assigning "user-facing interpretability" to intermediate tokens; trust should instead derive from the verification of the solution itself.
Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered: This is a position paper where the authors argue that Zeroth-Order (ZO) optimization in deep learning is "underexplored" rather than "underpowered." They present six claims (P1–P6) across three main axes: algorithms, systems, and evaluation. Their core stance is that by moving away from the paradigm of "full-space element-wise estimators" toward subspace/spectral domain estimation, leveraging system-level dividends of forward-only unidirectional flows, and adopting de-confounded evaluation protocols, ZO can evolve from a niche tool for memory-efficient fine-tuning into a scalable training paradigm.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems: This paper empirically examines two implicit assumptions of end-to-end prompt optimization in compound AI systems—coupling between agents and the worthiness of single-agent prompt optimization—using 18,000 grid evaluations and 144 optimization runs. It finds that neither assumption holds for most mid-tier models (49% of runs perform worse than zero-shot; A×B interaction term $p>0.52$). Consequently, a two-stage diagnostic framework is proposed (an $80 ANOVA coupling prediction + a $5, 10-minute headroom test) to transform the decision of whether to optimize prompts from a coin flip into a quantifiable engineering choice.
Prototype Transformer: Towards Language Model Architectures Interpretable by Design: ProtoT replaces the $O(N^2)$ self-attention in Transformers with linear communication channels driven by $R$ learnable "prototype vectors" (composed of write/read gates + time-discounted prefix mean). This forces each prototype to automatically bind to a namable concept (e.g., "woman," "COVID," "New Zealand") during training, enabling "surgical" concept-level editing of model behavior while achieving text generation Elo scores that surpass LLaMA of the same scale.
Query Circuits: Explaining How Language Models Answer User Prompts: This paper proposes the query circuit discovery task—directly tracing sparse subnetworks within the original LLM to explain "why the model produced a specific output for a specific input." It introduces a more robust fidelity metric, NDF, and a Best-of-N (BoN) sampling algorithm, enabling circuits comprising only 1.3% of the model's edges to recover approximately 60% of single-instance behavior on MMLU.
Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects: Addressing the limitation where Logit Lens only considers "direct effects" and fails to interpret a large number of SAE features, this paper proposes Query Lens: it simultaneously utilizes encoder-side key features and decoder-side value features, incorporating the indirect effects (Jacobian products of the residual stream) generated by downstream modules into the projection. This provides coherent input/output token explanations for previously "uninterpretable" features.
Rational Sparse Autoencoder: The hardcoded ReLU/JumpReLU/TopK encoder gates in Sparse Autoencoders (SAEs) are replaced with an element-wise trainable rational function $r(t)=P(t)/Q(t)$. This is combined with a two-step upgrade process—"Copying teacher weights + Remez fitting coefficients → Unfreezing and fine-tuning"—allowing any pre-trained SAE to strictly improve reconstruction fidelity without sacrificing sparsity or interpretability, while adding only a few scalar parameters.
Riemannian Generative Decoder: This paper addresses the pain point where Riemannian VAEs require manual design of complex probability densities for each manifold. It proposes the Riemannian Generative Decoder (RGD)——completely discarding the encoder and treating the latent of each sample as a free parameter optimized directly with a Riemannian optimizer (RiemannianAdam). By introducing "input noise scaled by the inverse local metric" as geometric regularization, RGD recovers more faithful geometries on synthetic branching trees, human mitochondrial DNA, and cell-cycle scRNA-seq data, demonstrating superior numerical stability in high dimensions compared to VAE baselines.
Scalable Circuit Learning for Interpreting Large Language Models: CircuitLasso transforms "circuit discovery in mechanistic interpretability" from expensive intervention-based methods into a sparse linear regression (Lasso) surrogate. By using only observational data and relying on $\ell_1$ penalties plus block upper-triangular constraints to identify sparse dependency skeletons between components, it enables circuit discovery directly in high-dimensional SAE feature spaces for the first time. It achieves near-SOTA structural accuracy on InterpBench with a 2–3× speedup and applies the discovered circuits to downstream domain generalization debiasing.
ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation: For expensive games where evaluation budgets are extremely limited (e.g., requiring model retraining), this work utilizes a Gaussian Process (GP) with a Hamming kernel as a surrogate for the value function. It adaptively selects the next coalition based on the "Expected Information Gain (EIG) for the Shapley values" and reduces the EIG computation complexity from $O(4^p t)$ to $O(p^4 + t^3)$.
Singular Vectors of Attention Heads Align with Features: This paper demonstrates, through both theory and toy models, "why and when" the singular vectors of the attention head QK matrix $\Omega = W_Q^\top W_K$ align with the feature directions actually used by the model. It proposes "Sparse Attention Decomposition" (SAD) as an observable signal to verify this alignment in real-world models (GPT-2 / Pythia).
Sparse Autoencoders are Topic Models: This paper proves that the $L_1$ objective of Sparse Autoencoders (SAE) is exactly the MAP estimation of an LDA-style "Continuous Topic Model" (CTM) under the limit of high activity and small contribution. Based on this, the SAE-TM framework is proposed: pre-training SAEs to obtain reusable topic atoms, learning word distributions post-hoc, and merging them into an arbitrary number of topics via clustering. Topic coherence on text and image datasets significantly exceeds current mainstream neural topic models.
Steer Like the LLM: Activation Steering that Mimics Prompting: This paper reinterprets "prompt steering" as a form of activation steering implemented by the LLM itself. By distilling activation differences injected by prompts using a token-specific ReLU probe, the authors develop the PSR (Prompt Steering Replacement) module. PSR outperforms existing activation steering methods (CAA, ReFT-R1, Stolfo, etc.) across three benchmarks and matches or surpasses prompting in AxBench and persona steering tasks.
Textual Supervision Enhances Geospatial Representations in Vision-Language Models: The authors use hierarchical linear probes to investigate whether vision/multimodal models encode information regarding "where on Earth an image was taken" within their hidden layers without explicit geographic supervision; the conclusion is that VLMs with textual supervision (CLIP, LLaVA, Qwen, Gemma) encode latitude and longitude far better than vision-only models (ViT, DINOv2), and this geographic information is concentrated in very few dimensions and can even be manipulated via "dimension steering" to rewrite the place names generated by the model.
The Discrete-Log Clock: How a Transformer Learns Modular Multiplication: Previous work found that when a Transformer learns modular multiplication $a\cdot b\bmod p$, the Fourier spectrum of its embeddings is "dense" (requiring all frequencies), appearing much more complex and less interpretable than modular addition. This paper demonstrates that this is merely an artifact of choosing the wrong analysis basis. The natural Fourier basis for modular multiplication is not the additive DFT, but the multiplicative character transform (Fourier transform over the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$). Switching to this basis immediately reveals a sparse spectrum (Gini coefficient increases from 0.07 to 0.58, with only 4 key frequencies remaining, and 96.9% of MLP neurons cleanly tuned to a single frequency). This proves the Transformer first utilizes the discrete logarithm to convert multiplication into addition, then applies the same "Clock" trigonometric identity mechanism as modular addition—a mechanism the authors term the "Discrete-Log Clock."
The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level: This paper uses $k$-sparse probing to systematically compare the polysemanticity of MoE expert neurons versus dense FFN neurons. It finds that MoE naturally tends toward monosemanticity under sparse routing pressure. Consequently, the analysis unit is elevated from "neurons" to "entire experts." The authors use LLMs to automatically assign natural language labels to hundreds of experts, validate them through causal trigger experiments, and conclude that "experts are neither broad domain specialists nor token-level processors, but fine-grained task experts."
The Perceived Fragility of Explanations in Audio Models: Manipulation of Attribution with Unchanged Predictions: The authors transfer "explanation manipulation attacks" from the visual to the audio deepfake detection domain, proposing an optimization framework constrained by psychoacoustic masking thresholds. This framework systematically alters Grad-CAM/LRP attribution heatmaps while remaining completely inaudible and without changing the model's final prediction, demonstrating that "explanations" in audio models are fragile in a security context.
Towards Atoms of Large Language Models: This paper provides the first formal definition of the "fundamental representation units" of Large Language Models (LLMs)—termed atoms. It characterizes the intrinsic geometry of LLM hidden representations using a non-Euclidean "Atom Inner Product" and proves that threshold-activated SAEs can precisely recover the set of atoms under appropriate conditions. Empirical tests on Gemma2 / Llama3.1 identify near-ideal atoms with $R^2 \approx 99.9\%$ and stability $q^* \approx 99.85\%$.
Towards Long-Horizon Interpretability: Efficient and Faithful Multi-Token Attribution for Reasoning LLMs: Addressing the efficiency bottleneck of $\mathcal{O}(M\cdot N)$ in token-wise attribution and the "information absorption" effect where intermediate reasoning tokens soak up attribution mass in reasoning LLMs, this paper proposes FlashTrace. It utilizes span-wise aggregation to compute attribution for an entire target span in a single pass and employs recursive attribution to backtrace importance from the output through the reasoning chain to the original input. FlashTrace is over 130x faster than the strongest baseline IFR on 5k target spans while consistently outperforming in faithfulness across RULER, MATH, and MoreHopQA.
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection: This paper utilizes Causal Tracing to discover that "refusal" in LLMs is not a static vector at terminal tokens but a "Refusal Trajectory" spanning upstream intermediate layers and tokens. Based on this, the authors design SALO—a <20M parameter detector trained only on standard alignment data. By leveraging the irreversibility of Transformer causal masks, SALO identifies adversarial attacks like GCG, AutoDAN, and Prefilling, increasing the detection rate from 0% to >85% on GCG/Prefilling tasks.
Universal 1/3 Time Scaling in Learning Spiked Distributions: By analyzing the mathematical properties of softmax and cross-entropy when learning spiked probability distributions, this paper reveals the fundamental cause of the universal 1/3 power-law decay in LLM training loss—an optimization bottleneck at the architectural level independent of data structure.
Verified SHAP: Provable Bounds for Precise Shapley Values in Neural Networks: VERISHAP provides the first provable bounds for SHAP value computation in neural networks by combining branch-and-bound with neural network verification techniques—scaling to feature search spaces orders of magnitude larger than existing exact methods.
Vision-Language Asymmetry in Bistable Image Captioning: This paper uses Wittgensteinian "duck-rabbit" style bistable images as probes. After characterizing three behavioral regimes of LLaVA across 3320 generations, a TopK Sparse Autoencoder (SAE) is trained on the CLIP layer it consumes. The study finds that 72% of bistable stimuli activate feature pools for both interpretations simultaneously in the vision tower (superposition). However, causal steering can only flip "default-dominant" stimuli but fails to flip "force-balanced" ones like the Young/Old Woman—proving that the bottleneck for "committing to a specific seeing-as" lies not in the vision tower but in the downstream language decoder.
What Linear Probes Miss: Multi-View Probing for Weight-Space Learning: This paper points out that single-view first-order probes miss row-column interactions and second-order correlation structures within weight matrices. It proposes MVProbe, which uses multi-view representations consisting of row/column first-order projections and row/column Gram branches, significantly outperforming ProbeX on Model Jungle and Stable Diffusion LoRA identification.
Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function: This paper presents the first causal mechanistic analysis of the tabular foundation model TabPFN-2.5 using activation patching, ablation, and attention entropy. The study reveals that one of the three feature attention heads (Head 2) possesses a causal necessity $2\text{--}5\times$ greater at its peak layer than other heads, with the peak layer shifting based on task complexity. In contrast, other heads exhibit symmetric late-layer patterns. Furthermore, the failure of activation steering across samples suggests that pure ICL architectures lack "injectable stable task directions."
Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions: This paper employs "rhyming couplet completion" as a clean test for look-ahead constraints. Using only lightweight tools—linear probes and activation patching—the study investigates "planning site formation" across more than ten scales in three model families: Qwen3, Gemma-3, and Llama-3. Probing reveals that information regarding future rhymes is linearly decodable at the newline character and strengthens with model scale. however, activation patching demonstrates that only Gemma-3-27B truly exhibits a causal dependence on this encoding. A "representational handoff," where causal drive transitions from the rhyme word to the newline character, occurs around layer 30. This handoff is ultimately localized to 5 attention heads, which recover approximately 90% of the rhyme-routing capability at the newline character.
Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints: This paper provides an architectural-level explanation for why the internal representations of Transformers can be repeatedly decoded by simple linear methods (probes, SAEs, activation steering). It proves that as long as semantic features are read through linear interfaces such as OV circuits or unembedding layers, they must reside in a cross-context invariant linear subspace (the Invariant Subspace Necessity theorem). The authors further derive a zero-shot application—the Self-Reference Property—which posits that a token's own embedding direction serves as its conceptual direction, enabling unsupervised classification using the geometric location of class tokens.