Skip to content

🔬 Interpretability

💬 ACL2026 · 63 paper notes

📌 Same area in other venues: 📷 CVPR2026 (33) · 🔬 ICLR2026 (196) · 🧪 ICML2026 (92) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (76) · 📹 ICCV2025 (10)

🔥 Top topics: LLM ×16 · Reasoning ×5 · Alignment/RLHF ×4 · Layout & Composition ×3 · Multimodal/VLM ×2

A Structured Clustering Approach for Inducing Media Narratives

The paper proposes a framework to automatically induce media narrative patterns from large-scale news corpora. By jointly modeling causal event chains and role information (Hero/Threat/Victim), it utilizes a role-constrained clustering algorithm to organize narrative chains into semantically coherent patterns. It generates interpretable narrative patterns consistent with framing theory in the domains of immigration and gun control.

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

This paper systematically compares the differences between extractive self-explanations generated by four open-source instruction-tuned LLMs across three types of text classification tasks, human rationales, and post-hoc attribution methods. The study finds that the consistency between self-explanations and human annotations is strongly influenced by text length and task complexity; however, in perturbation-based faithfulness evaluations, self-explanations often identify a subset of tokens more critical to the model's prediction.

AdaptiveK: Complexity-Driven Sparse Autoencoders for Interpretable Language Model Representations

AdaptiveK proposes a Sparse Autoencoder driven by input semantic complexity, allowing simple text to activate fewer features and complex text to activate more. Across experiments on eight autoregressive LLMs and additional architectures, it improves reconstruction quality, conceptual decoupling, and training efficiency while reducing the need for repetitive hyperparameter tuning common in fixed TopK approaches.

Aligning What LLMs Do and Say: Towards Self-Consistent Explanations

Constructed the Post-hoc Self-Consistency Bank (PSCB, 85K decisions × 428K explanations) to quantify the feature attribution gap between LLM answers and their natural language explanations. Improved attribution consistency through DPO optimization without compromising model accuracy.

Compositional Steering of Large Language Models with Steering Tokens

This paper proposes compositional steering tokens, which compress behavior instructions into embedding vectors in the input space via self-distillation. By training a dedicated compositional token <and> to capture the universal concept of "composition," the method demonstrates strong generalization capabilities across unseen behavior combinations, unseen behaviors, and an unseen number of combined behaviors.

Constructing Interpretable Features from Compositional Neuron Groups

The authors utilize Semi-Nonnegative Matrix Factorization (SNMF) to directly decompose MLP activations into "sparse neuron groups × non-negative coefficients," yielding interpretable features that map back to activation contexts and combine across layers. Evaluations of concept steering on Llama-3.1-8B / Gemma-2-2B / GPT-2 comprehensively outperform the latest SAEs (Llamascope / Gemmascope) and the strongly supervised baseline, DiffMeans.

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

By training a shared feature dictionary across multiple pretraining checkpoints of the same LLM using a sparse crosscoder, this work proposes the Relative Indirect Effect (RelIE) to measure how the causal importance of individual features "emerges, persists, or vanishes" over token counts. This study provides the first observation of the concept-level evolutionary trajectory in Pythia, OLMo, and BLOOM—from "specific subword detectors" to "internalized abstract syntactic/cross-lingual detectors."

Curing "Miracle Steps" in LLM Mathematical Reasoning with Rubric Rewards

This paper identifies the widespread presence of "Miracle Steps"—phenomena where reasoning chains leap to the correct answer without derivation—in current LLM mathematical reasoning. It proposes the Rubric Reward Model (RRM), a process-based reward function using problem-specific scoring rubrics. During RL training, RRM significantly reduces Miracle Steps by 71% and improves the Verified Pass@1024 on AIME2024 from 26.7% to 62.6%.

Diffusion-CAM: Faithful Visual Explanations for dMLLMs

Diffusion-CAM is proposed as the first interpretability method specifically designed for diffusion-based Multimodal Large Language Models (dMLLMs). By extracting structurally valid intermediate representations from denoising trajectories and employing four post-processing modules (Adaptive Kernel Denoising, Distribution-aware Confidence Gating, Contextual Background Attenuation, and Single-instance Causal Debiasing), it significantly outperforms autoregressive CAM baselines on COCO Caption and GranDf.

Do LLMs Capture Embodied Cognition and Cultural Variation? Cross-Linguistic Evidence from Demonstratives

The authors use demonstratives such as "this/that" and "这/那" as probes to construct a bilingual English-Chinese dataset (80 items/language × 4 cues × 4 perspectives × 5 scenarios). By establishing a human baseline from 6,400 responses from 320 native speakers, the study finds that English speakers excel at proximal–distal differentiation but are weaker in other-perspective taking, while Chinese speakers show the opposite pattern. In contrast, five SOTA LLMs failed to consistently distinguish between proximal and distal categories and exhibited no cross-cultural variation, generally reverting to English-centric reasoning or "All of the above" safety fallbacks.

Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations

This paper identifies and formalizes "Structural Alignment Bias" in LLM tool invocation—a phenomenon where LLMs tend to call a tool when query attributes can be effectively mapped to tool parameters, even if the tool's function is irrelevant to the user's goal. The authors construct the SABEval dataset to decouple structural alignment from semantic relevance. Using Contrastive Attention Attribution, they reveal the existence of two competing internal paths: semantic check and structural matching. A proposed rebalancing strategy achieves an 80% relative error reduction.

DPN-LE: Dual Personality Neuron Localization and Editing for Large Language Models

This paper proposes DPN-LE, which locates mutually exclusive personality-related neurons by comparing MLP activations of high/low trait samples. By intervening in only approximately 0.5% of neurons, it achieves personality control while preserving general capabilities significantly better than existing large-scale neuron editing methods.

Dual Alignment Between Language Model Layers and Human Sentence Processing

The authors use logit-lens to decode "internal surprisal" from each layer of 19 LMs (including GPT-2/Pythia/OPT) and discover a counter-intuitive "dual alignment": on naturalistic reading corpora, shallow layer surprisal aligns best with humans; however, on syntactic challenge sentences such as garden-path, NPS, NPZ, RC, and Attachment, deep layers align better. This corresponds to the human dual-mechanism reading model—"default shallow processing + switching to deep reanalysis when difficult"—and leads to the proposal of using the difference in surprisal between shallow and deep layers (KL/JS) as an "inter-layer prediction update" to serve as a supplementary feature for reading-time.

Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models

This paper reinterprets "massive activations," often regarded as outliers in LLMs, as interpretable domain-critical dimensions. It identifies these dimensions using a training-free activation magnitude criterion and performs activation steering exclusively on these dimensions, proving more effective than full-dimension steering in domain adaptation and jailbreaking scenarios.

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

This paper proposes the "Decomposition-then-Evaluation" paradigm and the EVIAN framework, which decomposes answers in visual instruction-tuning data into three components: visual descriptions, subjective reasoning, and factual claims. These are evaluated across three orthogonal dimensions: image-text consistency, logical coherence, and factual accuracy. The study finds that models trained on a small amount of high-quality data filtered by EVIAN outperform those trained on large-scale datasets.

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

A controlled knowledge framework was constructed to systematically investigate how LLMs utilize experimental descriptions and outcome evidence in scientific feasibility assessment. The study found that outcome evidence is more reliable than experimental descriptions, and partial experimental information often leads to performance below the baseline using only parametric knowledge, revealing the fragility of LLM reasoning.

Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models

The paper employs activation patching at the attention head granularity to demonstrate that Pythia and Gemma share a unified mechanism involving three attention heads in early-to-mid layers for processing seven types of English filler-gap dependencies (FGD). Scaling the activations of these specific heads by \(1.5 \times\) improves performance on the BLiMP benchmark. Conversely, Negative Polarity Item (NPI) licensing lacks such a unified mechanism, and supervised "DAS directions" learned during training fail entirely on Out-of-Distribution (OOD) data, suggesting that unsupervised patching is more reliable than supervised DAS.

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

FineSteer decomposes inference-time steering into two complementary stages: Subspace-guided Conditional Steering (SCS) determines "when to steer" using the energy ratio of the IR query subspace as a gate; Mixture of Steering Experts (MoSE) determines "how to steer" by dynamically aggregating prototype experts and residual refinement via an attention gating network to generate query-specific steering vectors. It outperforms SOTA on safety and truthfulness benchmarks.

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

The authors construct the first Video-LLM sycophancy benchmark, ViSE (367 videos / 6,367 multiple-choice questions / 7 categories of sycophantic scenarios). They systematically reveal the universal phenomenon across 9 SOTA Video-LLMs where "models abandon visual evidence to cater to users" and propose two training-free mitigation methods: (i) key-frame selection reduces sycophancy by up to 22.01% (and is proven via attention analysis to eliminate "first-frame bias" and "middle-layer instability"); (ii) representation steering reduces MSS by an average of 35.69% in the most difficult scenarios, bringing MSS close to 0 across 5 categories on LLaVA-OneVision.

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

This paper systematically investigates the token-level information distribution in text encoder outputs within text-to-image models using a causal intervention framework. It finds that the semantics of lexical items are typically concentrated on 1-2 representative tokens, and cross-item information flow leads to semantic leakage and image misinterpretation in 11% of cases. A simple yet effective token-level intervention method is proposed to improve alignment.

From Documents to Segments: A Contextual Reformulation for Topic Assignment

This paper shifts the fundamental unit of topic assignment from documents to segments, proposing SBTA and the SemEval-STM dataset. It demonstrates that assigning topics based on semantic segments in multi-topic short texts significantly improves topic purity, interpretability, and downstream retrieval utility.

From Interpretability to Performance: Optimizing Retrieval Heads for Long-Context Language Models

RetMask utilizes retrieval heads identified via "mechanistic interpretability" as a source of contrastive signals. By using the output of an ablated model (with retrieval heads masked) as the rejected sample and the original model's output as the chosen sample for DPO training, it achieves consistent improvements across 128K context lengths for Llama-3.1, Qwen3, and Olmo-3 families without requiring LLM judges or human annotation. Notably, it improves generation-with-citation by +70% and re-ranking by +32%.

From Weights to Activations: Is Steering the Next Frontier of Adaptation?

This paper systematically argues that steering (inference-time activation space intervention) should be considered an independent model adaptation paradigm. It proposes eight functional evaluation criteria to compare steering with traditional methods like fine-tuning, PEFT, and prompt engineering, positioning steering as a locally reversible behavior modification method based on activation space with unique advantages in computational efficiency, data efficiency, and reversibility.

HistLens: Mapping Idea Change across Concepts and Corpora

The HistLens framework is proposed to decompose conceptual representations into interpretable semantic basis vectors using Sparse Autoencoders (SAE). It tracks the diachronic evolution trajectories of multiple concepts and corpora within a shared coordinate system and supports implicit concept computation, providing quantifiable and comparable analytical tools for digital humanities and conceptual history research.

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

This paper characterizes the geometric evolution of internal truth representations in LLMs when context is introduced. By measuring the directional angle \(\theta\) and relative magnitude of truth vectors under context-present vs. context-absent conditions across 4 models and multiple datasets, the study identifies a three-phase pattern: "near-orthogonal in early layers → rapid convergence in middle layers → stabilization or further increase in late layers." Context generally amplifies truth-falsity separability, and context conflicting with parametric knowledge induces greater geometric shifts than aligned context.

How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

Representational analysis reveals that "logical validity" and "plausibility" are highly aligned in the latent space of LLMs, causing the model to conflate the two concepts (content effect). Constructing debiasing steering vectors effectively decouples these concepts, reducing content effects while improving reasoning accuracy.

IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration

The IDEA framework is proposed to extract decision-making knowledge from LLMs into interpretable parameterized models over semantic factors. By jointly learning verbal-to-numeric mappings and decision parameters via an EM algorithm, it achieves calibratable, editable, and interpretable LLM decision-making. Testing across five datasets, IDEA (78.6%) using Qwen-3-32B outperforms DeepSeek R1 (68.1%) and GPT-5.2 (77.9%).

Interpretability from the Ground Up

This work derives four principles—FGTI (Faithful, Grounded, Traceable, Interchangeable)—from the requirements of educational assessment stakeholders. It develops the AnalyticScore three-stage framework to achieve interpretable automated scoring, trailing non-interpretable SOTA by only 0.06 in average QWK on the ASAP-SAS dataset.

Interpretable Coreference Resolution Evaluation Using Explicit Semantics

This paper utilizes Concept and Named Entity Recognition (CNER) to map 29 fine-grained semantic labels onto coreference resolution outputs via a "mention + cluster-level majority voting" mechanism. This yields diagnostic Typed Mention F1 and Link F1 metrics, identifying systematic failure modes across semantic categories. These diagnostics guide targeted data augmentation using only three synthetic documents, improving the CoNLL-F1 of a LitBank-trained model on OntoNotes/PreCo by +2.5/+2.8 and Mention F1 by approximately +9.5.

Interpretable Semantic Gradients in SSD: A PCA Sweep Approach and a Case Study on AI Discourse

This paper proposes a PCA sweep procedure for Supervised Semantic Differential (SSD)—a method that estimates semantic gradients of text embeddings using individual difference variables. The procedure jointly utilizes interpretability and stability diagnostics (rather than predictive accuracy) to select the PCA dimension \(K\). In a case study involving 349 AI-themed essays and narcissism questionnaires, the sweep-selected \(K=15\) yielded a stable semantic gradient of "optimistic collaboration vs. distrustful mockery" related to Admiration, whereas a counterfactual \(K=120\) resulted in chaotic and uninterpretable clusters.

Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

By constructing a verifiable intermediate reasoning chain dataset using a rule-based problem decomposition method, this work reveals that the semantic correctness of CoT reasoning chains is unreliably correlated with final answer accuracy (correct chains lead to correct answers only 28% of the time). Furthermore, the most interpretable reasoning chains are not the most performance-enhancing—lengthy R1 chains perform best but are rated as the least interpretable by users.

Interpreting Style Representations via Style-Eliciting Prompts

This paper decodes difficult-to-interpret text style vectors into style-eliciting prompts that can directly drive LLM writing. Using "controllability" as an interpretability criterion, the method outperforms baselines that rely on direct LLM descriptions of target styles in tasks involving style recovery, synthetic text style control, and human text style imitation.

Interpreto: An Explainability Library for Transformers

Interpreto is an open-source Python interpretability library for HuggingFace language models that unifies token/word/sentence attribution with activation-level concept explanations under a single API, offering demos, tutorials, metrics, and end-to-end concept explanation pipelines.

Investigating More Explainable and Partition-Free Compositionality Estimation for LLMs: A Rule-Generation Perspective

The paper moves beyond the traditional paradigm of "constructing test sets for compositional generalization testing." Instead, it requires LLMs to generate a Python program as a mapping rule for an entire dataset. By using \(\mathcal{C}(\text{P})\) based on the upper bound of Kolmogorov complexity, the "program compactness + accuracy" is converted into a compositionality score of 0–100. This shifts the focus from "checking output correctness" to "measuring rule compression," bypassing data contamination from pre-training while providing an explainable, introspective evaluation.

Jacobian Scopes: Token-Level Causal Attributions in LLMs

The authors propose Jacobian Scopes—a unified framework that uses the "projection of the Jacobian from input token embeddings to the last-layer hidden state onto a specific vector" as token attribution strength. Accompanied by three scopes (Semantic / Fisher / Temperature) to explain how a "target logit / entire prediction distribution / model confidence" is driven by various input tokens, it requires only one backpropagation. On AOPC metrics, it matches Input×Gradient and significantly outperforms Integrated Gradients.

Knowledge Vector of Logical Reasoning in Large Language Models

The authors demonstrate that the capacities for deductive, inductive, and abductive reasoning within LLMs can be linearly represented as three nearly orthogonal "knowledge vectors." They propose a complementary refinement framework based on SAE subspace constraints, allowing these vectors to share commonalities while preserving unique characteristics, thereby stably enhancing performance across all three reasoning types under steering settings.

Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

VL-MDR upgrades the "single-scalar black-box" discriminative vision-language reward model into a three-headed architecture consisting of "dynamic dimension selection + per-dimension scoring + adaptive weighting." Combined with a 321k dataset featuring 21-dimensional fine-grained preference annotations, it outperforms existing open-source RMs on VL-RewardBench and generates higher-quality DPO preference pairs to mitigate VLM hallucinations.

Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization

This paper learns a shared steering direction and tutor-specific scaling factors from real teacher-student dialogues, enabling LLMs to generate tutoring utterances closer to specific human tutor styles without explicit persona prompts.

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

This paper uses probes, residual de-confounding, trace-anchor, and causal steering experiments on Qwen3-14B to demonstrate that while linear probes appear to distinguish deductive, inductive, and abductive reasoning with 100% accuracy, they actually detect data source and task format rather than reasoning modes within hidden states.

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

For Medieval Occitan, a low-resource historical language, the authors established an explainable framework using mBERT + Hybrid Tokenization + Domain-adapted MLM. By decomposing the problem of "whether original Latin neuter nouns became masculine or feminine in Occitan" into morphological cues versus syntactic context, they quantified the evidence and found that suffix morphology serves as the largest single signal, while context (especially articles and adjectives) pushes the Macro-F1 from 0.665 to 0.929.

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing

This is a position paper arguing that mechanistic interpretability research must incorporate a layer of "auditability." By establishing a continuous collaborative reviewing platform, community-refined guidelines, and source evidence tracking systems, it aims to transform fragmented replications, negative results, and methodological critiques into auditing protocols suitable for safety-critical scenarios.

Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy

Addressing the failure of LLMs in large-scale counting (where a single forward pass is limited to ~\(10–30\) due to layer depth), this study employs a simple test-time strategy: "slicing the list with | + prompting the model to count segments before summing." This approach increases accuracy for Qwen2.5/Llama3/Gemma3/GPT-4o/Gemini-2.5-Pro from 0–20% to 50–95% in scenarios with 50–100 objects. Through attention analysis and four types of causal mediation experiments, a three-stage circuit—"segment counting → intermediate step aggregation → final summation"—was localized to Layer 22 of Qwen2.5-7B (Head 13 for segmentation, Head 1 for aggregation).

METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

METER is the first benchmark to systematically evaluate LLMs' three-level causal reasoning (discovery / intervention / counterfactual) under a unified context. Utilizing 4,145 samples constructed via human-LLM collaboration, saliency-based information flow analysis reveals that LLM performance drops from 93% to 73% as they ascend the causal ladder. The root causes are interference from irrelevant facts during the discovery stage and a significant decline in faithfulness to the context at higher-level stages.

MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

The authors propose MINED—the first evaluation benchmark for multimodal time-sensitive knowledge, consisting of 2,104 \((subject, hypernym, property, attribute-list)\) quadruplets across 11 sub-tasks in 6 dimensions (Cognition / Awareness / Trustworthiness / Understanding / Reasoning / Robustness), totaling 4,208 questions. Evaluation of 15 LME's shows Gemini-2.5-Pro achieving the highest average \(\text{CEM}=63.07\) but still lacking ~15% of the knowledge; further tests using knowledge editing methods like FT-LLM / IKE effectively update outdated knowledge in LLaVA-v1.5 and Qwen-VL under single editing, but performance significantly degrades under lifelong editing (FT-LLM drops by 43.2% on average).

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models

This paper systematically probes 25 Transformer language models (ranging from BERT Base to Qwen2.5-7B) and discovers that lexical identity (lexeme) is linearly decodable in early layers but decays with depth, whereas inflectional features remains stable and readable across all layers, occupying a compact and controllable subspace.

NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning

The authors propose NOSE, a tri-modal olfactory representation learning framework. By using molecules as a hub, the framework aligns molecular structure, receptor sequences, and natural language descriptions through an orthogonal injection mechanism. Coupled with an LLM-driven weak positive sample strategy to alleviate description sparsity, it achieves SOTA performance across 11 downstream tasks and demonstrates excellent zero-shot generalization capabilities.

On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models

This paper provides evidence through large-scale behavioral evaluation and cognitive neuroscience-inspired functional localization/ablation experiments that Theory of Mind (ToM) and pragmatic reasoning in language models likely share internal computational mechanisms. It advances the concept of "Social World Models" from mere capability scores to a testable functional integration hypothesis.

Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

This paper proposes Preference Heads and Differential Preference Steering (DPS), using causal ablation to identify a small set of attention heads that carry user preferences. It then amplifies preference signals from these heads during decoding to improve personalized generation and prediction without modifying model parameters.

Probing for Reading Times

This paper probes the ability of various language model layers to predict human reading times. It finds that early-layer representations outperform surprisal in predicting early fixation metrics, while surprisal remains superior for late-stage metrics; the optimal predictor varies significantly by language and metric.

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

This is a diagnostic analysis paper: instead of competing for performance, the authors probe LLM metaphor processing from three complementary dimensions—semantic property alignment, lexical invariance, and syntactic influence. They find that "high scores on metaphor benchmarks" may stem from heterogeneous shallow signals (semantic drift + stable lexical anchors + heuristic sensitivity to syntactic irregularities) rather than robust integrated semantic understanding.

Retrieval Heads are Dynamic

This paper demonstrates that the retrieval heads in LLMs responsible for extracting information from the context are not a fixed set but change dynamically across generation steps. They cannot be replaced by static heads and can be predicted from hidden states, which can enhance the retrieval performance of dynamic RAG.

Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models

This paper proposes a proxy-model-based framework for black-box interpretability. It utilizes inexpensive small models to approximate the local decision boundaries of costly large models to generate LIME/SHAP explanations. Reliability is ensured through a "screen-and-apply" mechanism. Proxy explanations reduce costs by 88.2% while maintaining over 90% fidelity, and are successfully applied to downstream optimization tasks such as prompt compression and poisoned sample removal.

Rhetorical Questions in LLM Representations: A Linear Probing Study

Through linear probing analysis of how LLMs internally represent rhetorical questions, it is discovered that rhetorical questions are linearly separable in the representation space and transferable across datasets. However, the probe directions learned from different datasets are inconsistent—rhetorical questions are encoded by multiple heterogeneous linear directions rather than a single unified dimension.

Similarity-Distance-Magnitude Activations

This paper proposes the SDM (Similarity-Distance-Magnitude) activation function as a more robust alternative to softmax. It decouples and integrates three epistemic dimensions: the deep matching of correct predictions (Similarity), the distance to the training distribution (Distance), and the decision boundary distance (Magnitude), into a new activation: \(\text{sdm}(\mathbf{z}')_i = (2+q)^{d \cdot z'_i} / \sum_c (2+q)^{d \cdot z'_c}\). Based on this, an SDM estimator is constructed for selective classification, proving more robust than existing calibration methods under covariate shift and out-of-distribution inputs.

SITE: Soft Head Selection for Injecting ICL-Derived Task Embeddings

SITE proposes a gradient-optimized soft attention head selection method that identifies task-relevant heads to effectively inject ICL-derived task embeddings. It significantly outperforms ICL and existing embedding methods across 12 LLMs (4B-70B) while achieving performance comparable to PEFT with far fewer trainable parameters.

Sparse Feature Coactivation Reveals Causal Semantic Modules in Large Language Models

This paper automatically discovers semantic modules representing concepts and relations in LLMs via cross-layer co-activation graphs of SAE features from few prompts. It demonstrates that ablating or amplifying these modules allows for predictable manipulation of relational reasoning in Gemma 2 2B, achieving success rates up to 98% in single concept/relation scenarios and 90% in compositional scenarios.

SSA: Improving Performance With a Better Scoring Function

This paper identifies that Softmax attention collapses into an approximate hardmax under distribution shifts due to high-magnitude tokens. It proposes Scaled Signed Averaging (SSA) as a trainable alternative scoring function, which demonstrates superior generalization performance over Softmax across synthetic ICL tasks, a 114M decoder-only language model, and BabyBERTa encoder probes.

Style over Story: Measuring LLM Narrative Preferences via Structured Selection

This work designs an experimental paradigm based on constrained selection to measure the narrative preferences of LLMs. Using a library of 200 constraints constructed from narratology theory, 6 LLMs were evaluated across different instruction types. The study found that models systematically prioritize "Style" over content elements such as "Event," "Character," and "Setting."

The Impact of Off-Policy Training Data on Probe Generalisation

This paper systematically compares the impact of four types of training data—on-policy natural, on-policy incentivised, on-policy prompted, and off-policy—on the generalization of LLM activation probes. It finds that probes for behaviors visible on the text surface are robust, while "intentional" behaviors like deception, sycophancy, and sandbagging are highly susceptible to domain shifts. The authors propose using an on-policy incentivised test set to predict generalization failures in real-world monitoring.

Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall

This paper systematically evaluates the impact of weight quantization (e.g., GPTQ, AWQ, BitsAndBytes) on the factual knowledge recall of LLMs. It finds that quantization generally causes information loss and weakens knowledge retrieval, particularly harming smaller models and unsaturated relations; however, 8-bit/BitsAndBytes tend to preserve capabilities well, and some quantizations even enhance multi-hop factual recall.

Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures

Ours systematically reviews the latest progress in the intrinsic interpretability of LLMs, categorizing existing methods into five major design paradigms (Functional Transparency, Concept Alignment, Representational Decomposability, Explicit Modularity, and Latent Sparse Induction), while discussing open challenges and future directions.

Tracing Relational Knowledge Recall in Large Language Models

This paper systematically investigates the internal mechanisms of LLMs in recalling relational knowledge during text generation. It finds that the head-wise contribution of attention heads to the residual stream (\(\Delta_{att,h}\)) is the strongest feature for linear relation classification (reaching 91% accuracy). It proposes two probe attribution methods, HeadScore and TokenScore, to decompose predictions to the level of attention heads and source tokens, revealing clear correlations between probe accuracy and relational specificity, entity connectivity, and the concentration of probe signals.

Understanding or Memorizing? A Case Study of German Definite Articles in Language Models

This study employs the Gradiend gradient interpretability method to investigate whether language models rely on abstract syntactic rules or surface-level memorization when predicting German definite articles (der/die/das/den/dem/des). The findings indicate that models rely at least partially on memorized associations rather than strict rule-based encoding.