🔬 Interpretability¶

💬 ACL2026 · 34 paper notes

A Structured Clustering Approach for Inducing Media Narratives: This paper proposes a framework for automatically inducing media narrative patterns from large-scale news corpora. By jointly modeling causal event chains and character roles (hero/villain/victim), the framework employs a role-constrained clustering algorithm to organize narrative chains into semantically coherent narrative patterns. The approach generates interpretable narrative patterns consistent with framing theory in two domains: immigration and gun control.
Aligning What LLMs Do and Say: Towards Self-Consistent Explanations: This paper constructs a large-scale Post-hoc Self-Consistency Bank (PSCB, 85K decisions × 428K explanations), quantifies the feature attribution gap between LLM answers and their explanations, and improves attribution consistency of explanations via DPO optimization without sacrificing accuracy.
ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding: This paper proposes ChemVLR, the first reasoning-oriented VLM for the chemical domain. It constructs a 760K reasoning dataset via a cross-modal reverse engineering strategy and employs a three-stage training pipeline of continued pre-training → SFT → RL, achieving substantial improvements over proprietary models and domain-specialized VLMs on molecular recognition and reaction prediction tasks.
Context-Value-Action Architecture for Value-Driven Large Language Model Agents: This paper proposes the CVA (Context-Value-Action) architecture, grounded in the S-O-R psychological model and Schwartz's theory of basic human values. By training a Value Verifier on real human data, CVA decouples action generation from cognitive reasoning, effectively mitigating behavioral polarization in LLM agents. The approach achieves substantial improvements over baselines on CVABench, a benchmark comprising over 1.1 million real interaction trajectories.
Curing "Miracle Steps" in LLM Mathematical Reasoning with Rubric Rewards: This paper identifies a pervasive phenomenon in LLM mathematical reasoning termed "Miracle Steps"—instances where a reasoning chain leaps to the correct answer without valid derivation—and proposes the Rubric Reward Model (RRM), a problem-specific process reward function that reduces Miracle Steps by 71% during RL training and improves Verified Pass@1024 on AIME2024 from 26.7% to 62.6%.
Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations: This paper identifies and formalizes "structural alignment bias" in LLM tool invocations — the tendency of LLMs to invoke a tool when query attributes can be effectively mapped to tool parameters, even when the tool's functionality is irrelevant to the user's goal. The authors construct the SABEval dataset to decouple structural alignment from semantic relevance, apply contrastive attention attribution (CAA) to reveal two competing internal pathways (semantic checking vs. structural matching), and propose a path rebalancing strategy that achieves 80% relative error reduction.
Evian: Towards Explainable Visual Instruction-tuning Data Auditing: This paper proposes a Decomposition-then-Evaluation paradigm and the EVIAN framework, which decomposes responses in visual instruction tuning data into three components—visual description, subjective reasoning, and factual claims—and evaluates them along three orthogonal dimensions: image-text consistency, logical coherence, and factual accuracy. Models trained on the small high-quality subset selected by EVIAN outperform those trained on large-scale datasets.
Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models: This work constructs a controlled knowledge framework to systematically study how LLMs leverage experimental descriptions and outcome evidence in scientific feasibility assessment. Results show that providing outcome evidence is more reliable than experimental descriptions, that partial experimental information frequently degrades performance below a parametric-knowledge-only baseline, and that LLM reasoning exhibits notable fragility under incomplete evidence.
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning: This paper proposes Laser, a framework that conducts visual reasoning in latent space via Dynamic Window Alignment Learning (DWAL), enabling the model to maintain a probabilistic "superposition state" over future semantics rather than performing precise per-token prediction. This realizes a "global-before-local" cognitive hierarchy, achieving state-of-the-art performance among latent reasoning methods on 6 benchmarks with only 6 reasoning tokens (a reduction of 97%+), surpassing Monet by an average of 5.03%.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization: Through systematic mechanistic interpretability analysis, this paper reveals that LLM quantization exhibits two qualitatively distinct failure modes: 4-bit Signal Degradation (computational patterns remain intact but precision is impaired, amenable to local repair) and 2-bit Computation Collapse (functional destruction of critical components, requiring structural reconstruction).
HistLens: Mapping Idea Change across Concepts and Corpora: This paper proposes HistLens, a framework that leverages sparse autoencoders (SAEs) to decompose concept representations into interpretable semantic basis vectors, enabling the tracking of diachronic evolution trajectories across multiple concepts and corpora within a shared coordinate system. The framework supports implicit concept computation and provides a quantifiable, comparable analytical tool for digital humanities and conceptual history research.
IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration: This paper proposes IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantic factors. An EM algorithm jointly learns the mapping from verbal probability expressions to numeric values and the decision parameters, enabling calibrated, editable, and interpretable LLM decision-making. IDEA with Qwen-3-32B achieves 78.6% average F1 across five datasets, surpassing DeepSeek R1 (68.1%) and GPT-5.2 (77.9%).
Interpretability from the Ground Up: Starting from the informational needs of educational assessment stakeholders, this paper proposes four FGTI principles (Faithful, Grounded, Traceable, Interchangeable) and develops the three-stage AnalyticScore framework for interpretable automated scoring, achieving an average QWK on ASAP-SAS only 0.06 below the non-interpretable SOTA.
Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation: By constructing a verifiable intermediate reasoning trace dataset via rule-based question decomposition, this paper reveals that the semantic correctness of CoT reasoning traces correlates unreliably with final answer accuracy (correct traces lead to correct answers only 28% of the time), and that the most interpretable traces are not the most performance-enhancing ones—verbose R1 traces achieve the best performance yet are rated the least interpretable by users.
LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues: This paper proposes LePREC, a neuro-symbolic framework inspired by legal professionals' analytical processes. It uses LLMs to generate reasoning question–answer pairs that convert unstructured legal text into structured features, which are then fed into a sparse linear model for relevance classification. On the LIC dataset constructed from 769 Malaysian contract law cases, LePREC achieves 30–40% improvement over LLM baselines such as GPT-4o.
LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines: This paper proposes an LLM-guided semantic bootstrapping framework that leverages LLMs to generate sub-intents and trains a Non-Negated Tsetlin Machine (NTM) via three-stage curriculum synthetic data generation. High-confidence symbolic features extracted by the NTM are injected into real data representations, enabling a standard TM to approach BERT-level classification performance while maintaining full interpretability.
NOSE: Neural Olfactory-Semantic Embedding with Tri-Modal Orthogonal Contrastive Learning: This paper proposes NOSE, a tri-modal olfactory representation learning framework that uses molecules as a pivot to align three modalities—molecular structure, receptor sequences, and natural language descriptions—via an orthogonal injection mechanism. Combined with an LLM-driven weak positive augmentation strategy to address description sparsity, NOSE achieves state-of-the-art performance on 11 downstream tasks and demonstrates strong zero-shot generalization.
PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents: This paper proposes PV-SQL, an agent-based Text-to-SQL framework that combines two complementary components — Probe (iteratively generating probing queries to discover database value formats, column semantics, and table relationships) and Verify (extracting verifiable constraints via pattern matching and constructing a checklist) — achieving 5% higher execution accuracy and 20.8% higher valid efficiency score over the best baseline on the BIRD benchmark.
Reasoning Fails Where Step Flow Breaks: This paper proposes Step-Saliency, a diagnostic tool that identifies two depth-correlated information flow failure modes in large reasoning models (Shallow Lock-in and Deep Decay), and designs StepFlow, a test-time intervention that repairs information propagation and improves reasoning accuracy without retraining.
Revitalizing Black-Box Interpretability: Actionable Interpretability for LLMs via Proxy Models: This paper proposes a proxy-model-based black-box interpretability framework that leverages cheap small models to approximate the local decision boundaries of expensive large models for generating LIME/SHAP explanations. A statistical screen-and-apply mechanism ensures reliability: proxy explanations maintain over 90% fidelity while reducing costs by 88.2%, and are successfully applied to downstream optimization tasks such as prompt compression and poisoned sample removal.
Rhetorical Questions in LLM Representations: A Linear Probing Study: This work applies linear probing to analyze how LLMs internally represent rhetorical questions (RQs), finding that RQs are linearly separable in representation space and that probes transfer across datasets. However, probes trained on different datasets learn inconsistent directions, indicating that RQs are encoded along multiple heterogeneous linear directions rather than a single unified dimension.
Similarity-Distance-Magnitude Activations: This paper proposes SDM (Similarity-Distance-Magnitude) activations as a more robust replacement for softmax. By decoupling and integrating three epistemic dimensions—Similarity (deep matching with correct training predictions), Distance (proximity to the training distribution), and Magnitude (distance to the decision boundary)—into a novel activation \(\text{sdm}(\mathbf{z}')_i = (2+q)^{d \cdot z'_i} / \sum_c (2+q)^{d \cdot z'_c}\), the method constructs an SDM estimator for selective classification that is more robust than existing calibration approaches under covariate shift and out-of-distribution inputs.
SITE: Soft Head Selection for Injecting ICL-Derived Task Embeddings: SITE proposes a gradient-based soft attention head selection method that identifies task-relevant attention heads to effectively inject ICL-derived task embeddings. Across 12 LLMs (4B–70B), SITE substantially outperforms ICL and existing embedding injection methods while achieving performance comparable to PEFT with far fewer trainable parameters.
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks: SPENCE detects and quantifies data contamination of LLMs on NL2SQL benchmarks by systematically generating syntactic paraphrases of benchmark queries and measuring the decay of execution accuracy as a function of syntactic distance. Older benchmarks (e.g., Spider) exhibit stronger contamination signals, while the more recent BIRD benchmark is largely unaffected.
StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference: This paper proposes StructKV, a structure-aware KV Cache compression framework that identifies globally important tokens via Global In-Degree Centrality accumulated across layers, adaptively locates the optimal compression layer via Dynamic Pivot Detection, and decouples computation and storage budgets via Structural Propagation & Decoupling. At 60% prefill + 10% KV retention, StructKV achieves near-full-context performance on LongBench and RULER.
Style over Story: Measuring LLM Narrative Preferences via Structured Selection: This paper proposes a constrained-selection experimental paradigm to measure LLM narrative preferences. Using a library of 200 constraints constructed from narratological theory, six LLMs are evaluated across different instruction types, revealing that models systematically favor "Style" over content elements such as "Event," "Character," and "Setting."
TabReX: Tabular Referenceless eXplainable Evaluation: This paper proposes TabReX, a graph-reasoning-based referenceless evaluation framework for tabular generation. It converts source text and generated tables into knowledge graph triples and aligns them to compute interpretable, attribute-driven scores. TabReX substantially outperforms existing methods in correlation with human judgments, and the authors also introduce TabReX-Bench, a large-scale evaluation benchmark.
The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination: This paper systematically reveals the "reasoning trap" paradox: enhancing LLM reasoning capabilities — whether through RL, distillation, or switchable reasoning modes — systematically amplifies tool hallucination. This effect is associated with reasoning itself rather than RL training, and existing mitigation strategies (prompt engineering, DPO) face an unavoidable reliability-capability trade-off.
ThreadSumm: Summarization of Nested Discourse Threads Using Tree of Thoughts: This paper proposes ThreadSumm, a multi-stage LLM pipeline framework that models nested discourse thread summarization as a hierarchical reasoning problem. It first extracts aspects and atomic content units (ACUs) for content planning, then constructs a thread-aware sequence via sentence ordering, and finally applies Tree of Thoughts search to generate and score multiple paragraph candidates. The approach outperforms baselines on Reddit and StackExchange datasets.
To Trust or Not to Trust: Attention-Based Trust Management for LLM Multi-Agent Systems: This paper proposes the first comprehensive definition of "trustworthiness" for LLM multi-agent systems (LLM-MAS), grounded in six orthogonal dimensions derived from Grice's Cooperative Principle. It demonstrates that LLM attention patterns can distinguish different types of trustworthiness violations, and on this basis introduces A-Trust, a lightweight attention-based evaluation method, and an end-to-end Trust Management System (TMS) that achieves malicious message detection rates of 77–90% across diverse attack scenarios.
Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures: This paper presents a systematic survey of recent advances in intrinsic interpretability of LLMs, organizing existing methods into five design paradigms (functional transparency, concept alignment, representational decomposability, explicit modularity, and latent sparsity induction), and discusses open challenges and future directions.
Tracing Relational Knowledge Recall in Large Language Models: This paper systematically investigates the internal mechanisms by which LLMs recall relational knowledge during text generation. It finds that per-head attention contributions to the residual stream (\(\Delta_{att,h}\)) serve as the strongest features for linear relation classification (91% accuracy), and proposes two probe attribution methods—HeadScore and TokenScore—to decompose predictions to the attention head and source token levels, revealing clear correlations between probe accuracy and relation specificity, entity connectivity, and probe signal concentration.
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation: This paper systematically analyzes factual hallucinations induced by new-knowledge learning during SFT using a controlled synthetic dataset, Biography-Reasoning. It identifies the root mechanism as the attenuation of attention to key entities, and proposes KnownPatch—injecting a small amount of known-knowledge samples at the end of training to restore attention patterns—effectively mitigating hallucinations.
Understanding or Memorizing? A Case Study of German Definite Articles in Language Models: This paper employs the Gradiend gradient-based interpretability method to investigate whether language models predict German definite articles (der/die/das/den/dem/des) by leveraging abstract grammatical rules or surface-level memorization, finding that models rely at least partially on memorized associations rather than strict rule-based encoding.