Skip to content

🔬 Interpretability

📷 CVPR2026 · 33 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (196) · 💬 ACL2026 (63) · 🧪 ICML2026 (92) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (76) · 📹 ICCV2025 (10)

🔥 Top topics: Reasoning ×4 · Alignment/RLHF ×2 · Multimodal/VLM ×2

Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers

ALOE utilizes a one-time, label-free "teacher-student feature alignment" to convert frozen ViT foundation models (Supervised / DINOv3 / SigLIP2) into inherently interpretable B-cos versions. Once aligned, the backbone can be used as a drop-in replacement for tasks like classification, zero-shot, and dense prediction, improving accuracy by \(>4.9\) percentage points over original B-cosification on ViTs while providing faithful and localized explanations with \(100–1000\times\) higher data efficiency.

Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

This paper proposes BTTF, a pure optimization framework that uses Image-to-Video diffusion models to generate Counterfactual Explanations (CFE) for video classifiers. By optimizing the initial noise latent variable solely based on the gradients of the target classifier—first anchoring the search via "inversion" near the original video and then optimizing toward the target category—it generates a "parallel video" that is most similar to the original yet classified as another category, revealing the spatiotemporal features the model relies on for decision-making.

Beyond Top Activations: Efficient and Reliable Crowdsourced Evaluation of Automated Interpretability

For the problem of "how to evaluate automated neuron explanations," this paper utilizes Model-Guided Importance Sampling (MG-IS) to select the most informative inputs for crowdsourced labeling and Bayesian Rating Aggregation (BRAgg) to remove noise. This reduces the cost of a reliable full-distribution correlation evaluation from approximately $90k to $2.16k (~40×). Using this method, the authors systematically compare mainstream interpretability methods across multiple vision models, finding that Linear Explanations perform best overall, surprisingly outperforming recent LLM-based methods.

CIGMA: Causal Information-Gain Mechanistic Attribution of Attention Heads in Vision Transformers

CIGMA quantifies the contribution of each attention head to background shortcuts using two counterfactual edits (masking foreground/background). By ranking heads according to causal information gain and surgically zeroing out the top-K "spurious heads," ViT/VLM models are encouraged to shift attention from the background to foreground objects without requiring training. This leads to classification accuracy gains of 7.6–24.8 percentage points and an approximately 83% reduction in background dependency.

CREward: A Type-Specific Creativity Reward Model

This paper decomposes "visual creativity" along the image formation pipeline into three interpretable axes: Geometry / Material / Texture. It first establishes a human benchmark, CreBench, via expert pairwise comparisons to confirm that Large Vision-Language Models (LVLMs) align closely with human judgment regarding creativity. Subsequently, a lightweight type-specific reward model, CREward (comprising a frozen visual backbone and MLP heads), is distilled from LVLM-generated preference labels. This model is applied across three domains: creativity evaluation, creative sample filtering / LoRA slider-guided generation, and Grad-CAM based interpretability.

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

This paper proposes CoE, a training-free multimodal summarization framework. By constructing a Hierarchical Event Graph (HEG) to guide chain-of-event reasoning, it surpasses SOTA video CoT baselines on 8 datasets, achieving an average improvement of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore.

Draft and Refine with Visual Experts

Proposes DnR (Draft and Refine), an agent framework based on a query-conditional Visual Utilization metric. This framework quantifies an LVLM's actual reliance on visual evidence and iteratively improves visual grounding to reduce hallucinations through rendering feedback from external visual experts (detection/segmentation/OCR, etc.).

Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

This work redefines open-vocabulary 3D indoor scene editing as a goal-regressive planning problem. It introduces the PDDL-style symbolic language EditLang and an LLM-driven Planner-Validator loop to derive minimal editing sequences from target states. The method achieves the best balance across instruction faithfulness (69.1%), semantic consistency (86.6%), and physical plausibility (91.7%) across 63 editing tasks.

ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

ERMoE proposes reparameterizing MoE expert weights within an orthogonal eigenbasis and substituting traditional routing logits with eigenbasis alignment scores (cosine similarity), enabling stable routing and interpretable expert specialization without the need for auxiliary load balancing losses.

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

H-Sets utilizes the input Hessian to detect second-order (non-additive) interactions between pixels, recursively merging them into semantically coherent feature sets. It then scores each set using IDG-Vis (Integrated Directional Gradients + Harsanyi Dividends) at the set level, ultimately producing saliency maps that are sparser and more faithful than existing methods.

Hidden Monotonicity: Explaining Deep Neural Networks via their DC Decomposition

This paper losslessly decomposes any pre-trained ReLU network into the difference of two "monotonic and convex" subnetworks \(f=g-h\). By resolving the numerical explosion inherent in such decompositions, it introduces three attribution methods—SplitCAM, SplitLRP, and SplitGrad—setting new state-of-the-art (SOTA) results for saliency maps across faithfulness, localization, and robustness on VGG16 and ResNet18 (ImageNet-S).

Hierarchical Concept Embedding & Pursuit for Interpretable Image Classification

HCEP explicitly encodes the "concepts have a hierarchical structure (hypernym → hyponym)" prior into the geometric conditions of the CLIP embedding space. It then utilizes Hierarchical Beam Orthogonal Matching Pursuit (HB-OMP) to recover concepts along "root-to-leaf" paths, significantly improving concept recovery precision/recall while maintaining classification accuracy, especially in few-shot scenarios.

HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation

HUMORCHAIN explicitly encodes four major humor theories—Incongruity-Resolution, Benign Violation, Superiority, and Relief—into a multi-stage LLM reasoning chain ("Visual Parsing → Strategy Selection → Generation → Discriminator Feedback"). A Qwen3-VL-4B humor discriminator is trained for a "generation-evaluation-rewriting" loop, outperforming existing methods in human preference, Elo/BT scores, and semantic diversity across three datasets.

Improving Sparse Autoencoder with Dynamic Attention

This paper reformulates the Sparse Autoencoder (SAE) into a cross-attention architecture with shared concept vectors and replaces softmax with sparsemax. This allows each sample to automatically determine the number of activated concepts based on its own complexity, overcoming the inherent "setting K" problem in TopK SAEs to achieve lower reconstruction error and clearer concepts in both image and text domains.

Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

This paper proposes generalization performance prediction metrics based on the internal circuits of models, including Dependency Depth Bias (DDB) for pre-deployment model selection and Circuit Shift Score (CSS) for post-deployment performance monitoring. These metrics improve correlation by an average of 13.4% and 34.1%, respectively, compared to existing proxy metrics.

Language Models Can Explain Visual Features via Steering

Ours proposes performing causal intervention (steering) using SAE features on VLM vision encoders. By inputting blank images and allowing the language model to describe the visual concepts it "sees," this method achieves scalable automated interpretation of visual features without needing evaluation image sets. The hybrid approach, Steering-informed Top-k, achieves SOTA performance.

Make it SING: Analyzing Semantic Invariants in Classifiers

SING projects invariant directions in the null space of a classifier's linear head—which "change the input without changing logits"—into the CLIP vision-language space via a linear translator. By using two angular metrics (AS/IS) to quantify the semantic content of these invariants, the authors diagnose "semantic information leakage into the invariant subspace" across model, class, and image levels, discovering that DinoViT is less prone to leaking class-related semantics into the null space compared to models like ResNet50.

Making the Classification Explanation Faithful to the Confidence Score

This paper proposes MHE (Metropolis-Hastings Explainer), a black-box explanation method that uses MH sampling to search for masks where "the confidence remains close to the original image after partial occlusion." This ensures the confidence of the explained region strictly approximates the model's original confidence—by simultaneously identifying both positive and negative contribution regions—thereby upgrading the explanation from "class faithfulness" to "confidence faithfulness."

Measuring the (Un)Faithfulness of Concept-Based Explanations

This paper reveals that the faithfulness of existing unsupervised concept-based explanation methods (U-CBEMs) is overestimated due to overly complex surrogate models and flawed deletion-based evaluations. The authors propose SURF (Surrogate Faithfulness), a framework consisting of a simple linear surrogate and dual-space metrics. Validated by a sanity check ("random concepts should be less faithful"), this framework demonstrates its correctness and reveals for the first time that several state-of-the-art (SOTA) U-CBEMs are actually unfaithful.

MedLIME: A Distribution-Aligned and Evidence-Supported Framework for Medical Saliency Explanations

MedLIME enhances the classic black-box explanation method LIME with three key components: Generative Masking (GM) using MAE to ensure perturbed samples remain in-distribution, Supervised Test-Time Adaptation (STTA) to align inputs with the model's distribution, and Evidence-Based Regularization (EBR) via kNN and kernel estimation to incorporate historical clinical evidence. This framework improves the quality of saliency maps (AUPRC) for medical anomaly localization by up to 30% compared to various baselines.

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

This paper proposes the first framework to perform cross-modal fusion under missing infrared conditions in the coefficient domain rather than the pixel domain. By establishing a unified IR-VIS atomic space via a shared convolutional dictionary, it completes VIS→IR reasoning and adaptive fusion within the coefficient domain. Combined with a frozen LLM providing weak semantic priors for thermal information completion, the method achieves performance close to dual-modal fusion methods using only visible light input.

Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion

ND-CNPFuse is proposed to establish constraints between network parameters and input signals through neurodynamic analysis of Coupled Neural P (CNP) systems. This prevents abnormal continuous firing of neurons, enabling the generation of high-quality, interpretable decision maps for multi-focus image fusion (MFIF) tasks without any training.

PhaseWin: Reducing Object-Level Attribution from Quadratic Complexity to Near-Linear Phase-Window Search

PhaseWin transforms the quadratic greedy sub-region selection in object-level attribution (which re-scores all remaining regions at every step) into a coarse-to-fine search process featuring "phase pruning + window selection + dynamic supervision." While preserving greedy approximation guarantees, it achieves 95%+ of greedy attribution faithfulness using approximately 20% of the forward pass budget.

Pixel2Phys: Distilling Governing Laws from Visual Dynamics

Pixel2Phys is proposed as an MLLM-based multi-agent collaborative framework that automatically discovers interpretable physical governing equations from raw videos through an iterative hypothesis-verification-refinement loop involving four agents: Plan, Variable, Equation, and Experiment. It achieves a 45.35% improvement in extrapolation accuracy compared to baselines.

PRISM: Prototype-based Reasoning with Inter-modal Semantic Mining for Interpretable Image Recognition

PRISM augments traditional vision-only prototype networks (ProtoPNet series) with linguistic supervision. It utilizes CLIP and the Information Bottleneck principle to generate "text-conditioned attribution maps" as soft labels, implicitly anchoring visual prototypes to semantically meaningful image regions. By incorporating an entropy-based spatial compactness constraint to ensure non-overlapping prototypes, PRISM improves both accuracy and prototype interpretability on fine-grained classification tasks like CUB and Stanford Dogs.

Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

The CBM-Suite framework is proposed to systematically address four pitfalls of Concept Bottleneck Models (CBMs): the lack of pre-evaluation metrics for concept relevance, the linearity problem causing bottlenecks to be bypassed, the accuracy gap compared to black-box models, and the research gap regarding the impact of different visual backbones/VLMs. This is achieved through entropy measures, non-linear layers, and distillation losses, significantly enhancing both accuracy and interpretability.

RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation

RiskProp is proposed as a self-supervised risk propagation paradigm anchored by the collision frame. By utilizing future frame regularization and adaptive monotonic constraint losses, the model learns temporally coherent risk evolution curves relying solely on collision frame annotations, achieving SOTA performance on CAP and Nexar datasets.

Rounded or Streamlined Head? Bridging Concept Bottleneck Models and Attribute-Described Object Parts

To address two types of inconsistency in VLM-driven Concept Bottleneck Models (CBMs)—mislocalizing concepts to incorrect parts and activating concepts on irrelevant objects—this paper proposes OA-CBM. It uses an LLM to rewrite concepts into "part-attribute" pairs and constructs two segmentation datasets accordingly. It employs a Hierarchical Clustering module to generate class-agnostic foreground object masks to suppress background noise and a Cost Aggregation module to stabilize vision-concept correspondence. This improves concept grounding h-IoU from 9.8 to 35.7 in the challenging Pred-All setting, with a concurrent classification accuracy gain of approximately 2.9%.

SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World

Ours proposes the SafeDrive end-to-end planning framework, which simulates the future behavior of key entities through a trajectory-conditioned Sparse World Network (SWNet). It then employs a Fine-grained Reasoning Network (FRNet) for per-instance collision assessment and per-timestep drivable area compliance evaluation. SafeDrive achieves 91.6 PDMS and a collision rate of only 0.5% on NAVSIM, alongside a 66.8% driving score on Bench2Drive.

Selection-as-Nonlinearity: Bridging Attention and Activation via a Joint Game-Decision Lens for Interpretable, Discriminative Visual Representations

This paper proposes the SaN (Selection-as-Nonlinearity) perspective, reinterpreting attention as a "cooperative selection game driven by context-based scoring under unit budget constraints." It diagnoses the "weak-independence" phenomenon—where pure attention stacks significantly underperform when FFNs are removed—as a result of two structural tensions. Based on this, it designs a near-zero-overhead compensation module, CSaN (Layered Budget Calibration + Public-Private Collaborative Readout), enabling small-scale Swin/ViT/Hiera models to match or exceed the performance of counterparts twice their size on ImageNet.

TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

The TDATR framework is proposed, utilizing a "perceive-then-fuse" strategy and a structure-guided cell localization module to achieve end-to-end table recognition with limited annotated data, reaching SOTA on 7 benchmarks without dataset-specific fine-tuning.

VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension

VIRO embeds a lightweight operator-level verification mechanism (CLIP uncertainty verification + spatial logic verification) into a neuro-symbolic REC pipeline. This enables each reasoning step to self-verify and terminate early when no target is present. In zero-shot settings, it significantly outperforms compositional reasoning baselines with a balanced accuracy of 61.1%, while maintaining a program failure rate below 0.3% and high inference efficiency.

When Do Models Actually Decide? Mapping the Layer-Wise Decision Timeline in Pretrained Neural Networks

The authors train linear probes at each anchor layer of ResNet-18/50/101 (plus ViT-B/16 and ConvNeXt-Tiny) to track the specific layer where the prediction for each ImageNet image "settles." They discover a strong bimodal decision distribution and a "semantic phase transition" concentrated in the final residual stages. Based on these findings, they suggest that stability-based early exits provide negligible real-world speedup-accuracy gains.