Skip to content

🔬 Interpretability

📷 CVPR2026 · 31 paper notes

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

This paper proposes "information scope" as a novel dimension for SAE feature interpretability. By introducing the Contextual Dependency Score (CDS), it partitions CLIP's SAE features into local features (low CDS) and global features (high CDS), revealing their differentiated functional roles in classification, segmentation, and depth estimation.

CI-ICE: Intrinsic Concept Extraction Based on Compositional Interpretability

This paper introduces the CI-ICE task and the HyperExpress method, which leverages the hierarchical modeling capacity of hyperbolic space (Poincaré ball) to extract composable object-level and attribute-level intrinsic concepts. By applying Horosphere projection to enforce compositionality in the concept embedding space, HyperExpress achieves an ACC₁ of 0.504 on UCEBench, a 55% improvement over ICE (0.325).

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

This paper proposes CoE, a training-free multimodal summarization framework that constructs a Hierarchical Event Graph (HEG) to guide chain-of-events reasoning. CoE surpasses state-of-the-art video CoT baselines across 8 datasets, achieving average gains of +3.04 ROUGE, +9.51 CIDEr, and +1.88 BERTScore.

DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

This paper proposes DINO-QPM, a lightweight interpretability adapter that transforms the complex, high-dimensional features of a frozen DINOv2 backbone into contrastive, class-agnostic interpretable representations. Through quadratic programming for sparse feature selection and class-level feature assignment, the method simultaneously surpasses DINOv2 linear probing in accuracy and all comparable methods in interpretability on CUB-2011 and Stanford Cars.

Draft and Refine with Visual Experts

This paper proposes DnR (Draft and Refine), an agent framework built upon a question-conditioned Visual Utilization metric that quantifies the degree to which LVLMs actually rely on visual evidence. Through iterative rendering feedback from external visual experts (detection, segmentation, OCR, etc.), DnR improves visual grounding and reduces hallucinations.

Edit-As-Act: Goal-Regressive Planning for Open-Vocabulary 3D Indoor Scene Editing

This paper reframes open-vocabulary 3D indoor scene editing as a goal-regressive planning problem. It introduces EditLang, a PDDL-style symbolic language, and employs an LLM-driven Planner-Validator loop to derive minimal edit sequences by reasoning backward from goal states. Evaluated on 63 editing tasks, the method achieves the best overall balance across instruction fidelity (69.1%), semantic consistency (86.6%), and physical plausibility (91.7%).

ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization

ERMoE proposes reparameterizing MoE expert weights within an orthogonal eigenbasis and replacing conventional routing logits with eigenbasis scores (cosine similarity), achieving stable routing and interpretable expert specialization without auxiliary load-balancing losses.

Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?

This paper proposes the FASS benchmark, which systematically evaluates the stability of post-hoc feature attribution methods through prediction-invariant filtering, a three-axis stability decomposition (spatial / ranking / salient region), and multiple perturbation types (geometric / photometric / compression), exposing fundamental flaws in existing evaluation frameworks.

From Weights to Concepts: Data-Free Interpretability of CLIP via Singular Vector Decomposition

This paper proposes SITH (Semantic Inspection of Transformer Heads), a fully data-free and training-free interpretability framework for CLIP. SITH applies SVD directly to the Value-Output weight matrices of attention heads, then leverages a novel COMP algorithm to interpret each singular vector as a sparse combination of semantically coherent concepts. This achieves finer-grained intra-head interpretability than existing methods and enables precise weight editing to improve downstream performance.

Geometry-Guided Camera Motion Understanding in VideoLLMs

This paper reveals that VideoLLMs perform near random-chance on fine-grained camera motion primitives (pan/tilt/dolly, etc.), constructs CameraMotionDataset (12K clips × 15 atomic motions) and the CameraMotionVQA benchmark, and proposes a model-agnostic approach that injects geometric camera cues extracted by a frozen 3D foundation model (VGGT) via a lightweight temporal classifier and structured prompting — bridging this capability gap without any fine-tuning of the VideoLLM.

Geometry-Guided Camera Motion Understanding in VideoLLMs

This work systematically reveals camera motion blind spots in VideoLLMs through a benchmarking-diagnosis-injection framework, and significantly improves fine-grained camera motion understanding without fine-tuning by leveraging a frozen 3D foundation model (VGGT) for geometric feature extraction, a lightweight temporal classifier, and structured prompt injection.

Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings

This paper proposes generalization performance prediction metrics based on model-internal circuits, including Dependency Depth Bias (DDB) for pre-deployment model selection and Circuit Shift Score (CSS) for post-deployment performance monitoring, improving average correlation over existing proxy metrics by 13.4% and 34.1%, respectively.

Language Models Can Explain Visual Features via Steering

This paper proposes a method for scalable automatic explanation of visual features by causally intervening (steering) on SAE features in VLM visual encoders. By injecting feature vectors into a blank image's forward pass and prompting the language model to describe what it "sees," the approach eliminates the need for an evaluation image set. A hybrid method, Steering-informed Top-k, is further proposed and achieves state-of-the-art performance.

Measuring the (Un)Faithfulness of Concept-Based Explanations

This paper demonstrates that the faithfulness of existing unsupervised concept-based explanation methods (U-CBEMs) is systematically overestimated — due to the use of overly complex surrogate models and flawed deletion-based evaluation. The authors propose SURF (Surrogate Faithfulness), a simple linear surrogate with a dual-space metric framework, validated through a sanity check that "random concepts should be less faithful," and provide the first systematic benchmark revealing that multiple SOTA U-CBEMs are in fact not faithful.

Missing No More: Dictionary-Guided Cross-Modal Image Fusion under Missing Infrared

This paper proposes the first framework that performs cross-modal fusion under missing infrared conditions in the coefficient domain rather than the pixel domain. By learning a shared convolutional dictionary that establishes a unified IR-VIS atomic space, the method performs VIS→IR inference and adaptive fusion entirely in the coefficient domain. A frozen LLM provides weak semantic priors for thermal information completion. The approach achieves performance comparable to dual-modality fusion methods using only visible light images as input.

Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion

This paper proposes ND-CNPFuse, which performs neurodynamical analysis of coupled neural P (CNP) systems to establish constraint relationships between network parameters and input signals, preventing abnormal sustained neuronal firing. The method generates high-quality, interpretable decision maps for multi-focus image fusion (MFIF) without any training.

On the Possible Detectability of Image-in-Image Steganography

This paper exposes a fundamental security flaw in mainstream image-in-image deep steganography schemes: the embedding process is essentially a mixing process that can be readily separated by Independent Component Analysis (ICA). The authors propose an interpretable steganalysis method based on statistical moments of wavelet-domain independent components (achieving 84.6% accuracy with only 8-dimensional features), and demonstrate that the classical SRM+SVM approach achieves detection rates exceeding 99%.

On the Possible Detectability of Image-in-Image Steganography

This paper exposes a fundamental security vulnerability in invertible neural network (INN)-based image-in-image steganography: the embedding process is intrinsically a mixing process identifiable via independent component analysis (ICA). Using only 8-dimensional statistical features with an SVM achieves a detection rate of 84.6%, while the classical SRM+SVM baseline exceeds 99%.

Pixel2Phys: Distilling Governing Laws from Visual Dynamics

Pixel2Phys is proposed as a multi-agent collaborative framework built upon MLLMs, employing four agents — Plan, Variable, Equation, and Experiment — in an iterative hypothesize-verify-refine loop to automatically discover interpretable governing equations from raw videos, achieving a 45.35% improvement in extrapolation accuracy over baselines.

Reallocating Attention Across Layers to Reduce Multimodal Hallucination

A lightweight, training-free plugin method is proposed to mitigate hallucination in Multimodal Large Reasoning Models (MLRMs) by identifying perceptual and reasoning attention heads and applying Class-Conditioned Rescaling to rebalance cross-layer attention distribution. The method achieves an average improvement of 4.2% across 5 benchmarks with negligible additional inference overhead.

Reallocating Attention Across Layers to Reduce Multimodal Hallucination

This paper decomposes multimodal reasoning model hallucinations into two failure modes — shallow-layer perceptual bias and deep-layer reasoning drift — and selectively amplifies the contributions of identified perception/reasoning functional heads in a plug-and-play, training-free manner, achieving an average accuracy improvement of 4.2% with only ~1% additional computational overhead.

Rethinking Concept Bottleneck Models: From Pitfalls to Solutions

This paper proposes CBM-Suite, a methodological framework that systematically addresses four fundamental pitfalls of Concept Bottleneck Models—the absence of a pre-training concept relevance metric, the linearity problem that allows the concept bottleneck to be bypassed, the accuracy gap relative to black-box models, and the unexplored interaction effects of different visual backbones and VLMs—through entropy-based metrics, nonlinear layers, and distillation losses, significantly improving both accuracy and interpretability of CBMs.

RiskProp: Collision-Anchored Self-Supervised Risk Propagation for Early Accident Anticipation

This paper proposes RiskProp, a collision-anchored self-supervised risk propagation paradigm that learns temporally coherent risk evolution curves using only collision-frame annotations, via a future-frame regularization loss and an adaptive monotonicity constraint loss, achieving state-of-the-art performance on the CAP and Nexar datasets.

SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World

This paper proposes SafeDrive, an end-to-end planning framework that employs a trajectory-conditioned sparse world network (SWNet) to simulate future behaviors of critical entities, followed by a fine-grained reasoning network (FRNet) for per-instance collision assessment and per-timestep drivable-area compliance evaluation. SafeDrive achieves a PDMS of 91.6 with only 0.5% collision rate on NAVSIM, and a driving score of 66.8% on Bench2Drive.

SteelDefectX: A Coarse-to-Fine Vision-Language Dataset and Benchmark for Generalizable Steel Surface Defect Detection

This paper introduces SteelDefectX, the first vision-language dataset for steel surface defect detection (7,778 images, 25 defect categories), featuring coarse-to-fine textual annotations ranging from class-level to sample-level descriptions. A four-task benchmark is established covering pure-vision classification, vision-language classification, zero/few-shot recognition, and zero-shot transfer. Experiments demonstrate that high-quality textual annotations significantly improve model interpretability, generalization, and cross-domain transfer capability.

SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling

SubspaceAD demonstrates that fitting a single PCA model on features from a strong visual foundation model (DINOv2-G) is sufficient to outperform all few-shot anomaly detection methods requiring training, memory banks, or prompt tuning, achieving 98.0% image-level AUROC and 97.6% pixel-level AUROC on MVTec-AD under the 1-shot setting.

TDATR: Improving End-to-End Table Recognition via Table Detail-Aware Learning and Cell-Level Visual Alignment

This paper proposes the TDATR framework, which achieves end-to-end table recognition under limited annotation data through a "perceive-then-fuse" strategy and a structure-guided cell localization module, attaining state-of-the-art performance across 7 benchmarks without dataset-specific fine-tuning.

Text-guided Fine-Grained Video Anomaly Understanding

This paper proposes the T-VAU framework, which achieves pixel-level spatiotemporal anomaly localization via an Anomaly Heatmap Decoder (AHD), and introduces a Region-Aware Anomaly Encoder (RAE) that injects heatmap evidence into an LVLM for unified reasoning over anomaly detection, localization, and semantic explanation.

Towards Faithful Multimodal Concept Bottleneck Models

This paper proposes f-CBM — the first faithful multimodal Concept Bottleneck Model framework — which mitigates unintended information leakage in concept representations via a differentiable leakage loss, and improves concept detection accuracy using a Kolmogorov-Arnold Network (KAN) prediction head, achieving an optimal Pareto frontier across task accuracy, concept detection, and leakage reduction.

VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension

VIRO embeds lightweight operator-level verification mechanisms (CLIP uncertainty verification + spatial logic verification) into a neuro-symbolic REC pipeline, enabling each reasoning step to self-verify and terminate early when no target exists. Under a zero-shot setting, it achieves 61.1% balanced accuracy, substantially outperforming compositional reasoning baselines, while maintaining a program failure rate below 0.3% and efficient inference speed.

Why Does It Look There? Structured Explanations for Image Classification

This paper proposes the I2X framework, which transforms unstructured explainability (saliency maps) into structured explanations by tracking the co-evolution of prototype intensity extracted via GradCAM and model confidence across training checkpoints. The framework reveals the reasoning structure underlying "why the model attends to a specific region" and leverages this understanding to guide fine-tuning for performance improvement.