CVPR2025 Interpretability AI paper notes paper summaries Few-/Zero-Shot Learning Domain Adaptation Continual Learning Layout & Composition Watermarking Self-Supervised Learning

🔬 Interpretability¶

📷 CVPR2025 · 21 paper notes

📌 Same area in other venues: 📷 CVPR2026 (34) · 🔬 ICLR2026 (196) · 💬 ACL2026 (63) · 🧪 ICML2026 (92) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (80)

🔥 Top topics: Few-/Zero-Shot Learning ×2 · Domain Adaptation ×2

Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability: This paper proposes ALBM (Attribute-formed Language Bottleneck Model), which avoids spurious correlation reasoning by constructing an attribute-guided class-specific concept space, extracts fine-grained attribute features using visual attribute prompt learning, and automatically generates high-quality concept sets through a Description-Summary-Supplement (DSS) strategy, achieving better interpretability and scalability across 9 benchmarks.
Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability: This paper proposes the ALBM model, which replaces the class-shared concept space of existing Language Bottleneck Models (LBMs) with an Attribute-formed Class-specific Concept Space (ACCS) to address the issue of spurious cue reasoning and support cross-class generalization. Combined with Visual Attribute Prompt Learning (VAPL) to extract fine-grained attribute features, ALBM comprehensively outperforms existing interpretable classification methods on 9 few-shot benchmarks.
Differentiable Inverse Rendering with Interpretable Basis BRDFs: Proposes a differentiable inverse rendering method based on interpretable basis BRDFs, decomposing materials into combinations of physically meaningful basis functions to achieve interpretable material estimation.
Geometry-Guided Camera Motion Understanding in VideoLLMs: Proposes a complete framework spanning benchmark construction, diagnosis, and injection. By extracting camera motion cues from a 3D foundation model (VGGT) and injecting them into the VideoLLM via structured prompting, training-free camera motion perception enhancement is achieved.
Interpretable Image Classification via Non-parametric Part Prototype Learning: This paper proposes an interpretable image classification framework based on non-parametric prototype learning. It discovers semantically distinct object part prototypes by performing optimal transport clustering on self-supervised ViT features, addressing prototype redundancy issues in existing ProtoPNet methods, while introducing two new metrics, Distinctiveness and Comprehensiveness, to quantify explanation quality.
KVQ: Boosting Video Quality Assessment via Saliency-Guided Local Perception: Inspired by the human visual system, KVQ explicitly decouples global video quality into two factors: visual saliency and local texture. It extracts cross-region saliency via Fusion-Window Attention and enhances texture perception in independent regions using a Local Perception Constraint, significantly outperforming SOTA methods on five VQA benchmarks.
L-SWAG: Layer-Sample Wise Activation with Gradients Information for Zero-Shot NAS on Vision Transformers: This paper proposes L-SWAG (Layer-Sample Wise Activation with Gradients), a new general zero-cost proxy that evaluates network architecture quality by combining layer- and sample-wise activation and gradient information. It is the first to systematically extend zero-cost NAS to the Vision Transformer search space and establishes a new benchmark across 6 tasks in the Autoformer search space.
Language Guided Concept Bottleneck Models for Interpretable Continual Learning: This paper introduces language-guided Concept Bottleneck Models (CBMs) into continual learning. It uses ChatGPT to generate human-interpretable concepts and the CLIP text encoder to encode concept embeddings, constructing a concept bottleneck layer. This provides transparent decision explanations while mitigating catastrophic forgetting, outperforming the SOTA on ImageNet-subset by 3.06%.
Learning on Model Weights using Tree Experts: Discovers that most public models belong to a few Model Trees (fine-tuned from common ancestors), and learning weights within the same Tree is much simpler than across Trees. This paper proposes ProbeX, the first lightweight probing method targeting single hidden layer weights. Through Tucker tensor decomposition, it achieves a 30x reduction in parameter size and realizes the first zero-shot model classification (89.8% accuracy) by aligning model weights with text representations.
Learning Visual Composition through Improved Semantic Guidance: This paper proposes to significantly enhance the visual compositional understanding of standard CLIP models by improving the semantic supervision signals of training data (regenerating high-quality captions using foundation models and replacing training-from-scratch with a pre-trained text encoder). This improves performance on the ARO benchmark from CLIP's 59%/63% to 92%/94%, and on DOCCI image retrieval from 58.4% to 94.5% recall@1, without requiring any architectural modifications.
L-SWAG: Layer-Sample Wise Activation with Gradients information for Zero-Shot NAS on Vision Transformers: This paper proposes the L-SWAG metric, which characterizes the trainability and expressiveness of CNN and ViT networks through the product of layer-wise gradient variance and the cardinality of activation patterns. It further designs the LIBRA-NAS algorithm to combine complementary proxy metrics, achieving SOTA-level zero-shot NAS performance across ViT search spaces and 14 tasks.
On the Possible Detectability of Image-in-Image Steganography: This work theoretically and experimentally reveals that current popular deep learning-based image-in-image steganography schemes suffer from severe detectability vulnerabilities. The embedding process is essentially a mixing process that can be easily identified by Independent Component Analysis (ICA). An 8-dimensional feature vector consisting of the first four moments of independent components in the wavelet domain achieves a detection accuracy of 84.6%, while the classic SRM+SVM method achieves over 99% accuracy.
Open Ad-Hoc Categorization with Contextualized Feature Learning: This paper proposes OAK (Open Ad-hoc Categorization with Contextualized Feature Learning). By introducing a few learnable context tokens into the input layer of a frozen CLIP model and combining CLIP's vision-language alignment objective with the visual clustering objective of GCD, the method achieves adaptive ad-hoc category discovery and context switching under few-labeled samples. It achieves an accuracy of 87.4% on novel classes of the Stanford Mood dataset, outperforming CLIP and GCD by over 50%.
Probing the Mid-Level Vision Capabilities of Self-Supervised Learning: Approaching the analysis from the perspective of childhood visual development, this work systematically evaluates the capabilities of 22 self-supervised learning (SSL) models on mid-level vision tasks (depth estimation, surface normals, object segmentation, geometric correspondence, etc.). The study reveals that while a substantial performance gap remains between SSL models and supervised models on high-level semantic tasks, this gap is significantly smaller for mid-level vision capabilities like 3D spatial perception.
Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis: Prompt-CAM is proposed to realize almost "free" interpretable fine-grained analysis. By injecting class-specific learnable prompt tokens into a pre-trained ViT, it utilizes the multi-head attention maps of the last layer to identify and localize critical traits that distinguish fine-grained categories.
Sample- and Parameter-Efficient Auto-Regressive Image Models: This paper proposes XTRA, which introduces a Block Causal Mask (using \(k \times k\) token blocks as the causal unit) into ViT. This allows auto-regressive image models to outperform previous state-of-the-art auto-regressive models on 15 image recognition benchmarks using only 1/152 of the training samples, while achieving superior probing performance with 1/7 to 1/16 of the parameter count.
Scaling Vision Pre-Training to 4K Resolution: This paper proposes PS3 (Pre-training with Scale-Selective Scaling), which scales CLIP-style vision pre-training to 4K resolution with near-constant computational overhead by replacing global image contrastive learning with localized region and local caption contrastive learning. Combined with top-down/bottom-up patch selection mechanisms, the VILA-HD multimodal large language model is constructed, significantly outperforming GPT-4o and Qwen2.5-VL on high-resolution perception tasks.
TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction: This paper proposes TIDE, which leverages diffusion models and LLMs to automatically generate concept-level saliency map annotations to train locally interpretable domain generalization models. During testing, concept signatures are utilized for prediction correction, yielding an average performance improvement of 12% over SOTA on four standard DG benchmarks.
TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction: This paper proposes TIDE, a novel training scheme for single-source domain generalization. It leverages diffusion models and LLMs to automatically generate class-level concept annotations (e.g., "bird = sharp beak + wings + claws"). By training the model to focus on domain-invariant local concepts rather than global background features via a concept saliency alignment loss, the model can automatically correct erroneous predictions caused by domain shift during test time using concept saliency maps.
Towards Human-Understandable Multi-Dimensional Concept Discovery: Proposed the HU-MCD framework, which replaces traditional segmentation methods with SAM to discover human-understandable visual concepts, coupled with a CNN-specific input masking scheme to reduce noise interference, achieving concept-level model explanations that balance understandability and faithfulness under the completeness framework of MCD.
Why Does It Look There? Structured Explanations for Image Classification: This paper proposes the I2X framework, which transforms unstructured interpretability into structured interpretability by tracking the co-variance between model confidence and the intensity changes of abstract prototypes extracted from GradCAM saliency maps across training checkpoints. It also utilizes the identified "uncertain prototypes" to guide fine-tuning, reduce inter-class confusion, and improve classification accuracy.