Explaining CLIP Zero-shot Predictions Through Concepts¶

Conference: CVPR 2026 arXiv: 2603.28211 Code: https://github.com/oonat/ezpc Area: Information Retrieval Keywords: CLIP, Zero-shot Classification, Concept Bottleneck Model, Interpretability, Vision-Language Model

TL;DR¶

This paper proposes EZPC, which learns a linear projection matrix \(A\) to jointly map CLIP image and text embeddings into an interpretable concept space. The method provides faithful, human-understandable explanations for CLIP predictions with negligible accuracy loss (H-mean gap of ~1% on CIFAR-100/CUB/ImageNet-100) and an inference overhead of only ~0.1ms.

Background & Motivation¶

Background: Vision-language models (VLMs) such as CLIP have achieved remarkable success in zero-shot image recognition by aligning images and text in a shared semantic space, enabling recognition of arbitrary categories without task-specific training. Concept Bottleneck Models (CBMs), meanwhile, provide interpretable reasoning through a human-defined concept layer, but require concept annotations and cannot generalize to unseen categories.
Limitations of Prior Work: CLIP's high-dimensional embeddings are entangled black boxes—users cannot understand why the model associates a given image with a particular label. Although CBMs are interpretable, they require concept supervision and are limited to a closed-world setting. SpLiCE decomposes CLIP embeddings into concept combinations but requires per-image optimization (59× slower than CLIP), while Z-CBM demands a large concept vocabulary and expensive regression.
Key Challenge: Interpretability and open-world generalization appear to be mutually exclusive—CBMs offer interpretability but lack generalization, while CLIP generalizes but is not interpretable.
Goal: How can CLIP's zero-shot capability be preserved while making its predictions explainable through human-understandable concepts?
Key Insight: CLIP's internal representations may already implicitly encode human-understandable semantic structure, requiring only an appropriate projection to "decode" it.
Core Idea: Learn a single linear projection matrix \(A\) to jointly map CLIP image and text embeddings into a predefined concept space, using a matching loss to preserve interpretability and a reconstruction loss to maintain semantic fidelity.

Method¶

Overall Architecture¶

The EZPC pipeline: (1) define a set of \(m\) human-understandable concepts (e.g., "has feathers", "made of metal"); (2) learn a projection matrix \(A \in \mathbb{R}^{d \times m}\) mapping CLIP's \(d\)-dimensional embedding space to an \(m\)-dimensional concept space; (3) perform zero-shot classification in concept space via the dot product between the image concept vector \(c_x = v_x A\) and the class concept vector \(c_k\); (4) decompose each concept's contribution via the element-wise product \(s_{x,k} = c_x \odot c_k\) to provide faithful explanations.

Key Designs¶

Shared Concept Projection:
- Function: Uniformly maps both image and text embeddings into an interpretable concept space.
- Mechanism: A learnable projection matrix \(A \in \mathbb{R}^{d \times m}\) is defined, where each column corresponds to a concept direction. Image concept activations are computed as \(c_x = v_x A\) and class concept activations as \(C_\mathcal{Y} = T A\). Classification is performed in concept space via \(\hat{y} = \arg\max_k \langle c_x, c_k \rangle\). Since \(\langle c_x, c_k \rangle = \sum_{j=1}^{m} s_{x,k}^{(j)}\), the contribution of each concept dimension to the prediction is directly decomposable.
- Design Motivation: Using a single unified projection rather than per-image optimization ensures high efficiency; the linear projection guarantees explanation faithfulness—concept scores directly constitute classification logits, so explanations and the decision process are fully consistent.
Matching Loss:
- Function: Ensures the columns of the projection matrix \(A\) remain aligned with known concept embeddings, preserving interpretability.
- Mechanism: All concept phrases are encoded with the CLIP text encoder to obtain \(\Phi \in \mathbb{R}^{d \times m}\). \(A\) is initialized as \(\Phi\) and constrained via the MSE loss \(\mathcal{L}_{\text{match}} = \frac{1}{dm}\sum_{i,j}(A_{ij} - \Phi_{ij})^2\) to prevent it from drifting away from the concept basis.
- Design Motivation: Without this constraint, \(A\) may drift during training toward directions that no longer correspond to interpretable concepts. The anchoring loss strikes a balance between flexibility and interpretability.
Reconstruction Loss:
- Function: Ensures the similarity distribution in concept space is consistent with that in CLIP's original embedding space.
- Mechanism: KL divergence is used to measure the discrepancy between the concept-space distribution and CLIP's original distribution: \(\mathcal{L}_{\text{recon}} = \frac{1}{B}\sum_{i=1}^{B} \text{KL}(\text{softmax}(c_i C_\mathcal{Y}^\top) \| \text{softmax}(v_i T^\top))\)
- Design Motivation: Ensures that classification decisions after concept decomposition remain consistent with CLIP's original predictions—i.e., semantic fidelity is maintained without altering the model's core judgments by adding an interpretable layer.

Loss & Training¶

Total loss: \(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{match}} + \lambda \mathcal{L}_{\text{recon}}\)
\(\lambda = 1\) for most datasets; \(\lambda = 5\) for CUB and Places365.
The concept set is drawn from the GPT-3-generated concept vocabulary used in LF-CBM, augmented with a large ImageNet-1k concept pool for broader coverage.
All experiments use an 80/20 seen/unseen class split.

Key Experimental Results¶

Main Results¶

Generalized Zero-Shot Performance (Harmonic Mean):

Dataset	CLIP	Z-CBM	SpLiCE	EZPC
CIFAR-100	0.408	0.365	0.270	0.403
ImageNet-100	0.693	0.585	0.389	0.682
CUB	0.474	0.189	0.070	0.465
ImageNet-1k	0.530	0.462	0.300	0.481
Places365	0.362	0.357	0.282	0.352

Inference Efficiency Comparison:

Method	Latency (ms/img)	Overhead
CLIP	5.77	1.0×
Z-CBM	542.34	94.0×
SpLiCE	338.51	58.7×
EZPC	5.90	~1.0×

Ablation Study¶

\(\lambda\)	Zero-shot Seen	Unseen	GZS H-mean
0.01	0.377	0.508	0.358
0.1	0.654	0.820	0.630
1	0.699	0.851	0.682
10	0.707	0.859	0.695
100	0.706	0.857	0.692

Key Findings¶

EZPC closes within 1% of CLIP on most datasets (CIFAR-100: −0.5%, ImageNet-100: −1.1%, CUB: −0.9%), while SpLiCE and Z-CBM frequently lag by 10–15%.
A quantitative–qualitative trade-off exists for \(\lambda\): larger \(\lambda\) improves quantitative metrics (better preservation of CLIP's distribution), but qualitative analysis shows that smaller \(\lambda\) (e.g., 1) yields more semantically relevant concept activations.
The concept space exhibits good spatial alignment: on the Indigo Bunting class in CUB, the Pointing Accuracy for the positive concept "blue-grey body" reaches 96.7%, while the negative concept "red face" scores near zero.
Cross-dataset transfer is effective: a projection matrix trained on ImageNet-100 transfers directly to CIFAR-100 and CUB with performance close to CLIP.

Highlights & Insights¶

A minimalist interpretability solution: a single linear projection matrix delivers concept-level explanations with near-zero inference overhead (~0.1ms), making it suitable for large-scale deployment—in stark contrast to SpLiCE (59× slower) and Z-CBM (94× slower), both of which require per-image optimization.
Guaranteed faithfulness of explanations: since concept scores directly constitute prediction logits (\(\langle c_x, c_k \rangle = \sum_j s_{x,k}^{(j)}\)), explanations are constructively faithful rather than post-hoc attributions—a stronger guarantee than saliency-map-based methods.
Balanced dual-objective design: \(\mathcal{L}_{\text{match}}\) preserves interpretability, \(\mathcal{L}_{\text{recon}}\) preserves performance, and \(\lambda\) controls the trade-off. This design pattern is transferable to other tasks requiring a balance between interpretability and accuracy.

Limitations & Future Work¶

The linear projection assumption limits expressive capacity: highly nonlinear semantic relationships may not be fully captured in the concept space.
Dependence on concept set quality: interpretability relies on the quality and diversity of the concept vocabulary; biases in the concept set affect explanation fidelity.
Restricted to classification tasks: the current method focuses on classification; extending it to multimodal reasoning, VQA, and related tasks remains an open problem.
Larger performance gap on ImageNet-1k (5%): information loss from concept decomposition becomes more pronounced at scale.
Future directions: nonlinear concept mappings, adaptive concept discovery, and integration with LLMs for dynamic concept vocabulary expansion.

vs. SpLiCE: SpLiCE sparsely decomposes CLIP embeddings into combinations of concept vectors but requires per-image optimization, making it 59× slower. EZPC learns a unified projection with near-free inference.
vs. Z-CBM: Z-CBM reconstructs embeddings from a concept library for zero-shot CBM but requires a large concept pool and expensive regression (94× slower). EZPC is substantially more efficient and achieves better performance.
vs. LF-CBM: LF-CBM requires concept-annotated training and is a closed-set method. EZPC leverages LF-CBM's concept set while enabling open-world zero-shot generalization.
The finding that CLIP's internal representations naturally encode human-alignable structure has theoretical value for understanding the semantic organization of large-scale pretrained models.

Rating¶

Novelty: ⭐⭐⭐⭐ The linear projection + dual-loss formulation is elegant and concise, though limited in technical depth.
Experimental Thoroughness: ⭐⭐⭐⭐ Five datasets, multiple qualitative analyses, cross-domain experiments, and efficiency comparisons constitute a comprehensive evaluation.
Writing Quality: ⭐⭐⭐⭐⭐ Clear logical structure, rigorous mathematical notation, and intuitive experimental presentation.
Value: ⭐⭐⭐⭐ Provides a practical and efficient solution for VLM interpretability with real deployment potential.