CE-FAM: Concept-Based Explanation via Fusion of Activation Maps¶
Conference: ICCV 2025 arXiv: 2509.23849 Code: None Area: Interpretability Keywords: Concept explanation, activation map fusion, Grad-CAM, VLM knowledge transfer, interpretability
TL;DR¶
CE-FAM is a concept explanation method that trains a branch network sharing activation maps with an image classifier to simulate VLM embeddings, establishing a one-to-one correspondence among concept prediction → concept region (weighted sum of activation maps) → concept contribution (effect on classification score). The paper also introduces a novel NRA evaluation metric and surpasses existing methods on zero-shot concept reasoning.
Background & Motivation¶
In Explainable AI (XAI), three levels of explanation are needed, yet few methods satisfy all simultaneously:
What: Which human-understandable concepts has the model learned?
Where: Which image regions correspond to these concepts?
How: How much does each concept contribute to the final prediction?
- Saliency map methods (Grad-CAM, etc.): can only highlight important regions, leaving the interpretation of "what" to the user.
- Concept bottleneck models (CBM, TCAV): can quantify concept contributions but cannot localize concept regions.
- Dissection methods (CLIP-Dissect, WWW): associate concepts with individual neurons, but suffer from many-to-many mapping—one concept may relate to multiple neurons, and the optimal correspondence varies across samples.
Core insight: representing a concept with a single activation map is fundamentally limited; concept regions should be expressed via weighted fusion of activation maps.
Method¶
Overall Architecture¶
CE-FAM workflow: 1. Training: A branch network learns to map the classifier's multi-layer activation embeddings to the VLM (CLIP) image embedding space. 2. Concept prediction: Similarity between the mapped embeddings and concept text embeddings is computed. 3. Concept region: Gradients of the concept prediction score are back-propagated to obtain activation map weights, generating concept-specific region maps. 4. Concept contribution: The drop in classification score is measured by masking important channels.
Key Designs¶
- Multi-Layer Concept Learning
Unlike conventional methods that use only the last-layer embedding, CE-FAM leverages embedding vectors from all CNN layers to capture features from low-level to high-level:
\(\mathbf{z}^l = \text{AvgPool}(A^l)\) \(\mathbf{z}_{\text{cat}} = \text{Concat}(\mathbf{z}^1, \mathbf{z}^2, \ldots, \mathbf{z}^L)\)
A translator function \(h\) (a simple MLP) projects \(\mathbf{z}_{\text{cat}}\) into the CLIP embedding space. The training loss is:
\(\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{emb}} + \lambda \mathcal{L}_{\text{sim}}\)
where \(\mathcal{L}_{\text{emb}} = \text{MSE}(h(\mathbf{z}_{\text{cat}}) - E_{\text{image}}(\mathbf{x}))\) simulates the VLM embedding, and \(\mathcal{L}_{\text{sim}} = \text{MSE}(S^t - S_{\text{VLM}}^t)\) aligns concept prediction scores.
Key finding: even without \(\mathcal{L}_{\text{sim}}\) (i.e., without requiring a predefined concept set), the method still outperforms existing approaches—enabling zero-shot reasoning over concept labels.
- Concept Region via Activation Map Fusion
Extending the Grad-CAM idea, a region map is generated for each concept \(t\) at each layer \(l\):
\(R_t^l = \text{ReLU}\left(\sum_k \beta_k^t A_k^l\right)\) \(\beta_k^t = \frac{1}{\gamma} \sum_i \sum_j \frac{\partial S^t}{\partial A_k^l(i,j)}\)
Gradients of the concept prediction score \(S^t\) (rather than the class prediction score \(p^c\)) serve as weights, producing concept-specific region maps.
Layer selection: The most relevant layer is selected via a relevance score—Top-K important channels are masked one by one, the AUC of the resulting drop in \(S^t\) is measured, and the layer with the highest AUC is selected.
- Concept Contribution Quantification
Concept contribution can be quantified for any layer, not only the last: - The computation mirrors the relevance score, but targets the class prediction score \(p^c\). - The most important channels are progressively masked, and the AUC of the \(p^c\) drop curve is recorded. - Negative contribution (score increases after masking) is also informative, indicating the concept has a negative influence on the prediction.
- NRA Evaluation Metric (Normalized Region Accuracy)
Limitations of existing metrics: IoU is sensitive to threshold selection; VEA (AUC) yields high scores even for random results when the mask region is large.
NRA eliminates the influence of sample distribution through normalization:
\(\text{NRA} = \frac{\text{AUC} - \text{AUC}_{\text{low}}}{\text{AUC}_{\text{high}} - \text{AUC}_{\text{low}}}\)
where \(\text{AUC}_{\text{high}}\) is the AUC under the ideal (GT segmentation) condition and \(\text{AUC}_{\text{low}}\) is the AUC for a random region. NRA measures the relative position of the predicted region between random and ideal.
Loss & Training¶
- Dataset: ImageNet for concept learning training
- Optimizer: SGD, initial LR 0.1, warmed up to 0.2
- Maximum 150 epochs + early stopping (LR × 0.1 if validation loss does not decrease for 4 consecutive epochs)
- \(\lambda = 0.001\)
Key Experimental Results¶
Main Results¶
Concept region evaluation on the Broden dataset (ResNet50 classifier + CLIP VLM):
| Method | EPG(Object) | EPG(Avg) | NRA(Object) | NRA(Avg) | Hit Rate(Object) | Hit Rate(Avg) |
|---|---|---|---|---|---|---|
| CLIP-Dissect | 0.197 | 0.146 | 0.327 | 0.334 | 0.215 | 0.199 |
| WWW | 0.179 | 0.117 | 0.322 | 0.278 | 0.154 | 0.114 |
| CE-FAM | 0.233 | 0.154 | 0.459 | 0.361 | 0.436 | 0.247 |
CE-FAM exceeds the best baseline by 8% on NRA and 13% on Hit Rate (proportion of samples with NRA > 0.5).
Evaluation on the ImageNet-S dataset (ViT-B/16 classifier):
| Method | EPG | NRA | Hit Rate |
|---|---|---|---|
| CLIP-Dissect | 0.047 | 0.105 | 0.013 |
| WWW | 0.076 | 0.232 | 0.017 |
| CE-FAM | 0.138 | 0.273 | 0.193 |
The advantage is particularly pronounced on ViT, where single-channel activation maps are noisier.
Ablation Study¶
Concept region evaluation under different configurations (ResNet50):
| VLM | Sim Loss | Multi-Layer | EPG Avg | NRA Avg | Hit Rate Avg |
|---|---|---|---|---|---|
| CLIP | - | - | 0.152 | 0.338 | 0.209 |
| CLIP | ✓ | - | 0.156 | 0.348 | 0.234 |
| CLIP | ✓ | ✓ | 0.154 | 0.361 | 0.247 |
| SigLIP | - | - | 0.151 | 0.383 | 0.283 |
| SigLIP | ✓ | ✓ | 0.157 | 0.388 | 0.295 |
Key Findings¶
- Activation map fusion substantially outperforms single-channel association: Hit Rate improves from 11.4% (WWW) to 24.7%, demonstrating that concepts should not be represented by individual neurons.
- Multi-layer features are complementary: Using multi-layer embeddings improves NRA by 2.3% and Hit Rate by 3.8% over using only the last layer; low-level features are important for concepts such as color.
- Zero-shot concept reasoning is effective: Without \(\mathcal{L}_{\text{sim}}\), the method still outperforms existing methods that require concept labels.
- VLM choice has a significant impact: SigLIP outperforms CLIP; improvements in 2D representation quality in the VLM directly enhance concept learning.
- High training efficiency: The method surpasses existing approaches within only a few epochs.
- Misclassification analysis: Concept contributions can explain misclassification (e.g., an indigo bunting misclassified as a goldfinch due to negative contribution of the yellow concept).
Highlights & Insights¶
- First to establish a one-to-one correspondence among concept label, region, and contribution: completing the "What-Where-How" trinity of concept explanation.
- General framework: applicable to any image classifier using activation maps (both CNN and ViT).
- Well-motivated NRA metric: normalization effectively removes the influence of sample distribution on evaluation.
- Zero-shot capability: no concept-annotated dataset is required; arbitrary concepts can be handled using VLM knowledge alone.
Limitations & Future Work¶
- Concept expressiveness is bounded by VLM performance: CLIP tends to be dominated by salient image features, making fine-grained concept prediction difficult.
- Sensitivity to concept set selection: an overly large concept set introduces irrelevant concept noise.
- Concept contribution lacks ground-truth data for quantitative validation, making verification challenging.
- Computational cost scales with the number of layers and concepts; optimization is needed for large-scale deployment.
Related Work & Insights¶
- The approach of extending Grad-CAM from "class explanation" to "concept explanation" is natural and elegant.
- Representing concepts as weighted fusions of activation maps effectively continues the multi-channel representation philosophy of Net2Vec.
- The NRA metric can be generalized to other XAI tasks requiring evaluation of region accuracy.
Rating¶
- Novelty: ⭐⭐⭐⭐ Representing concept regions via activation map fusion and the one-to-one correspondence framework are novel contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ Two datasets (Broden and ImageNet-S), multiple classifiers, multiple VLMs, ablation studies, and qualitative analysis.
- Writing Quality: ⭐⭐⭐⭐ Problem definition is clear, and the evaluation metric design is well-justified.
- Value: ⭐⭐⭐⭐ The general framework is applicable to model debugging and misclassification diagnosis.