Measuring the (Un)Faithfulness of Concept-Based Explanations¶

Conference: CVPR 2026 arXiv: 2504.10833 Code: Available (released per paper statement) Area: Explainable AI / Model Interpretability Keywords: Concept explanations, faithfulness measurement, unsupervised concept methods, surrogate models, interpretability evaluation

TL;DR¶

This paper demonstrates that the faithfulness of existing unsupervised concept-based explanation methods (U-CBEMs) is systematically overestimated — due to the use of overly complex surrogate models and flawed deletion-based evaluation. The authors propose SURF (Surrogate Faithfulness), a simple linear surrogate with a dual-space metric framework, validated through a sanity check that "random concepts should be less faithful," and provide the first systematic benchmark revealing that multiple SOTA U-CBEMs are in fact not faithful.

Background & Motivation¶

Background: Deep visual models are difficult to interpret. Concept-based explanation methods (CBEMs) improve interpretability by decomposing intermediate model representations into human-understandable semantic concepts (e.g., edges, colors, object parts). Unsupervised CBEMs (U-CBEMs) automatically discover concept activation vectors (CAVs) and their importance scores, eliminating the need for manual concept annotation.
Limitations of Prior Work: The central evaluation criterion for U-CBEMs is faithfulness — whether an explanation truly reflects the model's internal computation. However, existing evaluations suffer from two systematic problems: (a) Overly complex surrogates — ICE-Eval first reconstructs embeddings via NMF before passing them through the original model, while C-SHAP-Eval trains an additional MLP; these complex surrogates make U-CBEMs appear faithful without the explanations clearly leading to model outputs; (b) Unreliable deletion-based evaluation — inferring faithfulness indirectly by observing performance drops after concept removal, but post-deletion inputs may fall off the data manifold, making model behavior in such regions unpredictable.
Key Challenge: There is an inherent tension between interpretability and faithfulness — making explanations simple and comprehensible necessarily entails information loss and reduced faithfulness. Prior evaluation methods artificially allowed U-CBEMs to simultaneously exhibit high interpretability and high faithfulness by permitting complex surrogates and single-class metrics, even though the explanations do not clearly lead to the model's predictions.
Goal: (a) Unify the fragmented landscape of faithfulness evaluation frameworks; (b) Design a faithfulness metric satisfying three criteria (simple surrogate, inclusion of concept importance, full-output measurement); (c) Conduct the first fair faithfulness benchmark of existing U-CBEMs.
Key Insight: The authors propose an extremely concise sanity check — replacing concepts with random vectors should decrease faithfulness. Remarkably, both ICE-Eval and C-SHAP-Eval fail this check.
Core Idea: Using the model's own final linear layer structure as a surrogate template, the authors propose SURF — a zero-parameter linear surrogate — paired with dual metrics in logit space (MAE) and probability space (EMD), enabling accurate and reliable faithfulness evaluation of U-CBEMs.

Method¶

Overall Architecture¶

SURF evaluation pipeline: for any explanation generated by a U-CBEM (comprising CAVs \(V_i\) and concept importances \(A_i\)), SURF uses a linear surrogate to map concept representations to the model's output space, then measures the discrepancy between the surrogate output and the true model output in both logit and probability spaces. The entire process introduces no trainable parameters and incurs negligible computation (200 FLOPs vs. 205M FLOPs for C-SHAP-Eval).

Key Designs¶

Unified Faithfulness Framework:
- Function: Subsumes all existing U-CBEM faithfulness evaluation methods under a single perspective.
- Mechanism: Any faithfulness metric consists of three components — a metric \(d\) (how output discrepancy is compared), a surrogate \(s\) (how outputs are derived from explanations), and a concept projection \(\mathcal{P}\) (how embeddings are mapped to concept space). Deletion-based methods assess faithfulness by removing concepts in a "deletion space" (pixels/weights/concepts) and observing model degradation; surrogate-based methods directly approximate the faithfulness integral in Eq. 3 via a surrogate.
- Design Motivation: Each U-CBEM paper proposes its own faithfulness metric, precluding cross-method comparison. The unified framework exposes systematic differences across methods.
SURF Surrogate Design:
- Function: Predict model outputs from U-CBEM explanations in the simplest possible manner.
- Mechanism: Observing that the model's final linear layer computes \(y_i = \sum_j \mathbf{h}_j^T \mathbf{f}_{i,j}\), which can be decomposed as \(y_i = \sum_j \mathbf{h}_j^T \mathbf{v}_{i,j} \alpha_{i,j}\), where \(\mathbf{v}_{i,j}\) is a normalized direction (i.e., a CAV) and \(\alpha_{i,j}\) is its norm (i.e., importance). The SURF surrogate directly mimics this structure: \(\hat{y}_i = \sum_j \sum_k \alpha_{i,k} \mathcal{P}(\mathbf{h}_j; V_i)_k\), replacing the final-layer weights with CAVs and importances discovered by the U-CBEM.
- Design Motivation: This surrogate is zero-parameter and non-reconstructive — it does not require first reconstructing the embedding and then passing it through the original model. It directly tests whether "the concepts and importances in an explanation can predict model outputs via linear combination," which is precisely the mental operation a human interpreter must perform.
Three Desiderata:
- Function: Define the conditions a good faithfulness metric must satisfy.
- Mechanism: (1) Surrogate should be as simple as possible — a complex surrogate merely pushes the complexity of the explanation downstream; human interpreters still cannot understand how the explanation leads to the prediction. (2) Use all components of the explanation — particularly concept importances \(A_i\); if the surrogate does not use \(A_i\), incorrect importances will not affect the faithfulness score. (3) Measure error across all output classes — evaluating only the predicted or ground-truth class ignores large deviations on other classes.
- Design Motivation: Both ICE-Eval and C-SHAP-Eval violate all three desiderata — the former relies on reconstruction and ignores \(A_i\), the latter introduces a trainable MLP and also ignores \(A_i\), and both measure only a single class.
SURF Dual Metrics:
- Function: Comprehensively evaluate faithfulness in two complementary spaces — logit and probability.
- Mechanism: \(\text{SURF}_{\text{MAE}} = \frac{1}{|\mathcal{V}|C} \sum_{\mathbf{x}} \sum_i |y_i - \hat{y}_i|\) measures absolute error in logit space; \(\text{SURF}_{\text{EMD}} = \frac{1}{2|\mathcal{V}|} \sum_{\mathbf{x}} \sum_i |p_i - \hat{p}_i|\) measures distributional distance in probability space.
- Design Motivation: Logit space is unaffected by normalization but has an unbounded range; probability space is normalized but softmax amplifies the predicted class while suppressing others. The two metrics are complementary: low \(\text{SURF}_{\text{MAE}}\) ensures logit accuracy, while low \(\text{SURF}_{\text{EMD}}\) ensures accurate probability distributions.

Loss & Training¶

SURF requires no training and serves purely as an evaluation metric. U-CBEMs in experiments are run according to their respective paper settings, discovering 5 concepts per output class.

Key Experimental Results¶

Measure-over-Measure Comparison (Sanity Check)¶

Setting	SURF_MAE ↓	SURF_EMD ↓	Top-1 ↑	C-SHAP Top-1 ↑	ICE Top-1 ↑
Perfect	0.00	0.000	100%	9.02%	100%
Rand Imp	2.70	0.862	97.5%	9.02%	100%
Full Rand	3.17	0.883	1.3%	97.6%	3.3%

Key finding: C-SHAP-Eval reports 97.6% accuracy under fully random explanations (Full Rand) — even higher than under the Perfect setting. ICE-Eval still reports 100% faithfulness under random importances (Rand Imp). Only SURF behaves correctly across all three settings — perfect under Perfect, degraded under Rand Imp, worst under Full Rand.

U-CBEM Benchmark (Object Classification, ResNet-50)¶

U-CBEM	SURF_MAE ↓	SURF_EMD ↓	Top-1 ↑	Rank Corr ↑
CDISCO	3.40	0.932	0.2%	0.002
ICE	3.33	0.628	98.9%	0.093
CRAFT	3.19	0.878	90.6%	0.068
C-SHAP	3.28	0.882	6.3%	0.005
MCD	2.60	0.426	99.4%	0.145
HU-MCD	1.97	0.384	99.7%	0.149
SAE	1.04	0.195	99.2%	0.366

Key Findings¶

No existing U-CBEM is truly faithful — even the best-performing SAE achieves a \(\text{SURF}_{\text{EMD}}\) of 0.195, indicating significant probability distribution deviation. Most methods score between 0.4–0.93 in \(\text{SURF}_{\text{EMD}}\), nearly as poor as random.
Top-1 Accuracy is a misleading metric — ICE and MCD achieve Top-1 of 98–99%, yet SURF_EMD and Rank Corr reveal their large errors on non-predicted classes. Relying solely on Top-1 severely overestimates faithfulness.
SAE is the most faithful method across all tasks — whether for classification, multi-attribute prediction, or age regression, SAE consistently ranks first, likely due to its advantage in completeness criteria.
Increasing the number of concepts does not necessarily improve faithfulness — CDISCO, CRAFT, C-SHAP, and ICE show little or no improvement, and sometimes decline, as concept count increases. Only MCD/HU-MCD improve monotonically with more concepts, exhibiting a natural saturation point.

Highlights & Insights¶

The sanity check "random concepts should be less faithful" is elegantly simple and powerful — it exploits an intuition that anyone can understand to expose a fundamental flaw in all prior metrics. Such simple yet decisive tests are highly valuable in the XAI field (analogous to Adebayo et al.'s sanity checks for saliency maps).
SURF's zero-parameter design is the key innovation — by observing that the final linear layer itself defines the form of a "perfect explanation," the method avoids any additional parameters or reconstruction operations. The gap of 200 FLOPs vs. 205M FLOPs is not merely an efficiency matter, but a fundamental question of whether surrogate complexity contaminates faithfulness evaluation.
The dual-metric design fills the blind spots of any single metric — Top-1 attends only to the largest class, Norm L1 only to the GT class, SURF_MAE does not weight classes by importance, and SURF_EMD complements evaluation in the probability domain.

Limitations & Future Work¶

SURF currently applies only to the final linear layer — explanations of intermediate layers cannot be evaluated, as the mapping from intermediate layers to outputs is nonlinear. Extending SURF to intermediate layers is a critical direction for future work.
Only faithfulness is evaluated, without joint assessment of interpretability — a method with low faithfulness but high interpretability may still have practical value. An ideal framework should simultaneously quantify the faithfulness–interpretability trade-off.
U-CBEMs in experiments discover only 5 concepts per class — this setting may be unfair to certain methods. Although Fig. 2 shows trends over varying concept counts, this analysis is performed on only one task.
Classification tasks predominate — only one regression task (age estimation) is included; generalization to other task types (e.g., segmentation, generation) remains unverified.

vs. ICE / ICE-Eval: ICE discovers concepts via NMF and evaluates faithfulness by reconstructing embeddings. SURF reveals that ICE-Eval does not use concept importances, so random and perfect importances yield identical scores — a fundamental flaw.
vs. C-SHAP / C-SHAP-Eval: C-SHAP computes concept importances via Shapley values and introduces an MLP surrogate. SURF reveals that the MLP surrogate sometimes performs better on random concepts — possibly because the MLP learns a concept-independent mapping from embeddings to outputs.
vs. CRAFT: CRAFT recursively decomposes sub-concepts via NMF and uses concept-space deletion for evaluation. SURF's surrogate-based approach avoids the off-manifold problem inherent in deletion operations.
vs. SAE (Sparse Autoencoders): SAE achieves the best performance under SURF and may represent a promising future direction for visual model interpretation.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic exposure of faithfulness evaluation flaws in U-CBEMs; SURF is elegantly and powerfully designed
Experimental Thoroughness: ⭐⭐⭐⭐ Three tasks, seven U-CBEMs, and a comprehensive measure-over-measure comparison; evaluation of intermediate layers is absent
Writing Quality: ⭐⭐⭐⭐⭐ Exceptionally rigorous logical structure, flowing seamlessly from unified framework to desiderata to sanity check to benchmark
Value: ⭐⭐⭐⭐⭐ Significant impact on the XAI community; likely to reshape evaluation standards in the U-CBEM field