Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?¶
Conference: CVPR 2026
arXiv: 2604.02532
Code: GitHub
Area: Explainable AI / Model Compression
Keywords: feature attribution stability, post-hoc explanation methods, prediction invariance, perturbation robustness, XAI benchmark
TL;DR¶
This paper proposes the FASS benchmark, which systematically evaluates the stability of post-hoc feature attribution methods through prediction-invariant filtering, a three-axis stability decomposition (spatial / ranking / salient region), and multiple perturbation types (geometric / photometric / compression), exposing fundamental flaws in existing evaluation frameworks.
Background & Motivation¶
Post-hoc feature attribution methods (e.g., Grad-CAM, LIME, SHAP, Integrated Gradients) are widely used in safety-critical visual systems to help practitioners understand model decisions. However, when inputs undergo minor perturbations that do not change model predictions, attribution results may shift dramatically, posing a serious threat to their reliability.
Existing stability evaluations suffer from three structural deficiencies:
Lack of prediction-invariant filtering: Stability is computed without verifying whether the perturbation changes the model's predicted class. Metrics such as Lipschitz continuity and max-sensitivity still compare attributions across perturbations that alter the prediction class, conflating "model sensitivity" with "explanation fragility."
Single scalar metrics: Stability is collapsed into a single value, making it impossible to distinguish between different failure modes such as spatial displacement, ranking changes, or salient-region inconsistency.
Evaluation limited to additive noise: Existing frameworks test primarily within an \(\varepsilon\)-ball of additive noise, neglecting geometric transformations, photometric changes, and compression artifacts that are common in real-world systems.
These deficiencies cause existing evaluations to systematically overestimate the stability of attribution methods.
Method¶
Overall Architecture¶
FASS (Feature Attribution Stability Suite) is a modular evaluation pipeline consisting of three stages: - Perturbation application → Prediction-invariant filtering → Three-axis stability measurement
Each input image is paired with its perturbed counterpart, and stability metrics are computed only when the model's top-1 predicted class remains unchanged.
Key Designs¶
-
Prediction-Invariant Filtering: For each input–perturbation pair, the pipeline checks whether the model's argmax prediction is consistent. Inconsistent pairs are excluded and separately reported as a retention-rate diagnostic. The core insight is that comparing attributions across different prediction classes is meaningless, since attributions are defined relative to a specific prediction. The retention rate itself becomes a first-class experimental quantity, revealing when stability evaluation becomes unreliable.
-
Three-Axis Stability Decomposition:
- SSIM (Structural Similarity Index): Measures spatial consistency of attribution maps using an 11×11 mean-pooling window to detect pixel-level spatial shifts.
- Spearman Rank Correlation: Measures whether the feature importance ranking is preserved after perturbation, independent of magnitude changes. Attribution maps are flattened and their ranks compared.
- Top-\(k\) Jaccard Overlap: With \(k=100\), measures the consistency of the top-100 most salient feature locations (0.07% of the \(224\times224\times3 = 150{,}528\)-dimensional attribution map).
-
Composite FASS Score: An equal-weighted average of the three metrics: \(\text{FASS} = (S + R + J) / 3\). The equal-weight design treats all three failure modes as equally important.
-
Perturbation Taxonomy:
- Geometric: 15° rotation, 20-pixel horizontal translation (zero-padded borders).
- Photometric: brightness scaling \(\times 1.5\), Gaussian noise \(\sigma=0.15\).
- Compression: JPEG quality factor 40.
Loss & Training¶
This paper presents an evaluation benchmark and does not involve model training. Evaluation is conducted on pretrained models (ResNet-50, DenseNet-121, ConvNeXt-Tiny, ViT-B/16) with four attribution methods (IG, GradientSHAP, Grad-CAM, LIME) implemented via the Captum library.
Key Experimental Results¶
Main Results¶
Evaluation scale: approximately 70,000 images × 5 perturbation types × 4 models × 4 attribution methods ≈ 6.4 million attribution computations.
| Dataset | Method | SSIM | Spearman | Jaccard | FASS |
|---|---|---|---|---|---|
| ImageNet | Grad-CAM | .885 | .966 | .314 | .722 |
| ImageNet | IG | .706 | .603 | .060 | .457 |
| ImageNet | GradientSHAP | .681 | .570 | .037 | .429 |
| ImageNet | LIME | .342 | .582 | .072 | .332 |
| CIFAR-10 | Grad-CAM | .830 | .899 | .423 | .717 |
| COCO | Grad-CAM | .810 | .881 | .321 | .671 |
Ablation Study — Prediction-Invariant Retention Rate¶
| Perturbation Type | Mean Retention Rate | Range |
|---|---|---|
| Rotation | 30.9% | 0.0–88.1% |
| Translation | 0.1% | 0.0–0.6% |
| Brightness | 0.8% | 0.0–9.0% |
| Noise | 34.5% | 0.0–94.4% |
| JPEG | 1.0% | 0.0–11.7% |
Without prediction-invariant filtering, up to 99% of evaluated pairs involve a prediction change — demonstrating that evaluating stability without filtering is severely problematic.
Ablation Study — Effect of Perturbation Type¶
| Perturbation Category | SSIM | Spearman | Jaccard | FASS |
|---|---|---|---|---|
| Geometric | .725 | .666 | .099 | .497 |
| Photometric | .770 | .724 | .178 | .557 |
| Compression | .791 | .739 | .196 | .576 |
Key Findings¶
- Grad-CAM is the most stable attribution method: It achieves the highest FASS across all 12 dataset–architecture combinations. The low-pass filtering characteristic of its 7×7 activation maps naturally absorbs local perturbation effects.
- IG and GradientSHAP are highly consistent: Their FASS gap does not exceed 0.05, indicating that SHAP's Shapley value sampling does not significantly degrade the stability of gradient signals.
- Choice of attribution method matters more than model architecture: The FASS gap between Grad-CAM and IG (0.21) is approximately twice the largest gap observed for the same method across different architectures (0.09).
- Geometric perturbations expose far greater instability than photometric perturbations: Benchmarks relying solely on additive noise systematically overestimate attribution robustness.
- LIME exhibits a distinctive SSIM–Spearman dissociation: It is spatially unstable yet relatively rank-consistent — a structural failure mode that a single scalar score cannot capture.
Highlights & Insights¶
- Prediction invariance as a prerequisite is a simple yet critically important design principle. Prior work has entirely overlooked this, rendering stability evaluation results unreliable.
- The three-axis decomposition is an elegant design: behind an identical FASS score, different methods exhibit completely different failure modes. For example, Grad-CAM achieves near-perfect Spearman correlation but relatively low Jaccard overlap, while LIME shows very low SSIM but acceptable Spearman.
- The idea of treating the retention rate as a first-class experimental quantity is highly instructive — it serves not only as a diagnostic indicator but also defines the boundary conditions under which stability evaluation itself is reliable.
Limitations & Future Work¶
- FASS measures stability rather than faithfulness: a method that consistently produces incorrect but stable attributions would receive a high score. Jointly evaluating stability and faithfulness is a natural next step.
- Each perturbation type is evaluated at a single intensity level; intensity sweeps may reveal nonlinear degradation behavior.
- The equal-weight composite score lacks domain adaptability; specific application scenarios may require different weightings.
- Only four attribution methods are covered; variants such as SmoothGrad and LRP, as well as concept-level methods, are not included.
- Pretrained models are used without dataset-specific fine-tuning, which affects retention rates (particularly for CIFAR-10 due to 32→224 upsampling).
Related Work & Insights¶
- This paper stands in sharp contrast to evaluation frameworks such as Quantus, LATEC, and OpenXAI — these either do not enforce prediction invariance, reduce stability to a single scalar, or test only additive noise.
- For real-world deployment scenarios (e.g., medical imaging, autonomous driving), the findings imply: (1) preferring Grad-CAM over pixel-level methods; and (2) testing attribution stability under target deployment conditions, including geometric transformations.
- This work can inspire further research on "evaluating the evaluators" — any stability metric must verify that its underlying assumptions are satisfied.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Evaluation framework design is novel, though the core ideas are not overly complex)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (70K images, 6.4M attribution computations, three datasets × four architectures × four methods)
- Writing Quality: ⭐⭐⭐⭐⭐ (Clear logic, precise problem formulation)
- Value: ⭐⭐⭐⭐ (Directly actionable guidance for the XAI community)