Skip to content

Learning Concept Bottleneck Models from Mechanistic Explanations

Conference: ICLR2026 arXiv: 2603.07343 Code: GitHub Area: Graph Learning Keywords: Concept Bottleneck Model, Sparse Autoencoder, mechanistic interpretability, Explainable AI, Multimodal LLM

TL;DR

This paper proposes Mechanistic CBM (M-CBM), which leverages Sparse Autoencoders to extract concepts from features learned by a black-box model, names and annotates them via a multimodal LLM, and constructs an interpretable Concept Bottleneck Model. Under controlled information leakage, M-CBM substantially outperforms existing CBM approaches.

Background & Motivation

Concept Bottleneck Models (CBMs) are a class of inherently interpretable models that first predict human-understandable concepts at an intermediate layer, then use those concepts to predict the final class. Existing CBMs source their concepts primarily in four ways: manual specification, knowledge graphs, LLM generation, and CLIP universal concepts. However, these prior-based concepts suffer from two fundamental issues:

  1. Insufficient predictive power: Prior concepts may not be sufficiently discriminative for the target task, or may not even be learnable from the data (e.g., non-visual concepts such as "warm to the touch" generated by LLMs for medical images).
  2. Severe information leakage: The Concept Bottleneck Layer (CBL) implicitly encodes class-relevant information; near black-box accuracy can be recovered even using random words as concepts, rendering the explanations meaningless.

Inspired by the field of Mechanistic Interpretability—particularly the success of Sparse Autoencoders (SAEs) in disentangling model features—the authors pose a central question: Can an interpretable approximation of a black-box model be constructed directly from the concepts the model itself has learned?

Core Problem

How to build a Concept Bottleneck Model without relying on a prior concept set, such that it simultaneously satisfies: (1) high task accuracy, (2) learnable and predictive concepts, and (3) concise explanations with controllable information leakage?

Method

The M-CBM pipeline consists of four stages:

1. Concept Extraction (Sparse Autoencoder)

Given a trained black-box backbone \(\phi\), a SAE sparsely decomposes its activations \(\mathbf{a}^{(i)} = \phi(\mathbf{x}^{(i)})\):

  • Encoder: \(\mathbf{h} = \text{ReLU}(\mathbf{W}_E^\top(\mathbf{a} - \mathbf{b}_D) + \mathbf{b}_E)\)
  • Decoder: \(\hat{\mathbf{a}} = \mathbf{W}_D^\top \mathbf{h} + \mathbf{b}_D\)
  • Training objective: reconstruction loss + L1 sparsity penalty \(\mathcal{L}_{\text{SAE}} = \|\mathbf{a} - \hat{\mathbf{a}}\|_2^2 + \lambda_{\text{SAE}} \|\mathbf{h}\|_1\)
  • The expansion factor \(m/n\) is kept within 4× to maintain manageable annotation costs.
  • Dead and near-dead neurons are filtered by thresholding: neurons whose removal increases the black-box recovery cross-entropy loss by more than ~1% are retained.

2. Concept Naming (Multimodal LLM)

For each surviving SAE hidden neuron \(h_j\):

  • The top-activating samples (10 images) and contrastive samples (10 images, comprising random samples and high-cosine-similarity negatives) are selected.
  • Concept saliency maps are generated for the activating samples based on decoder weights \(\mathbf{W}_D\).
  • The paired images are fed into GPT-4.1, which produces natural-language concept descriptions.
  • The model is explicitly instructed not to use class names; violations trigger a retry.
  • All concept names are embedded using text-embedding-3-large, and duplicates with cosine similarity > 0.98 are merged.

3. Dataset Annotation (Partial Annotation Strategy)

Since concept names are hypotheses rather than verified functional descriptions, the SAE hidden layer is not used directly as the bottleneck. Instead, an independent CBL is trained:

  • At most 1,000 images are annotated per concept (500 active + 500 inactive).
  • Active samples are images with activations above the 95th percentile.
  • Inactive samples consist of half random images and half the most similar negatives to the active samples.
  • Annotation format: 25 images arranged in a 5×5 grid are sent to GPT-4.1 alongside a reference grid to judge concept presence or absence.
  • Both sample sets are class-stratified to prevent annotation bias toward specific classes.
  • Annotation results are ternary vectors \(z_k^{(i)} \in \{-1, 0, 1\}\) (present / absent / unannotated).

4. CBM Training

  • CBL: Predicts \(K\) concepts from frozen backbone features using Masked BCE Loss optimized over annotated pairs \(\Omega\), with class-imbalance weighting.
  • Sparse linear classifier: Trained on concept logits (z-normalized) using the GLM-SAGA solver with elastic-net penalty (\(\alpha=0.99\)); sparsity is controlled by tuning \(\lambda_{\text{CLF}}\).

NCC Sparsity Metric

The authors note that the prior NEC (Number of Effective Concepts) metric imposes a hard cap on the total number of concepts \(K\), which is unfair for datasets with high intra-class diversity. They propose NCC (Number of Contributing Concepts):

\[\text{NCC}_\tau = \frac{1}{|\mathbb{D}|C} \sum_i \sum_r \min\left\{\kappa : \sum_{s=1}^{\kappa} u_{(s),r}^{(i)} \geq \tau \sum_k u_{k,r}^{(i)}\right\}\]

where \(u_{k,r}^{(i)} = |[g(\mathbf{a}^{(i)})]_k \cdot [\mathbf{W}_F]_{k,r}|\) is the absolute contribution of concept \(k\) to class \(r\). NCC measures sparsity at the decision level without imposing a hard constraint on the total number of concepts, making it more suitable for high-diversity tasks.

Key Experimental Results

Datasets & Backbones: CUB (ResNet18, 200 classes), ISIC2018 (ResNet50, 7 classes), ImageNet (ResNet50, 1000 classes)

Method CUB NCC=5 CUB avg ISIC NCC=5 ISIC avg ImageNet NCC=5 ImageNet avg
Black-box upper bound 76.67% - 79.37% - 76.15% -
LF-CBM 58.08% 71.09% 61.44% 67.55% 62.20% 69.08%
DN-CBM (RN) 38.21% 48.98% 35.38% 54.61% 46.71% 57.24%
VLG-CBM_CA 69.12% 72.25% 64.55% 72.61% N/A N/A
M-CBM 73.70% 74.18% 72.75% 75.51% 72.18% 73.64%

Concept prediction quality (ROC-AUC): M-CBM achieves Macro 90.04% vs. VLG-CBM_CA 62.03% on CUB, and 80.57% vs. 73.37% on ISIC, demonstrating that concepts extracted from the model itself are substantially more learnable.

Information leakage analysis: On CUB, replacing concepts with random words causes VLG-CBM to reach near black-box accuracy at NCC=1.5 (severe leakage). Removing class-conditional annotation reduces leakage, and M-CBM significantly outperforms the random-word baseline in the low-NCC regime.

Highlights & Insights

  1. Novel concept source: The first systematic use of SAE-extracted model-internal concepts for CBM construction, circumventing the mismatch between prior concepts and the target task.
  2. NCC metric: More flexible than NEC; measures explanation conciseness at the decision level without constraining the total number of concepts.
  3. Information leakage control: Combines class-agnostic annotation with sparsity control, and quantifies leakage using random-word experiments.
  4. Substantially improved concept learnability: ROC-AUC improves from 62% to 90% on CUB, confirming that model-internal concepts are indeed easier to learn.
  5. Efficient annotation strategy: SAE activations guide candidate image pre-selection; only ~1k images per concept need to be annotated, avoiding the computational bottleneck of full-dataset annotation.

Limitations & Future Work

  1. Concept learning remains a black box: The final layer is interpretable, but the CBL itself remains opaque; systematic methods to verify whether concepts are learned as intended are lacking.
  2. Information leakage not fully eliminated: Even under controlled NCC, random words can still achieve accuracy well above chance, indicating the leakage problem has not been fundamentally resolved.
  3. SAE requires human supervision: The approach is less plug-and-play than alternatives; it requires verifying that SAE-extracted concepts are interpretable and that annotation quality is reliable.
  4. Annotation cost: At ~$0.14 USD per concept, annotating 2,648 concepts for ImageNet still incurs non-trivial expense.
  5. Limited to image classification: The method has not been extended to detection, segmentation, or other visual tasks, nor explored in non-visual domains.
Method Concept Source Requires CLIP Leakage Control ImageNet Feasibility
LF-CBM LLM generation + CLIP-Dissect No Sparsity penalty Feasible
VLG-CBM LLM generation + GroundingDINO No NEC ~300 GPU-days, infeasible
DN-CBM CLIP SAE hidden layer Yes (CLIP only) Sparsity penalty Feasible but low accuracy
M-CBM Black-box SAE + MLLM annotation No NCC Feasible and best

DN-CBM is the closest precursor, also employing SAEs, but is constrained to CLIP backbones and uses the SAE hidden layer directly as the bottleneck rather than training an independent CBL. M-CBM addresses both limitations via MLLM annotation and independent CBL training.

The effectiveness of SAEs in decomposing black-box model features into interpretable concepts opens a new paradigm for model distillation and knowledge discovery. The NCC metric (coverage threshold over contribution-ranked concepts) is generalizable to other scenarios requiring sparse explanations. The partial annotation strategy (SAE-activation-guided candidate selection + grid-based batch annotation) offers insights for efficient large-scale dataset annotation. Future work could integrate circuit-level analysis to further model causal relationships among concepts.

Rating

  • Novelty: ⭐⭐⭐⭐ — Introducing SAE tools from the mechanistic interpretability literature into the CBM framework is a natural yet effective contribution.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three datasets at different scales, leakage analysis, and concept quality evaluation; however, M-CBM experiments with ViT backbones are absent.
  • Writing Quality: ⭐⭐⭐⭐⭐ — Clear motivation, intuitive pipeline diagrams, and thorough leakage analysis.
  • Value: ⭐⭐⭐⭐ — Provides a more pragmatic concept sourcing strategy for explainable AI; the NCC metric merits broader adoption.