Interpretable Generative Models through Post-hoc Concept Bottlenecks¶
Conference: CVPR 2025
arXiv: 2503.19377
Code: https://github.com/Trustworthy-ML-Lab/posthoc-generative-cbm
Area: Diffusion Models
Keywords: Concept Bottleneck Models, Interpretable Generative Models, Post-hoc Training, Generative Control, Concept Intervention
TL;DR¶
This paper proposes two low-cost post-hoc methods—Concept Bottleneck Autoencoders (CB-AE) and Concept Controllers (CC)—to convert pre-trained generative models into interpretable and controllable models without retraining from scratch or requiring ground-truth annotations. They outperform prior CBGM methods in steerability by an average of approximately 25% on CelebA/CelebA-HQ/CUB datasets, while being 4–15x faster to train.
Background & Motivation¶
-
Background: Deep generative models (GANs, Diffusion Models) have achieved immense success in image generation, but their generation process remains a black box. Concept Bottleneck Models (CBMs) achieve inherent interpretability by embedding human-understandable concepts into intermediate layers, but they are currently mostly limited to classification tasks.
-
Limitations of Prior Work:
- The only prior work extending CBMs to generative tasks, CBGM, requires training the entire generative model from scratch, incurring massive computational costs (e.g., 240 V100-hours for DDPM-256).
- CBGM requires ground-truth concept annotations for images, which are expensive to obtain.
- The steerability of CBGM is only around 25%, limiting its practical utility.
-
Key Challenge: To achieve interpretability, CBGM requires complete retraining, which runs counter to the recent trend of leveraging powerful pre-trained models. Meanwhile, CBGM's reliance on concept annotations restricts its scalability.
-
Goal: How to inject concept-level interpretability and steerability into any pre-trained generative model in a low-cost, post-hoc manner.
-
Key Insight: The intermediate latent space of generative models already encodes rich semantic info. Thus, one only needs to train a lightweight concept bottleneck layer to "decode" and "control" this information.
-
Core Idea: Freeze the pre-trained generator and train only a concept bottleneck autoencoder to be inserted into the latent space, utilizing pseudo-labels instead of ground-truth annotations to realize low-cost post-hoc interpretable generation.
Method¶
Mechanism: The core idea is to split the pre-trained generative model \(g = g_2 \circ g_1\) into two parts and insert a downloadable Concept Bottleneck Autoencoder \(f = D \circ E\) in between, such that the new model \(g_2 \circ f \circ g_1\) preserves generation quality while offering concept-level interpretability and intervention capabilities.
Overall Architecture¶
Input: Random noise \(z\). Passing through \(g_1\) yields the intermediate latent \(w\). The CB-AE encoder \(E\) maps \(w\) to a concept vector \(c = E(w)\), and the decoder \(D\) reconstructs the latent as \(w' = D(c)\). Finally, \(g_2(w')\) generates the image. The concept vector \(c\) contains both logits of pre-defined concepts and unsupervised concept embeddings.
Two methods: - CB-AE: A complete Concept Bottleneck Autoencoder that provides inherent interpretability. - CC: Train only the encoder (Concept Controller), used in conjunction with optimization-based intervention to achieve steering.
Key Designs¶
-
Concept Bottleneck Autoencoder (CB-AE):
- Function: Provides concept prediction and intervention capabilities while preserving generation quality.
- Mechanism: CB-AE consists of an encoder \(E\) (latent to concept vector) and a decoder \(D\) (concept vector to reconstructed latent). Each binary concept in the concept vector \(c\) is represented by two logits (e.g., "smiling" is represented as \([c_i^+, c_i^-]\)), alongside a 40-dimensional unsupervised embedding to capture concepts that are not pre-defined. There are three training objectives: (1) reconstruction loss \(\mathcal{L}_{r_1}(w, w') + \mathcal{L}_{r_2}(x, x')\) (MSE) to ensure generation quality; (2) concept alignment loss \(\mathcal{L}_c(\hat{y}, c)\) (cross-entropy) to align concept predictions with pseudo-labels \(\hat{y} = M(x)\); (3) intervention loss \(\mathcal{L}_{i_1}, \mathcal{L}_{i_2}\) to ensure that the generated images actually change when concepts are modified.
- Design Motivation: Post-hoc training avoids the prohibitive costs of training from scratch. Using pseudo-labels (obtained from zero-shot CLIP classification or few-shot labeling models) eliminates the need for ground-truth annotations.
-
Optimization-based Interventions:
- Function: Performs precise manipulation of specific concepts in the latent space via gradient optimization.
- Mechanism: Inspired by adversarial attacks, I-RFGSM (Iterative Randomized Fast Gradient Sign Method) is used to optimize over the predictions of the CB-AE encoder. Given the original latent \(w\) and the target concept \(c^*\), the optimization solves \(w^* = w + \arg\max_{\delta \in \Delta}[-\mathcal{L}_c(E(w+\delta), c^*)]\), subject to \(\|\delta\|_\infty \leq \epsilon\). Intuitively, it searches for a small perturbation \(\delta\) such that the concept prediction of \(w + \delta\) shifts to the target concept, while minimizing visual changes in the image.
- Design Motivation: While simple logit-swapping interventions are intuitive, their steerability is limited. Optimization-based interventions offer higher intervention success rates and superior image quality.
-
Concept Controller (CC):
- Function: A lighter alternative to CB-AE, designed solely for concept prediction and optimization-based intervention.
- Mechanism: Observing that optimization-based intervention does not utilize the decoder \(D\), the decoder can be omitted, leaving only the training of a concept predictor \(\Omega\). The training objective simplifies solely to the concept alignment loss \(\min_\Omega[\mathcal{L}_c(\hat{y}, c)]\). Used alongside optimization-based interventions, CC generally performs better in steerability, even though it is not inherently interpretable (as it does not alter the generative pathway).
- Design Motivation: If users only require control without the need for inherent interpretability, the training time of CC can be significantly shorter than that of CB-AE (8-30x faster than CBGM).
Loss & Training¶
- Total loss for CB-AE: \(\mathcal{L}_{r_1} + \mathcal{L}_{r_2} + \mathcal{L}_c + \mathcal{L}_{i_1} + \mathcal{L}_{i_2}\)
- Intervention training: Randomly select a concept, swap its logits to generate \(c_{intervened}\), and reconstruct the image to verify if the intervention is successful.
- CC training: \(\mathcal{L}_c\) only.
- 50 epochs, batch size 64, 4-layer MLP/Conv, optimization-based intervention uses 50-step I-RFGSM (\(\epsilon=0.1\)).
Key Experimental Results¶
Main Results¶
Steerability Comparison (8 concepts, CelebA GAN)
| Method | Steerability (%)↑ | FID↓ | Training Time |
|---|---|---|---|
| CBGM | 25.60 | 9.10 | 50 V100-hrs |
| CB-AE | 47.34 | 9.52 | 14 V100-hrs (3.5×) |
| CB-AE+opt-int | 61.14 | — | — |
| CC+opt-int | 51.14 | 7.65 | 6 V100-hrs (8.3×) |
Steerability (%) Across Models and Datasets
| Method | CelebA GAN | CelebA-HQ DDPM | CelebA-HQ StyGAN2 | CUB GAN |
|---|---|---|---|---|
| CBGM | 25.60 | 13.80 | — | 21.30 |
| CB-AE+opt-int | 61.14 | 38.09 | 61.66 | 46.03 |
| CC+opt-int | 51.14 | 41.45 | 67.95 | 48.91 |
Ablation Study¶
| Configuration | Concept Accuracy | Steerability | Description/Notes |
|---|---|---|---|
| CB-AE (logit swap) | 86.56% | 47.34% | Basic intervention |
| CB-AE + opt-int | 86.56% | 61.14% | Optimization-based intervention +13.8% |
| CC + opt-int | 87.65% | 51.14% | Lighter but slightly worse |
| CLIP zero-shot pseudo-labels | Slightly lower | Still significantly outperforms CBGM | Zero concept supervision is feasible |
| TIP few-shot pseudo-labels (128 images) | Close to supervised | Close to supervised | Approximable with only 128 images |
Key Findings¶
- Optimization-based intervention is key: Elevates the steerability of CB-AE from 47.34% to 61.14%, a relative improvement of 29%.
- CC is optimal on StyleGAN2: Reaches up to 67.95% on CelebA-HQ, as the clean latent space of GANs is more amenable to optimization.
- Pseudo-labels are sufficient: Utilizing zero-shot CLIP classifiers yields effective pseudo-labels, eliminating the need for any ground-truth annotation data.
- Large-scale scenario (40 concepts): On all 40 attributes of CelebA, CB-AE+opt-int reaches 58.3% steerability, while CBGM only scales to 23.1%.
- User studies validate the reliability of automated evaluations; human consensus on CB-AE's concept accuracy aligns closely with automated metrics.
Highlights & Insights¶
- Post-hoc training paradigm: The framework of freezing the pre-trained generator and training a lightweight bottleneck layer is extremely practical. It can be plugged directly into almost any generative model (GAN/DDPM/StyleGAN2). This design blueprint is highly transferable to interpretability research in other media, such as video or 3D generation.
- Adversarial attack to concept intervention: Creatively applying an adversarial attack metric (I-RFGSM) for concept steerability represents a highly clever and effective cross-domain analogy.
- Minimalist design of CC: Removing the decoder reveals that intervention can be thoroughly executed in the latent space via optimization, rather than learning an explicit concept-to-latent mapping, drastically curbing training costs.
Limitations & Future Work¶
- Concept coupling exists (e.g., "young" and "bald"); modifying one concept can alter others, as orthogonality control remains limited in the current implementation.
- Relying heavily on pseudo-label quality; CLIP's zero-shot capability for specific concepts (e.g., fine-grained CUB bird attributes) remains constrained.
- Only verified in image generation scenarios; extension to video or 3D generation has not yet been explored.
- Optimization-based intervention requires multi-step gradient calculations (e.g., 50 steps), which might limit its real-time application.
- There could be an optimization conflict between the reconstruction loss and intervention loss within CB-AE.
Related Work & Insights¶
- vs CBGM: CBGM trains the entire generative model from scratch and necessitates ground-truth concept annotations, whereas CB-AE uses post-hoc training with pseudo-labels. CB-AE outperforms CBGM in steerability by ~25% on average while being 4-15x faster to train.
- vs LF-CBM/VLG-CBM: These are post-hoc CBMs tailored for classification tasks. CB-AE is the pioneering work that extends the post-hoc CBM paradigm to generative models.
- vs GAN latent manipulation: Techniques such as InterfaceGAN directly manipulate the GAN latent space but lack concept-level interpretability and structured constraints.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of post-hoc concept bottlenecks and optimization-based interventions is highly novel, and the method design is elegant.
- Experimental Thoroughness: ⭐⭐⭐⭐ Very comprehensive, covering multiple generative models (GAN/DDPM/StyleGAN2), multiple datasets, and human user studies.
- Writing Quality: ⭐⭐⭐⭐ The methodology is described clearly, and the comparison tables are highly informative.
- Value: ⭐⭐⭐⭐ It provides a practical, low-cost solution for the interpretability of generative models, with strong potential for real-world applications.