Interpretable Generative Models through Post-hoc Concept Bottlenecks¶

Conference: CVPR 2025
arXiv: 2503.19377
Code: https://github.com/Trustworthy-ML-Lab/posthoc-generative-cbm
Area: Diffusion Models
Keywords: Concept Bottleneck Models, Interpretable Generative Models, Post-hoc Training, Generative Control, Concept Intervention

TL;DR¶

This paper proposes two low-cost post-hoc methods—Concept Bottleneck Autoencoders (CB-AE) and Concept Controllers (CC)—to convert pre-trained generative models into interpretable and controllable models without retraining from scratch or requiring ground-truth annotations. They outperform prior CBGM methods in steerability by an average of approximately 25% on CelebA/CelebA-HQ/CUB datasets, while being 4–15x faster to train.

Background & Motivation¶

Background: Deep generative models (GANs, Diffusion Models) have achieved immense success in image generation, but their generation process remains a black box. Concept Bottleneck Models (CBMs) achieve inherent interpretability by embedding human-understandable concepts into intermediate layers, but they are currently mostly limited to classification tasks.
Limitations of Prior Work:
- The only prior work extending CBMs to generative tasks, CBGM, requires training the entire generative model from scratch, incurring massive computational costs (e.g., 240 V100-hours for DDPM-256).
- CBGM requires ground-truth concept annotations for images, which are expensive to obtain.
- The steerability of CBGM is only around 25%, limiting its practical utility.
Key Challenge: To achieve interpretability, CBGM requires complete retraining, which runs counter to the recent trend of leveraging powerful pre-trained models. Meanwhile, CBGM's reliance on concept annotations restricts its scalability.
Goal: How to inject concept-level interpretability and steerability into any pre-trained generative model in a low-cost, post-hoc manner.
Key Insight: The intermediate latent space of generative models already encodes rich semantic info. Thus, one only needs to train a lightweight concept bottleneck layer to "decode" and "control" this information.
Core Idea: Freeze the pre-trained generator and train only a concept bottleneck autoencoder to be inserted into the latent space, utilizing pseudo-labels instead of ground-truth annotations to realize low-cost post-hoc interpretable generation.

Method¶

Mechanism: The core idea is to split the pre-trained generative model \(g = g_2 \circ g_1\) into two parts and insert a downloadable Concept Bottleneck Autoencoder \(f = D \circ E\) in between, such that the new model \(g_2 \circ f \circ g_1\) preserves generation quality while offering concept-level interpretability and intervention capabilities.

Overall Architecture¶

Input: Random noise \(z\). Passing through \(g_1\) yields the intermediate latent \(w\). The CB-AE encoder \(E\) maps \(w\) to a concept vector \(c = E(w)\), and the decoder \(D\) reconstructs the latent as \(w' = D(c)\). Finally, \(g_2(w')\) generates the image. The concept vector \(c\) contains both logits of pre-defined concepts and unsupervised concept embeddings.

Two methods: - CB-AE: A complete Concept Bottleneck Autoencoder that provides inherent interpretability. - CC: Train only the encoder (Concept Controller), used in conjunction with optimization-based intervention to achieve steering.

Key Designs¶

Concept Bottleneck Autoencoder (CB-AE):
- Function: Provides concept prediction and intervention capabilities while preserving generation quality.
- Mechanism: CB-AE consists of an encoder \(E\) (latent to concept vector) and a decoder \(D\) (concept vector to reconstructed latent). Each binary concept in the concept vector \(c\) is represented by two logits (e.g., "smiling" is represented as \([c_i^+, c_i^-]\)), alongside a 40-dimensional unsupervised embedding to capture concepts that are not pre-defined. There are three training objectives: (1) reconstruction loss \(\mathcal{L}_{r_1}(w, w') + \mathcal{L}_{r_2}(x, x')\) (MSE) to ensure generation quality; (2) concept alignment loss \(\mathcal{L}_c(\hat{y}, c)\) (cross-entropy) to align concept predictions with pseudo-labels \(\hat{y} = M(x)\); (3) intervention loss \(\mathcal{L}_{i_1}, \mathcal{L}_{i_2}\) to ensure that the generated images actually change when concepts are modified.
- Design Motivation: Post-hoc training avoids the prohibitive costs of training from scratch. Using pseudo-labels (obtained from zero-shot CLIP classification or few-shot labeling models) eliminates the need for ground-truth annotations.
Optimization-based Interventions:
- Function: Performs precise manipulation of specific concepts in the latent space via gradient optimization.
- Mechanism: Inspired by adversarial attacks, I-RFGSM (Iterative Randomized Fast Gradient Sign Method) is used to optimize over the predictions of the CB-AE encoder. Given the original latent \(w\) and the target concept \(c^*\), the optimization solves \(w^* = w + \arg\max_{\delta \in \Delta}[-\mathcal{L}_c(E(w+\delta), c^*)]\), subject to \(\|\delta\|_\infty \leq \epsilon\). Intuitively, it searches for a small perturbation \(\delta\) such that the concept prediction of \(w + \delta\) shifts to the target concept, while minimizing visual changes in the image.
- Design Motivation: While simple logit-swapping interventions are intuitive, their steerability is limited. Optimization-based interventions offer higher intervention success rates and superior image quality.
Concept Controller (CC):
- Function: A lighter alternative to CB-AE, designed solely for concept prediction and optimization-based intervention.
- Mechanism: Observing that optimization-based intervention does not utilize the decoder \(D\), the decoder can be omitted, leaving only the training of a concept predictor \(\Omega\). The training objective simplifies solely to the concept alignment loss \(\min_\Omega[\mathcal{L}_c(\hat{y}, c)]\). Used alongside optimization-based interventions, CC generally performs better in steerability, even though it is not inherently interpretable (as it does not alter the generative pathway).
- Design Motivation: If users only require control without the need for inherent interpretability, the training time of CC can be significantly shorter than that of CB-AE (8-30x faster than CBGM).

Loss & Training¶

Total loss for CB-AE: \(\mathcal{L}_{r_1} + \mathcal{L}_{r_2} + \mathcal{L}_c + \mathcal{L}_{i_1} + \mathcal{L}_{i_2}\)
Intervention training: Randomly select a concept, swap its logits to generate \(c_{intervened}\), and reconstruct the image to verify if the intervention is successful.
CC training: \(\mathcal{L}_c\) only.
50 epochs, batch size 64, 4-layer MLP/Conv, optimization-based intervention uses 50-step I-RFGSM (\(\epsilon=0.1\)).

Key Experimental Results¶

Main Results¶

Steerability Comparison (8 concepts, CelebA GAN)

Method	Steerability (%)↑	FID↓	Training Time
CBGM	25.60	9.10	50 V100-hrs
CB-AE	47.34	9.52	14 V100-hrs (3.5×)
CB-AE+opt-int	61.14	—	—
CC+opt-int	51.14	7.65	6 V100-hrs (8.3×)

Steerability (%) Across Models and Datasets

Method	CelebA GAN	CelebA-HQ DDPM	CelebA-HQ StyGAN2	CUB GAN
CBGM	25.60	13.80	—	21.30
CB-AE+opt-int	61.14	38.09	61.66	46.03
CC+opt-int	51.14	41.45	67.95	48.91

Ablation Study¶

Configuration	Concept Accuracy	Steerability	Description/Notes
CB-AE (logit swap)	86.56%	47.34%	Basic intervention
CB-AE + opt-int	86.56%	61.14%	Optimization-based intervention +13.8%
CC + opt-int	87.65%	51.14%	Lighter but slightly worse
CLIP zero-shot pseudo-labels	Slightly lower	Still significantly outperforms CBGM	Zero concept supervision is feasible
TIP few-shot pseudo-labels (128 images)	Close to supervised	Close to supervised	Approximable with only 128 images

Key Findings¶

Optimization-based intervention is key: Elevates the steerability of CB-AE from 47.34% to 61.14%, a relative improvement of 29%.
CC is optimal on StyleGAN2: Reaches up to 67.95% on CelebA-HQ, as the clean latent space of GANs is more amenable to optimization.
Pseudo-labels are sufficient: Utilizing zero-shot CLIP classifiers yields effective pseudo-labels, eliminating the need for any ground-truth annotation data.
Large-scale scenario (40 concepts): On all 40 attributes of CelebA, CB-AE+opt-int reaches 58.3% steerability, while CBGM only scales to 23.1%.
User studies validate the reliability of automated evaluations; human consensus on CB-AE's concept accuracy aligns closely with automated metrics.

Highlights & Insights¶

Post-hoc training paradigm: The framework of freezing the pre-trained generator and training a lightweight bottleneck layer is extremely practical. It can be plugged directly into almost any generative model (GAN/DDPM/StyleGAN2). This design blueprint is highly transferable to interpretability research in other media, such as video or 3D generation.
Adversarial attack to concept intervention: Creatively applying an adversarial attack metric (I-RFGSM) for concept steerability represents a highly clever and effective cross-domain analogy.
Minimalist design of CC: Removing the decoder reveals that intervention can be thoroughly executed in the latent space via optimization, rather than learning an explicit concept-to-latent mapping, drastically curbing training costs.

Limitations & Future Work¶

Concept coupling exists (e.g., "young" and "bald"); modifying one concept can alter others, as orthogonality control remains limited in the current implementation.
Relying heavily on pseudo-label quality; CLIP's zero-shot capability for specific concepts (e.g., fine-grained CUB bird attributes) remains constrained.
Only verified in image generation scenarios; extension to video or 3D generation has not yet been explored.
Optimization-based intervention requires multi-step gradient calculations (e.g., 50 steps), which might limit its real-time application.
There could be an optimization conflict between the reconstruction loss and intervention loss within CB-AE.

vs CBGM: CBGM trains the entire generative model from scratch and necessitates ground-truth concept annotations, whereas CB-AE uses post-hoc training with pseudo-labels. CB-AE outperforms CBGM in steerability by ~25% on average while being 4-15x faster to train.
vs LF-CBM/VLG-CBM: These are post-hoc CBMs tailored for classification tasks. CB-AE is the pioneering work that extends the post-hoc CBM paradigm to generative models.
vs GAN latent manipulation: Techniques such as InterfaceGAN directly manipulate the GAN latent space but lack concept-level interpretability and structured constraints.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of post-hoc concept bottlenecks and optimization-based interventions is highly novel, and the method design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Very comprehensive, covering multiple generative models (GAN/DDPM/StyleGAN2), multiple datasets, and human user studies.
Writing Quality: ⭐⭐⭐⭐ The methodology is described clearly, and the comparison tables are highly informative.
Value: ⭐⭐⭐⭐ It provides a practical, low-cost solution for the interpretability of generative models, with strong potential for real-world applications.