Revelio: Interpreting and Leveraging Semantic Information in Diffusion Models¶
Conference: ICCV 2025 arXiv: 2411.16725 Code: GitHub Area: Diffusion Models · Representation Learning Keywords: Diffusion model interpretability, k-sparse autoencoders, representation learning, transfer learning, monosemantic features
TL;DR¶
Revelio employs k-sparse autoencoders (k-SAE) to uncover monosemantic, interpretable features encoded across different layers and timesteps of diffusion models, and validates the transfer learning utility of these features via a lightweight classifier, Diff-C, enabling a systematic interpretation of black-box diffusion models.
Background & Motivation¶
Limitations of Prior Work¶
Background: Diffusion models excel at generating high-quality images, yet how their internal representations encode rich visual semantic information remains poorly understood. The core research questions include:
- What granularity of visual information is captured at different layers and timesteps?
- What are the differences in inductive biases between convolutional UNet and Transformer DiT architectures?
- How do pretraining data and language model conditioning affect representations?
Existing interpretation methods (e.g., attention map visualization, PCA) analyze only single images and lack holistic, systematic coverage.
Method¶
1. k-Sparse Autoencoder (k-SAE)¶
A k-SAE is trained on intermediate features of pretrained diffusion models (spatially pooled to \(\mathbf{x} \in \mathbb{R}^d\)):
Encoder: \(z = \text{TopK}(W_{enc}(x - b_{pre}) + b_{enc})\)
The TopK activation retains the top \(k\) largest neuron activations and sets the rest to zero (\(k=32\)).
Decoder: \(\hat{x} = W_{dec} z + b_{pre}\)
Loss: \(L_{mse} = \|x - \hat{x}\|_2^2\)
The semantic concept captured by each k-SAE neuron is revealed by inspecting its highest-activating images.
2. Diffusion Classifier (Diff-C)¶
A lightweight classifier consisting of 4 convolutional layers with progressive downsampling of diffusion features, followed by pooling and a fully connected layer. It validates the transfer learning effectiveness of diffusion features and runs \(10^4\) times faster than the Diffusion Classifier, which requires a full denoising pass.
Evaluation Metrics¶
- Label purity \(\sigma_{label}\): Average standard deviation of class labels among the top 10 images for each of the top 1000 highly activating k-SAE features. Lower values indicate greater class discriminability.
- GPT-4o evaluation: Highly activating images are presented to GPT-4o as multiple-choice questions to assess semantic granularity.
Key Experimental Results¶
Information Granularity Across Layers¶
Main Results¶
| Layer | Oxford-IIIT Pet σ↓ | Caltech-101 σ↓ |
|---|---|---|
| bottleneck | 9.48 | 9.35 |
| up_ft0 | 9.90 | 15.65 |
| up_ft1 | 8.59 | 21.33 |
| up_ft2 | 9.67 | 25.61 |
- Fine-grained task (Pet breed classification): up_ft1 performs best, capturing breed-level features.
- Coarse-grained task (Caltech-101 object classification): bottleneck performs best, where shape information suffices.
- up_ft2 tends to capture high-frequency texture/pixel-level information, resulting in the poorest transferability.
Effect of Timestep¶
Ablation Study¶
| Timestep \(t\) | Pet σ↓ (up_ft1) | Caltech-101 σ↓ (bottleneck) |
|---|---|---|
| 0 | 8.99 | 11.91 |
| 25 | 8.59 | 9.35 |
| 100 | 8.87 | 8.72 |
| 200 | 8.94 | 8.17 |
| 500 | 9.53 | 16.65 |
- Fine-grained task: \(t=25\) (low noise) is optimal.
- Coarse-grained task: \(t=200\) (moderate noise) is optimal; the additional noise may enhance feature generalization.
SD 1.5 vs. SD 2.1¶
| Model | Pet σ↓ |
|---|---|
| SD 1.5 | 8.59 |
| SD 2.1 | 9.67 |
SD 1.5 (using CLIP ViT-L/14 text encoder) outperforms SD 2.1 (using OpenCLIP ViT-H) on fine-grained classification.
Analysis of DiT Blocks¶
| Block | Pet σ↓ |
|---|---|
| 6 | 10.18 |
| 10 | 9.44 |
| 14 | 9.05 |
| 18 | 9.55 |
| 22 | 9.84 |
The middle block of DiT (block 14) captures the most class-discriminative features, analogous to up_ft1 in UNet.
Classification Performance¶
On Oxford-IIIT Pet, Diff-C (SD 1.5 up_ft1, \(t=25\)) achieves competitive classification accuracy with a minimal architecture, while running four orders of magnitude faster than the Diffusion Classifier.
Highlights & Insights¶
- k-SAE is applied to visual diffusion models for mechanistic interpretation for the first time, revealing monosemantic features.
- Interaction between representation granularity and task granularity: different tasks benefit from features extracted from different layers and timesteps.
- Interesting architectural comparison: convolutional UNet and Transformer DiT exhibit distinct inductive biases.
- Later layers of diffusion models (up_ft2) focus more on pixel-level reconstruction and transfer poorly — analogous to the behavior of later layers in LLMs.
Limitations & Future Work¶
- Interpretation of k-SAE features relies on manual inspection and is inherently subjective.
- GPT-4o-based granularity evaluation is noisy.
- Diff-C requires training a classifier (albeit lightweight) and is not truly zero-shot.
- Analysis of large-scale text-to-image models (e.g., SDXL, Flux) is absent.
Related Work & Insights¶
- Diffusion features for discriminative tasks: DDAE, Diffusion Classifier, DiffusionDet
- Model interpretation: sparse autoencoders in LLMs, PCA analysis in Plug-and-Play diffusion
- CLIP feature interpretation: CLIP-SAE
Rating¶
| Dimension | Score (1–5) |
|---|---|
| Novelty | 4 |
| Technical Depth | 4 |
| Experimental Thoroughness | 5 |
| Writing Quality | 4 |
| Overall | 4.2 |