Emergence and Evolution of Interpretable Concepts in Diffusion Models¶
Basic Information¶
- arXiv: 2504.15473
- Conference: NeurIPS 2025
- Authors: Berk Tinaz, Zalan Fabian, Mahdi Soltanolkotabi (USC)
- Institution: University of Southern California
- Code: Not open-sourced
TL;DR¶
This work is the first to systematically apply Sparse Autoencoders (SAEs) to multi-step diffusion models (Stable Diffusion v1.4), revealing that image composition emerges as early as the first reverse diffusion step while stylistic concepts form during intermediate stages. Based on these findings, the paper proposes temporally adaptive causal intervention techniques.
Background & Motivation¶
Despite the remarkable success of diffusion models in image generation, their internal mechanisms remain largely opaque. SAEs have proven effective in mechanistic interpretability of LLMs (e.g., Anthropic's analysis of Claude), yet they have not been applied to understanding how visual representations evolve over time during the multi-step generation process of diffusion models. Prior work (Surkov et al.) analyzed only single-step distilled diffusion models (SDXL Turbo), making it impossible to capture the temporal evolution of features — which is precisely the most distinctive characteristic of diffusion models.
Core Problem¶
- How much information do image representations already encode in the early stages of generation?
- How do visual representations evolve across different stages of the diffusion process?
- Can the discovered interpretable concepts be used to causally guide the generation process?
- How does the effectiveness of interventions vary with diffusion timestep?
Method¶
1. SAE Architecture¶
A \(k\)-sparse autoencoder (TopK activation) is trained on the residual updates of the U-Net in SD v1.4: - Encoder: \(\mathbf{z} = \text{TopK}(\text{ReLU}(\mathbf{W}_{enc}(\mathbf{x} - \mathbf{b})))\) - Decoder: \(\hat{\mathbf{x}} = \mathbf{W}_{dec}\mathbf{z} + \mathbf{b}\) - Concept vectors: \(\mathbf{f}_i = \mathbf{W}_{dec}[:, i]\) (columns of the decoder weight matrix) - Expansion ratio \(n_f = 4d = 5120\) (\(d = 1280\))
2. Temporally Aware Activation Collection¶
Independent SAEs are trained for each combination of 3 diffusion stages (\(t \in \{0.0, 0.5, 1.0\}\)) × 3 U-Net blocks (down_block, mid_block, up_block) × 2 conditioning types (cond/uncond). Activations are collected from the residual updates \(\Delta_{\ell,t}\) of cross-attention transformer blocks.
3. Concept Dictionary Construction (Vision-only Pipeline)¶
A key novelty is the use of a purely vision-based pipeline for concept annotation, without relying on LLMs: - RAM (image tagging) → Grounding DINO (open-set detection) → SAM (segmentation) - IoU between SAE activations and segmentation masks is computed; labels are assigned to concept IDs (CIDs) when IoU exceeds 0.5 - Each concept is represented by the mean Word2Vec embedding of the corresponding object name
4. Composition Prediction¶
A "concept map" is constructed using the concept dictionary: each spatial position is mapped to its top activated concept → Word2Vec embedding → cosine similarity with target object → segmentation prediction.
5. Causal Intervention Techniques¶
Spatially targeted intervention (composition control): $\(\tilde{\Delta}_{\ell,t}[i,j] = \begin{cases} \Delta_{\ell,t}[i,j] + \beta \sum_{c \in C_o} \mathbf{f}_c & \text{if } (i,j) \in S \\ \Delta_{\ell,t}[i,j] - \sum_{c \in C_o} \mathbf{f}_c & \text{otherwise} \end{cases}\)$ Activations are manipulated directly rather than through encode-then-decode, avoiding reconstruction error.
Global intervention (style control): \(\tilde{\Delta}_{\ell,t}[i,j] = \Delta_{\ell,t}[i,j] + \beta \mathbf{f}_c\)
Both interventions incorporate adaptive \(\beta\) normalization to stabilize effects across different objects and styles.
Key Experimental Results¶
Emergence Timing of Composition¶
- \(t=1.0\) (first step): mid_block IoU ≈ 0.26 — scene layout is already predictable, even though the model output is still pure noise
- \(t=0.5\) (middle stage): IoU saturates — composition is largely fixed
- \(t=0.0\) (final step): highest IoU, but constrained by annotation pipeline accuracy
- up_block yields the most precise segmentation predictions
Quantitative Metrics for Temporal Concept Evolution¶
| Timestep | Concept Cohesion↑ | Inter-concept Similarity↓ |
|---|---|---|
| \(t=1.0\) | 0.588 | 0.433 |
| \(t=0.5\) | 0.627 | 0.378 |
| \(t=0.0\) | 0.664 | 0.344 |
| → Concepts become progressively purer and more discriminative as generation proceeds |
Intervention Effectiveness vs. Diffusion Stage¶
| Stage | Spatial Intervention Success Rate | Global Intervention Success Rate | LPIPS |
|---|---|---|---|
| Early (\(t=1.0\)) | 80% | 78% (alters composition, not style) | 0.653 |
| Middle (\(t=0.5\)) | 23% (fails) | 93% (+0.021 ΔCLIP) | 0.385 |
| Final (\(t=0.0\)) | 25% (fails) | 69% (texture only) | 0.114 |
Summary of Core Findings¶
- Early stage: Composition is controllable; style is not (global intervention alters composition rather than style)
- Middle stage: Composition is locked; style is controllable (optimal window for style editing)
- Late stage: Only texture details remain modifiable
Highlights & Insights¶
- Striking finding: Image composition emerges at the very first reverse diffusion step, when the model output is still noise
- Vision-only annotation pipeline: Avoids LLM biases and scales to large-scale concept discovery
- Temporally adaptive intervention: First systematic characterization of "what to intervene and when"
- Theoretical coherence: The three-stage evolution (composition → style → texture) aligns with denoising autoencoder (DAE) theory
- Causal validity of concept vectors: Intervention experiments establish causality rather than mere correlation
Limitations & Future Work¶
- Validated only on SD v1.4 (U-Net); not extended to DiT architectures (e.g., FLUX)
- Skip connections in U-Net cause intervention information leakage, requiring large \(\beta\) values
- Concept dictionary quality depends on external detection and segmentation models
- Independent SAEs are trained per timestep, precluding direct cross-timestep concept comparison
- Word2Vec embeddings have limited expressive capacity
Related Work & Insights¶
- vs. Surkov et al. (SDXL Turbo SAE): Surkov et al. analyze only a single-step distilled model; this paper analyzes temporal evolution across multi-step generation — a critical distinction
- vs. Cross-attention visualization (DAAM): DAAM uses cross-attention for saliency maps; the SAE concepts proposed here are more fine-grained and support causal intervention
- vs. h-space/Jacobian editing directions: Kwon et al. and Park et al. identify editing directions in specific layers; the concept dictionary proposed here is more systematic and interpretable
- vs. Prompt-to-Prompt: P2P edits by manipulating attention weights, whereas this work discovers concepts in SAE latent space before intervening — the two approaches are complementary
Connections and Implications¶
- Directions for DiT architectures: The absence of skip connections in DiT may make interventions more effective. Extending SAEs to FLUX/SD3 is a natural next step
- Connection to Don't Let It Fade (TTA-Diffusion): Both works study the temporal dimension of diffusion — TTA identifies update forgetting, while this paper characterizes the timeline of composition emergence. The early fixation of composition explains why TTA's timestep allocation strategy is effective
- Implications for controllable generation: The optimal editing window depends on the type of edit (composition vs. style); a uniform "full-trajectory guidance" strategy may be suboptimal
Rating¶
- Novelty: ★★★★★ — Applying SAEs to the temporal evolution of diffusion models is an entirely new perspective
- Technical Depth: ★★★★☆ — The method is clean, and the experimental design is carefully crafted
- Experimental Thoroughness: ★★★★☆ — Qualitative and quantitative results are well integrated, though limited to SD v1.4
- Writing Quality: ★★★★★ — Structure is clear and driven by well-defined scientific questions