Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers¶
Conference: CVPR 2026
arXiv: 2601.06338
Code: None
Area: Image Generation / Mechanistic Interpretability
Keywords: Diffusion Transformer, Spatial Relation Generation, Mechanistic Interpretability, Attention Circuits, Text Encoder
TL;DR¶
Internal circuit mechanisms for spatial relation generation in Diffusion Transformers (DiT) are revealed through mechanistic interpretability: Randomized Token Embedding (RTE) models utilize a two-stage modular circuit (Relation Heads + Object Generation Heads), while T5-encoded models fused relation information into object tokens for single-token decoding, showing significant differences in robustness.
Background & Motivation¶
Text-to-Image (T2I) diffusion models have made great progress in generating high-quality images but often fail to compose spatial relations between multiple objects (e.g., "a red square is to the top-left of a blue circle"). While accuracy in single-object attribute generation is improving rapidly, progress in spatial relation generation remains slow.
Prior work proposed various remedies (layout conditioning, cross-attention guidance, curriculum learning), but few studies understand from an internal mechanism perspective why spatial relation generation fails. Motivating questions include:
- How do neural networks encode and use non-commutative relations between objects (e.g., "A above B" \(\neq\) "B above A")?
- How can key heads be systematically summarized and localized given that iterative sampling complicates attention map analysis?
- Is the bottleneck in spatial relation generation in cross-attention or the text encoder? A holistic research perspective is needed.
Method¶
Overall Architecture¶
The authors constructed a minimal text-image dataset to train DiT models of various scales from scratch to generate images of two objects (with combined shape and color attributes) arranged in specified spatial relations. Combinations of 3 shapes \(\times\) 2 colors \(\times\) 8 spatial relations were used. The architecture follows the PixArt-style DiT, comparing three text encoders: T5-XXL, Random Token Embedding (RTE), and RTE without position encoding.
Key Designs¶
-
Attention Synopsis: Facing massive cross-attention maps (layers \(\times\) heads \(\times\) timesteps \(\times\) conditional/unconditional \(\times\) tokens), the authors developed a scalable analysis paradigm:
- Group tokens by category (image tokens by object segmentation, text tokens by semantic attributes).
- Aggregate attention at the category grain to obtain interpretable inter-category interaction patterns.
- Average over timesteps to compress attention tensors into an [layers, heads] synopsis map.
- This allows localizing key heads from over 10 million attention maps.
-
Two-stage Circuit in RTE-DiT:
- Spatial Relation Head (L2H8): Interacts sinusoidal position encodings of image tokens with text embeddings of relation words via QK circuits. "Above" generates a vertical gradient; "left" generates a horizontal gradient. These gradients activate immediately at the start of sampling (step 0), acting as position labels marking canvas regions where objects should be placed.
- Object Generation Head (L4H3): Activates later (steps 4-8), reading position labels from the relation head and connecting regions with matching labels to corresponding shape tokens, generating the correct object at the correct position. This head is invariant to spatial position and relation, only conveying shape identity.
-
Fusion Circuit in T5-DiT:
- T5 self-attention fuses the entire sentence context into each token; thus, DiT decodes spatial relations from non-relation word tokens (especially the second shape token, shape2).
- Variance decomposition shows: in T5 embeddings, shape2 explains 37.5% variance and relations 12.1%; after DiT MLP projection, relation information is amplified to 21.3%.
- Causal validation via vector arithmetic on T5 embeddings (subtracting the original relation vector, adding a new one) successfully changed the spatial positions of generated objects.
-
Weight-Space Head Screening: An efficient screening method without generating samples—directly calculating QK interactions between image position features and text relation features to check if the resulting spatial maps align with reference relation gradients.
Loss & Training¶
- Standard diffusion training using DPM-Solver++ (14 steps) sampling, CFG=4.5.
- Multiple model sizes trained: DiT-B (12L/12H/768D), mini (6L/6H/384D), micro (6L/3H/192D), nano (3L/3H/192D).
- EMA weights used for evaluation.
- Four-dimensional metrics: Color, Shape, Unique Binding, Spatial Relation.
Key Experimental Results¶
Main Results¶
| Model | Text Encoder | Color↑ | Shape↑ | Unique Binding↑ | Spatial Relation↑ |
|---|---|---|---|---|---|
| DiT-B | T5 | 99% | 97% | 93% | 89% |
| DiT-B | RTE | 99% | 96% | 90% | 86% |
| DiT-B | RTE w/o pos | 99% | 96% | 41% | 15% |
| DiT-nano | RTE | - | - | - | 5% |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| Ablate Relation Attn in L2H8 | Spatial Acc 67%→33% | Relation head is crucial for spatial layout |
| Ablate Object Attn in L4H3 | Shape Acc 90%→76% | Object head has causal role in shape generation |
| Ablate relation words in T5-DiT | Almost no impact | T5 fuses relation info into other tokens |
| Ablate shape2 in T5-DiT | Relation Acc down 50% | Relation info primarily encoded in shape2 |
| Insert filler "the" in T5-DiT | Relation Acc drops sharply | T5 circuit is sensitive to minor lexical changes |
| Insert filler in RTE-DiT | Remains stable | Modular circuit is more robust to perturbation |
Key Findings¶
- Circuit Mechanism Depends on Text Encoder: RTE uses a modular two-stage circuit (Relation \(\rightarrow\) Position Label \(\rightarrow\) Object), while T5 uses a fused single-token decoding circuit.
- Position Encoding is Necessary: RTE without position encoding achieves only 15% spatial accuracy because it cannot distinguish "A above B" from "B above A".
- Learning Dynamics are Phased: Color \(\rightarrow\) Shape \(\rightarrow\) Attribute Binding \(\rightarrow\) Spatial Relation; relations are learned slowest.
- Distinct Robustness: RTE-DiT is sensitive to relation word ablation but robust to filler words; T5-DiT is the opposite.
- Transferability to Pre-trained Models: Sparse spatial circuits can also be identified in PixArt-Sigma.
Highlights & Insights¶
- Mechanistic Interpretability Methodology: Attention Synopsis and weight-space screening provide scalable tools for understanding large-scale DiT.
- Biological Analogy: The gradient mechanism of the spatial relation head resembles molecular gradients in embryonic development.
- Unified Perspective: Reconciles the views that "cross-attention is the bottleneck" and "text encoder is the bottleneck," showing both hold true under different configurations.
- Design Inspiration: The trade-off between Modular (RTE) vs. Fused (T5)—modular is more robust and interpretable, while fused is more compact but fragile.
- Practical Implication: Improving spatial relation generation may require prioritizing improvements in embedding models rather than the DiT itself.
Limitations & Future Work¶
- Experiments were conducted on a minimalist dataset (3 shapes \(\times\) 2 colors \(\times\) 8 relations); real-world complexity is much higher.
- Only spatial relations between two objects were studied; circuit mechanisms for 3+ object compositions remain unexplored.
- The comparison between RTE and T5 might be influenced by training data volume and convergence.
- Exploring how to use the discovered circuit mechanisms to improve generation (e.g., via attention intervention) was not performed.
- Analysis of pre-trained models (PixArt-Sigma) is relatively shallow, as their baseline spatial relation performance is weak.
Related Work & Insights¶
- Consistent with Transformer circuit analysis (Elhage et al., 2021) but applied to diffusion models for the first time.
- Methods like Attend-and-Excite improve compositionality by manipulating cross-attention; this work provides a mechanistic explanation for such approaches.
- The impact of text encoder choice on model behavior is underestimated—CLIP, T5, and RTE lead to fundamentally different internal computations.
- Provides guidance for designing more robust T2I architectures: modular circuits may be superior to fused circuits.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First to reveal specific circuit mechanisms for spatial relations in DiT.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Rigorous causal manipulation and ablation, though limited by simplified setup.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear narrative, excellent visualizations, and tight logic.
- Value: ⭐⭐⭐⭐ — Provides a solid foundation for understanding and improving compositional generation in T2I.