Circuit Mechanisms for Spatial Relation Generation in Diffusion Transformers¶

Conference: CVPR 2026
arXiv: 2601.06338
Code: No explicit repository link
Area: Image Generation
Keywords: Diffusion Transformer, Interpretability, Spatial Relations, Cross-Attention Circuits, Text Encoder

TL;DR¶

Through mechanistic interpretability methods, this work reveals two distinct circuit mechanisms for spatial relation generation in Diffusion Transformers: Randomized Text Encoders (RTE) use a two-stage modular circuit with "relation heads + object heads," while T5 encoders integrate relation information into object tokens for single-token decoding, making the latter more fragile under out-of-distribution perturbations.

Background & Motivation¶

Background: T2I models have made significant progress in generating single-object attributes, but improvements in generating multi-object spatial relations (above, below, left, right, etc.) have been slow.

Limitations of Prior Work: (a) Some argue the failure lies in insufficiently focused cross-attention and propose attention guidance; (b) others suggest the issue is that text encoders do not sufficiently preserve spatial information. These two views lack a unified mechanistic explanation.

Key Challenge: It remains unclear how models internally encode and utilize non-commutative relations (e.g., "A is above B" \(\neq\) "B is above A").

Goal: To understand how DiT internally implements correct spatial relation generation and why it fails.

Key Insight: Training DiT from scratch on a minimal controllable dataset and reverse-engineering internal circuits using mechanistic interpretability tools (attention synopses, QK circuit analysis, causal interventions).

Core Idea: The choice of text encoder determines the relation generation circuit in DiT—this affects not only performance but also robustness.

Method¶

Overall Architecture¶

A minimal relation dataset (2 objects \(\times\) 8 spatial relations \(\times\) shape/color combinations) was constructed to train PixArt-style DiT models in various configurations (different sizes \(\times\) different text encoders: RTE/T5/CLIP). Internal circuits were analyzed via attention synopses and causal operations.

Key Designs¶

Attention Synopsis: To handle massive attention maps (layers \(\times\) heads \(\times\) timesteps \(\times\) conditional/unconditional \(\times\) tokens), a scalable analysis paradigm was developed:
- Group tokens by category (image tokens by object segmentation, text tokens by semantic attributes).
- Aggregate attention energy between categories of token pairs.
- Average over timesteps to obtain a [layer \(\times\) head] synopsis map. Design Motivation: Checking millions of attention maps individually is unfeasible; category aggregation drastically reduces dimensionality while preserving semantic information.
Two-Stage Circuit Discovery in RTE-DiT:
- Spatial Relation Head (L2H8): Image tokens read relation text tokens via the QK circuit. \(Q\) comes from sinusoidal positional encodings, \(K\) from relation word embeddings, and their inner product generates gradient maps consistent with spatial relations (e.g., "above" corresponds to a vertical gradient). This head activates at sampling step 0, writing "position labels" to mark regions where objects should be placed.
- Object Generation Head (L4H3): Reads the correspondence between image tokens with "position labels" and object shape text tokens. After injecting positional encoding into the VO output of L2H8, L4H3 shows correct selective attention. It activates during steps 4-8. Design Motivation: The two-stage circuit resembles molecular gradient guidance in embryonic development—L2H8 establishes a "positional field," and L4H3 reads the field to generate objects.
Different Circuit in T5-DiT:
- T5 self-attention integrates full-sentence information into each token; DiT decodes spatial relations and object information from the shape2 token.
- Variance decomposition confirms the shape2 token encodes relations (\(\sim\)12% partial \(R^2\)), shape (\(\sim\)37.5%), and color (\(\sim\)4.7%).
- The proportion of relation information is amplified to \(\sim\)21% after the DiT MLP projection.
- Causal Operation: Performing vector arithmetic on shape2 embeddings (subtracting the original relation vector + adding a new relation vector) can change the generated position. Design Motivation: T5's contextual representation scatters relation information across multiple tokens; DiT adopts a more compact but fragile "decode everything from one token" strategy.

Function¶

Four-dimensional evaluation: color (color presence), shape (shape presence), unique_binding (correct shape-color binding), and spatial_relation (correct spatial relation). Evaluation uses cv2 segmentation and classification tools.

Key Experimental Results¶

Main Results — Model Comparison¶

Model	Color Acc	Shape Acc	Binding Acc	Relation Acc
RTE-DiT-B	High	High	High	High (~67%)
T5-DiT-B	High	High	High	High
RTE (No Pos. Encoding)	High	High	Low	Low
DiT-nano (Any Encoder)	High	Med	Low	Very Low (5%)

Ablation Study — Causal Intervention¶

Intervention	Effect	Description
Ablate L2H8 Relation Attn	Relation Acc 67%→33%	Confirms causal role of relation head
Ablate L4H3 Object Attn	Shape Acc 90%→76%	Confirms causal role of object head
Ablate other heads	Minimal impact	Highly concentrated circuits
T5-DiT Ablate relation words	Minimal impact	Relation info already merged into object tokens
T5-DiT Ablate shape2	All metrics drop 50%	Core information source

Robustness Comparison¶

Perturbation Type	RTE-DiT Relation Acc	T5-DiT Relation Acc
Original Prompt	High	High
Insert "the" fillers	Stable	Large Drop
Synonymous color replacement	Stable	Drop
Sentence reversal	Stable	Drop

Key Findings¶

Modular vs. Compact: RTE-DiT's two-stage circuit is modular (separate processing of relations and objects), whereas T5-DiT's single-token decoding is more compact but less robust.
Text Encoder is the Bottleneck: Although T5's contextual mixing helps faster convergence, it makes DiT's relation representation fragile—filler word perturbations can destroy relation information in shape2.
Positional Encoding is Mandatory: RTE without positional encoding fails to distinguish "A above B" from "B above A" due to text permutation invariance.
Minimum Model Threshold: DiT-nano fails to form spatial relation heads (Relation Acc 5%), indicating a minimum required capacity.
PixArt-Sigma Validation: Similar (though weaker) spatial circuits were found in pre-trained large models, validating the transferability of the analysis tools.

Highlights & Insights¶

The embryonic development analogy is insightful: The spatial gradients produced by relation heads are analogous to molecular gradients guiding cell differentiation—position labels act like morphogens.
Unified two perspectives: Both cross-attention and text encoders participate in spatial relation generation, but the mechanism depends on the choice of encoder.
Vector arithmetic experiments provide strong causal evidence: \(V_{shape2}^* - V_{lower\_left} + 3V_{lower\_right}\) precisely controls object position.
The Attention Synopsis method is a general-purpose tool applicable to analyzing other large DiT models.

Limitations & Future Work¶

The minimal dataset includes only simple geometric shapes; relation generation circuits for real-world objects may be more complex.
Analysis of pre-trained PixArt-Sigma is preliminary (effective for 8/30 object pairs) and requires systematic validation.
Only 8 spatial relations were analyzed, excluding size relations, containment, etc.
No specific improvement schemes (e.g., designing more robust text encoders) based on the findings were proposed.

Inherits QK circuit analysis from the Transformer Circuits Thread but applies it to diffusion models for the first time.
Provides crucial guidance for T2I model design: if relation generation is critical, encoders that maintain text token independence should be favored.
Provides a circuit-level explanation for why CLIP and T5 differ in performance as text encoders.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to reveal specific circuit mechanisms for DiT spatial relation generation.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Attention analysis + weight space screening + causal intervention + robustness testing + pre-trained model validation.
Writing Quality: ⭐⭐⭐⭐⭐ Fluid narrative, excellent analogies, and clear illustrations.
Value: ⭐⭐⭐⭐⭐ Significant implications for understanding and improving T2I spatial relation generation.