A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers¶

Conference: ICLR 2026 arXiv: 2602.21596 Code: Available Area: Diffusion Models Keywords: diffusion transformer, conditioning, embedding sparsity, cosine similarity, AdaLN

TL;DR¶

This paper presents the first systematic analysis of conditional embeddings in diffusion Transformers, revealing extreme angular similarity (inter-class cosine similarity >99%) and dimensional sparsity (only 1–2% of dimensions carry semantic information). Pruning 2/3 of low-magnitude dimensions leaves generation quality virtually unchanged, exposing a hidden semantic bottleneck in conditional embeddings.

Background & Motivation¶

Background: Transformer-based diffusion models such as DiT, SiT, and REPA inject conditioning signals (embeddings of class labels and timesteps) via AdaLN, achieving state-of-the-art generation performance.

Limitations of Prior Work: Despite being a core component of diffusion Transformers, the internal structure and semantic encoding of conditional embeddings remain almost entirely unexplored—no prior work has analyzed what these embeddings actually look like.

Key Challenge: The conditional embeddings of 1,000 classes in a 1,152-dimensional space are highly similar (cosine similarity >99%), yet models still correctly distinguish classes to generate high-quality images. This contradiction—"nearly identical vectors producing entirely different images"—demands explanation.

Goal: To systematically understand how diffusion Transformers encode and utilize conditioning signals.

Key Insight: Directly analyzing learned conditioning vectors by measuring cosine similarity, dimensional participation rates, and variance distributions, then validating which dimensions matter through pruning experiments.

Core Idea: Diffusion Transformers compress semantic information into a small number of "head" dimensions of the conditional embedding, while the remaining majority of "tail" dimensions are redundant.

Method¶

Overall Architecture¶

This is a purely analytical paper with no new method proposed. Three findings are reported: (1) extreme angular similarity; (2) sparse representation; (3) prunable redundancy. The analysis covers 6 ImageNet SOTA models and 2 continuous conditioning tasks (pose-guided portrait generation and video-to-audio generation).

Key Designs¶

Extreme Angular Similarity Analysis:
Finding: Pairwise cosine similarity among the conditional embeddings of 1,000 ImageNet classes exceeds 99% (reaching 99.46% for REPA). Continuous conditioning tasks (pose/video) exhibit even higher similarity, exceeding 99.9%.
Explanatory Hypothesis: Diffusion training optimizes embeddings across all timesteps, biasing the model toward globally aligned embeddings that provide stable denoising signals. Semantic differences are encoded in a small number of high-magnitude head dimensions.
Participation Ratio (PR) Analysis:
Metric: \(\alpha_{norm} = \text{PR}(|c|) / d\), measuring the proportion of dimensions effectively utilized.
Finding: The nPR of SOTA models (MDT/REPA/MG) is only 1.5–2.3%, meaning only ~18–26 of 1,152 dimensions are effectively used. DiT is an exception (10.5%) due to its older architecture.
Continuous conditioning tasks exhibit higher nPR (13–48%), as they require encoding finer-grained continuous conditioning information.
Pruning Experiments—Roles of Head vs. Tail Dimensions:
Tail pruning (\(\tau=0.01\)): Removing 38.9% of low-magnitude dimensions changes FID from 7.17 to 7.16 (virtually unchanged).
Tail pruning (\(\tau=0.02\)): Removing 66.2% of dimensions raises FID from 7.17 to 9.22 (still acceptable).
Head pruning: Removing only 2/1,152 (0.2%) of the highest-magnitude dimensions raises FID slightly to 7.85; removing 8/1,152 (0.69%) causes FID to spike to 523.
Conclusion: Semantic information is entirely concentrated in a small number of head dimensions.

Loss & Training¶

N/A (analytical paper; no model training performed).

Key Experimental Results¶

Main Results (Conditional Embedding Statistics)¶

Model	Dimension	nPR	Cosine Similarity
DiT-XL	1152	10.47%	90.01%
MDT-XL	1152	1.60%	99.05%
SiT-XL	1152	2.28%	98.52%
REPA-XL	1152	1.53%	99.46%
LightningDiT	1152	2.05%	97.79%
X-MDPT (Pose)	1024	48.42%	99.98%
MDSGen (Audio)	768	13.57%	99.99%

Pruning Ablation (REPA-XL)¶

Pruning	Dims Removed	FID↓	IS↑	CLIP↑
None	0%	7.17	176.0	29.75
Tail \(\tau=0.01\)	38.9%	7.16	176.0	29.81
Tail \(\tau=0.02\)	66.2%	9.22	125.2	29.22
Head \(\tau=5.0\)	0.2% (2 dims)	7.85	164.2	29.56
Head \(\tau=1.0\)	0.69% (8 dims)	523.8	1.95	22.69

Key Findings¶

Pruning 39% of tail dimensions yields a marginal FID improvement (7.17→7.16) and a slight CLIP increase, suggesting these dimensions are not merely useless but may introduce noise.
Variance analysis shows that only 15–20 head dimensions carry meaningful inter-class variance; the remaining 98% of dimensions exhibit near-zero variance.
Training dynamics reveal that cosine similarity and sparsity increase progressively during training, indicating these are natural outcomes of the training process rather than artifacts of random initialization.
Applying pruning at later denoising steps is more effective than at earlier steps, as conditioning signals are more fine-grained in the late denoising phase.

Highlights & Insights¶

Counterintuitive Finding: Conditional embeddings for 1,000 entirely distinct classes exhibit >99% cosine similarity, yet the model generates completely different images from <2% dimensional differences—fundamentally challenging conventional understanding of conditional encoding.
Implications for Efficient Conditioning Design: Given that only ~20 effective dimensions are needed, future conditioning injection mechanisms could be substantially simplified, reducing dimensionality, parameter count, and inference cost.
Distinction from Contrastive Learning Collapse: Although the phenomenon superficially resembles representation collapse (highly similar embeddings), it is not harmful here—the iterative refinement of the diffusion process can amplify minute differences.

Limitations & Future Work¶

Only the AdaLN injection mechanism is analyzed; the embedding structure of cross-attention conditioning (e.g., text-guided generation) may differ.
Explanations for the observed phenomena remain at the hypothesis level, lacking rigorous theoretical justification.
The analysis is confined to pretrained models; whether the discovered sparsity can be exploited to redesign and train more efficient conditioning mechanisms remains unexplored.
Pruning experiments are conducted at inference time; whether sparsity can accelerate training has not been investigated.

vs. AlignTok: AlignTok focuses on how encoders affect the semantic quality of latent spaces, while this paper focuses on how conditional embeddings encode semantics—the two are complementary.
vs. Contrastive Learning Collapse: Although the phenomenon is similar (extremely similar embeddings), in diffusion models this constitutes an effective compression rather than a harmful collapse.
vs. Information Bottleneck Theory: The findings are consistent with IB theory—the model learns to compress conditioning information into the minimum necessary subspace.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic revelation of the hidden structure of conditional embeddings in diffusion Transformers.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6+ models, 3 task types, detailed pruning ablations, and training dynamics tracking.
Writing Quality: ⭐⭐⭐⭐ Findings are clearly presented, though theoretical explanations are relatively weak.
Value: ⭐⭐⭐⭐⭐ Fundamentally advances understanding of conditional generative models.