Skip to content

A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers

Conference: ICLR 2026 arXiv: 2602.21596 Code: Available Area: Diffusion Models Keywords: diffusion transformer, conditioning, embedding sparsity, cosine similarity, AdaLN

TL;DR

This paper presents the first systematic analysis of conditional embeddings in diffusion Transformers, revealing extreme angular similarity (inter-class cosine similarity >99%) and dimensional sparsity (only 1–2% of dimensions carry semantic information). Pruning 2/3 of low-magnitude dimensions leaves generation quality virtually unchanged, exposing a hidden semantic bottleneck in conditional embeddings.

Background & Motivation

Background: Transformer-based diffusion models such as DiT, SiT, and REPA inject conditioning signals (embeddings of class labels and timesteps) via AdaLN, achieving state-of-the-art generation performance.

Limitations of Prior Work: Despite being a core component of diffusion Transformers, the internal structure and semantic encoding of conditional embeddings remain almost entirely unexplored—no prior work has analyzed what these embeddings actually look like.

Key Challenge: The conditional embeddings of 1,000 classes in a 1,152-dimensional space are highly similar (cosine similarity >99%), yet models still correctly distinguish classes to generate high-quality images. This contradiction—"nearly identical vectors producing entirely different images"—demands explanation.

Goal: To systematically understand how diffusion Transformers encode and utilize conditioning signals.

Key Insight: Directly analyzing learned conditioning vectors by measuring cosine similarity, dimensional participation rates, and variance distributions, then validating which dimensions matter through pruning experiments.

Core Idea: Diffusion Transformers compress semantic information into a small number of "head" dimensions of the conditional embedding, while the remaining majority of "tail" dimensions are redundant.

Method

Overall Architecture

This is a purely analytical paper with no new method proposed. Three findings are reported: (1) extreme angular similarity; (2) sparse representation; (3) prunable redundancy. The analysis covers 6 ImageNet SOTA models and 2 continuous conditioning tasks (pose-guided portrait generation and video-to-audio generation).

Key Designs

  1. Extreme Angular Similarity Analysis:

  2. Finding: Pairwise cosine similarity among the conditional embeddings of 1,000 ImageNet classes exceeds 99% (reaching 99.46% for REPA). Continuous conditioning tasks (pose/video) exhibit even higher similarity, exceeding 99.9%.

  3. Explanatory Hypothesis: Diffusion training optimizes embeddings across all timesteps, biasing the model toward globally aligned embeddings that provide stable denoising signals. Semantic differences are encoded in a small number of high-magnitude head dimensions.

  4. Participation Ratio (PR) Analysis:

  5. Metric: \(\alpha_{norm} = \text{PR}(|c|) / d\), measuring the proportion of dimensions effectively utilized.

  6. Finding: The nPR of SOTA models (MDT/REPA/MG) is only 1.5–2.3%, meaning only ~18–26 of 1,152 dimensions are effectively used. DiT is an exception (10.5%) due to its older architecture.
  7. Continuous conditioning tasks exhibit higher nPR (13–48%), as they require encoding finer-grained continuous conditioning information.

  8. Pruning Experiments—Roles of Head vs. Tail Dimensions:

  9. Tail pruning (\(\tau=0.01\)): Removing 38.9% of low-magnitude dimensions changes FID from 7.17 to 7.16 (virtually unchanged).

  10. Tail pruning (\(\tau=0.02\)): Removing 66.2% of dimensions raises FID from 7.17 to 9.22 (still acceptable).
  11. Head pruning: Removing only 2/1,152 (0.2%) of the highest-magnitude dimensions raises FID slightly to 7.85; removing 8/1,152 (0.69%) causes FID to spike to 523.
  12. Conclusion: Semantic information is entirely concentrated in a small number of head dimensions.

Loss & Training

N/A (analytical paper; no model training performed).

Key Experimental Results

Main Results (Conditional Embedding Statistics)

Model Dimension nPR Cosine Similarity
DiT-XL 1152 10.47% 90.01%
MDT-XL 1152 1.60% 99.05%
SiT-XL 1152 2.28% 98.52%
REPA-XL 1152 1.53% 99.46%
LightningDiT 1152 2.05% 97.79%
X-MDPT (Pose) 1024 48.42% 99.98%
MDSGen (Audio) 768 13.57% 99.99%

Pruning Ablation (REPA-XL)

Pruning Dims Removed FID↓ IS↑ CLIP↑
None 0% 7.17 176.0 29.75
Tail \(\tau=0.01\) 38.9% 7.16 176.0 29.81
Tail \(\tau=0.02\) 66.2% 9.22 125.2 29.22
Head \(\tau=5.0\) 0.2% (2 dims) 7.85 164.2 29.56
Head \(\tau=1.0\) 0.69% (8 dims) 523.8 1.95 22.69

Key Findings

  • Pruning 39% of tail dimensions yields a marginal FID improvement (7.17→7.16) and a slight CLIP increase, suggesting these dimensions are not merely useless but may introduce noise.
  • Variance analysis shows that only 15–20 head dimensions carry meaningful inter-class variance; the remaining 98% of dimensions exhibit near-zero variance.
  • Training dynamics reveal that cosine similarity and sparsity increase progressively during training, indicating these are natural outcomes of the training process rather than artifacts of random initialization.
  • Applying pruning at later denoising steps is more effective than at earlier steps, as conditioning signals are more fine-grained in the late denoising phase.

Highlights & Insights

  • Counterintuitive Finding: Conditional embeddings for 1,000 entirely distinct classes exhibit >99% cosine similarity, yet the model generates completely different images from <2% dimensional differences—fundamentally challenging conventional understanding of conditional encoding.
  • Implications for Efficient Conditioning Design: Given that only ~20 effective dimensions are needed, future conditioning injection mechanisms could be substantially simplified, reducing dimensionality, parameter count, and inference cost.
  • Distinction from Contrastive Learning Collapse: Although the phenomenon superficially resembles representation collapse (highly similar embeddings), it is not harmful here—the iterative refinement of the diffusion process can amplify minute differences.

Limitations & Future Work

  • Only the AdaLN injection mechanism is analyzed; the embedding structure of cross-attention conditioning (e.g., text-guided generation) may differ.
  • Explanations for the observed phenomena remain at the hypothesis level, lacking rigorous theoretical justification.
  • The analysis is confined to pretrained models; whether the discovered sparsity can be exploited to redesign and train more efficient conditioning mechanisms remains unexplored.
  • Pruning experiments are conducted at inference time; whether sparsity can accelerate training has not been investigated.
  • vs. AlignTok: AlignTok focuses on how encoders affect the semantic quality of latent spaces, while this paper focuses on how conditional embeddings encode semantics—the two are complementary.
  • vs. Contrastive Learning Collapse: Although the phenomenon is similar (extremely similar embeddings), in diffusion models this constitutes an effective compression rather than a harmful collapse.
  • vs. Information Bottleneck Theory: The findings are consistent with IB theory—the model learns to compress conditioning information into the minimum necessary subspace.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ First systematic revelation of the hidden structure of conditional embeddings in diffusion Transformers.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ 6+ models, 3 task types, detailed pruning ablations, and training dynamics tracking.
  • Writing Quality: ⭐⭐⭐⭐ Findings are clearly presented, though theoretical explanations are relatively weak.
  • Value: ⭐⭐⭐⭐⭐ Fundamentally advances understanding of conditional generative models.