Scaling Down Text Encoders of Text-to-Image Diffusion Models¶
Conference: CVPR 2025
arXiv: 2503.19897
Code: https://github.com/LifuWang-66/DistillT5
Area: Diffusion Models
Keywords: Text Encoder Distillation, T5 Compression, FLUX, Knowledge Distillation, Model Efficiency
TL;DR¶
This paper distills the T5-XXL (11B) text encoder into T5-Base (220M) using a vision-based knowledge distillation method. While reducing the size by 50x, it incurs almost no loss in image quality and semantic understanding, revealing that text encoders in text-to-image tasks exhibit severe over-parameterization and a "downward scaling law."
Background & Motivation¶
Background: Text encoders for text-to-image diffusion models have rapidly evolved from early CLIP models to T5-XXL (11B parameters). State-of-the-art (SOTA) models such as Imagen, FLUX, and SD3 employ T5-XXL to enhance complex semantic understanding and text rendering capabilities.
Limitations of Prior Work: The 11B parameters of T5-XXL introduce massive GPU memory overhead. The FLUX pipeline itself exceeds 24GB, making it almost impossible to run with T5-XXL on consumer-grade GPUs. Although 8-bit quantization can alleviate this to some extent, the total parameter size remains enormous.
Key Challenge: The T5 model is trained on the C4 natural language corpus, containing a vast amount of non-visual data. Experiments show that text-to-image alignment for images generated from non-visual prompts is very low, indicating that most of the representation capability in T5-XXL is redundant for text-to-image generation.
Goal: To answer "Is such a massive text encoder truly necessary for text-to-image generation?" and explore the downward scaling law of text encoders.
Key Insight: The embedding space of T5-XXL contains a substantial amount of non-visual redundant information. A smaller model can be trained to learn only the visual subspace useful for image generation.
Core Idea: Use the image synthesis capability of the diffusion model itself as distillation guidance (vision-based distillation), enabling the smaller model to learn to produce the exact same denoising predictions as the larger model at the visual level.
Method¶
Overall Architecture¶
The pre-trained FLUX diffusion model is kept frozen, and only the small T5 encoder is trained. Given a prompt, text embeddings are generated via the teacher (T5-XXL) and student (T5-Base) respectively, and sent into FLUX to obtain two denoising predictions. The difference between them is minimized. A step-following strategy is adopted to perform distillation at each denoising step. An MLP is used to project student embeddings into the teacher embedding space.
Key Designs¶
-
Vision-based Knowledge Distillation:
- Function: Transfer the text-to-image capability of T5-XXL to the small T5 model.
- Mechanism: Instead of directly distilling T5 text embeddings (naive distillation), the embedding discrepancy is amplified in the pixel/latent space through the frozen diffusion model. The loss function is \(\mathcal{L}_{vision} = \mathbb{E}_p[\|\mu_\theta(\mathbf{x}_t, t, \omega_\phi(p)) - \mu_\theta(\mathbf{x}_t, t, \omega_{\hat{\phi}}(p))\|^2]\), where \(\mu_\theta\) is the denoising prediction of the diffusion model.
- Design Motivation: Naive distillation leads to mode collapse in the student's embedding space because the huge parameter gap prevents the small model from establishing a one-to-one mapping. Vision-based distillation introduces uncertainty through noise and finer-grained latent space features, allowing the student to achieve the same visual outcome with a different distribution.
-
Step-Following Training:
- Function: Ensure that the student model learns correct guidance at each denoising timestep.
- Mechanism: Starting from pure noise \(\mathbf{x}_T \sim \mathcal{N}(0, I)\), at each timestep \(t\) the same latent is fed into the diffusion model, yielding two predictions using the teacher and student embeddings respectively to calculate the loss and backpropagate to update the student encoder. Then, the latent is progressed to the next step using the teacher's prediction, repeating this until \(\mathbf{x}_0\) is reached.
- Design Motivation: The training data of SOTA diffusion models is private, making it impossible to obtain image-text pairs to directly construct \(\mathbf{x}_t\). However, prompts are easy to acquire; therefore, the entire sampling trajectory is simulated starting directly from noise.
-
Three-Stage Dataset Construction:
- Function: Cover the complete visual embedding space of T5-XXL in text-to-image generation.
- Mechanism: The first stage uses LAION-Aesthetics-6.5+ (~100K prompts) to cover image quality and style; the second stage uses T2I-CompBench (4200 prompts) to cover semantic understanding (color, shape, texture, spatial relationships); the third stage uses a self-built CommonText (50K prompts) to cover text rendering capabilities.
- Design Motivation: The advantages of T5-XXL are reflected across three dimensions, and training data must be targets-constructed to ensure comprehensive inheritance.
Loss & Training¶
Three-stage training: the first stage runs T2I-CompBench for 50K iterations, the second stage runs CommonText for 70K iterations, and the third stage mixes all data for 200K iterations. Evaluated on 8×A800 GPUs with a total batch size of 32. AdamW optimizer is used with a learning rate of 1e-4, employing 20-step iterative denoising. The guidance scale is randomly sampled between 2 and 5.
Key Experimental Results¶
Main Results¶
| Model | Parameters | FID↓ | CLIP-Score↑ | Semantic Understanding Avg↑ | Text Rendering Char↑ |
|---|---|---|---|---|---|
| Flux w/ T5-Small | 60M | 25.10 | 28.28 | - | 31.9 |
| Flux w/ T5-Base | 220M | 24.32 | 29.79 | 50.32 | 69.3 |
| Flux w/ T5-XL | 3B | 23.17 | 30.33 | 53.74 | 77.8 |
| Flux w/ T5-XXL | 11B | 22.36 | 31.30 | 55.56 | 76.7 |
| SD3 | - | 19.83 | 32.21 | - | 38.7 |
Ablation Study¶
| Training Data Combination | FID↓ | CLIP↑ | Semantics↑ | Text↑ |
|---|---|---|---|---|
| LAION only | 24.13 | 29.69 | 31.09 | 2.97 |
| CompBench only | 23.55 | 27.88 | 44.93 | 1.32 |
| CommonText only | 28.95 | 25.62 | 21.20 | 43.41 |
| Naive distill (All) | 26.47 | 22.52 | 13.78 | 0.35 |
| Vision distill (All) | 24.32 | 29.79 | 50.32 | 49.1 |
Key Findings¶
- Image quality and semantic understanding are insensitive to encoder size: T5-Base (50x smaller) is close to T5-XXL in image quality and semantic understanding.
- Text rendering is the dimension most affected by model size: T5-Small text rendering degrades severely, T5-Base is barely usable, and T5-XL is close to T5-XXL.
- Naive distillation fails completely: Mode collapse causes all metrics to be much lower than those of vision distillation.
- t-SNE visualizations confirm that the distribution learned by T5-Base is completely different from T5-XXL, yet it can still guide the diffusion model effectively. This indicates that exact replication of the embedding distribution is unnecessary for text-to-image generation.
- After adopting T5-Base, the FLUX pipeline can run on a 24GB GPU, achieving a 2.7x speedup by eliminating the need for CPU offloading.
Highlights & Insights¶
- The discovery of "text encoder over-parameterization" has practical implications—most of the 11B parameters in T5-XXL are redundant for text-to-image generation because the non-visual knowledge in T5 (trained on C4) is unused in this task. This insight may push the field to rethink resource allocation for model components.
- The concept of vision-based distillation is highly clever—"letting the diffusion model tell you what embedding is good" by leveraging the pre-trained model's powerful synthesis capabilities. This paradigm can be transferred to encoder distillation in other modalities.
- Verification of compatibility with ControlNet, LoRA, and distilled models demonstrates the utility of the distilled encoder as a drop-in replacement.
Limitations & Future Work¶
- Text rendering capability still suffers noticeable degradation on small models, requiring more text-related data or specialized training strategies.
- Only verified on FLUX; applicability to other architectures (such as UNet-based SD series) remains to be confirmed.
- The distillation process still requires training for roughly 320K iterations on 8×A800 GPUs, which is not cheap.
- Future work can explore more extreme compression (e.g., smaller than 60M parameters) or integrate quantization to further reduce inference costs.
Related Work & Insights¶
- vs. Scaling Study in Imagen: Imagen discovered that T5-XXL outperformed smaller T5 versions, but that was concluded when guiding the training of the diffusion model from scratch. This paper conversely demonstrates that the encoder can be significantly downscaled once the diffusion model is fixed.
- vs. 8-bit Quantization: Quantization only compresses precision without reducing the number of parameters; distillation directly cuts the parameter count by 50 times.
- vs. Progressive Distillation: Those works distill the sampling steps of the diffusion model itself; this paper distills the text encoder, and the two are orthogonal and can be combined.
Rating¶
- Novelty: ⭐⭐⭐⭐ Vision-based distillation is a novel approach, and the analysis of the downward scaling law is highly valuable.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across three dimensions with thorough ablation and compatibility testing.
- Writing Quality: ⭐⭐⭐⭐ Clear problem definition and sound experimental design.
- Value: ⭐⭐⭐⭐⭐ Extremely high practical value, directly lowering the deployment barrier for models like FLUX.