Text Embedding Knows How to Quantize Text-Guided Diffusion Models¶
Conference: ICCV 2025 arXiv: 2507.10340 Code: https://github.com/jimmy9704/QLIP Area: Diffusion Models / Model Quantization Keywords: diffusion model quantization, text-guided, dynamic bit-width, mixed precision, post-training quantization
TL;DR¶
This paper is the first to leverage text prompts to guide dynamic bit-width allocation for diffusion model quantization — by predicting the quality of images generated from a given text prompt, it adaptively selects high/medium/low bit precision for different layers and timesteps, reducing computational complexity while maintaining or even improving generation quality.
Background & Motivation¶
Diffusion models have achieved remarkable success in text-to-image generation, but their billions of parameters and hundreds of denoising iterations impose substantial computational overhead, limiting deployment in resource-constrained settings.
Limitations of Prior Work: - Methods such as PTQ4DM and Q-Diffusion account for the effect of timesteps on quantization, but neglect the value of input conditions (text prompts) as a source of quantization guidance. - TDQ adaptively adjusts activation scaling per timestep but applies the same bit-width to all layers. - Input-adaptive dynamic quantization has been explored in super-resolution (CADyQ, AdaBM), yet remains unexplored for diffusion models.
Core Observation: - When a text prompt contains rich and specific descriptions, the generated image quality is high, and low-bit quantization causes significant quality degradation. - When a text prompt is simple and generic, generation quality under low-bit quantization remains close to full precision. - Accordingly, image quality can be predicted from the text prompt to guide dynamic bit-width allocation.
Method¶
Overall Architecture¶
QLIP (Quantization of Language-to-Image diffusion models using text Prompts) consists of two modules:
- T2Q (Text-to-Quality) module: Predicts a generation quality score \(q\) from text embeddings.
- Q2B (Quality-to-Bit) module: Determines the bit-width for each layer at each timestep based on the quality score \(q\).
QLIP can be seamlessly integrated on top of existing quantization methods (Q-Diffusion, PTQD).
Key Designs¶
-
T2Q Module:
- Takes a CLIP text embedding \(\mathbf{z} \in \mathbb{R}^{C_{clip}}\) as input and outputs a scalar quality score \(q = \phi(\mathbf{z})\).
- Architecture is simple: three linear layers.
- Training data: 10k images generated by the full-precision model, with GIQA scores used as pseudo-labels.
- Trained with MSE loss: \(L_{t2q} = \frac{1}{N}\sum_i (\bar{q}^i - \phi(\mathbf{z}^i))^2\).
-
Q2B Module:
- Supports three bit-width levels \(\mathcal{B} = \{b_{low}, b_{med}, b_{high}\}\).
- Quality-driven probability \(\mathbf{p}_q = \sigma((q-0.5)\mathbf{s} + \mathbf{o})\), where \(\mathbf{s}, \mathbf{o} \in \mathbb{R}^K\) are learnable parameters.
- Timestep-driven probabilities \(\mathbf{p}_m^t, \mathbf{p}_h^t\): a set of parameters is learned for every \(M\) timesteps, shared among adjacent timesteps.
- Selection probabilities for each bit-width are computed as:
- \(\mathbf{p}_{b_{low}}^t = (1-\mathbf{p}_q) \odot (1-\mathbf{p}_m^t)\)
- \(\mathbf{p}_{b_{med}}^t = (1-\mathbf{p}_q) \odot \mathbf{p}_m^t + \mathbf{p}_q \odot (1-\mathbf{p}_h^t)\)
- \(\mathbf{p}_{b_{high}}^t = \mathbf{p}_q \odot \mathbf{p}_h^t\)
- The final bit-width is selected via argmax; the straight-through estimator (STE) is used for differentiability during training.
-
High-Bit Strategy for Initial Timesteps: The first \(m\) timesteps are forced to use high precision (i.e., \(\mathbf{p}_q\) is set to 1), as the early denoising steps determine semantic alignment between the generated image and the text.
Loss & Training¶
- First term: L2 error between noise predictions from the full-precision and quantized models.
- Second term: bit-width penalty that encourages the use of lower bits to reduce computation.
- Weights are fixed at 4-bit; dynamic precision allocation is applied to activations only.
- Only the Q2B module parameters are trained; the diffusion model and T2Q module are frozen.
Key Experimental Results¶
Main Results (BK-SDM-Tiny-2M, COCO2017)¶
| Method | Bit Config | FAB↓ | BitOPs(T)↓ | FID↓ | sFID↓ | CLIP Score↑ |
|---|---|---|---|---|---|---|
| Full Precision | W32A32 | 32.00 | 10.46 | 23.79 | 66.19 | 0.3069 |
| Q-Diffusion | W4A16 | 16.00 | 1.03 | 30.02 | 73.25 | 0.3068 |
| +QLIP | W4A{8,16,32} | 12.14 | 0.88 | 30.01 | 73.24 | 0.3063 |
| PTQD | W4A16 | 16.00 | 1.03 | 30.27 | 77.18 | 0.3069 |
| +QLIP | W4A{8,16,32} | 12.14 | 0.88 | 30.02 | 73.26 | 0.3063 |
Stable Diffusion v1.4 Results¶
| Method | FAB↓ | FID↓ | sFID↓ | CLIP Score↑ |
|---|---|---|---|---|
| Full Precision | 32.00 | 22.23 | 65.11 | 0.3174 |
| Q-Diffusion W4A8 | 8.00 | 23.40 | 66.57 | 0.3126 |
| +QLIP W4A{6,8,10} | 7.86 | 21.61 | 64.32 | 0.3120 |
| PTQD W4A8 | 8.00 | 22.75 | 68.63 | 0.3126 |
| +QLIP W4A{6,8,10} | 7.86 | 21.35 | 65.81 | 0.3120 |
QLIP reduces FAB while simultaneously improving FID/sFID, as it allocates higher bit-widths to quality-sensitive components.
Ablation Study¶
Quality Metric Selection for T2Q Module:
| Quality Metric | SROCC↑ | PLCC↑ | FAB↓ | FID↓ |
|---|---|---|---|---|
| w/o QLIP | - | - | 8.00 | 23.40 |
| Realism score | 0.513 | 0.502 | 8.10 | 22.18 |
| CLIP-IQA | 0.713 | 0.708 | 8.54 | 21.81 |
| GIQA | 0.805 | 0.811 | 7.86 | 21.61 |
Component Analysis of Q2B Module:
| Configuration | FAB↓ | FID↓ | Notes |
|---|---|---|---|
| \(\mathbf{p}_q\) only | 7.57 | 26.91 | Low FAB but poor FID when used alone |
| \(\mathbf{p}_q + \mathbf{p}_h^t\) | 6.73 | 29.37 | Bit-width too low, quality degrades |
| \(\mathbf{p}_q + \mathbf{p}_m^t\) | 8.60 | 21.96 | Medium bits preserve quality but FAB is high |
| Full QLIP | 7.86 | 21.61 | All three components jointly achieve the best balance |
Key Findings¶
- The specificity of a text prompt is positively correlated with the required bit-width: more detailed prompts are assigned higher bit precision.
- Cross-attention layers are insensitive to quantization under simple prompts and can use low bit-widths.
- QLIP incurs minimal runtime overhead; actual inference time is close to W4A8 (4.85s vs. 4.53s) while maintaining FID at the W4A16 level.
Highlights & Insights¶
- The idea of using text as a quantization signal is novel and intuitive: simple prompt → low requirements → low bit-width is acceptable.
- Plug-and-play design: only the lightweight Q2B module is trained (~2K–1M parameters), applicable to any existing diffusion model quantization method.
- This work approaches diffusion model compression from the perspective of input-adaptive quantization, opening a new research direction.
Limitations & Future Work¶
- Currently only text prompts are considered as input conditions; the approach could be extended to image conditions, segmentation maps, and other input modalities.
- The CLIP Score remains largely unchanged or slightly decreases after applying QLIP, indicating room for improvement in text-image alignment.
- The quality prediction accuracy of the T2Q module directly affects bit-width allocation; a more accurate image quality predictor could further improve performance.
Related Work & Insights¶
- The paper draws inspiration from input-adaptive quantization in super-resolution (CADyQ, RefQSR) and transfers the idea to diffusion model generation.
- Complementary to the timestep-adaptive approach of TDQ: TDQ focuses on the temporal dimension, while QLIP adds adaptivity along the input content dimension.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First to leverage text prompts for diffusion model quantization, with clear observations and motivation
- Technical Depth: ⭐⭐⭐ — Module design is simple and effective, but theoretical depth is limited
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple datasets, baselines, and ablations with comprehensive comparisons
- Value: ⭐⭐⭐⭐⭐ — Plug-and-play design with practical significance for diffusion model deployment