Text Embedding Knows How to Quantize Text-Guided Diffusion Models¶

Conference: ICCV 2025 arXiv: 2507.10340 Code: https://github.com/jimmy9704/QLIP Area: Diffusion Models / Model Quantization Keywords: diffusion model quantization, text-guided, dynamic bit-width, mixed precision, post-training quantization

TL;DR¶

This paper is the first to leverage text prompts to guide dynamic bit-width allocation for diffusion model quantization — by predicting the quality of images generated from a given text prompt, it adaptively selects high/medium/low bit precision for different layers and timesteps, reducing computational complexity while maintaining or even improving generation quality.

Background & Motivation¶

Diffusion models have achieved remarkable success in text-to-image generation, but their billions of parameters and hundreds of denoising iterations impose substantial computational overhead, limiting deployment in resource-constrained settings.

Limitations of Prior Work: - Methods such as PTQ4DM and Q-Diffusion account for the effect of timesteps on quantization, but neglect the value of input conditions (text prompts) as a source of quantization guidance. - TDQ adaptively adjusts activation scaling per timestep but applies the same bit-width to all layers. - Input-adaptive dynamic quantization has been explored in super-resolution (CADyQ, AdaBM), yet remains unexplored for diffusion models.

Core Observation: - When a text prompt contains rich and specific descriptions, the generated image quality is high, and low-bit quantization causes significant quality degradation. - When a text prompt is simple and generic, generation quality under low-bit quantization remains close to full precision. - Accordingly, image quality can be predicted from the text prompt to guide dynamic bit-width allocation.

Method¶

Overall Architecture¶

QLIP (Quantization of Language-to-Image diffusion models using text Prompts) consists of two modules:

T2Q (Text-to-Quality) module: Predicts a generation quality score \(q\) from text embeddings.
Q2B (Quality-to-Bit) module: Determines the bit-width for each layer at each timestep based on the quality score \(q\).

QLIP can be seamlessly integrated on top of existing quantization methods (Q-Diffusion, PTQD).

Key Designs¶

T2Q Module:
- Takes a CLIP text embedding \(\mathbf{z} \in \mathbb{R}^{C_{clip}}\) as input and outputs a scalar quality score \(q = \phi(\mathbf{z})\).
- Architecture is simple: three linear layers.
- Training data: 10k images generated by the full-precision model, with GIQA scores used as pseudo-labels.
- Trained with MSE loss: \(L_{t2q} = \frac{1}{N}\sum_i (\bar{q}^i - \phi(\mathbf{z}^i))^2\).
Q2B Module:
- Supports three bit-width levels \(\mathcal{B} = \{b_{low}, b_{med}, b_{high}\}\).
- Quality-driven probability \(\mathbf{p}_q = \sigma((q-0.5)\mathbf{s} + \mathbf{o})\), where \(\mathbf{s}, \mathbf{o} \in \mathbb{R}^K\) are learnable parameters.
- Timestep-driven probabilities \(\mathbf{p}_m^t, \mathbf{p}_h^t\): a set of parameters is learned for every \(M\) timesteps, shared among adjacent timesteps.
- Selection probabilities for each bit-width are computed as:
  - \(\mathbf{p}_{b_{low}}^t = (1-\mathbf{p}_q) \odot (1-\mathbf{p}_m^t)\)
  - \(\mathbf{p}_{b_{med}}^t = (1-\mathbf{p}_q) \odot \mathbf{p}_m^t + \mathbf{p}_q \odot (1-\mathbf{p}_h^t)\)
  - \(\mathbf{p}_{b_{high}}^t = \mathbf{p}_q \odot \mathbf{p}_h^t\)
- The final bit-width is selected via argmax; the straight-through estimator (STE) is used for differentiability during training.
High-Bit Strategy for Initial Timesteps: The first \(m\) timesteps are forced to use high precision (i.e., \(\mathbf{p}_q\) is set to 1), as the early denoising steps determine semantic alignment between the generated image and the text.

Loss & Training¶

\[L_{QLIP} = (\epsilon_\theta(\mathbf{x}_t, t) - \hat{\epsilon}_\theta(\hat{\mathbf{x}}_t, t))^2 + \lambda_{bit}(b_{high} \cdot \sum_k \mathbf{p}_{b_{high}}^t(k) + b_{med} \cdot \sum_k \mathbf{p}_{b_{med}}^t(k))\]

First term: L2 error between noise predictions from the full-precision and quantized models.
Second term: bit-width penalty that encourages the use of lower bits to reduce computation.
Weights are fixed at 4-bit; dynamic precision allocation is applied to activations only.
Only the Q2B module parameters are trained; the diffusion model and T2Q module are frozen.

Key Experimental Results¶

Main Results (BK-SDM-Tiny-2M, COCO2017)¶

Method	Bit Config	FAB↓	BitOPs(T)↓	FID↓	sFID↓	CLIP Score↑
Full Precision	W32A32	32.00	10.46	23.79	66.19	0.3069
Q-Diffusion	W4A16	16.00	1.03	30.02	73.25	0.3068
+QLIP	W4A{8,16,32}	12.14	0.88	30.01	73.24	0.3063
PTQD	W4A16	16.00	1.03	30.27	77.18	0.3069
+QLIP	W4A{8,16,32}	12.14	0.88	30.02	73.26	0.3063

Stable Diffusion v1.4 Results¶

Method	FAB↓	FID↓	sFID↓	CLIP Score↑
Full Precision	32.00	22.23	65.11	0.3174
Q-Diffusion W4A8	8.00	23.40	66.57	0.3126
+QLIP W4A{6,8,10}	7.86	21.61	64.32	0.3120
PTQD W4A8	8.00	22.75	68.63	0.3126
+QLIP W4A{6,8,10}	7.86	21.35	65.81	0.3120

QLIP reduces FAB while simultaneously improving FID/sFID, as it allocates higher bit-widths to quality-sensitive components.

Ablation Study¶

Quality Metric Selection for T2Q Module:

Quality Metric	SROCC↑	PLCC↑	FAB↓	FID↓
w/o QLIP	-	-	8.00	23.40
Realism score	0.513	0.502	8.10	22.18
CLIP-IQA	0.713	0.708	8.54	21.81
GIQA	0.805	0.811	7.86	21.61

Component Analysis of Q2B Module:

Configuration	FAB↓	FID↓	Notes
\(\mathbf{p}_q\) only	7.57	26.91	Low FAB but poor FID when used alone
\(\mathbf{p}_q + \mathbf{p}_h^t\)	6.73	29.37	Bit-width too low, quality degrades
\(\mathbf{p}_q + \mathbf{p}_m^t\)	8.60	21.96	Medium bits preserve quality but FAB is high
Full QLIP	7.86	21.61	All three components jointly achieve the best balance

Key Findings¶

The specificity of a text prompt is positively correlated with the required bit-width: more detailed prompts are assigned higher bit precision.
Cross-attention layers are insensitive to quantization under simple prompts and can use low bit-widths.
QLIP incurs minimal runtime overhead; actual inference time is close to W4A8 (4.85s vs. 4.53s) while maintaining FID at the W4A16 level.

Highlights & Insights¶

The idea of using text as a quantization signal is novel and intuitive: simple prompt → low requirements → low bit-width is acceptable.
Plug-and-play design: only the lightweight Q2B module is trained (~2K–1M parameters), applicable to any existing diffusion model quantization method.
This work approaches diffusion model compression from the perspective of input-adaptive quantization, opening a new research direction.

Limitations & Future Work¶

Currently only text prompts are considered as input conditions; the approach could be extended to image conditions, segmentation maps, and other input modalities.
The CLIP Score remains largely unchanged or slightly decreases after applying QLIP, indicating room for improvement in text-image alignment.
The quality prediction accuracy of the T2Q module directly affects bit-width allocation; a more accurate image quality predictor could further improve performance.

The paper draws inspiration from input-adaptive quantization in super-resolution (CADyQ, RefQSR) and transfers the idea to diffusion model generation.
Complementary to the timestep-adaptive approach of TDQ: TDQ focuses on the temporal dimension, while QLIP adds adaptivity along the input content dimension.

Rating¶

Novelty: ⭐⭐⭐⭐ — First to leverage text prompts for diffusion model quantization, with clear observations and motivation
Technical Depth: ⭐⭐⭐ — Module design is simple and effective, but theoretical depth is limited
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple datasets, baselines, and ablations with comprehensive comparisons
Value: ⭐⭐⭐⭐⭐ — Plug-and-play design with practical significance for diffusion model deployment