Skip to content

Text Embedding Knows How to Quantize Text-Guided Diffusion Models

Conference: ICCV 2025 arXiv: 2507.10340 Code: https://github.com/jimmy9704/QLIP Area: Diffusion Models / Model Quantization Keywords: diffusion model quantization, text-guided, dynamic bit-width, mixed precision, post-training quantization

TL;DR

This paper is the first to leverage text prompts to guide dynamic bit-width allocation for diffusion model quantization — by predicting the quality of images generated from a given text prompt, it adaptively selects high/medium/low bit precision for different layers and timesteps, reducing computational complexity while maintaining or even improving generation quality.

Background & Motivation

Diffusion models have achieved remarkable success in text-to-image generation, but their billions of parameters and hundreds of denoising iterations impose substantial computational overhead, limiting deployment in resource-constrained settings.

Limitations of Prior Work: - Methods such as PTQ4DM and Q-Diffusion account for the effect of timesteps on quantization, but neglect the value of input conditions (text prompts) as a source of quantization guidance. - TDQ adaptively adjusts activation scaling per timestep but applies the same bit-width to all layers. - Input-adaptive dynamic quantization has been explored in super-resolution (CADyQ, AdaBM), yet remains unexplored for diffusion models.

Core Observation: - When a text prompt contains rich and specific descriptions, the generated image quality is high, and low-bit quantization causes significant quality degradation. - When a text prompt is simple and generic, generation quality under low-bit quantization remains close to full precision. - Accordingly, image quality can be predicted from the text prompt to guide dynamic bit-width allocation.

Method

Overall Architecture

QLIP (Quantization of Language-to-Image diffusion models using text Prompts) consists of two modules:

  1. T2Q (Text-to-Quality) module: Predicts a generation quality score \(q\) from text embeddings.
  2. Q2B (Quality-to-Bit) module: Determines the bit-width for each layer at each timestep based on the quality score \(q\).

QLIP can be seamlessly integrated on top of existing quantization methods (Q-Diffusion, PTQD).

Key Designs

  1. T2Q Module:

    • Takes a CLIP text embedding \(\mathbf{z} \in \mathbb{R}^{C_{clip}}\) as input and outputs a scalar quality score \(q = \phi(\mathbf{z})\).
    • Architecture is simple: three linear layers.
    • Training data: 10k images generated by the full-precision model, with GIQA scores used as pseudo-labels.
    • Trained with MSE loss: \(L_{t2q} = \frac{1}{N}\sum_i (\bar{q}^i - \phi(\mathbf{z}^i))^2\).
  2. Q2B Module:

    • Supports three bit-width levels \(\mathcal{B} = \{b_{low}, b_{med}, b_{high}\}\).
    • Quality-driven probability \(\mathbf{p}_q = \sigma((q-0.5)\mathbf{s} + \mathbf{o})\), where \(\mathbf{s}, \mathbf{o} \in \mathbb{R}^K\) are learnable parameters.
    • Timestep-driven probabilities \(\mathbf{p}_m^t, \mathbf{p}_h^t\): a set of parameters is learned for every \(M\) timesteps, shared among adjacent timesteps.
    • Selection probabilities for each bit-width are computed as:
      • \(\mathbf{p}_{b_{low}}^t = (1-\mathbf{p}_q) \odot (1-\mathbf{p}_m^t)\)
      • \(\mathbf{p}_{b_{med}}^t = (1-\mathbf{p}_q) \odot \mathbf{p}_m^t + \mathbf{p}_q \odot (1-\mathbf{p}_h^t)\)
      • \(\mathbf{p}_{b_{high}}^t = \mathbf{p}_q \odot \mathbf{p}_h^t\)
    • The final bit-width is selected via argmax; the straight-through estimator (STE) is used for differentiability during training.
  3. High-Bit Strategy for Initial Timesteps: The first \(m\) timesteps are forced to use high precision (i.e., \(\mathbf{p}_q\) is set to 1), as the early denoising steps determine semantic alignment between the generated image and the text.

Loss & Training

\[L_{QLIP} = (\epsilon_\theta(\mathbf{x}_t, t) - \hat{\epsilon}_\theta(\hat{\mathbf{x}}_t, t))^2 + \lambda_{bit}(b_{high} \cdot \sum_k \mathbf{p}_{b_{high}}^t(k) + b_{med} \cdot \sum_k \mathbf{p}_{b_{med}}^t(k))\]
  • First term: L2 error between noise predictions from the full-precision and quantized models.
  • Second term: bit-width penalty that encourages the use of lower bits to reduce computation.
  • Weights are fixed at 4-bit; dynamic precision allocation is applied to activations only.
  • Only the Q2B module parameters are trained; the diffusion model and T2Q module are frozen.

Key Experimental Results

Main Results (BK-SDM-Tiny-2M, COCO2017)

Method Bit Config FAB↓ BitOPs(T)↓ FID↓ sFID↓ CLIP Score↑
Full Precision W32A32 32.00 10.46 23.79 66.19 0.3069
Q-Diffusion W4A16 16.00 1.03 30.02 73.25 0.3068
+QLIP W4A{8,16,32} 12.14 0.88 30.01 73.24 0.3063
PTQD W4A16 16.00 1.03 30.27 77.18 0.3069
+QLIP W4A{8,16,32} 12.14 0.88 30.02 73.26 0.3063

Stable Diffusion v1.4 Results

Method FAB↓ FID↓ sFID↓ CLIP Score↑
Full Precision 32.00 22.23 65.11 0.3174
Q-Diffusion W4A8 8.00 23.40 66.57 0.3126
+QLIP W4A{6,8,10} 7.86 21.61 64.32 0.3120
PTQD W4A8 8.00 22.75 68.63 0.3126
+QLIP W4A{6,8,10} 7.86 21.35 65.81 0.3120

QLIP reduces FAB while simultaneously improving FID/sFID, as it allocates higher bit-widths to quality-sensitive components.

Ablation Study

Quality Metric Selection for T2Q Module:

Quality Metric SROCC↑ PLCC↑ FAB↓ FID↓
w/o QLIP - - 8.00 23.40
Realism score 0.513 0.502 8.10 22.18
CLIP-IQA 0.713 0.708 8.54 21.81
GIQA 0.805 0.811 7.86 21.61

Component Analysis of Q2B Module:

Configuration FAB↓ FID↓ Notes
\(\mathbf{p}_q\) only 7.57 26.91 Low FAB but poor FID when used alone
\(\mathbf{p}_q + \mathbf{p}_h^t\) 6.73 29.37 Bit-width too low, quality degrades
\(\mathbf{p}_q + \mathbf{p}_m^t\) 8.60 21.96 Medium bits preserve quality but FAB is high
Full QLIP 7.86 21.61 All three components jointly achieve the best balance

Key Findings

  • The specificity of a text prompt is positively correlated with the required bit-width: more detailed prompts are assigned higher bit precision.
  • Cross-attention layers are insensitive to quantization under simple prompts and can use low bit-widths.
  • QLIP incurs minimal runtime overhead; actual inference time is close to W4A8 (4.85s vs. 4.53s) while maintaining FID at the W4A16 level.

Highlights & Insights

  • The idea of using text as a quantization signal is novel and intuitive: simple prompt → low requirements → low bit-width is acceptable.
  • Plug-and-play design: only the lightweight Q2B module is trained (~2K–1M parameters), applicable to any existing diffusion model quantization method.
  • This work approaches diffusion model compression from the perspective of input-adaptive quantization, opening a new research direction.

Limitations & Future Work

  • Currently only text prompts are considered as input conditions; the approach could be extended to image conditions, segmentation maps, and other input modalities.
  • The CLIP Score remains largely unchanged or slightly decreases after applying QLIP, indicating room for improvement in text-image alignment.
  • The quality prediction accuracy of the T2Q module directly affects bit-width allocation; a more accurate image quality predictor could further improve performance.
  • The paper draws inspiration from input-adaptive quantization in super-resolution (CADyQ, RefQSR) and transfers the idea to diffusion model generation.
  • Complementary to the timestep-adaptive approach of TDQ: TDQ focuses on the temporal dimension, while QLIP adds adaptivity along the input content dimension.

Rating

  • Novelty: ⭐⭐⭐⭐ — First to leverage text prompts for diffusion model quantization, with clear observations and motivation
  • Technical Depth: ⭐⭐⭐ — Module design is simple and effective, but theoretical depth is limited
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple datasets, baselines, and ablations with comprehensive comparisons
  • Value: ⭐⭐⭐⭐⭐ — Plug-and-play design with practical significance for diffusion model deployment