Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation¶

Conference: CVPR 2025
arXiv: 2412.03178
Code: None
Area: Image Generation
Keywords: Uncertainty Quantification, Text-to-Image, Semantic Uncertainty, LVLM, Bias Detection

TL;DR¶

First systematic quantification of text-to-image generation model uncertainty relative to prompts, proposing the PUNC method—using LVLMs to caption generated images and compare them to the original prompts in the text space, separating epistemic and aleatoric uncertainties via precision/recall.

Background & Motivation¶

Background¶

Background: Uncertainty quantification for T2I models is almost unexplored. Existing works focus solely on the image space (judging if an image is OOD), while uncertainty in the prompt space remains unexamined.

Key Challenge¶

Key Challenge: Uncertainty should manifest at the semantic level: changes in the image space (such as color or contrast) do not necessarily reflect semantic differences—reversing zebra stripes results in massive pixel-space differences but identical semantics.

Limitations of Prior Work¶

Limitations of Prior Work: Aleatoric uncertainty: prompt ambiguity leads to multiple plausible semantic outputs (e.g., "fis" could be fish or fist).

Mechanism¶

Mechanism: Epistemic uncertainty: the model does not understand certain concepts (e.g., political figures not included in the training data).

Application value: bias detection, copyright protection, OOD detection, deepfake prevention.

Method¶

Overall Architecture¶

Three-step pipeline of PUNC (Prompt-based UNCertainty estimation): 1. Generate an image \(\bm{x}\) from prompt \(\bm{c}^*\) using the T2I model. 2. Generate a description (caption) \(\hat{\bm{c}}\) for the generated image using an LVLM. 3. Compare \(\bm{c}^*\) and \(\hat{\bm{c}}\) in the text space: low similarity = high uncertainty.

Key Design 1: Evaluation in Text Space Rather Than Image Space¶

Function: Bypass interference from semantically irrelevant changes in the image space.
Mechanism: Leverage the powerful image understanding capabilities of LVLMs (e.g., Molmo, LLaMA 3, GPT-4) to extract semantic content from generated images, translating them into text descriptions \(\hat{\bm{c}} = f_\omega^{txt}(\bm{c}^*, f_\omega^{img}(\bm{x}))\), and then compute similarity in the text space.
Design Motivation: Image-space methods are sensitive to non-semantic changes like brightness/color, whereas the text space naturally captures semantics. High similarity = low uncertainty (the generated image faithfully reflects the prompt), low similarity = high uncertainty (the model is uncertain about the prompt's meaning or lacks relevant knowledge).

Key Design 2: Precision/Recall Separating Two Types of Uncertainty¶

Function: Decompose total uncertainty into aleatoric and epistemic uncertainties.
Mechanism: Use precision and recall concepts in text similarity metrics (ROUGE, BERTScore):
- Recall measures the proportion of semantic concepts in the prompt that are preserved in the image \(\rightarrow\) low recall = high epistemic uncertainty (the model does not know certain concepts).
- Precision measures the proportion of semantic concepts in the image that match the prompt \(\rightarrow\) low precision = high aleatoric uncertainty (prompt ambiguity causes the model to add extra concepts).
Design Motivation: Image-space methods cannot perform this decomposition. The text precision/recall framework naturally aligns with the classic definitions of uncertainty.

Key Design 3: Comparison Framework with Timestep/Ensemble Methods¶

Function: Provide a comprehensive evaluation system for uncertainty quantification methods.
Mechanism: Adapt image-space methods like DDPM-OOD (reconstruction comparison under different noise levels), LMD (reconstruction comparison after perturbation), and 2XDM (comparison of double generations) by mapping them to prompt-space evaluation.
Design Motivation: Establishing a unified evaluation framework is essential for fair comparisons among different methods. PUNC requires only 1 generation + 1 LVLM call, offering a significant efficiency advantage over DDPM-OOD which requires 50 \(\times\) forward passes.

Loss & Training¶

A training-free method that directly leverages pretrained T2I models and LVLMs for inference-time evaluation.

Key Experimental Results¶

Main Results: OOD Detection (AUROC)¶

Method	Computational Overhead	Remote Sensing	Texture	Microscopic	Average
DDPM-OOD	~50×	Low	Low	Low	Low
2XDM	2×	Medium	Medium	Medium	Medium
PUNC (BERTScore)	1×+LVLM	Highest	Highest	Highest	Highest

PUNC consistently outperforms image-space methods across various OOD detection scenarios.

Uncertainty Decomposition Validation¶

Dataset	Uncertainty Type	Precision (aleatoric)	Recall (epistemic)
Corrupted	Aleatoric↑	Decreased significantly	Relatively stable
Remote Sensing	Epistemic↑	Relatively stable	Decreased significantly
Vague	Aleatoric↑	Decreased significantly	Relatively stable

Accurately validates the hypothesis that precision and recall capture the two types of uncertainties, respectively.

Key Findings¶

PUNC is effective across four T2I models (SDv1.5, SDXS, SDXL, PixArt-Σ).
Simple and effective: requires only one generation + one captioning step, highly computationally efficient.
Uncertainty quantification can be utilized for bias detection and copyright infringement detection.

Highlights & Insights¶

New Task Definition: First to systematically define prompt-space uncertainty in T2I models, filling an important gap.
Text Space is the Right Space: LVLMs serve as a semantic bridge, converting visual uncertainty into quantifiable textual differences.
Precision/Recall as Uncertainty: Elegantly aligns classic NLP evaluation concepts with uncertainty decomposition.

Limitations & Future Work¶

Understanding errors from the LVLM itself can introduce noise (imperfect captioning).
Text similarity metrics (ROUGE/BERTScore) may not fully capture subtle semantic differences.
Only tested on four T2I models; extensions to areas like video generation remain unexplored.
Future work could integrate internal model representation (such as attention maps) for deeper uncertainty analysis.

DDPM-OOD: Image-space OOD detection baseline, significantly outperformed by PUNC.
Semantic Uncertainty (LLM Field): PUNC transfers the ideas of semantic uncertainty in LLMs to T2I.
BERTScore: A text similarity metric, serving as the mathematical foundation for the precision/recall decomposition in PUNC.

Rating¶

⭐⭐⭐⭐ — Pioneers uncertainty quantification in T2I. The PUNC method is elegant, effective, and computationally efficient, with a novel precision/recall decomposition. The limitations of relying on LVLM quality and text similarity metrics are manageable.