NeurIPS 2025 (Workshop: Generative and Protective AI for Content Creation) Image Generation Hallucination T2I Evaluation Alignment Upper Bound Bias Detection Taxonomy

Hallucination as an Upper Bound: A New Perspective on Text-to-Image Evaluation¶

Conference: NeurIPS 2025 (Workshop: Generative and Protective AI for Content Creation)
arXiv: 2509.21257
Code: None
Area: Text-to-Image Generation / Evaluation Methods
Keywords: Hallucination, T2I Evaluation, Alignment Upper Bound, Bias Detection, Taxonomy

TL;DR¶

This paper proposes a definition of hallucination in text-to-image (T2I) models as bias-driven deviation, establishes a taxonomy of three hallucination categories—attribute, relation, and object—and argues that hallucination evaluation serves as an "upper bound" for prompt alignment evaluation, thereby revealing hidden model biases.

Background & Motivation¶

State of the Field¶

Hallucination has been extensively studied in large language models (LLMs) and vision-language models (VLMs):

Area	Hallucination Definition	Research Depth
LLM	Generating content inconsistent with facts	Deep (numerous surveys and benchmarks)
VLM	Generating descriptions inconsistent with images	Actively developing (HaluEval, THRONE, etc.)
T2I	Not clearly defined	Nearly absent

Existing T2I evaluation methods primarily focus on alignment:

TIFA: QA-based prompt faithfulness
GenEval: Compositional generation capability
T2I-CompBench: Compositionality benchmark
VQAScore: Visual question answering-based scoring

These methods only check "whether what the prompt requires is present," while ignoring "what the model generates beyond the prompt."

Lower Bound vs. Upper Bound¶

The paper presents a key insight:

Evaluation Dimension	Meaning	Type
Alignment Evaluation	Are the elements required by the prompt present?	Lower Bound
Hallucination Evaluation	What has the model added beyond the prompt?	Upper Bound

Focusing solely on alignment provides only a lower bound on model performance. A complete evaluation must also detect prompt-undriven content added by the model itself—i.e., hallucination.

Method¶

Overall Architecture¶

This paper is a position paper whose core contributions are conceptual: 1. Defining hallucination in T2I generation 2. Establishing a taxonomy of three hallucination categories 3. Distinguishing hallucination from alignment errors 4. Arguing for the necessity of hallucination evaluation as an upper bound

Key Designs¶

Hallucination vs. Alignment Error¶

Phenomenon	Alignment Error	Hallucination
Definition	Failure to correctly render prompt-specified content	Addition of content not specified by the prompt
Example	"Red car" is generated as blue	"Car" generates pedestrians on the road
Direction	Model omission/error	Model addition
Source	Insufficient understanding/rendering capability	Internal model bias/prior

Hallucination Taxonomy¶

1. Object Hallucination¶

Generating entities not mentioned in the prompt.

Formally: let prompt \(P\) specify object set \(O = \{o_1, \ldots, o_n\}\). If the generated image contains a non-empty set \(O'\) where \(O' \cap O = \emptyset\), then \(O'\) constitutes object hallucination.

Prompt	Expected Content	Hallucinated Content	Bias Source
"a bowl of apples"	Bowl of apples	Oranges appear in the bowl	Scene completion bias
"a horse"	Horse	Rider appears on horse	Co-occurrence statistics
"a street with cars"	Street with cars	Pedestrians, bicycles appear	Scene completeness bias

2. Attribute Hallucination¶

The model assigns specific visual attributes to objects whose attributes are not specified in the prompt.

Formally: let prompt \(P\) include object \(o\) but no explicit attributes. If \(o\) in the image possesses attribute \(a'\) (not implied by \(P\)), then \(a'\) constitutes attribute hallucination.

Prompt	Expected Output	Hallucinated Attribute	Reflected Bias
"a doctor"	Doctor (neutral)	Male, white coat	Gender/occupational stereotype
"a wedding cake"	Wedding cake	White, multi-tiered	Cultural default
"a child"	Child	Smiling, outdoors, neat clothing	Idealized emotional default

3. Relation Hallucination¶

The model inserts relationships between objects that are not described in the prompt.

Formally: let prompt \(P\) include objects \(O = \{o_1, o_2\}\) with no explicit relation. If the image contains relation \(r\) (not implied by \(P\)), then \(r\) constitutes relation hallucination.

Prompt	Expected Composition	Hallucinated Relation	Reflected Bias
"a man and a dog"	Man and dog co-present	Man walking dog (on leash)	Control/ownership association
"a woman and a laptop"	Woman and laptop	Woman typing	Work scenario association
"a child and a book"	Child and book	Child reading	Learning narrative association

Loss & Training¶

This paper involves no training. It is a conceptual framework paper intended to lay the groundwork for future T2I hallucination benchmarks and evaluation methods.

Key Experimental Results¶

Conceptual Framework Comparison¶

As a position paper, this work contains no conventional experiments. The core contribution lies in conceptual organization. The following compares existing evaluation dimensions:

Evaluation Method	Detects Missing Objects	Detects Attribute Errors	Detects Relation Errors	Detects Extra Objects	Detects Implicit Bias
TIFA	✓	✓	Partial	✗	✗
GenEval	✓	✓	✓	✗	✗
T2I-CompBench	✓	✓	✓	✗	✗
VQAScore	✓	✓	Partial	✗	✗
iHallA	✓	✓	✓	Partial	✗
Proposed Framework	✓	✓	✓	✓	✓

Completeness Comparison of Evaluation Dimensions¶

Dimension	Alignment Evaluation (Lower Bound)	Hallucination Evaluation (Upper Bound)
Core Question	Is what the prompt requires present?	What extra content has the model added?
Captured Bias	Capability deficiency	Implicit bias and priors
Evaluation Direction	Missing content detection	Added content detection
Completeness	Necessary but insufficient	Complementary dimension
Existing Coverage	Extensive	Nearly absent

Key Findings¶

Alignment evaluation is incomplete: Existing T2I evaluation methods only check "whether anything is missing," not "whether anything extra has been added." A complete evaluation picture requires both.
Hallucination reveals hidden biases: Object hallucination reflects scene completion bias, attribute hallucination reflects social stereotypes, and relation hallucination reflects over-learned associations—all of which are entirely overlooked by current alignment evaluation.
Independence of the three hallucination categories: Object, attribute, and relation hallucination are three independent dimensions, each involving distinct evaluation challenges (entity detection vs. attribute recognition vs. relation reasoning).
Implications for model deployment: Hallucination undermines controllability, neutrality, and trustworthiness—factors critical to real-world deployment that are neglected in existing evaluations.

Highlights & Insights¶

The lower/upper bound analogy is compelling: Framing alignment as a lower bound and hallucination as an upper bound provides a clear conceptual framework for evaluation.
Social bias perspective: Attribute hallucination directly connects to AI fairness issues (e.g., gender and cultural stereotypes).
Addressing an evaluation blind spot: The paper explicitly identifies a systematic gap in T2I evaluation.
Practical guidance: The taxonomy provides concrete categorical dimensions for constructing new T2I hallucination benchmarks.

Limitations & Future Work¶

Lack of empirical validation: As a position paper, no quantitative experiments or benchmark results are provided.
No evaluation methodology: A taxonomy is proposed but no concrete detection methods or evaluation metrics are designed.
Ambiguous boundary of "hallucination": Scene completion by the model is sometimes reasonable (e.g., generating backgrounds); clearer criteria are needed for when such additions constitute hallucination.
No discussion of connections to VLM hallucination: T2I hallucination evaluation may benefit from methods developed for VLM hallucination detection.
Workshop paper length constraints: Many ideas are only briefly introduced without in-depth development.

Huang et al. (2025): Survey on LLM hallucination
Bai et al. (2024): VLM hallucination research
TIFA (Hu et al., 2023): QA-based T2I evaluation
GenEval (Ghosh et al., 2023): Compositional generation evaluation
T2I-CompBench (Huang et al., 2023): Compositionality benchmark
iHallA (Lim et al., 2025): Partially addresses T2I hallucination
The proposed taxonomy can guide future benchmark design, particularly for bias auditing.

Rating¶

Novelty: ⭐⭐⭐⭐ — First systematic framework for T2I hallucination
Technical Depth: ⭐⭐ — Conceptual work with no technical methods or experiments
Practicality: ⭐⭐⭐ — Provides direction for future benchmark design but is not directly applicable
Clarity: ⭐⭐⭐⭐⭐ — Clear writing with vivid examples
Overall Score: 6/10