Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation¶

Conference: AAAI 2026 arXiv: 2511.10136 Code: None Area: Image Generation Keywords: Compositional generation, text-to-image, negation reasoning, counting, spatial relations

TL;DR¶

This paper systematically investigates fundamental deficiencies in compositional fidelity of text-to-image (T2I) models, focusing on three basic primitives—negation, counting, and spatial relations. It reveals a "submultiplicative" interference phenomenon in which models perform adequately on individual primitives but suffer dramatic performance degradation under joint composition, attributing this to training data scarcity, the unsuitability of continuous attention architectures for discrete logic, and evaluation metrics biased toward visual plausibility rather than constraint satisfaction.

Background & Motivation¶

State of the Field¶

Current mainstream T2I models (DALL·E 3, Stable Diffusion, Imagen, Parti, etc.) can generate highly photorealistic images and have reached impressive levels of quality in style, aesthetics, and single-concept rendering. However, these models exhibit systematic failures in compositional reasoning—that is, the ability to simultaneously satisfy multiple constraints such as counting, attribute binding, spatial relations, and negation.

Root Cause¶

A description trivial for humans—such as "exactly three red apples to the left of a vase, with no flowers inside the vase"—poses an enormous challenge to current models. More critically, even when a model can satisfy each individual constraint in isolation, performance degrades significantly when the constraints are combined. This is not merely a quantitative accumulation but a qualitative collapse.

Paper Goals¶

The authors seek to systematically address the following questions from the perspective of compositional primitives: 1. Why do advances in individual capabilities fail to generalize to joint prompts? 2. What are the root causes of this failure (data / architecture / evaluation)? 3. Why do current methods and straightforward extensions fail to bridge the gap?

Core Idea¶

The paper proposes a primitive-based taxonomy that decomposes compositional fidelity into three fundamental dimensions—negation, counting, and spatial relations—uses formal analysis to reveal the submultiplicative interference mechanism, and, drawing on 15 benchmarks and a variety of methods, charts research directions spanning theory, architecture, training, and evaluation.

Method¶

Overall Architecture¶

This is a survey paper whose analytical framework proceeds as follows: - Formal definition: A T2I model \(G_\theta: \mathcal{Y} \to \Delta(\mathcal{X})\) maps text prompts to image distributions. - Compositional primitive decomposition: Compositional fidelity is decomposed into negation, counting, and spatial relations. - Interference metric: The submultiplicative degradation of joint performance is quantified via \(\rho(y)\).

Key Designs¶

1. Negation Primitive Analysis¶

Negation requires the model to reason about "what should not appear." Formally, the latent distribution entropy of negated prompts is higher than that of affirmative prompts:

\[\mathcal{H}(p(z|y_{\text{neg}})) > \mathcal{H}(p(z|y_{\text{aff}}))\]

Negation manifests in highly diverse linguistic forms in T2I: - Morphological negation: unstriped, nontoxic - Lexical/privative adjectives: empty, barefoot - Syntactic/clausal cues: no, not, without, neither…nor - Quantificational negation: no N, fewer than n - Relational negation: not left of, not touching

Key finding: Negation samples are extremely scarce in training data—approximately 0.4% in MS COCO and 0.6% in LAION-400M. Common failure modes include ignoring negation tokens, misapplied scope, over-suppression, and occlusion confusion.

2. Counting Primitive Analysis¶

Counting exposes a fundamental architectural limitation of Transformer parallel attention: the absence of an explicit enumerator. The error grows superlinearly with target quantity \(n\):

\[\text{Error}(n) \approx \Theta(n^\beta), \quad \beta \in [1.2, 1.5]\]

This means error escalates sharply for \(n > 5\). Linguistic forms of counting include exact numerals, bounded ranges, fuzzy quantifiers, comparative relations, compound specifications, and spatially distributed counting.

Failure modes: Unintended object duplication or merging; attribute binding leakage (numerals applied to the wrong subset); comparative constraint failure (one category disappearing rather than quantities being adjusted).

3. Spatial Relations Primitive Analysis¶

Spatial relations require resolving linguistic relationships into geometrically consistent scenes. For example, "a blue cube on a red sphere" requires:

\[\exists c,s \text{ s.t. } \text{IsCube}(c) \land \text{IsSphere}(s) \land \text{Color}(c,\text{blue}) \land \text{Color}(s,\text{red}) \land \text{On}(c,s)\]

where \(\text{On}(c,s)\) entails geometric constraints of bottom contact and horizontal overlap. Spatial relations encompass directional, topological, proximity, alignment, support, and partition-layout relations.

Failure modes: Local pairwise predicate satisfaction with global conflict; directional relations inverting under viewpoint changes; support relations nominally satisfied but physically implausible.

4. Joint Compositionality—Submultiplicative Interference¶

The most severe failures emerge when primitives are combined. Assuming independence across primitives, the joint success rate would be:

\[F_\theta^{\text{ind}}(y) := F_\theta^{\text{cnt}}(y) \cdot F_\theta^{\text{spat}}(y) \cdot F_\theta^{\text{neg}}(y)\]

However, the observed behavior is submultiplicative:

\[\rho(y) := \frac{F_\theta(y)}{F_\theta^{\text{ind}}(y)} < 1\]

For instance, with 70% success per individual primitive, the joint rate under the independence assumption would be 34.3%, but actual performance drops to approximately 20% due to interference (\(\rho \approx 0.58\)). This reveals hidden interactions among constraints—a constraint trade-off phenomenon: enforcing layout disrupts counting; honoring negation removes expected context; correct counting is accompanied by incorrect attribute binding.

Loss & Training¶

As a survey, the paper categorizes the training strategies of existing methods:

Negation methods: Contrastive strategies (TripletCLIP), data augmentation (CC-Neg 228K pairs), architectural methods (energy constraints, empty-frame encoding).

Counting methods: Data augmentation (improved captions in DALL·E 3), architectural innovations (modified attention, MoE), layout-based methods (bounding boxes, semantic regions), hybrid methods (LLM-grounded diffusion).

Spatial methods: Sampling methods (Composable Diffusion), attention methods (Attend-and-Excite), architectural methods (Set-of-Mark), 3D-aware methods (Zero123).

Joint methods: Inference-time composition, compositional data augmentation, curriculum learning (EvoGen).

Key Experimental Results¶

Main Results¶

The paper synthesizes analyses across 15 benchmarks; the table below summarizes key comparisons:

Benchmark	Scale	Auto-eval	Negation	Counting	Spatial	Primary Focus
T2I-CompBench	6,000	✓	✓	✓	✓	Comprehensive auto-evaluation
CREPE	370K+	✓	✓	✗	✓	Systematic compositionality
NegBench	79K	✓	✓	✗	✗	Comprehensive negation
CC-Neg	228K	✓	✓	✗	✗	Negation training/evaluation
SugarCrepe	>1,000	✓	✓	✓	✓	Hard-negative probing

Ablation Study¶

Dataset	Negation Frequency	Notes
MS COCO	~0.4%	Explicit negation nearly absent
CC3M	1.63%	Low frequency
CC12M	~2.5%	Slightly higher but still scarce
LAION-400M	~0.6%	Equally scarce in large-scale data
High-count scenes (\(n>5\))	<2%	Extremely rare in training data
Complex spatial arrangements	<5%	Multi-relation scenes uncommon

Key Findings¶

Fundamental mismatch between training data and architecture: The joint primitive training distribution approximates independence, yet individual marginal probabilities are already low and co-occurrences are even scarcer.
Complexity-theoretic explanation: Joint constraint satisfaction is NP-hard in the general case (RCC-8 spatial constraints are NP-complete; label placement is NP-hard).
Search space explosion: For \(n\) objects, \(m\) spatial relations, and \(k\) negation constraints, the search space is \(O(n! \cdot 2^m \cdot \binom{n}{k})\).
Model scale is not a cure: Larger models yield marginal gains on individual primitives but suffer the same performance collapse on joint tasks.
Inter-primitive interference: Enforcing negation may trigger hallucination of unrelated objects (due to strong co-occurrence priors); combining counting and spatial constraints may produce physically implausible scenes.

Highlights & Insights¶

The formalization of the submultiplicative interference coefficient \(\rho(y)\) is an elegant quantitative tool that precisely characterizes the nature of compositional failure.
The root-cause analysis of the data–architecture mismatch is incisive: continuous attention optimizes score functions for majority patterns, and regularization terms (visual priors) dominate when compositional conflicts arise.
Reframing T2I generation through the lens of constraint satisfaction and combinatorial optimization provides theoretical motivation for neuro-symbolic approaches.
Loss functions \(\mathcal{L}_{\text{neg}}\) that equate negation with training "concept absence" cannot distinguish "specific absence" from "generic suppression"—a fundamental architectural limitation.

Limitations & Future Work¶

Absence of experimental validation: As a survey, no new methods are proposed or experimentally compared.
Focus restricted to three primitives: Temporal reasoning, causal relations, and abstract concepts and other more complex compositional challenges are not covered.
Lack of quantitative interference analysis: Although \(\rho(y)\) is defined, it is not systematically measured across specific models and benchmarks.
Qualitative characterization of solutions: The comparison of method strengths and weaknesses is largely descriptive, lacking quantitative comparisons in a unified experimental setting.
Flow-based models not covered: Compositional performance of recent models such as FLUX and SD3 is not addressed.

T2I-CompBench (Huang et al., 2023): The most comprehensive compositionality benchmark, covering all three primitives.
Composable Diffusion (Feng et al., 2023): An inference-time method decomposing spatial relations into independent energy functions.
LLM-grounded Diffusion (Lian et al., 2023): Uses LLMs to parse text into structured scene representations.
CC-Neg (Singh et al., 2025): A dataset of 228K negation image–caption pairs.
CountGen (Binyamin et al., 2025): Embeds a differentiable counting loss into training.
Key Takeaway: This survey provides the most systematic theoretical framework for understanding T2I compositional failures; the submultiplicative interference \(\rho(y) < 1\) should serve as a core metric for measuring progress.

Rating¶

Novelty: ⭐⭐⭐⭐ — The primitive decomposition framework and formalization of submultiplicative interference offer a genuinely novel analytical perspective.
Experimental Thoroughness: ⭐⭐⭐ — Survey in nature with no new experiments, though coverage is broad.
Writing Quality: ⭐⭐⭐⭐⭐ — Logically clear, formally rigorous, and well-structured.
Value: ⭐⭐⭐⭐ — Provides a systematic roadmap for understanding and addressing compositional challenges in T2I generation.