The Fabrication of Reality and Fantasy: Scene Generation with LLM-Assisted Prompt Interpretation¶

Conference: ECCV 2024
arXiv: 2407.12579
Code: Available
Area: Image Generation
Keywords: Text-to-Image Generation, Large Language Models, benchmark, Diffusion Models, Creative Scene Generation

TL;DR¶

Proposes the Realistic-Fantasy Benchmark (RFBench) to evaluate the performance of diffusion models on creative/knowledge-intensive prompts, and designs a training-free RFNet framework that enhances the generation capability of diffusion models for abstract and imaginative concepts through LLM-assisted prompt interpretation and a semantic alignment assessment module.

Background & Motivation¶

Existing text-to-image models perform poorly when handling complex prompts that require creativity or specialized knowledge due to the lack of scenes violating conventional reality in training data (e.g., "a mouse hunting a lion"). Traditional solutions (data collection + retraining/fine-tuning) are extremely costly and can lead to catastrophic forgetting.

This work argues that the core problem is: how can diffusion models better capture imaginative and abstract concepts? Meanwhile, it points out the lack of evaluation benchmarks for such tasks.

Key observations: - Training data bias prevents models from understanding scenes with role conflicts (e.g., "a cat chased by a mouse"), scientific reasoning (e.g., "water droplets on the International Space Station"), etc. - Existing benchmarks do not cover evaluations in creative and fantasy dimensions. - LLMs possess logical reasoning and knowledge inference capabilities, which can compensate for training data bias.

Method¶

Overall Architecture¶

RFNet (Realistic-Fantasy Network) is a two-stage training-free method:

Stage 1 (LLM-Driven Detail Synthesis): Uses an LLM to parse the prompt, generating object layouts (bounding boxes), detailed descriptions, background scenes, and negative prompts.
Stage 2 (Comprehensive Image Synthesis): Employs a two-step generation process—first generating foreground objects independently, and then seamlessly integrating them into the background.

Key Designs¶

1. RFBench Benchmark Construction¶

Contains 229 compositional text prompts, categorized into two major areas with a total of nine subcategories:

Realistic & Analytical: - Scientific and empirical reasoning (e.g., "a drop of water on the International Space Station") - Cultural and temporal awareness (e.g., "costumed children knocking door-to-door on October 31st") - Factual or literal descriptions (e.g., "a tank abandoned on a beach for 50 years") - Conceptual and metaphorical thinking (e.g., "a man brave as a lion")

Creativity & Imagination: - Common objects in anomalous situations (e.g., "a rubber duck sailing on a lava field") - Imaginative scenes (e.g., "an octopus and a seahorse playing chess") - Counterfactual scenarios (e.g., "fish swimming in clouds") - Role reversal or conflict (e.g., "a cat chased by a mouse") - Anthropomorphic scenes (e.g., "a snowman building a friend in a blizzard")

The collection workflow adopts a hybrid approach: alternating between in-context learning of ChatGPT and Bard combined with predefined rules to ensure diversity.

2. LLM-Driven Detail Synthesis¶

Given a prompt, the LLM is guided by task requirements and in-context learning to generate enhanced responses, including: - Bounding box layouts for main objects - Detailed descriptions for each object - Background scene description - Negative prompts

The core objective is to compensate for training data bias in diffusion models using the logical reasoning capability of LLMs.

3. Semantic Alignment Assessment (SAA)¶

Resolves the issue where descriptions generated by the LLM for different objects might conflict. For example, descriptions for "lion" might be "sleeping obliviously" vs "fleeing in panic"; although each is individually reasonable, they are semantically conflicting.

The SAA module selects the most compatible combination of descriptions by calculating the cosine similarity between the description vectors of different objects, ensuring textual precision and consistency, and providing clear instructions for the subsequent diffusion model.

4. Comprehensive Image Synthesis¶

Step 1—In-Depth Object Generation: - Generated independently for each foreground object, with the input format [background prompt] with [target object], [descriptions] - Uses a cross-attention constraint function to restrict the object within the bounding box:

\[\mathcal{L}_{obj}(\mathbf{A}, i, v) = [1 - \text{Topk}_u(\mathbf{A}_{uv} \cdot \mathbf{m}_i)] + [\text{Topk}_u(\mathbf{A}_{uv} \cdot (1 - \mathbf{m}_i))]\]

Updates the latent variables using gradients at each denoising step: \(z'_t \leftarrow z_t - \alpha \cdot \nabla_{z_t} \sum_{v \in V} \mathcal{L}\)
Extracts cross-attention maps and converts them into saliency masks for the next step.

Step 2—Seamless Background Integration: - Replaces the corresponding regions of the current latent with the masked latents from Step 1. - Introduces two constraint functions: - Guidance Constraint: Minimizes the discrepancy between the current cross-attention and the attention of the generated object from Step 1, preserving fine details. - Suppression Constraint: Minimizes key cross-attention maps outside the bounding boxes, reducing multi-object interference.

\[\mathcal{L}_{bg} = \beta \cdot \underbrace{\sum_u |(\mathbf{A}'_{uv} - \mathbf{A}^{(i)}_{uv}) \cdot \mathbf{m}_i|}_{\text{guidance}} + \gamma \cdot \underbrace{\text{Topk}_u(\mathbf{A}'_{uv} \cdot (1 - \mathbf{m}_i))}_{\text{suppression}}\]

Loss & Training¶

RFNet is a training-free method and does not require training. It mainly utilizes: - Pre-trained Stable Diffusion (v1.4 / v2.1) as the base generative model. - Pre-trained LLM (ChatGPT/GPT-4) for prompt parsing. - Denoising steps set to 50, guidance scale to 7.5, resolution of \(512 \times 512\). - Hyperparameters \(\alpha\) control the scale of gradient updates, \(\beta/\gamma\) control the strength of guidance/suppression constraints. - Replacement operations are restricted to the first \(rT\) timesteps.

Key Experimental Results¶

Main Results¶

Comparison of GPT4-CLIP and GPT4Score on RFBench:

Method	GPT4-CLIP R&A	GPT4-CLIP C&I	GPT4-CLIP Avg	GPT4Score R&A	GPT4Score C&I	GPT4Score Avg
Stable Diffusion	0.573	0.552	0.561	0.667	0.440	0.541
MultiDiffusion	0.510	0.510	0.510	0.517	0.493	0.504
Attend and Excite	0.523	0.560	0.546	0.633	0.520	0.570
LMD	0.457	0.536	0.501	0.550	0.600	0.578
BoxDiff	0.532	0.553	0.543	0.583	0.520	0.548
SDXL	0.536	0.619	0.582	0.567	0.587	0.578
RFNet (Ours)	0.587	0.623	0.607	0.833	0.627	0.719

RFNet improves by 33% on GPT4Score compared to Stable Diffusion, by 43% on the Creativity & Imagination task, and by 61% on the Realistic & Analytical GPT4Score compared to MultiDiffusion.

Comparison against Imagen on a subset of DrawBench:

Prompt	Imagen	RFNet
A shark in the desert	0.194	0.713
An elephant under the sea	0.300	0.900
A panda making latte art	0.050	0.250
A pizza cooking an oven	0.700	0.831
Rainbow coloured penguin	0.394	0.519

Significantly outperforms Imagen on most prompts requiring creative imagination.

Ablation Study¶

Impact of each component on RFBench GPT4Score:

SAA	Guidance	Suppression	GPT4Score
✗	✗	✗	0.295
✓	✗	✗	0.407
✗	✓	✗	0.554
✗	✓	✓	0.572
✓	✓	✓	0.719

The SAA module contributes the most, improving from 0.572 to 0.719 (+25.7%).
The guidance and suppression constraints are complementary, jointly lifting the baseline from 0.295 to 0.572.
The full model achieves a 143.7% improvement over the baseline.

Key Findings¶

SAA is crucial: without semantic alignment assessment, conflicting descriptions generated by the LLM lead to significant degradation in image quality.
Highly similar descriptions lead to high-quality images, while low-similarity descriptions result in visual inconsistency.
In a user study with 120 participants, RFNet achieved the highest ratings in both image quality and text-prompt fidelity.
Traditional CLIPScore has limitations in evaluating creative scenarios, where GPT4-CLIP and GPT4Score are more suitable.

Highlights & Insights¶

First Realistic-Fantasy Benchmark: Systematically defines 9 subcategories covering realistic reasoning and creative imagination, filling the evaluation gap.
Training-free Architecture: Compatible with independently-trained LLMs and diffusion models, requiring no parameter tuning, and offering flexible deployment.
Dual Constraint Mechanism Different from Traditional Layout Loss: Combines guidance constraints for fidelity and suppression constraints for interference reduction, which are functionally complementary.
LLMs Compensate for Data Bias: Leverages the knowledge and reasoning capacity of LLMs to compensate for the lack of unconventional scenes in diffusion model training data.

Limitations & Future Work¶

Relies heavily on the quality of LLM reasoning; errors in LLM understanding will propagate to the generated results.
SAA selects descriptions based on text cosine similarity, which might not be suitable for highly abstract concepts.
The two-step generation process increases inference time.
The benchmark scale is limited (229 prompts); it could be extended with more subcategories and a larger scale.
More evaluation metrics could be developed to complement GPT4-CLIP and GPT4Score.

LMD (LLM-grounded Diffusion): Uses LLMs to generate foreground object layouts and then guides the diffusion model. RFNet adds the SAA and dual-constraint mechanisms on top of this.
Attend and Excite: Enhances semantic understanding through attention mechanisms but lacks the capability to handle complex creative prompts.
RPG: Integrates LLMs in a closed-loop manner, improving generation quality through Chain-of-Thought.
SDXL: Strong high-resolution synthesis capabilities but sometimes fails to capture the creative intent in prompts.

Rating¶

Novelty: ⭐⭐⭐⭐ — Systematically defines the realistic-fantasy generation task and constructs a benchmark for the first time.
Effectiveness: ⭐⭐⭐⭐ — Significantly outperforms SOTA methods in both automatic and human evaluations.
Engineering Value: ⭐⭐⭐ — Training-free but has a relatively complex pipeline, relying on multiple pre-trained models.
Recommendation: ⭐⭐⭐⭐ — The benchmark dataset and evaluation methodologies provide valuable references for the community.