Evaluating Text-to-Visual Generation with Image-to-Text Generation¶

Conference: ECCV 2024
arXiv: 2404.01291
Code: Yes (Open-source data, models, and code)
Area: Video Generation
Keywords: Text-to-Visual Generation Evaluation, VQAScore, Image-Text Alignment, Compositional Prompting, GenAI-Bench

TL;DR¶

The authors propose VQAScore, which uses Visual Question Answering (VQA) models instead of CLIP to evaluate text-to-visual generation quality. It significantly outperforms CLIPScore on complex compositional prompts and releases the GenAI-Bench benchmark.

Background & Motivation¶

Although text-to-image/video generation models (e.g., Stable Diffusion, DALL-E 3) have progressed rapidly, reliably evaluating the generation quality remains an unresolved key problem. The most widely used evaluation metric, CLIPScore, has a fundamental flaw: CLIP's text encoder is essentially a "bag of words" that cannot distinguish prompts with different semantic structures but identical vocabularies. For instance, "a horse eating grass" and "grass eating a horse" would yield similar CLIPScores, which is clearly unreasonable.

The core problems are: (1) CLIPScore evaluates inaccurately for complex prompts involving compositional relations (such as spatial relations, attribute binding, action relations, etc.); (2) Existing improvement schemes (e.g., using larger CLIP models or introducing extra parsers) either offer limited gains or are overly complex; (3) There is a lack of high-quality evaluation benchmarks tailored for compositional generation.

This paper proposes a counter-intuitive yet highly effective solution: using image-to-text VQA models to evaluate text-to-image generation quality. The Core Idea is to reformulate the evaluation problem into a simple Visual Question Answering task—"Does this figure show '{text}'?"—and calculate the probability of the VQA model answering "Yes" as the alignment score.

Method¶

Overall Architecture¶

The pipeline of the VQAScore evaluation framework is highly straightforward: (1) Given a generated image and a text prompt; (2) Embed the text prompt into a template question "Does this figure show '{text}'?"; (3) Use a VQA model to compute the probability of answering "Yes"; (4) This probability serves as the VQAScore alignment score.

Key Designs¶

VQAScore Evaluation Metric:
- Function: Accurately measure the semantic alignment between generated images and text prompts.
- Mechanism: Utilize the vision-language reasoning capability of VQA models to reformulate alignment evaluation as a binary question-answering task. The VQA model determines semantic consistency by jointly processing images and text, which avoids compositional understanding deficiencies caused by the independent encoding of images and text in CLIP.
- Design Motivation: VQA models naturally possess compositional reasoning capabilities (e.g., understanding "who is doing what to whom"), which is precisely what CLIPScore lacks.
CLIP-FlanT5 Model:
- Function: Further improve the performance of VQAScore.
- Mechanism: Train a bidirectional image-question encoder. Unlike standard VQA models, CLIP-FlanT5 allows image embeddings to depend on question content (and vice versa), enabling deeper cross-modal interaction. It uses FlanT5 as the language model backbone combined with a CLIP vision encoder.
- Design Motivation: Standard unidirectional encoding ignores the guiding effect of the question content on image understanding. A bidirectional encoder can capture finer-grained image-text interactions.
GenAI-Bench Benchmark:
- Function: Provide a more challenging benchmark for evaluating compositional text-to-visual generation.
- Mechanism: It contains 1,600 compositional text prompts across dimensions such as scene parsing, object recognition, attribute binding, relation reasoning, and high-level logical reasoning. It collects over 15,000 human ratings covering mainstream generative models like Stable Diffusion, DALL-E 3, and Gen2.
- Design Motivation: Text prompts in existing evaluation benchmarks are too simple to adequately test the compositional understanding capabilities of generative models.

Loss & Training¶

CLIP-FlanT5 is trained on large-scale image-text pairs using standard VQA training objectives. Key strategies include: employing a bidirectional attention mechanism instead of unidirectional attention, and training solely on image data while demonstrating generalization to video and 3D model evaluations.

Key Experimental Results¶

Main Results¶

Dataset	Metric	Ours (VQAScore)	CLIPScore	Gain
8 Image-Text Alignment Benchmarks	Kendall τ	SOTA	Second-best	Average +15-25%
Winoground	Accuracy	Significantly leading	~50% (Random)	+20-30%
GenAI-Bench	Human Correlation	Best	Low	Significant improvement
Video Alignment	Kendall τ	Applicable	Not applicable	Cross-modal generalization

Ablation Study¶

Configuration	Key Metrics	Description
Different VQA Models	Performance Variance	Larger VQA models perform better
Different Question Templates	Robust	VQAScore is robust to the choice of templates
CLIP-FlanT5 vs GPT-4V	Leading or on par	Open-source models outperform proprietary models
Image vs Video vs 3D	All effective	Training solely on images generalizes to other modalities

Key Findings¶

VQAScore achieves SOTA on all 8 image-text alignment benchmarks, despite its extreme simplicity.
The open-source CLIP-FlanT5 even outperforms baseline methods utilizing GPT-4V.
VQAScore generalizes to video and 3D model evaluation, demonstrating strong cross-modal capabilities.
GenAI-Bench reveals significant deficiencies in the compositional understanding of current generative models.

Highlights & Insights¶

The core idea is exceptionally simple and efficient—a direct VQA question outperforms complex evaluation methods by a wide margin.
It reveals an important methodological insight: text-to-image generation can conversely be evaluated using image-to-text models.
As an open-source alternative, CLIP-FlanT5 outperforms GPT-4V, lowering evaluation costs.
The paper has accumulated 411 citations, indicating the broad influence of this work.

Limitations & Future Work¶

VQAScore still depends on the quality of VQA models and may fail in scenarios that are challenging for VQA models.
Question templates like "Does this figure show..." may lack flexibility for certain types of prompts.
GenAI-Bench primarily focuses on English prompts, without evaluating multilingual scenarios.
The capability of VQAScore in fine-grained aesthetic quality assessment (such as composition and color) has not been explored.

CLIPScore: The classic evaluation method by Hessel et al., widely used but suffering from compositional flaws.
TIFA: An evaluation approach that uses an LLM to generate questions and then assesses them via VQA; it is far more complex than VQAScore.
DSG: It evaluates via dependency parsing and scene graphs, requiring an extra parsing step.
Insight: Sometimes the simplest methods are the most effective; evaluation and generation can establish connections through inverse models.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The idea of using VQA to evaluate T2I generation is extremely simple yet highly effective, having a massive impact.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely comprehensive, covering 8 benchmarks, multi-modal generalization, human evaluation, and a new benchmark.
Writing Quality: ⭐⭐⭐⭐ Highly logical with compelling motivation.
Value: ⭐⭐⭐⭐⭐ Over 411 citations demonstrate its broad practical impact.