Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation¶

Conference: NeurIPS 2025 arXiv: 2509.21227 Code: None (project page available) Area: Image Generation Keywords: evaluation metrics, compositional alignment, text-to-image, VQA metrics, human judgment

TL;DR¶

This paper systematically evaluates 12 text-image compositional alignment metrics against human judgments, finding that no single metric consistently outperforms all others across compositional categories, that VQA metrics are not always superior, and that embedding-based metrics (ImageReward, HPS) are stronger on certain categories.

Background & Motivation¶

Evaluation of text-to-image generation relies heavily on automated metrics, yet the reliability of these metrics in reflecting human preferences remains an open question. Several critical issues characterize the current state of the field:

Background: Most metrics are adopted based on popularity or convention rather than systematic validation against human judgment.
Limitations of Prior Work: Model comparisons and rankings depend directly on the chosen metrics; an erroneous choice of metric can mislead research directions. Furthermore, an increasing number of methods (ReNO, DPOK, etc.) use these metrics as reward signals for reinforcement learning, meaning metric bias directly affects model training.
Key Challenge: Compositional alignment—covering entity presence, attribute binding (color/shape/texture), spatial relations (2D/3D), non-spatial relations, and counting accuracy—is a core challenge in T2I generation, yet no prior work has comprehensively compared metrics against human judgments across fine-grained compositional categories.
Goal: This work presents the first comprehensive comparison of 12 metrics across 8 compositional categories against human judgment.

Method¶

Overall Architecture¶

Evaluation Design: Based on the T2I-CompBench++ benchmark, comprising 2,400 text-image samples across 8 compositional categories. Generation results from 6 T2I models are used (SD v1.4, SD v2, Structured Diffusion, Composable Diffusion, Attend-and-Excite, GORS), all annotated with human ratings.

The 12 evaluated metrics are grouped into three classes:

Embedding-based metrics (5): - CLIPScore: Cosine similarity of CLIP embeddings - PickScore: CLIP fine-tuned on pairwise preference judgments - HPS: CLIP fine-tuned on human comparison data - ImageReward: Adds a reward head, trained on ranked human preference data - BLIP-2: Compares image-generated captions against input text

VQA-based metrics (5): - VQAScore: Generates yes/no questions from text and answers them with a VQA model - TIFA: Uses structured templates to cover objects, attributes, and relations - DA Score: Tests entity-attribute binding - DSG: Converts text to scene graphs to verify entities and relations - B-VQA: Decomposes text into object-attribute pairs and queries BLIP-VQA for each

Image-only metrics (2): - CLIP-IQA: Regresses image quality from CLIP embeddings - Aesthetic Score: Estimates aesthetic value based on large-scale human ratings

Key Designs¶

Multi-dimensional Analysis Strategy:

Correlation Analysis: Spearman correlation (primary metric) between each metric and human ratings per compositional category, supplemented by Pearson and Kendall correlations.
Regression Analysis: Linear regression models fitted per category (human ratings as target, all metrics as features) to analyze joint contributions.
Distribution Analysis: Score distribution characteristics of each metric are examined to reveal saturation and compression issues.

Loss & Training¶

This is an evaluation study; no model training is involved. All metrics and data are drawn from the T2I-CompBench++ benchmark and the official implementations of each metric.

Key Experimental Results¶

Main Results¶

Spearman Correlation Coefficients (each metric vs. human ratings):

Metric	Color	Shape	Texture	2D-Spatial	Non-Spatial	Complex	3D-Spatial	Numeracy
CLIPScore	0.282	0.291	0.535	0.369	0.439	0.276	0.315	0.223
HPS	0.219	0.440	0.601	0.410	0.535	0.270	0.416	0.471
ImageReward	0.580	0.520	0.734	0.394	0.512	0.424	0.401	0.484
DA Score	0.772	0.463	0.711	0.318	0.453	0.488	0.297	0.462
VQA Score	0.678	0.405	0.701	0.533	0.495	0.638	0.339	0.473
TIFA	0.684	0.336	0.423	0.311	0.351	0.519	0.195	0.526
DSG	0.599	0.388	0.628	0.328	0.470	0.411	0.427	0.469
CLIP-IQA	0.092	0.078	-0.001	0.088	0.082	0.027	0.098	0.068
Aesthetic	0.056	0.195	0.078	0.136	0.061	0.051	0.123	0.036

Best Metric per Category:

Category	Best	Runner-up
Color	DA Score	TIFA
Shape	ImageReward	DA Score
Texture	ImageReward	DA Score
2D Spatial	VQA Score	HPS
Non-Spatial	HPS	ImageReward
Complex	VQA Score	TIFA
3D Spatial	DSG	HPS/BLIP-2
Numeracy	TIFA	ImageReward

Ablation Study¶

Regression coefficient analysis reveals that the joint contributions of metrics differ from their individual correlations. HPS emerges as notably more important in regression, recording the largest regression coefficients across multiple categories (Shape: 0.761, 2D Spatial: 1.143, Non-Spatial: 0.629, Numeracy: 1.277). CLIP-IQA and Aesthetic exhibit regression coefficients near zero or negative.

Score distribution characteristics: - Embedding-based metrics cluster in the middle range (0.25–0.5), offering limited discriminability. - VQA-based metrics are heavily right-skewed and saturate near 1.0, making it difficult to distinguish high-quality candidates. - Image-only metrics display varied distributional characteristics but contribute nothing to compositional alignment.

Key Findings¶

No universal best metric: No single metric consistently leads across all 8 compositional categories.
CLIPScore underperforms: Despite being the most widely used metric, it never ranks in the top-2 for any category.
VQA metrics are not always optimal: Embedding-based metrics outperform them on Shape, Texture, and Non-Spatial categories.
ImageReward and HPS are standouts: They appear in the top-3 for 6 and 4 categories, respectively.
Image-only metrics are ineffective: CLIP-IQA and Aesthetic yield extremely low correlations (<0.2) across all categories.
VQA metric saturation: Scores concentrate near 1.0, hampering discrimination between outputs.
Embedding metric compression: Scores cluster in the middle range, failing to reflect quality differences.

Highlights & Insights¶

Filling an evaluation gap: This is the first systematic comparison of 12 metrics against human judgment on fine-grained compositional tasks.
Practical guidance: Provides researchers with evidence-based criteria for metric selection—metrics should be chosen according to the specific type of compositional challenge.
Insightful distribution analysis: Uncovers the saturation problem in VQA metrics and the compression problem in embedding-based metrics.
Warning for reward model use: When metrics serve as reward signals, their biases directly mislead model training.
Multi-dimensional analysis: Goes beyond correlation analysis to include regression and distribution analysis, revealing more comprehensive patterns.

Limitations & Future Work¶

Relies solely on the T2I-CompBench++ benchmark (6 relatively older models), without coverage of recent models (DALL-E 3, SD3, FLUX, etc.).
Only linear correlation and linear regression are analyzed; nonlinear relationships are not explored.
No new metric or improvement is proposed; the contribution remains at the analysis level.
The sample size of 2,400 may be insufficient for certain categories.
The consistency and bias of human judgments themselves are not deeply analyzed.
Computational efficiency of the metrics is not compared.

T2I-CompBench++ (2024): Provides a structured compositional evaluation benchmark with human annotations.
VQAScore (Lin et al., 2024): Evaluates alignment via VQA-based question answering; performs strongly across multiple categories.
ImageReward (Xu et al., 2024): Trained on ranked human preferences; delivers the most consistently stable overall performance.
ReNO (Eyring, 2024): Uses a combination of metrics as rewards for noise optimization, validating the importance of composite metrics.
Insight: Evaluation metrics themselves require meta-evaluation; future work should develop universal or adaptive composite metrics.

Rating¶

Novelty: 3/5 — A systematic evaluation study with no significant methodological innovation.
Practical Value: 5/5 — Directly informs metric selection in T2I evaluation.
Experimental Thoroughness: 4/5 — Comprehensive comparison of 12 metrics × 8 categories, but limited to a single data source.
Writing Quality: 4/5 — Clear structure, rich tables, and well-defined conclusions.