Skip to content

Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation

Conference: NeurIPS 2025 arXiv: 2509.21227 Code: None (project page available) Area: Image Generation Keywords: evaluation metrics, compositional alignment, text-to-image, VQA metrics, human judgment

TL;DR

This paper systematically evaluates 12 text-image compositional alignment metrics against human judgments, finding that no single metric consistently outperforms all others across compositional categories, that VQA metrics are not always superior, and that embedding-based metrics (ImageReward, HPS) are stronger on certain categories.

Background & Motivation

Evaluation of text-to-image generation relies heavily on automated metrics, yet the reliability of these metrics in reflecting human preferences remains an open question. Several critical issues characterize the current state of the field:

  • Background: Most metrics are adopted based on popularity or convention rather than systematic validation against human judgment.
  • Limitations of Prior Work: Model comparisons and rankings depend directly on the chosen metrics; an erroneous choice of metric can mislead research directions. Furthermore, an increasing number of methods (ReNO, DPOK, etc.) use these metrics as reward signals for reinforcement learning, meaning metric bias directly affects model training.
  • Key Challenge: Compositional alignment—covering entity presence, attribute binding (color/shape/texture), spatial relations (2D/3D), non-spatial relations, and counting accuracy—is a core challenge in T2I generation, yet no prior work has comprehensively compared metrics against human judgments across fine-grained compositional categories.
  • Goal: This work presents the first comprehensive comparison of 12 metrics across 8 compositional categories against human judgment.

Method

Overall Architecture

Evaluation Design: Based on the T2I-CompBench++ benchmark, comprising 2,400 text-image samples across 8 compositional categories. Generation results from 6 T2I models are used (SD v1.4, SD v2, Structured Diffusion, Composable Diffusion, Attend-and-Excite, GORS), all annotated with human ratings.

The 12 evaluated metrics are grouped into three classes:

Embedding-based metrics (5): - CLIPScore: Cosine similarity of CLIP embeddings - PickScore: CLIP fine-tuned on pairwise preference judgments - HPS: CLIP fine-tuned on human comparison data - ImageReward: Adds a reward head, trained on ranked human preference data - BLIP-2: Compares image-generated captions against input text

VQA-based metrics (5): - VQAScore: Generates yes/no questions from text and answers them with a VQA model - TIFA: Uses structured templates to cover objects, attributes, and relations - DA Score: Tests entity-attribute binding - DSG: Converts text to scene graphs to verify entities and relations - B-VQA: Decomposes text into object-attribute pairs and queries BLIP-VQA for each

Image-only metrics (2): - CLIP-IQA: Regresses image quality from CLIP embeddings - Aesthetic Score: Estimates aesthetic value based on large-scale human ratings

Key Designs

Multi-dimensional Analysis Strategy:

  1. Correlation Analysis: Spearman correlation (primary metric) between each metric and human ratings per compositional category, supplemented by Pearson and Kendall correlations.
  2. Regression Analysis: Linear regression models fitted per category (human ratings as target, all metrics as features) to analyze joint contributions.
  3. Distribution Analysis: Score distribution characteristics of each metric are examined to reveal saturation and compression issues.

Loss & Training

This is an evaluation study; no model training is involved. All metrics and data are drawn from the T2I-CompBench++ benchmark and the official implementations of each metric.

Key Experimental Results

Main Results

Spearman Correlation Coefficients (each metric vs. human ratings):

Metric Color Shape Texture 2D-Spatial Non-Spatial Complex 3D-Spatial Numeracy
CLIPScore 0.282 0.291 0.535 0.369 0.439 0.276 0.315 0.223
HPS 0.219 0.440 0.601 0.410 0.535 0.270 0.416 0.471
ImageReward 0.580 0.520 0.734 0.394 0.512 0.424 0.401 0.484
DA Score 0.772 0.463 0.711 0.318 0.453 0.488 0.297 0.462
VQA Score 0.678 0.405 0.701 0.533 0.495 0.638 0.339 0.473
TIFA 0.684 0.336 0.423 0.311 0.351 0.519 0.195 0.526
DSG 0.599 0.388 0.628 0.328 0.470 0.411 0.427 0.469
CLIP-IQA 0.092 0.078 -0.001 0.088 0.082 0.027 0.098 0.068
Aesthetic 0.056 0.195 0.078 0.136 0.061 0.051 0.123 0.036

Best Metric per Category:

Category Best Runner-up
Color DA Score TIFA
Shape ImageReward DA Score
Texture ImageReward DA Score
2D Spatial VQA Score HPS
Non-Spatial HPS ImageReward
Complex VQA Score TIFA
3D Spatial DSG HPS/BLIP-2
Numeracy TIFA ImageReward

Ablation Study

Regression coefficient analysis reveals that the joint contributions of metrics differ from their individual correlations. HPS emerges as notably more important in regression, recording the largest regression coefficients across multiple categories (Shape: 0.761, 2D Spatial: 1.143, Non-Spatial: 0.629, Numeracy: 1.277). CLIP-IQA and Aesthetic exhibit regression coefficients near zero or negative.

Score distribution characteristics: - Embedding-based metrics cluster in the middle range (0.25–0.5), offering limited discriminability. - VQA-based metrics are heavily right-skewed and saturate near 1.0, making it difficult to distinguish high-quality candidates. - Image-only metrics display varied distributional characteristics but contribute nothing to compositional alignment.

Key Findings

  1. No universal best metric: No single metric consistently leads across all 8 compositional categories.
  2. CLIPScore underperforms: Despite being the most widely used metric, it never ranks in the top-2 for any category.
  3. VQA metrics are not always optimal: Embedding-based metrics outperform them on Shape, Texture, and Non-Spatial categories.
  4. ImageReward and HPS are standouts: They appear in the top-3 for 6 and 4 categories, respectively.
  5. Image-only metrics are ineffective: CLIP-IQA and Aesthetic yield extremely low correlations (<0.2) across all categories.
  6. VQA metric saturation: Scores concentrate near 1.0, hampering discrimination between outputs.
  7. Embedding metric compression: Scores cluster in the middle range, failing to reflect quality differences.

Highlights & Insights

  • Filling an evaluation gap: This is the first systematic comparison of 12 metrics against human judgment on fine-grained compositional tasks.
  • Practical guidance: Provides researchers with evidence-based criteria for metric selection—metrics should be chosen according to the specific type of compositional challenge.
  • Insightful distribution analysis: Uncovers the saturation problem in VQA metrics and the compression problem in embedding-based metrics.
  • Warning for reward model use: When metrics serve as reward signals, their biases directly mislead model training.
  • Multi-dimensional analysis: Goes beyond correlation analysis to include regression and distribution analysis, revealing more comprehensive patterns.

Limitations & Future Work

  • Relies solely on the T2I-CompBench++ benchmark (6 relatively older models), without coverage of recent models (DALL-E 3, SD3, FLUX, etc.).
  • Only linear correlation and linear regression are analyzed; nonlinear relationships are not explored.
  • No new metric or improvement is proposed; the contribution remains at the analysis level.
  • The sample size of 2,400 may be insufficient for certain categories.
  • The consistency and bias of human judgments themselves are not deeply analyzed.
  • Computational efficiency of the metrics is not compared.
  • T2I-CompBench++ (2024): Provides a structured compositional evaluation benchmark with human annotations.
  • VQAScore (Lin et al., 2024): Evaluates alignment via VQA-based question answering; performs strongly across multiple categories.
  • ImageReward (Xu et al., 2024): Trained on ranked human preferences; delivers the most consistently stable overall performance.
  • ReNO (Eyring, 2024): Uses a combination of metrics as rewards for noise optimization, validating the importance of composite metrics.
  • Insight: Evaluation metrics themselves require meta-evaluation; future work should develop universal or adaptive composite metrics.

Rating

  • Novelty: 3/5 — A systematic evaluation study with no significant methodological innovation.
  • Practical Value: 5/5 — Directly informs metric selection in T2I evaluation.
  • Experimental Thoroughness: 4/5 — Comprehensive comparison of 12 metrics × 8 categories, but limited to a single data source.
  • Writing Quality: 4/5 — Clear structure, rich tables, and well-defined conclusions.