EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations¶
Conference: ACL 2025
arXiv: 2506.24016
Code: Available
Area: Interpretability
Keywords: Image Captioning Evaluation, Explainable Evaluation Metric, Structured Explanations, VLM Fine-Tuning, Reference-Free Evaluation
TL;DR¶
This paper proposes EXPERT, a reference-free image captioning evaluation metric based on VLM fine-tuning. By constructing a large-scale structured explanation dataset and designing a two-stage evaluation template, it achieves SOTA performance on multiple benchmark datasets while providing high-quality structured explanations across three dimensions: fluency, relevance, and descriptiveness.
Background & Motivation¶
Automatic evaluation of image captioning is crucial for measuring and improving models. Recently, explainable evaluation metrics have gained attention—they provide not only numerical scores but also textual explanations. However, existing explainable metrics have two key limitations:
Lack of standardized guidelines for explanations: Explanations generated by existing metrics (such as FLEUR) lack uniform evaluation dimensions and formats, leading to inconsistent content and structures.
Quality of explanations remains unverified: Prior research lacks a systematic evaluation of the quality of generated explanations—even if the scores are accurate, the explanations may contain errors or irrelevant information.
Core Problem: How to build an image captioning evaluation metric that both scores accurately and provides high-quality, explainable feedback?
Method¶
Overall Architecture¶
The construction of EXPERT consists of three phases:
- Dataset Construction: Extending existing human evaluation datasets (Polaris, Nebula) by adding structured explanations for each image-caption pair.
- Template Design: Designing a two-stage evaluation template (scoring first, followed by explanation).
- Supervised Fine-Tuning: Performing SFT on LLaVA-1.5 (13B).
Key Designs¶
1. Structured Explanation Dataset Construction¶
Two datasets are extended to obtain large-scale training data:
| Dataset | Original Dataset | Number of Explanations |
|---|---|---|
| Polaris-exp | Polaris (Wada et al., 2024) | 16,014 |
| Nebula-exp | Nebula (Matsuda et al., 2024) | 26,152 |
| Total | 42,166 |
Each explanation is based on three standardized dimensions: - Fluency: Whether the description is fluent, natural, and grammatically correct. - Relevance: Whether the description correctly represents the visual content and is closely related to the image. - Descriptiveness: Whether the description is precise, informative, and covers important details of the image.
Explanations are generated using GPT-4o and verified via human evaluation.
Dataset Quality Verification (Table 1):
| Evaluation Dimension | Average Score (4-point scale) | Standard Deviation |
|---|---|---|
| Consistency | 3.72 | 0.52 |
| Factuality | 3.84 | 0.39 |
| Informativeness | 3.72 | 0.45 |
Four native English-speaking annotators evaluated 100 uniformly sampled explanations, confirming their high quality.
2. Two-Stage Evaluation Template¶
The template adopts a scoring-explanation order, which has been proven effective by prior research:
First Stage - Scoring: - Query: Requesting a score for the image-caption pair. - Response: Human ratings from the dataset. - Score Binning: Rounding scores to the nearest multiple of 0.10 to simplify numerical representation.
Second Stage - Explanation: - Query: Requesting a brief explanation based on three dimensions + descriptions of each dimension + predefined output format. - Response: Corresponding structured explanation from the dataset.
Key design points of the template: - The two stages use the same dimension descriptions to maintain consistency. - Predefined output formats ensure a uniform structure for explanations. - The scoring stage uses human scores from the dataset, while the explanation stage uses GPT-4o generated explanations.
3. Supervised Fine-Tuning (SFT)¶
Base model: LLaVA-1.5 (13B)
Data processing: - Merging the training sets of Polaris-exp and Nebula-exp. - For multiple annotator scores for the same image-caption pair in Polaris -> Take the mean. - For duplicate pairs across datasets -> Merge and take the mean. - Converting to the two-stage evaluation template format for training.
4. Score Smoothing during Inference¶
Using Score Smoothing to obtain finer-grained scores:
where \(p(i,j)\) is the probability of generating digit \(i\) at the \(j\)-th decimal place. Compared to greedy decoding (which directly takes the most probable digit), score smoothing leverages probability distribution information to produce more continuous scores.
Loss & Training¶
- Standard SFT loss (cross-entropy)
- Full parameter fine-tuning based on LLaVA-1.5 (13B)
- Greedy decoding is used to ensure determinism and reproducibility
- Score smoothing is applied during the inference stage and does not affect training
Key Experimental Results¶
Main Results¶
Performance on Multiple Human Evaluation Benchmarks (Table 2 Summary):
| Metric | Flickr8k-EX (τc) | Flickr8k-CF (τb) | COMPOSITE (τc) | Polaris (τc) | Nebula (τc) |
|---|---|---|---|---|---|
| CLIPScore | 51.2 | 34.4 | 53.8 | 52.3 | 46.9 |
| PAC-S | 54.3 | 36.0 | 55.7 | 52.5 | 47.2 |
| FLEUR | 53.0 | 38.6 | 63.5 | 58.3 | 51.7 |
| HICE-S | 56.4 | 37.2 | 57.9 | - | - |
| EXPERT | 56.7 | 39.3 | 65.0 | 61.1 | 54.9 |
| GPT-4o | 54.3 | 39.3 | 65.9 | 58.2 | 54.3 |
EXPERT achieves SOTA among all reference-free metrics (except on Pascal-50S). It even outperforms many metrics that require reference captions (such as CLIPScore, PAC-S, FLEUR, etc.).
Comparison with GPT-4o: EXPERT performs on par with or better than GPT-4o on most datasets, demonstrating that a carefully fine-tuned 13B model can match ultra-large-scale models.
Ablation Study¶
Human Evaluation of Explanation Quality (Figure 4):
| Metric | Consistency | Factuality | Informativeness |
|---|---|---|---|
| FLEUR | ~2.5 | ~2.8 | ~2.3 |
| EXPERT_{w/o SFT} | ~2.3 | ~2.5 | ~2.1 |
| EXPERT | ~3.4 | ~3.5 | ~3.2 |
Key findings: - EXPERT leads significantly across all dimensions, with differences being statistically significant at the 0.01 level. - Standardized dimensions alone are insufficient (EXPERT_{w/o SFT} does not outperform FLEUR); they must be combined with supervised training on high-quality explanations.
Qualitative Analysis (Figure 3): - Example 1: The caption mentions "three dogs" but there is only one in the image. FLEUR misses the detail that the caption omitted the frisbee, whereas EXPERT accurately points out the lack of description regarding the dog chasing the frisbee. - Example 2: The caption is grammatically incomplete. FLEUR incorrectly interprets the caption as mentioning a "blue bed", while EXPERT correctly identifies the grammatical incompleteness.
Key Findings¶
- Small Models Can Outperform GPT-4o: The 13B EXPERT matches or outperforms GPT-4o on evaluation tasks, due to SFT aligning with human preferences and having complete token probabilities for score smoothing.
- Good Generalization Across Datasets: It performs exceptionally well on datasets outside of Polaris/Nebula (the training sources), indicating that human evaluation preferences enjoy a degree of consistency across different datasets.
- Both Structured Explanations and Supervised Training are Indispensable: Structured prompting without training does not perform better than FLEUR, but yields massive improvements after training.
- Most Common Error Type: Over-penalizing captions that lack details.
Highlights & Insights¶
- First Systematic Evaluation of Explanation Quality of Explainable Metrics: Prior work only focused on scoring accuracy; this paper is the first to directly evaluate explanation quality as a metric.
- Efficient Data Construction Strategy: Leveraging GPT-4o to generate explanations followed by human quality verification to obtain 42K+ high-quality training data at a low cost.
- Interplay of Score Binning and Score Smoothing: Simplifying numerical representation during training and recovering precision using probability distributions during inference is a clever engineering design.
- Practical Three-Dimension Evaluation Framework: Fluency, relevance, and descriptiveness cover the core aspects of image captioning evaluation, ensuring high explainability.
Limitations & Future Work¶
- Slow Inference Speed: Generating structured explanations requires a large number of output tokens, significantly increasing inference time.
- Propensity for Over-penalization: The most common error is assigning excessively low scores to captions that lack minor details.
- English-Only Support: Fine-tuning based on LLaVA-1.5 limits its multilingual capabilities.
- Dependence of Explanation Data on GPT-4o: If GPT-4o exhibits systematic bias in its judgments, this bias will be propagated to EXPERT.
- Promising Directions: Larger base models (e.g., LLaVA-NeXT), more evaluation dimensions (e.g., creativity, humor), and multilingual extensions.
Related Work & Insights¶
- FLEUR (Lee et al., 2024): The only prior work on explainable reference-free metrics, which uses the scoring-explanation order but lacks standardized dimensions.
- Polaris (Wada et al., 2024) / Nebula (Matsuda et al., 2024): Human evaluation datasets extended by this work.
- CLIPScore (Hessel et al., 2021): A baseline reference-free metric based on CLIP.
- HICE-S (Zeng et al., 2024): Hierarchical image captioning evaluation, which outperforms EXPERT on Pascal-50S.
- Insight: The supervised fine-tuning paradigm of "scoring + structured explanation" can be generalized to the evaluation of other generative tasks (such as summarization evaluation, dialogue evaluation, and code evaluation).
Rating¶
| Dimension | Score (1-5) |
|---|---|
| Novelty | 4 |
| Practicality | 4.5 |
| Experimental Thoroughness | 5 |
| Writing Quality | 4.5 |
| Overall Rating | 4.5 |
The experiments are incredibly solid—6 benchmark datasets, 20+ baseline comparisons, and human evaluations to verify explanation quality. Both the data construction and template design are highly elegant. As a paper on evaluation metrics, it balances accuracy and explainability exceptionally well, offering high practical value.