Taxonomy-Aware Evaluation of Vision-Language Models¶

Conference: CVPR 2025
arXiv: 2504.05457
Code: https://github.com/vesteinn/vlm-eval
Area: Multimodal VLM
Keywords: VLM Evaluation, Fine-Grained Visual Classification, Hierarchical Metrics, Taxonomy Mapping, Text-Taxonomy Alignment

TL;DR¶

Proposes a taxonomy-aware VLM evaluation framework. By mapping the free-text output of VLMs onto a taxonomic tree, it utilizes hierarchical precision (hP) and hierarchical recall (hR) to quantify the correctness and specificity of predictions, solving the problem where traditional exact match and text similarity metrics fail to score "partially correct" answers.

Background & Motivation¶

When asked to identify an entity in an image, a VLM might output "I see a conifer" instead of the precise label "norway spruce". This exposes two major challenges in VLM evaluation: (1) Free-form text generated by VLMs needs to be mapped to the evaluation label space; (2) Evaluation metrics should assign partial scores to responses that are insufficiently specific but not incorrect ("conifer" is a parent category of "norway spruce" and should not be penalized as completely wrong).

Limitations of Prior Work: Current VLM classification evaluations employ a binary approach (completely correct or completely incorrect), failing to leverage the hierarchical label structures inherent in many classification tasks. Standard text similarity metrics (BLEU, ROUGE, BERTScore, etc.) also fail to truly capture taxonomic distance.

Key Challenge: The diversity and uncertainty of VLM outputs vs. the structured judgment required for evaluation. A VLM may be highly "accurate" (never straying from the correct subtree) but not "specific" (providing only high-level categories), a distinction that traditional accuracy completely fails to capture.

Key Insight: Leverages existing taxonomic knowledge graphs (such as Wikidata and iNaturalist's Catalogue of Life) to map the free-form text output of VLMs to taxonomic nodes, followed by evaluation using hierarchical precision and recall.

Method¶

Overall Architecture¶

The framework consists of three steps: (1) The VLM generates a free-text prediction for an image; (2) The text is mapped to a taxonomic tree node (based on CLIP similarity + heuristics); (3) Hierarchical precision (hP) and hierarchical recall (hR) are calculated between the predicted and ground-truth nodes. The framework supports taxonomic trees extracted from Wikidata and Catalogue of Life, covering various domains such as food, sports, flora and fauna, cars, landmarks, etc.

Key Designs¶

Hierarchical Precision and Hierarchical Recall Metrics (hP/hR)
- Function: Quantifies the "correctness" (whether it deviates from the correct path) and "specificity" (how much information along the correct path is predicted) of VLM predictions, respectively.
- Mechanism: For a predicted node $v^{pr}$ and a ground-truth node $v^{gt}$, the ratio of the intersection of their ancestor sets to their respective ancestor sets is calculated: $$hP = \frac{1}{N}\sum_{n=1}^{N}\frac{|anc(v_n^{pr}) \cap anc(v_n^{gt})|}{|anc(v_n^{pr})|}$$ $$hR = \frac{1}{N}\sum_{n=1}^{N}\frac{|anc(v_n^{pr}) \cap anc(v_n^{gt})|}{|anc(v_n^{gt})|}$$
- Design Motivation: hP = 1 indicates that the prediction, while potentially not specific enough, contains no incorrect information (e.g., predicting "conifer" when the ground truth is "norway spruce"); low hR indicates missing information in the prediction. The harmonic average of the two (hF) provides a comprehensive evaluation.
- Example: Image is "Train", prediction is "a mode of transport" $\rightarrow$ hP=1.00, hR=0.75 (correct but not specific); Image is "Pool", prediction is "high jump" $\rightarrow$ hP=0.67, hR=0.67 (partially incorrect).
Text-to-Taxonomy Mapping Algorithm (Algorithm 1)
- Function: Reliably maps free-form text generated by VLMs to nodes in a taxonomic tree.
- Mechanism: A multi-stage matching strategy—first utilizes CLIP similarity to retrieve the top-k candidate nodes, then sequentially attempts exact matching and n-gram overlap matching (n=4, 3, 2). When the differences among top candidates are minor (ambiguous scores), the common ancestor of the candidates is identified and used as a conservative prediction.
- Design Motivation: VLM outputs vary wildly, making pure text matching prone to failure. A multi-stage strategy combining CLIP semantic similarity and string matching is more robust. The common ancestor fallback mechanism yields conservative yet correct predictions under uncertainty.
Taxonomy Construction
- Function: Constructs a taxonomy from the Wikidata knowledge graph that satisfies the definition of a "rooted directed tree."
- Mechanism: Uses the "subclass of" relationship in Wikidata to construct the tree, retaining the longest path when multiple paths exist, and choosing randomly in case of ties. High-level abstract classes that introduce cycles are excluded.
- Design Motivation: Knowledge graphs are not trees by default and require customized extraction. It supports iNaturalist21 (a species taxonomy with 10,000 leaf nodes) and OVEN (Wikidata taxonomy aggregating multiple FGVC datasets including ImageNet21k and Cars196).

Key Experimental Results¶

Correlation Between Existing Text Similarity Metrics and Hierarchical Metrics (Tab. 1)¶

Similarity Metric	iNat21 $\tau$-hP	iNat21 $\tau$-hR	OVEN $\tau$-hP	OVEN $\tau$-hR
Exact Match	0.01	0.07	0.01	0.01
BERTScore	0.01	0.31	0.27	0.18
CLIP-i2t	0.35	0.49	0.35	0.34
- The Kendall $\tau$ correlation between existing metrics and hierarchical metrics is generally extremely low, indicating that they cannot substitute for taxonomy-aware evaluations.

Ablation Study¶

Configuration	hF	Description
Exact Match Mapping	0.39	Pure exact string matching, performs the worst
CLIP-t2t Direct Match	0.75	Semantic similarity matching, performs better
CLIP-t2t + Alg.1	0.80	Multi-stage heuristics + common ancestor fallback, optimal
CLIP-i2t + Alg.1	0.80	Image-to-text similarity + algorithm, hF is on par with t2t

Mapping Quality Evaluation (Tab. 2, 416 Human-Annotated Nodes)¶

Method	hP	hR	hF	Exact Match Rate
Exact Match	0.37	0.42	0.39	17.5%
CLIP-t2t + Alg.1	0.79	0.82	0.80	47.1%

VLM Ranking Shift (8 VLMs on iNaturalist21)¶

LLaVA ranks lowest under Exact Match, but its hP is very high (conservative predictions that rarely stray from the correct path), revealing insights missed by traditional metrics.
GPT-4 ranks highest across all metrics, but its hP is inferior to QVLChat—GPT-4 tends to provide more specific but potentially incorrect predictions.
Prompt tuning experiments: GPT-4 can simultaneously improve both hP and hR (becoming more accurate and more specific), whereas other models typically face a trade-off between hP and hR.

Highlights & Insights¶

Core Contribution: Introduces taxonomy-aware metrics for VLM evaluation in Fine-Grained Visual Classification (FGVC) for the first time, offering orthogonal evaluation dimensions of accuracy (hP) and specificity (hR).
Counter-Intuitive Finding: Models ranking worse under traditional metrics can perform best under hierarchical precision—conservative predictions do not equate to poor predictions.
Practical Value: hP and hR can serve as feedback signals for prompt tuning (the paper demonstrates an application guiding 30 rounds of prompt optimization for a bird classifier).
Framework Generality: Applicable to any classification task with a hierarchical label structure, not limited to the computer vision domain.

Limitations & Future Work¶

Mapping free-form text to taxonomic nodes is inherently a low-resource problem, lacking large-scale training data to train specialized mappers.
The taxonomy extracted from Wikidata contains noise, and the non-uniform granularity of subtrees affects the interpretation of global average metrics.
Applicable only to classification tasks with a hierarchical label structure; not applicable to tasks like relational reasoning or VQA.
The mapping algorithm relies on the representation quality of CLIP, which may perform poorly on low-frequency or specialized terminology.

Hierarchical precision/recall (hP/hR by Kiritchenko et al.) serves as the primary source for the core metrics in this work.
The entity linking concept from the OVEN benchmark inspired the text-to-taxonomy mapping.
Can be extended to hierarchical evaluations in scenarios like multi-label classification, open-vocabulary detection, etc.
Insight: VLM evaluation should focus not only on "correctness" but also on "how far off" an error is—an idea that can be generalized to other structured output spaces.

Key Findings¶

The correlation between existing text similarity metrics (EM, BLEU, BERTScore, etc.) and hierarchical metrics is extremely low (Kendall $\tau$ mostly falls between 0.01-0.31), meaning they cannot replace taxonomy-aware evaluation.
LLaVA ranks lowest under traditional Exact Match but achieves high hP—indicating that while it lacks specificity, it rarely produces incorrect information.
GPT-4 is inferior to QVLChat on hP—GPT-4 tends to provide more specific but potentially incorrect predictions, revealing the accuracy-specificity trade-off.
Only GPT-4 can simultaneously improve both hP and hR through prompting; other models face a trade-off between the two.

Rating¶

Novelty: ⭐⭐⭐⭐ Systematically introduces hierarchical precision/recall to VLM evaluation for the first time, offering a unique perspective.
Experimental Thoroughness: ⭐⭐⭐⭐ Features 8 VLMs, two major taxonomies, both synthetic and real-world data, and prompt-tuning experiments, providing comprehensive coverage.
Writing Quality: ⭐⭐⭐⭐⭐ Features clear formal definitions, beautifully designed figures and tables, and rigorous argumentative logic.
Value: ⭐⭐⭐⭐ Fills a theoretical gap in fine-grained VLM classification evaluation; hP/hR can serve as feedback signals for prompt tuning.