Trust but Verify: Programmatic VLM Evaluation in the Wild¶
Conference: ICCV 2025
arXiv: 2410.13121
Code: Project Page
Area: Multimodal VLMs
Keywords: VLM Evaluation, Hallucination Detection, Scene Graph, Programmatic Verification, Open-ended QA, Helpfulness-Truthfulness Trade-off
TL;DR¶
This paper proposes the PROVE (Programmatic VLM Evaluation) evaluation paradigm. By constructing high-fidelity scene graphs from ultra-detailed image descriptions and utilizing LLMs to generate programmatically verifiable open-ended visual question-answering pairs, it simultaneously evaluates both the helpfulness and truthfulness of VLM responses within a unified scene graph framework, revealing that current models struggle to achieve a good balance between the two.
Background & Motivation¶
Core Problem¶
VLMs frequently generate plausible-sounding but actually incorrect responses (hallucinations). However, existing evaluation methods have obvious limitations:
Discriminative benchmarks (e.g., POPE): They only test binary existence questions ("Is there a person in the image?"), failing to simulate real-world usage scenarios.
Generative benchmarks (e.g., CHAIR, MMHal): They evaluate open-ended responses but rely on external LLM scoring, and the context provided during judgment is often insufficient to verify all claims.
Limitations of LLM-as-a-judge: The lack of clear scoring criteria and high sensitivity to prompt variations lead to inconsistent and arbitrary grading.
Specific Example¶
When a VLM answers "there are four Labradoodles in the image," both claims <quantity==4> and <breed==Labradoodle> need to be verified. However, if the LLM judge only receives a brief description ("four puppies on a light blue carpet"), it cannot complete the full verification.
Motivation¶
An evaluation method is needed that can test open-ended question answering (close to real-world usage) while reliably and interpretably evaluating response quality. The key lies in: high recall of scene descriptions + programmatic verification + unified evaluation framework.
Method¶
Overall Architecture¶
PROVE consists of two main components: Dataset Construction and Programmatic Evaluation.
1. Dataset Construction Pipeline¶
Ultra-detailed image descriptions (DOCCI) -> Scene graph construction -> LLM-generated QA pairs + verification programs -> Filtering -> 10.5K high-quality QA pairs
Step 1: Scene Graph Representation Construction - It uses 5K image-description pairs from the DOCCI test set (average description length of 136 words, far exceeding competing datasets). - Entity-attribute-relation triplets are extracted from the descriptions to construct a directed graph \(g(\mathcal{C})\). - The scene graph is implemented as a Python class, offering APIs to query entities, attributes, relations, and to extract subgraphs.
Step 2: Verifiable QA Pair Generation - GPT-4o is used to generate 10-15 diverse and challenging open-ended QA pairs for each image. - A corresponding Python verification program is simultaneously generated, which can be executed on the scene graph objects to verify the correctness of the QA pairs.
Step 3: Dual Filtering - Programmatic Filtering: The verification program is executed, discarding QA pairs that result in program execution failure (18.3% ) or return incorrect answers (9.8%). - Text Filtering: Low-quality QA pairs are excluded—including those that are trivial/ambiguous/incomplete (determined by LLMs), not entailed by the image (using a visual entailment model), contain sensitive words, or are semantically redundant (using SemDeDup).
Ultimately, approximately 50% of the QA pairs are retained, yielding a total of 10.5K high-quality samples.
2. Programmatic Evaluation Method¶
Given the model's answer \(\hat{\mathcal{A}} = m_\theta(\mathcal{Q}, \mathcal{I})\), evaluation is conducted along two dimensions:
Helpfulness score (hscore)—The recall of the model's answer relative to the ground truth (GT) answer's scene graph:
- Scene graph tuples are extracted from both the GT answer and the model's answer.
- Premise tuples already present in the question are excluded.
- The average of the maximum cosine similarities from the GT tuples to the answer tuples is computed.
Truthfulness score (tscore)—The precision of the answer tuples relative to the complete scene:
- It not only matches against the scene graph of the full description but also utilizes a visual entailment model to check the image itself.
- This reduces false positive hallucination detections caused by incomplete descriptions.
Key Designs¶
- Dual Verification Sources: The tscore simultaneously leverages the textual scene graph and visual entailment, as even highly detailed descriptions cannot cover all aspects of an image.
- Decoupling of hscore and tscore: The two are not necessarily positively correlated—a response can be helpful but not entirely truthful (containing hallucinations), or truthful but not helpful enough.
- Interpretability: The approach is based on concrete scoring criteria from scene graph matching rather than opaque LLM scoring.
Key Experimental Results¶
Main Results¶
VLM helpfulness-truthfulness trade-off on PROVE
| Model | Parameters | hscore ↑ | tscore ↑ | Average ↑ |
|---|---|---|---|---|
| Qwen2-VL | 2B | 69.36 | 80.64 | 75.00 |
| InternVL2 | 2B | 73.96 | 79.51 | 76.74 |
| Phi-3.5-Vision | 4B | 73.35 | 82.27 | 77.81 |
| LLaVA-1.5 | 7B | 72.67 | 82.58 | 77.62 |
| LLaVA-Next | 7B | 74.28 | 80.03 | 77.15 |
| InternVL2 | 8B | 74.55 | 80.56 | 77.56 |
| Pixtral | 12B | 73.34 | 82.43 | 77.88 |
| LLaVA-1.5 | 13B | 72.46 | 82.40 | 77.43 |
| InternVL2 | 26B | 74.63 | 79.23 | 76.93 |
| Claude-3.5-Sonnet† | - | 71.06 | 77.31 | 74.19 |
| GPT-4o-mini† | - | 73.18 | 79.24 | 76.21 |
| Gemini-1.5-Flash† | - | 72.73 | 81.74 | 77.23 |
| GPT-4o† | - | 76.53 | 80.92 | 78.72 |
| Oracle* | - | 82.84 | 85.59 | 84.22 |
Ablation Study¶
| Analysis Dimension | Findings |
|---|---|
| hscore vs tscore Correlation | The average linear correlation across models is only 0.03, indicating almost no correlation. |
| Model Size vs. Truthfulness | For InternVL2 (2B -> 8B -> 26B), hscore increases but tscore does not necessarily improve. |
| LLaVA Series Comparison | The LLaVA-1.5 series achieves the overall best tscore but has a lower hscore. |
| Human Evaluation - QA Quality | 95.9% of the questions are judged as relevant, and 98.2% of the answers are judged as correct. |
| Human Evaluation - Metric Correlation | hscore correlates with human judgment at 0.81, and tscore correlates at 0.45. |
| Oracle vs. Best Model | The Oracle scores 84.22 on average vs. GPT-4o at 78.72, showing significant room for improvement. |
Key Findings¶
- Very few models achieve a good balance between the two: Only GPT-4o, Phi-3.5-Vision, and Pixtral exhibit balanced performance.
- Highly-ranked models do not necessarily yield high truthfulness: Claude-3.5-Sonnet and InternVL2-26B rank high on aggregated leaderboards, but their tscores lag behind the simpler LLaVA-1.5.
- Models fail in different ways: GPT-4o's errors are "milder" (e.g., reading 3 out of 6 letters correctly), whereas LLaVA's errors are more severe (e.g., reading only 1 correctly); GPT-4o generates more descriptive answers, improving its hscore.
- Commonly hallucinated objects: Common objects such as tree, building, wall, sign, etc.
Highlights & Insights¶
- Paradigm Innovation: Programmatic verification is introduced to open-ended VLM evaluation for the first time, achieving high reliability through a closed-loop pipeline of "QA generation -> programmatic verification -> scene graph evaluation."
- Revealing Critical Trade-offs: The weak correlation (0.03) between helpfulness and truthfulness demonstrates that improvements in recent models primarily enhance helpfulness rather than truthfulness.
- Scalability of Evaluation: The entire benchmark construction process is fully automated, allowing easy scaling to larger image-description sources.
- Quantified Counter-intuitive Findings: Larger and newer models are not necessarily more truthful, challenging the naive assumption that "scaling = better."
Limitations & Future Work¶
- Recall Cost: To ensure high precision, approximately 50% of the QA pairs are filtered out, which may exclude certain hard-to-verify question types.
- Description Coverage: Even high-recall descriptions cannot capture all aspects of an image, potentially missing some hallucinations.
- Dependency on External Models: Text embeddings (Sentence-BERT), scene-graph extraction, and visual entailment (OFA) tools may introduce their respective errors.
- Untested Mitigation Methods: Hallucination mitigation strategies such as fine-tuning, preference optimization, and training-free decoding have not been evaluated on PROVE.
Related Work & Insights¶
- Difference from CHAIR: CHAIR only evaluates the precision/recall of objects in descriptions and is limited to image captioning templates; PROVE supports arbitrary open-ended questions.
- Difference from MMHal-Bench: MMHal relies on a pipeline of off-the-shelf models that introduce noise, and its GPT-4 scoring can penalize correct answers due to a lack of context.
- Difference from GAVIE: GAVIE relies on dense descriptions and bounding boxes, where questions focus heavily on local regions and spatial relations, leading to unnatural response qualities.
- Inspirations for Future Work: Integrating agentic VLMs (capable of planning, reasoning, and self-reflection) could be a direction for simultaneously improving both hscore and tscore.
Rating ⭐⭐⭐⭐¶
Novelty: ⭐⭐⭐⭐⭐ — The paradigm of programmatic verification + scene graph evaluation is novel and highly influential.
Utility: ⭐⭐⭐⭐ — Provides a scalable automated evaluation pipeline.
Experimental Thoroughness: ⭐⭐⭐⭐ — Covers 14 models, including human evaluation verification.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation, adequate examples, and rigorous presentation.