Skip to content

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

Conference: ICML 2025
arXiv: 2412.15484
Code: github.com/adobe-research/CapMAS
Area: Multimodal VLM
Keywords: Hyper-Detailed Captioning, Hallucination, Multiagent System, Factuality, Coverage Evaluation

TL;DR

Proposed the CapMAS multi-agent system, which corrects hallucinations through LLM-MLLM collaboration by decomposing detailed image-text descriptions into atomic propositions and verifying their truthfulness one by one. It also introduces a framework to evaluate detailed captions from the dual dimensions of factuality and coverage, significantly improving the description quality of various MLLMs, including GPT-4V.

Background & Motivation

MLLMs can generate long and detailed image descriptions but suffer from severe hallucination issues: descriptions include objects, or incorrect attributes/relations, that do not exist in the images.

Key finding: Existing hallucination detection methods fail on long sequences. - Confidence and Consistency methods fail to detect hallucinations after the 192nd token. - Reason: As the MLLM output grows longer, the model increasingly relies on its self-generated text rather than the input image (attention weights shift from image tokens to text tokens).

Experimental validation: By "isolating" objects in long descriptions into independent queries (the Isolation method), the AUROC improves from 57.5 (Confidence) and 73.5 (Consistency) to 81.4.

Method

CapMAS Multi-agent System

Three-step pipeline (training-free): 1. Decomposer LLM: Decomposes the detailed description into atomic propositions (the smallest units that can be verified as true or false). 2. Fact Check MLLM: Converts each proposition into a True/False question and queries the MLLM independently.

Definition of hallucination score: $\(H(u) = -\log(\min(p(\text{T}|x, Q(u)) - p(\text{F}|x, Q(u)), \epsilon))\)$

Propositions are classified into a True set \(\mathcal{T}\) and a False set \(\mathcal{F}\) based on a threshold \(\pi\).

  1. Corrector LLM: Corrects the original description based on \(\mathcal{T}\) and \(\mathcal{F}\).

Evaluation Framework

Factuality Evaluation: - GPT-4o decomposes descriptions into atomic propositions, then determines truthfulness by referring to both the image and reference descriptions. - Factuality = \(T / (T + F)\)

Coverage Evaluation: - Constructs a high-granularity VQA dataset (averaging 49.8 multiple-choice questions per image, 19,899 questions in total). - Assumption: If the description completely covers the image information, visual questions can be answered using only the description. - Employs an LLM to answer questions based on the generated description, and uses the accuracy as the coverage.

Meta-evaluation of Evaluation Metrics

Three types of hallucinations (Object/Attribution/Relation) were introduced into the DOCCI dataset to test whether each metric could detect them:

Metric Clean Object Attrib Relation Can Detect?
CIDEr 6.4 4.8 6.2 6.7
CLIP-S 81.3 81.0 80.9 81.4
CLAIR 86.9 85.2 80.0 83.5 Partially
Ours 62.8 52.3 60.9 51.9

Key Experimental Results

Improvement of CapMAS on Different Models

Description Model CapMAS CLAIR Factuality Coverage Average
LLaVA-NeXT-7B 68.8 59.9 47.9 58.9
LLaVA-NeXT-7B LLaMA-3 + 7B 74.1 72.2 46.9 64.4
GPT-4V 82.4 77.1 53.5 71.0
GPT-4V LLaMA-3 + InternVL 84.6 82.1 53.5 73.4

Comparison with Other Methods

Method CLAIR Factuality Coverage Average
Base (LLaVA-1.5-7B) 62.1 52.8 34.3 49.7
VCD 59.7 44.6 39.3 47.9
OPERA 59.1 53.0 34.1 48.7
LURE 57.2 51.9 27.6 45.6
CapMAS 66.3 63.4 33.1 54.3

Key Findings

  • Existing decoding methods (VCD, OPERA) are ineffective or even harmful for detailed descriptions (VCD reduces factuality).
  • CapMAS improves the factuality of GPT-4V descriptions (77.1→82.1), even when using a much weaker model for checking than GPT-4V.
  • VQA benchmark performance is uncorrelated with detailed description capabilities, questioning the VQA-centric evaluation paradigm.

Highlights & Insights

  1. Isolation verification outperforms Confidence/Consistency: Confirms the necessity of the decompose-then-check strategy.
  2. Plug-and-play + training-free: Can be applied to any description model, including closed-source GPT-4V.
  3. Dual evaluation of Factuality × Coverage: Systematically decouples and evaluates these two dimensions for the first time.
  4. Revelation of VQA benchmark limitations: Strong performance of an MLLM on VQA does not imply superior description ability.

Limitations & Future Work

  • Factuality improvement is accompanied by a slight decrease in coverage (conservative corrections lead to information loss).
  • Relies on the MLLM's own visual understanding capabilities to check for hallucinations.
  • The quality of the LLM decomposer affects the final results.
  • Hyperparameter \(\pi\) controls the trade-off between factuality and coverage.
  • Decoding methods (VCD, OPERA)
  • Training methods (LRV)
  • Correction methods (LURE, Volcano)
  • Evaluation methods (CLIPScore, CLAIR, ALOHa, FaithScore)

Rating

⭐⭐⭐⭐ — Precise problem formulation (failure of long-sequence hallucination detection), and a well-designed evaluation framework. The CapMAS method is intuitive and highly effective. The dual-dimensional evaluation and the revelation of VQA benchmark limitations possess independent value.