NeurIPS 2025 Multimodal VLM Visual captioning evaluation multi-dimensional benchmark multimodal large language models correctness and thoroughness caption assessment

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness¶

Conference: NeurIPS 2025 arXiv: 2502.14914 Code: Project Page Area: Multimodal VLM Keywords: Visual captioning evaluation, multi-dimensional benchmark, multimodal large language models, correctness and thoroughness, caption assessment

TL;DR¶

This paper proposes CAPability, a comprehensive visual captioning benchmark covering 12 dimensions across 6 perspectives. It annotates visual elements (rather than sentences) for nearly 11K images and videos, simultaneously evaluating caption correctness (precision) and thoroughness (hit). A novel "Knows but doesn't Tell" (\(K\bar{T}\)) metric is introduced to reveal the significant capability gap between MLLMs in QA versus captioning tasks.

Background & Motivation¶

Background: As multimodal large language models (MLLMs) advance rapidly, traditional visual captioning benchmarks (e.g., MS-COCO, MSR-VTT) have become severely outdated for two reasons: (1) ground-truth annotations in traditional benchmarks are typically short sentences, inadequate for evaluating the detailed descriptions generated by modern MLLMs; and (2) traditional metrics (BLEU, CIDEr, etc.) rely on N-gram matching and are highly sensitive to sentence style, making evaluation unreliable.
Limitations of Prior Work: Recently proposed benchmarks, while improved, still have notable limitations. DetailCaps, Dream-1K, and VDC adopt a "holistic perspective" evaluation—extracting keywords from ground-truth descriptions for comparison—which is susceptible to human bias and accumulated LLM errors. CompreCap adopts an "object perspective" evaluation, focusing solely on object-related information, thus offering limited coverage and ignoring important dimensions such as scene, text, style, and camera information.
Key Challenge: Comprehensive visual captioning evaluation requires a multi-perspective assessment that simultaneously measures both correctness and thoroughness of descriptions. The latter has been largely neglected in prior work—most benchmarks only assess "how much was said correctly" without considering "how much was covered comprehensively." This insight motivates the design of CAPability.

Method¶

Overall Architecture¶

CAPability draws inspiration from visual generation benchmarks (e.g., GenEval, VBench, T2VCompBench)—just as generation tasks evaluate quality across multiple aspects, captioning tasks should do likewise. The overall pipeline proceeds as follows: dimension design → data collection → MLLM pre-annotation → data balancing → human annotation (accuracy >97%) → data filtering → independent multi-dimensional evaluation.

Key Designs¶

A taxonomy of 12 dimensions across 6 perspectives: Visual captioning is decomposed into the following dimensions, each with approximately 1,000 independently collected and evaluated samples:
- Object-Related: object category, object color, object count, spatial relation
- Global-Related: scene, style
- Text-Related: OCR
- Camera-Related: camera angle, camera motion
- Temporal-Related: action, event
- Knowledge-Related: character recognition

Nine static dimensions apply to both images and videos; four dynamic dimensions apply to videos only. Object count spans both static and dynamic settings. The design motivation is that information obtainable from a single frame is considered static, while information requiring the full video is considered dynamic.

"One Represents All" annotation strategy: Rather than exhaustively annotating all objects or actions in a sample, the benchmark randomly selects one element for annotation. The underlying principle is the law of large numbers—random selection across a large number of samples approximates the expected distribution across different granularities. To avoid human selection bias, three state-of-the-art MLLMs (GPT-4o, Gemini-1.5-pro, Qwen-VL-Max) are used to enumerate all candidate elements; Qwen2.5-Max then merges the results and randomly selects one as the pre-annotation.
Three-state evaluation and dual-metric system: Each sample's caption is assigned one of three states, from which two core metrics are derived:
- MIS (Missing): the dimension's content is not mentioned in the caption
- COR (Correct): the dimension's content is mentioned and correctly described
- INC (Incorrect): the dimension's content is mentioned but incorrectly described
\[\text{Precision} = \frac{|S(\text{COR})|}{|S(\text{COR}) \cup S(\text{INC})|}\]

\[\text{Hit} = \frac{|S(\text{COR})|}{|S(\text{ALL})|}\]

Precision measures only correctness (among what was described, how much was accurate), while Hit measures both correctness and thoroughness (among all content that should be described, how much was correctly covered).

\(K\bar{T}\) (Knows but doesn't Tell) metric: Annotations are converted into QA format, and model performance on QA versus captioning tasks is compared:

\[K\bar{T} = \frac{|S_{qa}(\text{COR}) \cap [S(\text{INC}) \cup S(\text{MIS})]|}{|S_{qa}(\text{COR})|}\]

This metric quantifies the proportion of cases in which a model answers correctly when queried but fails to express the information when generating a caption—revealing the gap between MLLMs' "passive knowledge" and "active expression." This is a dimension that prior work has not quantified.

Loss & Training¶

This paper presents an evaluation benchmark and does not involve model training. Evaluation employs GPT-4 Turbo (1106-preview) as the judge, performing three-state classification (MIS/COR/INC) on generated captions for each dimension. Distinct prompt templates are designed for dimensions with predefined categories (style, camera angle, camera motion) and for open-ended description dimensions.

Key Experimental Results¶

Main Results — Closed-Source and 72B Models¶

Model	Precision Avg	Hit Avg	Strongest Capability
GPT-4o (0806)	79.2%	56.5%	Camera angle Precision 67.0% (leading by 9.6%)
Gemini-1.5-pro	77.3%	60.4%	Object count Hit leads by 10%+; best thoroughness
Gemini-2.0-flash	79.3%	56.2%	Tied highest Precision
Qwen2.5VL-72B	75.9%	53.4%	Best open-source Hit; strong on scene and camera motion
InternVL2.5-78B	71.2%	47.0%	—
LLaVA-OV-72B	74.7%	46.6%	—

Dimension Difficulty Analysis¶

Dimension	Best Precision	Best Hit	Overall Difficulty
Scene	97.0%	86.9%	Easy
OCR	95.9%	88.8%	Easy
Style	91.4%	91.4%	Easy
Object Category	89.8%	86.3%	Relatively Easy
Object Color	90.4%	67.7%	Moderate
Object Count	78.6%	40.0%	Hard
Camera Angle	67.0%	67.0%	Hard
Action	56.8%	51.4%	Hard
Camera Motion	35.4%	35.2%	Very Hard
Character Recognition	90.9%	37.9%	High Precision, Low Hit

Key Findings¶

Substantial thoroughness gap: All models exhibit a significant drop from Precision to Hit (averaging 20%+), indicating that models tend to describe only what they are confident about at the expense of comprehensiveness.
Divergent model strategies: GPT-4o is conservative (high Precision, moderate Hit; "say less rather than say it wrong"), while Gemini-1.5-pro is more aggressive (covers more through higher verbosity).
Common weaknesses: Object count, camera angle/motion, character recognition, and action represent bottleneck dimensions across all models.
\(K\bar{T}\) findings: All models exhibit a significant \(K\bar{T}\) gap, demonstrating that MLLMs' active captioning ability is substantially weaker than their passive question-answering ability.

Highlights & Insights¶

Pioneering thoroughness evaluation: CAPability is the first to systematically evaluate caption thoroughness within a multi-dimensional framework, uncovering the previously overlooked "knows but doesn't tell" problem.
Interpretive value of the \(K\bar{T}\) metric: The metric quantitatively demonstrates the capability gap between active captioning and passive question-answering in MLLMs, providing directional guidance for training strategies such as caption-aware training.
Statistical elegance of the "One Represents All" strategy: The law of large numbers is leveraged to make multi-granularity annotation feasible.
Unified image–video evaluation framework: The 12 dimensions span both static and dynamic content, constituting the first unified visual captioning evaluation system.

Limitations & Future Work¶

Data for each dimension is collected independently, and the benchmark does not assess a model's ability to simultaneously cover multiple dimensions within a single sample.
Reliance on GPT-4 Turbo as the judge may introduce evaluation bias.
Dynamic dimensions use only video, and static dimensions use only images; cross-modal evaluation is not explored.
Approximately 1,000 samples per dimension may be insufficient for certain fine-grained subcategories.

Comparison with CompreCap: CompreCap evaluates only object-related information, whereas CAPability extends coverage to 6 perspectives and jointly evaluates correctness and thoroughness.
Inspiration from visual generation benchmarks: The dimensional design is informed by GenEval and VBench—captioning and generation are inverse tasks and should have symmetric evaluation dimensions.
Potential of \(K\bar{T}\) as an RLHF signal: The metric can identify knowledge that models possess but fail to express, guiding targeted training to improve captioning ability.

Rating¶

Novelty: ⭐⭐⭐⭐ The multi-perspective + dual-metric + \(K\bar{T}\) evaluation framework is novel in design, though the core idea of dimension-wise evaluation is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ A large number of closed-source and open-source models across multiple scales (7B to 72B) are evaluated, with detailed dimensional analysis.
Writing Quality: ⭐⭐⭐⭐ The paper is clearly structured with rich figures and tables, and motivations are well articulated.
Value: ⭐⭐⭐⭐⭐ The work fills a critical gap in thoroughness evaluation for visual captioning, identifies important capability gaps, and provides significant guidance for the research community.