MMBench: Is Your Multi-modal Model an All-Around Player?¶
Conference: ECCV2024
arXiv: 2307.06281
Code: VLMEvalKit
Area: Multimodal VLM
Keywords: VLM benchmark, multi-modal evaluation, CircularEval, choice extraction, bilingual benchmark
TL;DR¶
Proposes MMBench—a bilingual (English/Chinese) multimodal benchmark comprising 3,217 multiple-choice questions across 20 fine-grained ability dimensions, featuring a CircularEval evaluation strategy and an LLM-based choice extraction mechanism to significantly improve evaluation robustness and fairness.
Background & Motivation¶
Large vision-language models (VLMs) have advanced rapidly in recent years, but they lack systematic and reliable quantitative evaluation methods:
- Traditional objective benchmarks (e.g., VQAv2, COCO Caption) suffer from "false negative" issues—e.g., predicting "bicycle" when the ground truth is "bike" is penalized as incorrect; moreover, they evaluate only a single task, failing to provide fine-grained ability profiles.
- Subjective evaluations (e.g., OwlEval, LVLM-eHub) rely on human annotation, which is high-cost, biased, non-scalable, and difficult to reproduce.
- The instruction-following capabilities of different VLMs vary significantly; many models cannot directly output option labels (A/B/C/D), leading to exact-match evaluations severely underestimating their true capability.
Therefore, a systematically designed, robustly evaluated, and capability-comprehensive objective benchmark is highly demanded.
Core Problem¶
- How to construct a VLM evaluation benchmark that has comprehensive capability coverage and controllable data quality?
- How to resolve the difficulty of option extraction caused by differences in instruction-following capabilities across various VLMs?
- How to eliminate bias caused by random guessing and option preference in multiple-choice question evaluations?
Method¶
1. Hierarchical Capability Taxonomy¶
MMBench designs a three-level capability taxonomy:
- L-1 (2 categories): Perception, Reasoning
- L-2 (6 categories): Coarse Perception (CP), Fine-grained Perception - Single Instance (FP-S), Fine-grained Perception - Cross Instance (FP-C), Attribute Reasoning (AR), Logic Reasoning (LR), Relation Reasoning (RR)
- L-3 (20 categories): Covers fine-grained abilities such as object localization, action recognition, spatial relations, and social reasoning.
Each L-3 capability includes at least 125 questions, maintaining a balanced distribution.
2. Data Collection and Quality Control¶
- Source: Over 80% of the questions are collected from the Internet, and the remaining ~20% are constructed based on validation sets of public datasets.
- Text-Only Filtering: Multiple SOTA LLMs (e.g., GPT-4, Gemini-Pro) are queried using only the text. If more than half answer correctly, the question is removed (indicating it can be answered without the image, hence unsuitable for multimodal evaluation).
- Error Sample Filtering: All questions are fed into multiple SOTA VLMs. If all models answer incorrectly, the sample is manually reviewed, and genuinely incorrect samples are removed.
- Bilingual Version: Translated into Chinese leveraging GPT-4, preserving proper nouns, and manually verified \(\rightarrow\) MMBench-CN.
3. LLM-Assisted Option Extraction¶
To address the issue where free-form text output from VLMs cannot be directly matched to options, a two-step extraction workflow is designed:
- Step 1 (Heuristic Matching): Attempts to directly extract option labels A/B/C/D from the model's output.
- Step 2 (LLM Extraction): If Step 1 fails, the question, options, and model output are sent to GPT-4 to determine which option the model's prediction matches best.
- As an option extractor, GPT-4 achieves a 91.5% alignment rate with human annotators, which is much higher than GPT-3.5-Turbo (around 85%).
4. CircularEval Circular Evaluation Strategy¶
To eliminate biases from random guessing (a 25% baseline for 4 options) and positional preferences in multiple-choice questions:
- Perform \(N\) inferences for each question with \(N\) options, applying a circular shift to the options each time.
- The question is considered correctly answered only if the model is correct in all \(N\) inferences.
- In practice, inference can be early-terminated once the model fails, keeping the computational cost lower than \(N\) times.
- Effect: Compared to VanillaEval (single inference), CircularEval generally reduces accuracy by 8–34 percentage points, thereby wider separating the gap between models.
Key Experimental Results¶
| Model | Overall | CP | FP-S | FP-C | AR | LR | RR |
|---|---|---|---|---|---|---|---|
| InternLM-XComposer2 | 78.1 | 80.4 | 83.5 | 73.0 | 83.7 | 63.6 | 74.4 |
| Qwen-VL-Max | 75.4 | 74.8 | 87.2 | 67.0 | 85.3 | 54.9 | 70.5 |
| GPT-4v | 74.3 | 77.6 | 73.8 | 71.5 | 85.3 | 63.6 | 68.6 |
| LLaVA-InternLM2-20B | 72.3 | 78.3 | 76.6 | 68.2 | 78.4 | 46.2 | 69.4 |
| Gemini-Pro-V | 70.2 | 70.0 | 78.9 | 65.9 | 82.9 | 46.2 | 65.9 |
| Yi-VL-34B | 68.4 | 72.0 | 78.0 | 54.7 | 81.2 | 38.6 | 68.2 |
| OpenFlamingo v2 | 2.3 | 1.1 | 3.5 | 1.5 | 5.3 | 0.0 | 2.7 |
Key Findings:
- LLM Backbone is Crucial: Within the same LLaVA architecture, switching the LLM from Vicuna-7B to InternLM2-20B increases the overall accuracy from 63.4% to 72.3%, with reasoning capabilities showing particularly significant packages.
- Model Scaling is Effective: MiniGPT4 improves by 8.3% scaled from 7B to 13B; LLaVA v1.5 improves by 3.5% scaled from 7B to 13B.
- Potential of Small Models: MiniCPM-V (\(\le\) 3B parameters) still achieves 61.4% under CircularEval.
- Small Bilingual Gap: Top models show only a 1–2% gap between MMBench and MMBench-CN, with InternLM-XComposer2 having a gap of less than 1%.
- Impact of Content Censorship: GPT-4v refuses to answer 1.8% of the tests (mainly celebrity identification), while Gemini-Pro-V refuses 1.6%.
Highlights & Insights¶
- Clever Design of CircularEval: Eliminates positional bias and random guessing through option circular shifting, substantially improving evaluation robustness at acceptable computational cost.
- LLM Option Extractor: Elegantly resolves the discrepancy in instruction-following capabilities across VLMs, achieving a 91.5% alignment rate with humans.
- Three-level Capability Taxonomy: 20 L-3 capability dimensions provide fine-grained diagnostic capacity, directly localizing model weaknesses.
- Systematic Quality Control: Double-filtering mechanism (text-only filtering + consensus-error filtering) secures data quality.
- Bilingual Aligned Evaluation: The English and Chinese versions are perfectly mapped, facilitating a fair comparison of target VLMs' cross-lingual capabilities.
Limitations & Future Work¶
- The multiple-choice format itself is inherently limited—it cannot evaluate capabilities like open-ended generation, multi-turn dialogue, or long-text reasoning.
- Quality control relies on SOTA models; it might fail to detect errors when all SOTA models make the same mistake.
- CircularEval is sensitive to the number of options; the difficulty varies significantly between 2-option and 4-option settings.
- Option extraction relies heavily on the GPT-4 API, incurring non-trivial cost and risk of API version drift.
- Although covering 20 dimensions, the current evaluation lacks recent hot capability dimensions such as OCR, chart understanding, and mathematical reasoning.
Related Work & Insights¶
| Benchmark | Questions | Capability Dimensions | Evaluation Method | Bilingual | Robustness Strategy |
|---|---|---|---|---|---|
| MMBench | 3217 | 20 (Three-level) | Multiple Choice + CircularEval | ✓ | CircularEval + LLM Extraction |
| MME | ~2400 | 14 | Yes/No | ✗ | None |
| OwlEval | 82 | Various | Subjective/Human | ✗ | None |
| SEED-Bench | 19K | 12 | Multiple Choice | ✗ | None |
| VQAv2 | 1.1M | Single | Open-ended | ✗ | Exact Match |
Compared to the simple Yes/No questions of MME, MMBench's multiple-choice questions are closer to real-world reasoning. Compared to SEED-Bench, which has a larger question pool but lacks a robustness strategy, MMBench secures evaluation reliability through CircularEval.
Insights & Connections¶
- The "multiple inferences + consistency verification" concept of CircularEval can be generalized to other multiple-choice evaluation scenarios (e.g., coding capability, mathematical reasoning benchmarks).
- LLM-assisted option extraction provides a general paradigm for evaluating open-ended models, bypassing the constraint of requiring models to strictly follow output formats.
- The paper indicates the decisive impact of LLM backbones on VLM performance, inspiring subsequent research to focus more closely on language model selection and alignment.
- The evaluation code is integrated into VLMEvalKit, which has become a standard evaluation tool for subsequent VLM research.
Rating¶
- Novelty: ⭐⭐⭐⭐ — CircularEval and the LLM option extractor represent meaningful methodological innovations.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Evaluates 21 VLMs, with multi-dimensional analysis and thorough ablation studies.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure, rich charts, and comprehensive motivation discussions.
- Value: ⭐⭐⭐⭐⭐ — Has become a standard for VLM evaluation, with VLMEvalKit being widely adopted.