MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly¶
Conference: NeurIPS 2025 arXiv: 2505.10610 Code: GitHub Area: Multimodal VLM / Long-Context Evaluation Keywords: long-context VLM, benchmark, multi-task evaluation, cross-modal tokenization, NIAH
TL;DR¶
This paper introduces MMLongBench, the first comprehensive benchmark for evaluating long-context vision-language models (LCVLMs), comprising 13,331 samples spanning 5 downstream task categories, mixed image types, and 5 standardized input length levels (8K–128K tokens). Evaluation of 46 models reveals that single-task performance is a weak proxy for overall capability, and that stronger reasoning ability positively correlates with long-context performance.
Background & Motivation¶
Background: The context windows of VLMs have been extended to 128K+ tokens (e.g., GPT-4o, Gemini-2.5), giving rise to long-context VLMs (LCVLMs) capable of processing hundreds of images and thousands of interleaved text tokens. However, evaluation benchmarks have lagged significantly behind.
Limitations of Prior Work: - Limited task coverage: Existing benchmarks focus on a single task type (e.g., MM-NIAH targets only needle-in-a-haystack retrieval; MMLongBench-Doc covers only document VQA), and no single task can reflect overall long-context capability. - Narrow image type coverage: Most benchmarks include only natural photographs or synthetic document screenshots, lacking comprehensiveness. - Inconsistent input length definitions: Different benchmarks define "length" differently—some by image count, others by token count—and most provide only a single length level. - Absence of important tasks: Practically relevant scenarios such as Visual RAG, many-shot ICL, and summarization are entirely absent from prior benchmarks.
Key Challenge: Model developers need to identify performance strengths and weaknesses across specific length levels and task types, yet existing benchmarks do not support such fine-grained analysis.
Goal: To construct a unified evaluation benchmark covering multiple task types, image modalities, and length levels.
Key Insight: A unified cross-modal token counting scheme (vision patches + text tokens), combined with 5 standardized length levels and 5 downstream task categories.
Core Idea: Provide the missing evaluation infrastructure for LCVLMs through unified length control and diverse task coverage.
Method¶
Overall Architecture¶
MMLongBench encompasses 5 task categories × 5 length levels × mixed image types: - Visual RAG: Retrieve information from long-context Wikipedia passages to answer visual questions. - NIAH: Locate a "needle" inserted into a sequence of "haystack" images. - Many-Shot ICL: Perform image classification based on hundreds of in-context examples. - Summarization: Summarize PDF documents. - DocVQA: Conduct visual question answering over long documents.
Key Designs¶
-
Cross-Modal Tokenization:
- Function: Unify the counting of vision patches and text tokens.
- Mechanism: The image token count equals the number of patches produced by the visual encoder (after \(2\times2\) pixel unshuffle), which are summed with text tokens to form the total sequence length.
- Design Motivation: Aligns with implementations of recent models such as Qwen2.5-VL and InternVL3, ensuring length metrics are comparable across models.
-
5 Standardized Length Levels (8K/16K/32K/64K/128K):
- Function: Provide each sample with context versions at five distinct lengths.
- Mechanism: Total token counts are precisely controlled by padding or truncating context materials.
- Design Motivation: Enables systematic analysis of performance trends as a function of context length, following established practices in text-domain long-context evaluation.
-
Diverse Image Type Coverage:
- Function: Include both natural images (photographs, scenes) and synthetic images (document screenshots, webpages, application screenshots).
- Mechanism: Different tasks naturally introduce different image types—NIAH and ICL use natural images; DocVQA and Summarization use synthetic images; VRAG incorporates both.
- Design Motivation: Prevents evaluation blind spots arising from image-type bias.
-
Comprehensive Model Evaluation (46 Models):
- Function: Evaluate both closed-source (GPT-4o, Gemini, etc.) and open-source (LLaVA, Qwen-VL, InternVL, etc.) models.
- Mechanism: A unified evaluation protocol with controlled variables for fair comparison.
- Design Motivation: Provide a panoramic view of current LCVLM capabilities.
Key Experimental Results¶
Key Findings¶
| Finding | Details |
|---|---|
| Single-task → Overall | Weak proxy: Strong performance on NIAH does not imply strong performance on VRAG or ICL. |
| Closed-source vs. Open-source | Closed-source models lead overall, but both categories exhibit substantial degradation at 128K. |
| Reasoning vs. Long-Context | Positive correlation: Gemini thinking variants substantially outperform their standard counterparts. |
| OCR Bottleneck | OCR capability and cross-modal retrieval are the primary bottlenecks for current LCVLMs. |
| Length Sensitivity | Most models exhibit significant performance degradation beginning at 32K tokens. |
Task-Level Results (Representative Models)¶
| Model | VRAG | NIAH | ICL | Summ | DocVQA | Overall |
|---|---|---|---|---|---|---|
| GPT-4o | High | High | Mid | Mid | High | Top tier |
| Gemini-2.5-Flash | Mid | High | High | High | Mid | High tier |
| Qwen2.5-VL-72B | Mid | Mid | Mid | Mid | Mid | Mid tier |
| InternVL3-8B | Low | Mid | Low | Low | Low | Low tier |
Length Sensitivity¶
| Length | Average Performance (Normalized) |
|---|---|
| 8K | 1.00 (baseline) |
| 16K | ~0.95 |
| 32K | ~0.85 |
| 64K | ~0.70 |
| 128K | ~0.55 |
Key Numbers¶
- 13,331 evaluation samples
- 5 downstream task categories
- 46 evaluated models
- 5 standardized length levels (8K–128K)
- Mixed image types (natural + synthetic)
Highlights & Insights¶
- "Single-task performance is a weak proxy for overall capability": This finding highlights the insufficiency of using NIAH alone to evaluate long-context ability, motivating the necessity of multi-task evaluation.
- Reasoning capability ≈ Long-context capability: The advantage of Gemini thinking variants suggests that explicit reasoning ability facilitates processing of long contexts—an insight with implications for LCVLM design.
- Unified length control: This work is the first in the vision-language domain to enforce length control standards as rigorous as those in the text domain.
- Extensibility: All datasets are designed to be readily extended to longer contexts (256K+).
Limitations & Future Work¶
- No intra-task difficulty stratification: Easy and hard subsets are not distinguished within individual tasks.
- English-only evaluation: Multilingual long-context evaluation is absent.
- Video excluded: Long-video understanding is an important long-context scenario not covered in this benchmark.
- No human baseline: Human performance on the same tasks is not provided as an upper-bound reference.
Related Work & Insights¶
- vs. MM-NIAH: Covers only the NIAH task; MMLongBench covers 5 task categories.
- vs. MMLongBench-Doc: Focuses solely on document VQA without length control; this work provides comprehensive task coverage with rigorous length control.
- vs. MileBench: Claims comprehensiveness but has an average context length of only ~9K tokens, making it unsuitable as a genuine long-context benchmark.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First comprehensive multi-task long-context evaluation benchmark for LCVLMs.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 46 models × 5 tasks × 5 length levels, with detailed error analysis.
- Writing Quality: ⭐⭐⭐⭐⭐ — Motivation is clearly articulated; comparative tables with prior benchmarks are highly persuasive.
- Value: ⭐⭐⭐⭐⭐ — Fills a critical gap in LCVLM evaluation and is poised to become a standard benchmark.