MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly¶

Conference: NeurIPS 2025 arXiv: 2505.10610 Code: GitHub Area: Multimodal VLM / Long-Context Evaluation Keywords: long-context VLM, benchmark, multi-task evaluation, cross-modal tokenization, NIAH

TL;DR¶

This paper introduces MMLongBench, the first comprehensive benchmark for evaluating long-context vision-language models (LCVLMs), comprising 13,331 samples spanning 5 downstream task categories, mixed image types, and 5 standardized input length levels (8K–128K tokens). Evaluation of 46 models reveals that single-task performance is a weak proxy for overall capability, and that stronger reasoning ability positively correlates with long-context performance.

Background & Motivation¶

Background: The context windows of VLMs have been extended to 128K+ tokens (e.g., GPT-4o, Gemini-2.5), giving rise to long-context VLMs (LCVLMs) capable of processing hundreds of images and thousands of interleaved text tokens. However, evaluation benchmarks have lagged significantly behind.

Limitations of Prior Work: - Limited task coverage: Existing benchmarks focus on a single task type (e.g., MM-NIAH targets only needle-in-a-haystack retrieval; MMLongBench-Doc covers only document VQA), and no single task can reflect overall long-context capability. - Narrow image type coverage: Most benchmarks include only natural photographs or synthetic document screenshots, lacking comprehensiveness. - Inconsistent input length definitions: Different benchmarks define "length" differently—some by image count, others by token count—and most provide only a single length level. - Absence of important tasks: Practically relevant scenarios such as Visual RAG, many-shot ICL, and summarization are entirely absent from prior benchmarks.

Key Challenge: Model developers need to identify performance strengths and weaknesses across specific length levels and task types, yet existing benchmarks do not support such fine-grained analysis.

Goal: To construct a unified evaluation benchmark covering multiple task types, image modalities, and length levels.

Key Insight: A unified cross-modal token counting scheme (vision patches + text tokens), combined with 5 standardized length levels and 5 downstream task categories.

Core Idea: Provide the missing evaluation infrastructure for LCVLMs through unified length control and diverse task coverage.

Method¶

Overall Architecture¶

MMLongBench encompasses 5 task categories × 5 length levels × mixed image types: - Visual RAG: Retrieve information from long-context Wikipedia passages to answer visual questions. - NIAH: Locate a "needle" inserted into a sequence of "haystack" images. - Many-Shot ICL: Perform image classification based on hundreds of in-context examples. - Summarization: Summarize PDF documents. - DocVQA: Conduct visual question answering over long documents.

Key Designs¶

Cross-Modal Tokenization:
- Function: Unify the counting of vision patches and text tokens.
- Mechanism: The image token count equals the number of patches produced by the visual encoder (after \(2\times2\) pixel unshuffle), which are summed with text tokens to form the total sequence length.
- Design Motivation: Aligns with implementations of recent models such as Qwen2.5-VL and InternVL3, ensuring length metrics are comparable across models.
5 Standardized Length Levels (8K/16K/32K/64K/128K):
- Function: Provide each sample with context versions at five distinct lengths.
- Mechanism: Total token counts are precisely controlled by padding or truncating context materials.
- Design Motivation: Enables systematic analysis of performance trends as a function of context length, following established practices in text-domain long-context evaluation.
Diverse Image Type Coverage:
- Function: Include both natural images (photographs, scenes) and synthetic images (document screenshots, webpages, application screenshots).
- Mechanism: Different tasks naturally introduce different image types—NIAH and ICL use natural images; DocVQA and Summarization use synthetic images; VRAG incorporates both.
- Design Motivation: Prevents evaluation blind spots arising from image-type bias.
Comprehensive Model Evaluation (46 Models):
- Function: Evaluate both closed-source (GPT-4o, Gemini, etc.) and open-source (LLaVA, Qwen-VL, InternVL, etc.) models.
- Mechanism: A unified evaluation protocol with controlled variables for fair comparison.
- Design Motivation: Provide a panoramic view of current LCVLM capabilities.

Key Experimental Results¶

Key Findings¶

Finding	Details
Single-task → Overall	Weak proxy: Strong performance on NIAH does not imply strong performance on VRAG or ICL.
Closed-source vs. Open-source	Closed-source models lead overall, but both categories exhibit substantial degradation at 128K.
Reasoning vs. Long-Context	Positive correlation: Gemini thinking variants substantially outperform their standard counterparts.
OCR Bottleneck	OCR capability and cross-modal retrieval are the primary bottlenecks for current LCVLMs.
Length Sensitivity	Most models exhibit significant performance degradation beginning at 32K tokens.

Task-Level Results (Representative Models)¶

Model	VRAG	NIAH	ICL	Summ	DocVQA	Overall
GPT-4o	High	High	Mid	Mid	High	Top tier
Gemini-2.5-Flash	Mid	High	High	High	Mid	High tier
Qwen2.5-VL-72B	Mid	Mid	Mid	Mid	Mid	Mid tier
InternVL3-8B	Low	Mid	Low	Low	Low	Low tier

Length Sensitivity¶

Length	Average Performance (Normalized)
8K	1.00 (baseline)
16K	~0.95
32K	~0.85
64K	~0.70
128K	~0.55

Key Numbers¶

13,331 evaluation samples
5 downstream task categories
46 evaluated models
5 standardized length levels (8K–128K)
Mixed image types (natural + synthetic)

Highlights & Insights¶

"Single-task performance is a weak proxy for overall capability": This finding highlights the insufficiency of using NIAH alone to evaluate long-context ability, motivating the necessity of multi-task evaluation.
Reasoning capability ≈ Long-context capability: The advantage of Gemini thinking variants suggests that explicit reasoning ability facilitates processing of long contexts—an insight with implications for LCVLM design.
Unified length control: This work is the first in the vision-language domain to enforce length control standards as rigorous as those in the text domain.
Extensibility: All datasets are designed to be readily extended to longer contexts (256K+).

Limitations & Future Work¶

No intra-task difficulty stratification: Easy and hard subsets are not distinguished within individual tasks.
English-only evaluation: Multilingual long-context evaluation is absent.
Video excluded: Long-video understanding is an important long-context scenario not covered in this benchmark.
No human baseline: Human performance on the same tasks is not provided as an upper-bound reference.

vs. MM-NIAH: Covers only the NIAH task; MMLongBench covers 5 task categories.
vs. MMLongBench-Doc: Focuses solely on document VQA without length control; this work provides comprehensive task coverage with rigorous length control.
vs. MileBench: Claims comprehensiveness but has an average context length of only ~9K tokens, making it unsuitable as a genuine long-context benchmark.

Rating¶

Novelty: ⭐⭐⭐⭐ — First comprehensive multi-task long-context evaluation benchmark for LCVLMs.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 46 models × 5 tasks × 5 length levels, with detailed error analysis.
Writing Quality: ⭐⭐⭐⭐⭐ — Motivation is clearly articulated; comparative tables with prior benchmarks are highly persuasive.
Value: ⭐⭐⭐⭐⭐ — Fills a critical gap in LCVLM evaluation and is poised to become a standard benchmark.