MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios¶

Conference: NeurIPS 2025 arXiv: 2505.21333 Code: https://mme-videoocr.github.io/ Area: Multimodal VLM / Video Understanding / OCR Evaluation Keywords: video OCR, benchmark, cross-frame understanding, language prior bias, multimodal LLM evaluation

TL;DR¶

This paper introduces MME-VideoOCR, a comprehensive video OCR evaluation benchmark comprising 25 tasks, 44 scenarios, 1,464 videos, and 2,000 manually annotated QA pairs, spanning three levels of text recognition, understanding, and reasoning. Evaluation of 18 state-of-the-art MLLMs reveals that the strongest model (Gemini-2.5 Pro) achieves only 73.7% overall, with cross-frame understanding tasks falling below 25%.

Background & Motivation¶

Background: MLLMs have achieved promising performance on static image OCR; however, video OCR presents unique challenges—including motion blur, temporal variation, and visual effects—that lead to significant performance degradation.

Limitations of Prior Work: - OCR Benchmark: only 25 videos and 1 task type, lacking diversity - FG Bench: 1,028 videos but uses mixed automatic and manual annotation with only 6 task types - Both benchmarks focus predominantly on text perception, neglecting text-based understanding and reasoning

Three Core Challenges in Video OCR: - (1) Text appears in diverse forms (foreground, background, danmaku, watermarks, etc.), requiring spatiotemporal visual-text association - (2) Key textual information is distributed across multiple frames, necessitating cross-frame aggregation and temporal understanding - (3) As task complexity increases, reasoning over recognized text becomes essential

Method¶

Task Taxonomy (10 Categories, 25 Sub-tasks)¶

Text Recognition: location-specified recognition, attribute-specified recognition
Visual Text QA: text-centric QA, translation
Text Grounding: spatial grounding, temporal grounding
Attribute Recognition: color recognition, named entity recognition, counting
Change Detection & Tracking: change detection, text tracking
Special Text Parsing: table/chart/document/mathematical formula/handwriting parsing
Cross-Frame Text Understanding: scrolling text comprehension, trajectory recognition, disordered assembly
Text-Based Reasoning: integrating scattered clues, recognizing implicit relations, resolving ambiguity
Text-Based Video Understanding: subtitle video comprehension, multi-hop needle-in-a-haystack
Robustness Testing: AIGC video, long video, adversarial video

Data Construction¶

Video Sources (three pipelines): - Reconstruction from existing datasets (BOVText, RoadTextVQA, etc.): GPT-4o evaluates visual dynamics and textual semantic quality, followed by filtering - Manual collection from public platforms (YouTube, Bilibili, Kuaishou) - AI generation (Wan text-to-video model): 2,000 phrases → scene descriptions → video generation → filtering

Annotation Pipeline: - Manual annotation (not model-generated): 3–4 QA pairs per video → two rounds of expert filtering retaining 1–2 high-quality pairs - Expert verification: review of ambiguous questions, inaccurate answers, and insufficiently challenging items - Balanced option distribution + debiasing test

Debiasing Test: Without visual input, model accuracy based solely on language priors should approximate chance level (Containment Match 0%, multiple-choice 25.1%), verifying the absence of knowledge leakage and language prior bias.

Evaluation Protocol¶

Containment Match: text recognition and handwriting recognition tasks
GPT-assisted scoring: tasks with multiple valid answers such as translation
Multiple-choice: remaining understanding and reasoning tasks

Key Experimental Results¶

Main Results (18 Models)¶

Model	Scale	TR	VTQA	TG	AR	CDT	STP	CFTU	TBR	TBVU	RVT	Total
Gemini-2.5 Pro	-	83.0	91.6	64.5	74.0	70.0	84.4	48.7	74.0	56.5	72.0	73.7
GPT-4o	-	83.3	81.6	60.5	74.7	51.5	68.0	30.7	60.7	59.0	75.3	66.4
Qwen2.5-VL	72B	80.7	80.0	65.0	74.0	56.5	79.6	26.7	74.7	57.0	78.7	69.0
InternVL3	78B	70.0	77.6	67.5	76.0	65.5	71.6	24.7	77.3	57.0	75.3	67.2
InternVL3	8B	61.3	72.0	60.0	69.3	56.5	62.4	23.3	57.3	55.0	71.3	59.8
LLaVA-OneVision	7B	42.0	50.0	49.0	54.0	41.0	46.4	20.0	45.3	52.0	60.0	46.0

Fine-Grained Task Analysis (Top-5 Models)¶

Task	Gemini-2.5 Pro	Qwen2.5-VL 72B	InternVL3 78B	GPT-4o
Trajectory Recognition	0.0%	0.0%	0.0%	0.0%
Disordered Assembly	76.0%	16.0%	4.0%	30.0%
Multi-hop Needle-in-a-Haystack	27.0%	18.0%	18.0%	25.0%
Subtitle Video Comprehension	86.0%	96.0%	96.0%	93.0%
Translation	84.0%	66.0%	68.0%	70.0%

Key Findings¶

Cross-frame understanding is the most critical bottleneck: most models score below 25% on Cross-Frame Text Understanding; all Top-5 models achieve 0% on trajectory recognition
Resolution and frame count are critical: increasing both consistently improves performance, though some models degrade when frame count increases from 32 to 64 due to attention dispersion
Token compression is unsuitable for OCR: compression-based approaches such as VideoChat-Flash and Slow-fast MLLM perform poorly on OCR tasks
Severe language prior bias: models tend to "correct" misspellings to semantically plausible words (e.g., "throuh" → "through") rather than faithfully recognizing visual content
Large gap between single-frame and cross-frame tasks: subtitle comprehension (single-frame information) exceeds 90%, while multi-hop needle-in-a-haystack (cross-frame aggregation) falls below 30%, indicating that models rely on sparse frames rather than genuinely integrating temporal information
Pronounced scaling effects: Qwen2.5-VL 7B→72B yields 10%+ improvement; InternVL3 8B→78B yields 7%+ improvement

Highlights & Insights¶

⭐⭐⭐⭐ Comprehensive task design: 25 sub-tasks covering the full pipeline from perception → understanding → reasoning, including innovative dimensions such as cross-frame understanding and robustness testing
⭐⭐⭐⭐ Rigorous debiasing design: debiasing tests, balanced option distribution, and multi-round expert review eliminate language priors and knowledge leakage
⭐⭐⭐⭐ Actionable findings: results such as 0% trajectory recognition, language prior bias, and token compression deficiencies directly inform directions for model improvement
⭐⭐⭐ Fully manual annotation: distinguishes this benchmark from those using hybrid annotation, yielding better-controlled quality

Limitations & Future Work¶

The total of 2,000 QA pairs leaves some sub-categories with limited samples (e.g., approximately 50 for trajectory recognition), potentially causing score volatility
Coverage is primarily bilingual (Chinese and English), with no inclusion of additional languages
Among the three difficulty tiers (easy/medium/hard), frontier models perform well on easy and medium instances, necessitating continuous supplementation of high-difficulty samples
The benchmark does not evaluate models after video OCR fine-tuning—whether it is suitable as a training objective remains unclear
Adversarial video testing employs only an all-black frame insertion strategy, limiting the diversity of adversarial forms

Rating¶

⭐⭐⭐⭐ A much-needed comprehensive evaluation benchmark for the video OCR domain. The task design spans rich dimensions, annotation quality is high, and debiasing is rigorously handled. The cross-frame understanding bottleneck and language prior bias identified by this benchmark directly guide MLLM optimization. The benchmark demonstrates strong discriminability and challenge.

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios¶

TL;DR¶

Background & Motivation¶

Method¶

Task Taxonomy (10 Categories, 25 Sub-tasks)¶

Data Construction¶

Evaluation Protocol¶

Key Experimental Results¶

Main Results (18 Models)¶

Fine-Grained Task Analysis (Top-5 Models)¶

Key Findings¶

Highlights & Insights¶

Limitations & Future Work¶

Rating¶

Related Work & Insights¶

Rating¶

Related Papers¶