MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering¶

Conference: ACL 2025
arXiv: 2405.11985
Code: Available
Area: Multilingual Translation
Keywords: Multilingual VQA, Text-Centric Visual Question Answering, MLLM Evaluation, Low-Resource Languages, Benchmark Dataset

TL;DR¶

This work introduces MTVQA, the first multilingual text-centric visual question answering benchmark covering 9 languages. It resolves the "vision-text misalignment" issue of translation-based approaches through human expert annotation. Evaluations reveal a substantial performance gap between the best MLLM (InternVL-2.5, 32.2%) and the human baseline (79.7%), highlighting the severe challenges of multilingual text understanding.

Background & Motivation¶

Text-Centric Visual Question Answering (TEC-VQA) serves as an important proxy task for evaluating AI's understanding capabilities in text-rich scenarios. However, two core limitations exist in current research:

Language coverage bias: Existing TEC-VQA benchmarks focus almost exclusively on high-resource languages like English and Chinese, while low-resource languages (e.g., Arabic, Thai, Vietnamese) are severely neglected.

Fundamental flaws of translation-based approaches: Previous multilingual VQA works (e.g., xGQA, MaXM) scale up QA pairs using translation engines. However, this incurs a severe "vision-text misalignment" issue in TEC-VQA scenarios, where translation only processes the QA texts but neglects the visual text embedded within the images.

For instance, with an image of a Russian menu, a translation engine might incorrectly translate dish names, leading to a mismatch between the QA pair and the actual text displayed in the image. While such misalignment might not be problematic in general VQA (where questions do not rely on visual text), it is fatal for TEC-VQA.

Method¶

Overall Architecture¶

The construction of MTVQA follows three phases: (1) multi-source text-rich image collection; (2) two-round human expert annotation (maker-checker paradigm); and (3) large-scale MLLM benchmarking.

Key Designs¶

Image Collection and Cleaning:
- Sources comprise three channels: public datasets (e.g., ICDAR MLT19, 30%), web crawling (Common Crawl, 20%), and on-site photography (50%).
- On-site photography was conducted in countries/regions where each language is natively spoken, ensuring the authenticity and high quality of the images.
- Multilingual OCR engines were used to filter images containing text, and algorithms filtered out sensitive/inappropriate content.
- The collection covers over 20 fine-grained scenarios: menus, maps, bills, slides, academic papers, etc.
- Resulting in a final set of 2,116 test images and 6,678 training images.
Two-Round Human Expert Annotation:
- Annotator Qualifications: Native speakers of the target language for over 10 years with a bachelor's degree or higher.
- Round 1 (Generation): For each image, 3 annotators generate 5 QA pairs. The first 3 pairs require directly reading the text in the image to answer, while the remaining 2 require reasoning based on the visual text.
- Round 2 (Verification): Another group of 2 annotators independently reviews each QA pair for relevance, accuracy, conciseness, and ethical compliance.
- A 10% random sampling check was implemented; failed batches were sent back for re-annotation.
- Although this design was costly (approximately $90,000 and taking over 5 months), it ensured high annotation quality.
9-Language Coverage:
- Arabic (AR), Korean (KO), Japanese (JA), Thai (TH), Vietnamese (VI), Russian (RU), French (FR), German (DE), and Italian (IT).
- Covers multiple writing systems (Arabic, Hangul, Japanese scripts, Thai, Latin, and Cyrillic).
- Final dataset contains 28,607 QA pairs across 8,794 images.

Evaluation Design¶

Accuracy is adopted as the primary metric instead of ANLS, as ANLS cannot accurately reflect the correctness of textual content in images.
A unified prompt format is used to constrain output length, ensuring concise and evaluable answers.
Human baselines are evaluated using 10 native speakers per language.

Key Experimental Results¶

Main Results—MLLM Multilingual TEC-VQA Performance (Accuracy %)¶

Model	AR	DE	FR	IT	JA	KO	RU	TH	VI	Avg
Human	76.9	80.2	84.1	78.0	79.1	81.7	76.3	78.4	82.8	79.7
InternVL2.5-78B	15.9	39.0	45.6	42.9	21.1	33.9	12.2	23.8	41.5	32.2
Qwen2-VL-72B	20.7	36.5	44.1	42.8	21.6	37.4	15.6	17.7	41.6	30.9
GPT-4o	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2	27.8
Claude3 Opus	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1	25.7
TextSquare	3.7	27.0	30.8	26.7	3.2	7.2	6.7	5.2	12.4	13.6

Key Contrastive Analysis¶

Dimension	Finding
Human vs. Best MLLM	79.7% vs. 32.2%, a 47.5% gap, indicating huge room for improvement
Latin vs. Non-Latin Scripts	DE/FR/IT are generally higher than AR/JA/TH/RU, due to training data bias
Text-Specific vs. General MLLMs	TextMonkey (9.9%) < MiniCPM (17.3%) < InternVL (32.2%); text-specific models lag behind because they focus only on English and Chinese
OCR+GPT-4 vs. GPT-4V	21.6% vs. 22.0%, each with its own pros and cons
OCR+GPT-4V	28.3%, the best combination
Instruction Tuning Gain	Xcomposer-4KHD: 11.2% → 19.7% (+8.5%)

Error Analysis Statistics¶

Error Type	Ratio
OCR Recognition Failure	39%
Insufficient Reasoning	34%
Language Bias	15%
Hallucination	12%

Key Findings¶

All models perform significantly worse on non-Latin script languages, especially Arabic and Russian.
Open-source models in the Qwen2-VL and InternVL series are already capable of outperforming GPT-4V/GPT-4o.
Text-specific MLLMs lag behind general models in multilingual scenarios due to their excessive focus on English and Chinese.
Few-shot improvements are limited and tend to saturate (zero-shot 22.0% → 5-shot 24.8%).
Asking questions in English versus the native language makes almost no difference, indicating the bottleneck lies in visual text perception rather than language understanding.

Highlights & Insights¶

Precise Problem Definition: The paper clearly points out the fundamental flaw of translation-based approaches in TEC-VQA (vision-text misalignment) and addresses it with high-quality manual annotation.
Broad Coverage: Spanning 9 languages, 20+ scenario types, and dual coverage of both document and natural scenes, it represents the most comprehensive multilingual TEC-VQA benchmark to date.
Revealing Gap: The massive gap of 32.2% vs. 79.7% clearly demonstrates that MLLMs are still far from passing in multilingual text understanding.
OCR Failures Account for 39%: This suggests that improving visual text perception (rather than language understanding) is the true key.

Limitations & Future Work¶

Still Limited Language Coverage: The 9 covered languages still leave out many low-resource languages (e.g., Hindi, Swahili).
Single Answer Format: Only short answers are requested, lacking reasoning questions that require longer explanations.
Limitations of Evaluation Metric: Accuracy requires exact matching, which might be overly strict for morphologically rich languages such as Arabic.
High Cost Barrier: The translation and annotation cost of approximately $90,000 makes rapid expansion of language coverage difficult.
Small Training Set Size: The scale of 6,678 training images limits the effectiveness of instruction tuning.

General Multilingual VQA: xGQA (7 languages) and MaXM (7 languages) employ translation schemes, which are only suitable for scenarios that do not rely on visual text in images.
TEC-VQA Benchmarks: TextVQA, DocVQA, and OCRBench are predominantly in English.
Core Insight from MTVQA: Constructing TEC-VQA benchmarks for low-resource languages requires native annotation; translation-based approaches do not work.

Rating¶

Novelty: ⭐⭐⭐⭐ — The first natively annotated multilingual TEC-VQA benchmark.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Evaluates over 20 models (including both closed-source and open-source), complete with human baselines and detailed error analysis.
Writing Quality: ⭐⭐⭐⭐ — Detailed description of the dataset construction process and deep experimental analysis.
Value: ⭐⭐⭐⭐⭐ — Fills an important gap in multilingual TEC-VQA and reveals massive room for improvement in MLLMs.