MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation¶

Conference: CVPR 2026
arXiv: 2603.23896
Code: None (MMTIT-Bench planned for release)
Area: Multimodal VLM / Machine Translation
Keywords: Text-Image Translation, Multilingual Benchmark, Chain-of-Thought, Cognition-Perception-Reasoning, VLLM Evaluation

TL;DR¶

Constructed MMTIT-Bench, a multilingual and multi-scenario text-image machine translation benchmark covering 14 non-English and non-Chinese languages. Proposed the CPR-Trans data paradigm (Cognition → Perception → Translation Reasoning), which significantly improves end-to-end translation quality on 3B and 7B models, with the 7B model achieving performance competitive with a 235B model.

Background & Motivation¶

Background: Text-Image Machine Translation (TIMT) aims to directly translate text content within images. With the advancement of VLLMs, end-to-end TIMT has replaced traditional OCR+NMT cascade schemes. However, existing research primarily focuses on English-Chinese pairs and evaluations in simple scenarios like digital documents.
Limitations of Prior Work: (1) Lack of a benchmark covering multiple languages and scenarios—existing datasets cover at most 4 languages (MTIT6) with limited scene diversity; (2) The Chain-of-Thought (CoT) reasoning paradigm for TIMT is underdeveloped—existing methods either cascade OCR and translation or perform pure linguistic reasoning, neglecting visual cognition.
Key Challenge: VLLMs perform well on high-resource languages, but their robustness in low-resource languages and complex visual scenes (menus, posters, street views) remains unknown, and there is no suitable benchmark for systematic evaluation.
Goal: (1) Construct a TIMT benchmark covering multiple languages and scenarios; (2) Design a reasoning data paradigm suitable for TIMT.
Key Insight: Simulate the human translation process—understand the scene (Cognition) → recognize the text (Perception) → reason the translation (Reasoning), and design structured CoT supervision.
Core Idea: Use a three-stage structured reasoning chain of Cognition-Perception-Reasoning to guide end-to-end text-image translation.

Method¶

Overall Architecture¶

The work consists of two parallel pipelines. One is MMTIT-Bench construction: collecting real images from 14 languages, followed by four steps: "Image Collection → OCR Processing & Annotation → Translation Annotation → Final Audit/Filtering." This results in 1400 high-quality evaluation samples (100 per language, with remaining annotated data used for training), supported by a Dual-Protocol Evaluation (VLLM Judge + COMET). The other is the CPR-Trans data paradigm: using Qwen3-VL-235B to generate a three-stage structured CoT ("Cognition → Perception → Translation Reasoning") on training data as SFT supervision for end-to-end TIMT models. The trained models are then evaluated on MMTIT-Bench.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}}%%
flowchart TD
    A["Real images in 14 languages<br/>Menus/Posters/Documents etc. ~14000 imgs"]
    subgraph BENCH["MMTIT-Bench Construction"]
        direction TB
        B1["Image Collection + Human Verification"] --> B2["OCR Annotation<br/>Gemini 2.5 Flash + Human Verification"]
        B2 --> B3["Translation Annotation<br/>Multi-temperature Sampling + 3-Model Voting"]
        B3 --> B4["Final Filtering/Audit<br/>100 samples per lang → 1400 total"]
    end
    A --> BENCH
    BENCH -->|Official Test set| EVALSET["Test Set 1400 Samples<br/>ZH/EN Dual References"]
    BENCH -->|Remaining Data| TRAIN["Training Corpus"]
    subgraph CPR["CPR-Trans Generation (Qwen3-VL-235B)"]
        direction TB
        C1["Cognition<br/>Describe global scene, no text recognition"] --> C2["Perception<br/>Analyze text layout and reading order"]
        C2 --> C3["Translation Reasoning<br/>Integrate vision + text for translation"]
    end
    TRAIN --> CPR
    CPR --> SFT["SFT Training Qwen2.5-VL-3B/7B<br/>&lt;think&gt; Reasoning Chain + &lt;answer&gt; Translation"]
    SFT --> EVAL
    EVALSET --> EVAL
    subgraph EVAL["Dual-Protocol Evaluation"]
        direction LR
        E1["VLLM Judge<br/>Faithfulness/Fluency/Readability/Terminology"]
        E2["Rule-based Metrics (COMET)"]
    end

Key Designs¶

1. MMTIT-Bench: A 14-Language, Rigorously Annotated Image Translation Benchmark Addressing the limited language/scenario coverage of existing benchmarks, the authors manually collected ~14,000 real-world images containing text across 14 languages (German, Turkish, Vietnamese, Korean, Malay, Russian, French, Indonesian, etc.). High-quality samples were obtained via a four-step pipeline: Gemini 2.5 Flash assisted OCR + human verification (supporting Markdown tables and LaTeX formulas) → multi-temperature sampling + three-model voting (Gemini, Seed1.6, Qwen3-VL) for translations → selection of 100 samples per language → final audit by language experts. Each image includes both English and Chinese translations to evaluate other→en and other→zh directions.

2. CPR-Trans Data Paradigm: Decomposing Translation into Cognition → Perception → Translation This core training paradigm addresses supervision quality in end-to-end TIMT. Direct translation causes loss of OCR perception (the model cannot "see" its own recognition process), while Simple CoT lacks reasoning beyond OCR text. CPR-Trans mimics human cognition: <cognition> describes the global visual scene without recognizing text → <perception> analyzes spatial layout and reading order → <trans> integrates visual and textual understanding for translation reasoning. The chain is wrapped in <think></think>, and the final translation in <answer></answer>. For example, in a menu image, cognition recognizes it as a menu with appetizers/mains; perception locates items; and trans translates them sequentially, providing an interpretable visual basis.

3. Dual-Protocol Evaluation: Model Judge + Rule-based Metrics To overcome the biases of single evaluation methods, the authors combine: (a) VLLM judging using Gemini 2.5 Flash and Qwen3-VL-235B across four dimensions: Faithfulness, Fluency, Readability, and Terminology consistency; (b) Rule-based evaluation using COMET. High consistency between these two protocols ensures reliability.

Loss & Training¶

Training Data: 12,600 human-annotated samples + 70,000 SynthDog synthetic samples, totaling 165,200 aligned multimodal samples.
SFT based on Qwen2.5-VL-3B and 7B models.
CPR-Trans reasoning chains generated in stages by Qwen3-VL-235B.

Key Experimental Results¶

Main Results¶

Performance of various models on MMTIT-Bench (Gemini-Flash Judge, other→en / other→zh):

Model	Params	Think	other2en	other2zh
Cascade (MinerU+Qwen3)	-	-	48.32	49.70
Qwen3-VL-Instruct	235B	-	64.39	69.67
Qwen3-VL-Thinking	235B	✓	73.81	77.90
Gemini 2.5 Flash	-	✓	82.94	85.00
Ours (Qwen2.5-VL + CPR-Trans)	7B	✓	83.98	82.84

The 7B model + CPR-Trans exceeds Gemini 2.5 Flash in the other→en direction.

Ablation Study¶

Comparison of different data paradigms (7B model, Gemini-Flash Judge):

Paradigm	other2en	other2zh	Description
Origin (No Fine-tuning)	53.98	46.89	Baseline
Direct (Direct Translation)	68.40	62.42	Loses perception capability
Simple CoT (OCR+Trans)	74.65	71.03	Lacks reasoning
Distillation (VLLM)	71.90	69.91	Native chain of thought
CPR-Trans	83.98	82.84	Structured reasoning (Best)

Ablation of reasoning components (7B, Gemini judge other2en):

Cognition	Perception	Trans	Score
-	-	-	74.65 (baseline)
✓	-	-	76.91
-	-	✓	80.73
✓	-	✓	82.11
-	✓	✓	81.90
✓	✓	✓	83.98

Key Findings¶

Translation Reasoning (Trans) contributes most (+6.08 vs baseline), indicating explicit translation reasoning is core to performance.
Cognition component improves performance by +2.26 when added alone, as global context helps disambiguate translations.
Perception component alone shows negligible impact (+0.22↓), but its value lies in providing structured text information for subsequent reasoning steps.
Thinking mode consistently outperforms non-thinking modes within the same model family, confirming the importance of explicit reasoning for TIMT.
Cascade solutions (OCR+LLM) are significantly inferior to end-to-end solutions due to severe error propagation in complex scenes.

Highlights & Insights¶

Small Model Beats Large Model: The 7B CPR-Trans model outperformed Gemini 2.5 Flash in the other→en direction, suggesting that the value of high-quality reasoning data can outweigh model scale. This provides significant insight for resource-constrained scenarios.
Generality of CPR Paradigm: Decomposing complex tasks into cognition, perception, and reasoning is applicable not only to TIMT but also to document understanding, OCR error correction, and other tasks requiring joint vision-language reasoning.
Benchmark Methodology: The labeling pipeline (multi-model voting + human audit) and the dual-track evaluation provide a template for building future multimodal benchmarks.

Limitations & Future Work¶

The 14 languages are still primarily mid-to-high resource; truly low-resource languages (e.g., Burmese, Swahili) are missing.
The statistical significance may be limited by having only 100 test samples per language.
The quality of CPR-Trans chains depends on the capability of the 235B teacher model used for distillation.
RL fine-tuning (e.g., GRPO/DPO) was not explored to further enhance reasoning quality.
Domain shift exists between synthetic data (SynthDog) and real-world scenarios.

vs MTIT6: Extends from 4 languages (1200 samples) to 14 languages (1400 samples), covering more scene types and longer text (160 words vs 7 words avg).
vs DoTA/PATIMT: These focus only on English-Chinese document translation, whereas MMTIT-Bench’s multi-scenario design (menus, posters) is closer to real-world usage.
vs R1-style thinking: While native long CoT is effective, it is often uncontrollable and redundant. The structured design of CPR-Trans provides precise guidance and avoids repetitive "reflection" loops.

Rating¶

Novelty: ⭐⭐⭐⭐ The CPR-Trans paradigm is well-designed, and the benchmark construction is rigorous, though the core idea of structured CoT is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐ Extensive model evaluations and detailed ablations, though fine-grained cross-lingual analysis is missing.
Writing Quality: ⭐⭐⭐⭐ Clear structure and informative charts, though handling both the benchmark and the method makes for a dense paper.
Value: ⭐⭐⭐⭐ Fills a gap in multilingual TIMT evaluation; the CPR-Trans paradigm has broad transferability.