Skip to content

Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=zCYXhSy9UH
Code/Data: TBD
Area: Multimodal / Vision-Language Model Evaluation
Keywords: Multilingual, Multimodal, VLM Evaluation, In-language Exams, Cultural Inclusivity, MCQA

TL;DR

KALEIDOSCOPE, built through global open-science collaboration, manually collects 20,911 real-world multiple-choice exam questions across 18 languages and 14 subjects (55% requiring visual context). It establishes the largest "in-language" multilingual multimodal VLM benchmark to date, revealing systematic deficiencies in current VLMs regarding low-resource languages, multimodal reasoning, and STEM subjects.

Background & Motivation

Background: VLM performance evaluation has long been dominated by English-centric and Western-centric benchmarks. While multilingual text evaluation has expanded and multimodal benchmarks are emerging, the intersection—reliable evaluation that is both multilingual and multimodal—remains scarce.

Limitations of Prior Work: Common shortcuts involve translating English benchmarks into other languages, which suffers from two fundamental flaws: (1) Translation fails to capture cultural context and local knowledge, instead solidifying Western-centric assumptions; (2) Automated pipelines amplify noise such as "translationese," contaminating evaluation signals. "In-language" benchmarks reflecting regional culture and knowledge are largely missing.

Key Challenge: Frontier generative models are rapidly expanding across modalities and languages, claiming to represent a diverse world. However, the metrics used to measure them remain monolingual and monocultural, creating a systematic misalignment between capability claims and evaluation coverage.

Goal: To construct a large-scale, in-language, multimodal, and culturally authentic exam-based benchmark that tests VLMs as humans are tested globally, thereby diagnosing gaps across linguistic, modal, and disciplinary dimensions.

Core Idea: [In-language Exams] eschew translation in favor of manual collection from official national exams, question banks, and government websites by native speakers and experts, preserving original linguistic and cultural contexts. [Global Open Science Collaboration] utilizes contributors across 20 countries and four continents to ensure linguistic and cultural authenticity. [Image-grounded MCQA] adopts a unified 4-option multiple-choice format close to human testing, requiring 55% of questions to be answered only through visual understanding, making "visually grounded reasoning" the core testing point.

Method

Overall Architecture

KALEIDOSCOPE is a "benchmark + evaluation protocol" rather than a model. The construction pipeline follows three stages: Collection of real exam questions by the global community; Automated Parsing + Human Refinement to convert PDFs, web pages, and scans into structured JSON; and a Triple-Pass Manual Quality Control. The evaluation side employs distinct CoT or direct-answer protocols for open-weight and closed-source VLMs, respectively, using accuracy as the metric.

flowchart TD
    A[Global Open Science Collaboration<br/>20-Country Native Speakers + Experts] --> B[Collect Real Exam Questions<br/>Official Exams/Banks/Gov Websites<br/>License Tracing]
    B --> C[Stage 1: Automated Parsing<br/>PDF/Web Parsers + Mathpix OCR + GPT-4o<br/>→ LaTeX/Markdown/JSON]
    C --> D[Stage 2: Human Refinement<br/>Heuristic Rules + Claude/GPT-4o<br/>Align Stem-Image-Options]
    D --> E[Triple Quality Control<br/>Dual-annotator Acceptance → Script Validation → Dual-verifier Final Audit]
    E --> F[KALEIDOSCOPE<br/>18 Languages/14 Subjects/20,911 Items/55% Multimodal]
    F --> G[Evaluation Protocols<br/>Closed-source: Zero-shot CoT + ANSWER Tag<br/>Open-source: JSON Direct Answer]

Key Designs

1. Three Design Principles: Multimodality, Multilinguality, Diversity The benchmark enforces strict data selection. Multimodality places images at the core (11,459/20,911 items require visuals, covering diagrams, photos, maps, formulas, and tables). Multilinguality focuses on both low-resource (Nepali, Lithuanian, Bengali, Telugu, etc.) and high-resource languages (English, Spanish, Portuguese, Russian, French, German, Arabic, Hindi, Dutch) across 8 language families. Diversity spans 14 subjects and 6 domains (e.g., Mathematics, Sociology, Medicine, Driving Tests), tracking education levels (High School/University Entrance/Vocational) for fine-grained analysis.

2. Two-Stage Annotation Pipeline: Automated Parsing + Human Refinement Source formats vary (PDFs, web pages, scans). Stage one uses PDF/Web parsers or OCR APIs (e.g., Mathpix) with GPT-4o to extract text and image elements into LaTeX/Markdown/JSON. Stage two uses heuristic rules and high-performance LLMs (Claude 3.5 Sonnet, GPT-4o) to reconstruct output for correct alignment, followed by manual verification of image-item bindings and formula formats. Each item contains 17 fields, including source country, language, and subject labels in both English and the source language.

3. Triple-Pass Manual Quality Control + Failure Mode Review To mitigate risks in international collaboration, three manual checkpoints are implemented: (1) Dual independent annotator acceptance (including license compliance); (2) Automated scripts checking for JSON errors and duplicates; (3) Final manual audit by two independent verifiers. During evaluation, suspicious outputs (ambiguous, empty, or consistent cross-model failures) are manually reviewed, leading to the correction or removal of problematic items.

4. Split Evaluation Protocols for Open/Closed Models Closed-source models use zero-shot Chain-of-Thought (CoT) with in-language prompt translations, requiring final answers inside <ANSWER></ANSWER> tags. Since small open-source models gain little from CoT in preliminary tests, they use direct answering with a mandatory {'choice': ...} JSON structure and English instructions. Metrics reported include Accuracy, Format Error (F.E.), and Valid Accuracy (Valid Acc.).

Key Experimental Results

Main Results Table (Macro-average Accuracy %, weight-equalized by language)

Model Overall Acc. F.E. Multimodal Acc. Text-only Acc.
Claude 3.5 Sonnet 62.91 1.78 55.63 73.54
Gemini 1.5 Pro 62.10 1.62 55.01 72.35
GPT-4o 58.32 6.52 49.80 71.40
Qwen2.5-VL-72B 52.94 0.02 48.40 60.00
Qwen2.5-VL-32B 48.21 0.88 44.90 53.77
Qwen2.5-VL-7B 39.56 0.08 36.85 43.91
Aya-Vision-32B 39.27 1.05 35.74 44.73
Aya-Vision-8B 35.09 0.07 32.35 39.27
Qwen2.5-VL-3B 35.56 0.19 33.67 38.51
Molmo-7B-D 32.87 0.04 31.43 35.12
Pangea-7B 31.31 7.42 27.15 37.84

Closed-source models lead (Claude/Gemini ~62%), yet even the strongest overall accuracy (63%) remains far from human-level performance. GPT-4o suffers from high format error rates in multimodal settings (10.5%), though its Valid Acc. recovers significantly. In the open-source camp, Qwen2.5-VL-72B is the strongest (52.94%).

Image Type Analysis (Valid Accuracy %)

Model Diagram Figure Graph Map Photo Formula Table Text
Claude 3.5 Sonnet 62.9 50.5 74.2 80.1 77.8 52.1 75.0 85.2
Gemini 1.5 Pro 59.4 51.3 67.9 69.4 75.8 68.3 76.0 85.2
GPT-4o 59.6 48.2 68.4 78.8 81.5 64.4 76.5 86.2
Qwen2.5-VL-72B 51.1 43.9 59.4 66.1 70.5 48.7 61.5 86.0

Models perform well on Tables (76.5%) and Photos (81.5%), but struggle significantly on Diagrams (62.9%), which require abstract visual reasoning.

Key Findings

  • Modality Gap: Text-only performance significantly exceeds multimodal performance across all models. Larger models show wider gaps (GPT-4o: 21.6% gap vs. Molmo: 3.69% gap).
  • Subject Gap: Humanities and Social Sciences yield an average accuracy of 83.7%, while STEM is only 59.2%. Research suggests models can identify visual content and retrieve knowledge but lack required reasoning chains for STEM.
  • Cross-lingual Gap: High-resource languages outperform low-resource ones; Latin-script languages generally outperform non-Latin scripts, suggesting cross-lingual transfer effects.

Highlights & Insights

  • "Examination" as Evaluation Paradigm: Real-world exams (driving licenses, vocational certifications, university entrance) naturally provide human difficulty scales and culturally grounded reasoning requirements.
  • In-language vs. Translation: Prioritizing native-language collection directly addresses Western-centric bias and "translationese" contamination typical in translated benchmarks.
  • Three-dimensional Diagnosis: Metrics across Modality × Subject × Language, combined with 17 metadata fields, transform the benchmark from a leaderboard into a diagnostic tool for failure attribution.
  • Evaluation-to-Data Feedback: Integrating suspicious output reviews into the quality control loop is a best practice for large-scale crowdsourced benchmarks.

Limitations & Future Work

  • MCQA Ceiling: The 4-option format simplifies automated scoring but introduces a 25% random guess baseline and fails to assess open-ended generation or long-chain reasoning.
  • Imbalance: Sample sizes per language range from 126 to 2,000, limiting statistical reliability for certain low-resource languages (e.g., Nepali).
  • LLM Dependence: Using GPT-4o/Claude for parsing and refinement may introduce model-specific biases or favor related architectures in evaluation.
  • Contamination Risk: Publicly sourced data risks inclusion in pre-training corpora; continuous updates and contamination detection are necessary.
  • Outlook: Expansion to more language families, introduction of open-ended responses, human-AI comparative analysis, and utilizing diagnostic results to refine multimodal training data ratios.
  • Multilingual Text Evaluation: Continues the lineage of Global-MMLU and INCLUDE with native-speaker participation, extending it to the multimodal domain.
  • Multimodal Benchmarks: Complements MMMU, MMBench, and Pangea, but advances in "native exam questions + cultural authenticity + 18-language scale."
  • Insight: For developers of multilingual/multimodal models, evaluation must move beyond translating English benchmarks. STEM visual reasoning and non-Latin scripts are currently the weakest and most critical areas for training data investment.

Rating

  • Novelty: ⭐⭐⭐⭐ Innovative evaluation paradigm focusing on native-language authentic exams and global collaboration.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Extensive diagnosis across 11 models and multiple dimensions, though limited by sample imbalance in low-resource languages.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation, detailed pipeline descriptions, and well-organized visualizations.
  • Value: ⭐⭐⭐⭐⭐ Significant long-term community value as the largest multilingual multimodal exam benchmark to date.