OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models¶
Conference: ACL 2026 arXiv: 2604.20806 Code: GitHub Area: Multimodal VLM / LLM Evaluation Keywords: Multi-image reasoning, Olympiad-level reasoning, Vision-language model benchmark, Cross-image association, Scientific reasoning
TL;DR¶
This paper presents OMIBench — the first large-scale benchmark for olympiad-level multi-image reasoning, covering 1,000+ competition problems across biology, chemistry, mathematics, and physics. Even the strongest LVLM (Gemini-3-Pro) achieves only ~50% accuracy, a drop of more than 25% compared to single-image benchmarks.
Background & Motivation¶
Background: LVLMs have made significant progress on standard reasoning tasks, with chain-of-thought (CoT) prompting achieving major breakthroughs on single-image olympiad benchmarks. Existing benchmarks such as OlympiadBench are approaching saturation by top models.
Limitations of Prior Work: (1) Existing olympiad-level multimodal benchmarks are almost entirely limited to single-image settings, whereas real scientific competitions frequently feature problems that require reasoning across multiple interrelated figures and experimental diagrams. (2) Existing multi-image benchmarks (e.g., MuirBench, MMIU) focus on perception and cross-image reference but are of limited difficulty and lack strong semantic/quantitative cross-image dependencies, making them insufficient for evaluating olympiad-level reasoning. (3) Expert-annotated reasoning paths are absent, precluding fine-grained analysis of where model reasoning fails.
Key Challenge: Olympiad-level multi-image reasoning requires models not only to understand individual images, but also to (1) maintain coherent information flow across images and (2) perform deep cross-image, cross-modal reasoning — a qualitative leap from perception to integrated reasoning that existing benchmarks cannot effectively evaluate.
Goal: To construct an olympiad-level multi-image reasoning benchmark spanning four major scientific disciplines, with expert-annotated reasoning paths and multiple evaluation protocols, systematically exposing the reasoning deficiencies of LVLMs in multi-image settings.
Key Insight: Real competition problems requiring joint multi-image reasoning are collected from international and national subject olympiads, rather than synthetic or simplified multi-image tasks.
Core Idea: Extending olympiad-level reasoning evaluation from single-image to multi-image settings — when evidence is distributed across multiple images, the reasoning difficulty undergoes a qualitative rather than merely quantitative change.
Method¶
Overall Architecture¶
OMIBench contains 1,000+ olympiad-level multi-image reasoning problems, with an average of 3.07 images per problem. Both multiple-choice and open-ended answer formats are supported. Each problem is accompanied by an expert-verified reasoning rationale, and two evaluation modes are supported: exact match and semantic equivalence. The data construction pipeline consists of four stages: data collection and filtering → reasoning path annotation → quality control → category labeling.
Key Designs¶
-
Multi-Image Competition Problem Dataset Construction:
- Function: Provides authentic, olympiad-level evaluation data requiring cross-image reasoning.
- Mechanism: PDF problems are collected from international olympiads (IPhO, IChO, etc.), national/regional competitions, and mixed-difficulty benchmarks, converted to Markdown via Mathpix OCR, and manually verified. Selection criteria require each problem to contain ≥2 images that jointly provide reasoning evidence. Multilingual problems are first translated using Google Translate and then verified by human annotators.
- Design Motivation: To ensure problem difficulty genuinely reaches competition level and that non-trivial semantic/quantitative dependencies exist among multiple images.
-
Two-Stage Expert Reasoning Path Annotation:
- Function: Provides reference reasoning paths for each problem that can be used to analyze model reasoning processes.
- Mechanism: Gemini-2.5-pro-thinking is first used to generate up to 16 candidate solutions per problem; solutions with correct answers are retained (if all fail, the correct answer is provided for regeneration, reducing manual annotation effort by ~20%). Annotators with competition experience then verify and revise the solutions to ensure reasoning steps are correct, complete, and well-structured.
- Design Motivation: Most competition datasets lack solution processes, yet reasoning paths are critical for diagnosing exactly where model reasoning fails.
-
Dual Evaluation Protocol (Exact Match + GPTScore):
- Function: Simultaneously evaluates strict answer correctness and semantic equivalence.
- Mechanism: Exact match (ACC) requires complete answer agreement; GPTScore evaluates semantic equivalence of open-ended answers under multimodal context constraints, handling cases where expressions differ but meanings are equivalent.
- Design Motivation: Open-ended answers can be expressed in multiple equivalent forms; relying solely on exact match underestimates true model capability.
Loss & Training¶
This is a pure benchmarking work; no model training is involved.
Key Experimental Results¶
Main Results¶
| Model | Bio Score | Chem Score | Math Score | Phys Score | Overall Score |
|---|---|---|---|---|---|
| Gemini-3-Pro | 71.31 | 25.35 | 62.56 | 38.92 | 50.53 |
| GPT-5 | 62.55 | 29.03 | 56.51 | 40.80 | 48.11 |
| GPT-5-mini | 59.36 | 24.42 | 56.74 | 43.63 | 47.73 |
| Qwen3-VL-32B | 58.57 | 20.74 | 40.70 | 25.00 | 35.78 |
| InternVL3-78B | 46.61 | 20.74 | 17.21 | 18.63 | 23.83 |
Comparison with Single-Image Benchmarks¶
| Analysis | Result |
|---|---|
| Gemini-3-Pro: OlympiadBench → OMIBench | 75.67% → 50.53% (↓25%+) |
| Model ranking correlation (Spearman ρ) | 0.614 < 0.7 (moderate correlation) |
| Manual review: error rate in o4-mini reasoning steps | 46% of key steps contain logical errors |
Key Findings¶
- The strongest model, Gemini-3-Pro, reaches only 50.53%, demonstrating that multi-image olympiad reasoning remains an enormous challenge.
- Accuracy drops by more than 25% moving from single-image to multi-image settings, and model rankings shift substantially (ρ = 0.614), indicating that multi-image reasoning ability cannot be simply inferred from single-image performance.
- A significant gap exists between closed-source and open-source models — Gemini-3-Pro outperforms the best open-source model by ~15%, yet GPT-4o is on par with open-source models, suggesting that scale alone is not the decisive factor.
- Long CoT, test-time scaling, and ICL yield limited but consistent improvements; parameter scaling and think-with-image approaches offer negligible or even negative gains.
- Chemistry and physics are the most challenging disciplines (lowest scores), while biology is the most "accessible" — possibly because biology problems rely more on knowledge recall than multi-step reasoning.
Highlights & Insights¶
- The claim of a qualitative shift from single-image to multi-image reasoning is strongly supported by experimental evidence — an absolute drop of 25%+ and rank reordering (ρ = 0.614) together demonstrate that this is not a simple additive difficulty increase.
- Manual review reveals that 46% of key reasoning steps contain logical errors — models can produce fluent reasoning chains that are nonetheless logically flawed, which serves as an important warning for CoT evaluation methodology.
- Coverage across four disciplines allows the benchmark to reveal imbalances in reasoning capability across subjects, which is informative for education and capability assessment.
Limitations & Future Work¶
- The dataset contains ~1,000 problems; subsets for some disciplines may be too small for sufficient statistical power.
- Semantic evaluation relies on GPTScore; the reliability of LLM-as-judge for assessing equivalence of mathematical/scientific answers remains to be validated.
- The types of cross-image dependencies (complementary information, contradictory information, temporal change, etc.) are not categorized at a fine-grained level.
- Multimodal RAG or tool-augmented strategies have not been tested.
- Problem sources are biased toward international and Chinese competitions, which may introduce unfair bias for models from certain cultural backgrounds.
Related Work & Insights¶
- vs. OlympiadBench (He et al., 2024): Both are competition-level benchmarks, but fewer than 5% of OlympiadBench problems involve multiple images; OMIBench is entirely multi-image, exposing capability deficiencies previously obscured by single-image settings.
- vs. MuirBench / MMIU: These multi-image benchmarks are of low difficulty, lack competition-level reasoning, and provide no reasoning path annotations.
- vs. ReMI (Kazemi et al., 2024): Covers mathematics and physics but at H/COL difficulty level, excludes biology and chemistry, and provides no reasoning annotations.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of multi-image and olympiad-level evaluation represents a novel assessment angle, though the benchmark construction methodology is relatively standard.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluation of 30+ models, analysis of multiple augmentation strategies, and systematic comparison with single-image benchmarks.
- Writing Quality: ⭐⭐⭐⭐ Clear structure and rich data presentation.
- Value: ⭐⭐⭐⭐ Fills the gap in multi-image olympiad reasoning evaluation and provides meaningful reference for model capability analysis.