Skip to content

MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems

Conference: ACL 2025
arXiv: 2503.01891
Code: GitHub | HuggingFace
Area: Multimodal VLM
Keywords: scientific reasoning benchmark, multimodal evaluation, Chinese science problems, vision-language models, mathematical and physical reasoning

TL;DR

This work proposes MMSciBench, a multimodal scientific reasoning benchmark containing 4,482 Chinese high school mathematics and physics problems. It covers both multiple-choice and question-answering formats, across text-only and multimodal (text-image) settings, complete with human-annotated difficulty levels and a three-level knowledge taxonomy. Evaluation shows that the strongest model, Gemini 1.5 Pro 002, only achieves 63.77% accuracy, with a significant performance drop on multimodal problems (a gap of 36.28 percentage points).

Background & Motivation

Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated strong capabilities across numerous tasks, but their scientific reasoning abilities—especially in multimodal scenarios—remain insufficiently evaluated. Existing scientific benchmarks exhibit three major limitations:

Lack of multimodal evaluation: Most scientific benchmarks only contain text-only problems, failing to assess the joint vision-text reasoning capabilities of models.

Limited area coverage: Existing datasets are either overly focused on a single discipline or cross-disciplinary but lack systematicity, making it difficult to evaluate the understanding of core concepts within specific subjects.

Insufficient evaluation granularity: The lack of human-annotated difficulty levels and structured knowledge classification makes it difficult to analyze performance discrepancies across different complexities and knowledge domains.

Furthermore, Chinese scientific reasoning benchmarks are particularly scarce. GAOKAO-Bench and GAOKAO-MM contain only 3K and 650 problems respectively, and lack fine-grained knowledge classification. This motivates the authors to construct a Chinese scientific benchmark that balances scale, quality, multimodality, and fine-grained evaluation.

Method

1. Data Collection and Quality Control

MMSciBench's data is originally annotated by K-12 teachers, with each problem containing: - Problem text (Chinese) - Detailed step-by-step solution process and final answer - Human-annotated difficulty score (0–1 normalized) - Knowledge point tags - Metadata (problem type, modality, subject)

Quality control workflow: - Filter problems with incomplete information or duplicates. - Keep only highly challenging problems with difficulty scores \(\geq 0.7\). - Limit each problem to at most one image to maintain evaluation consistency. - Employ GPT-4o for three-level knowledge classification, followed by manual validation by curriculum experts. - Obtain a final set of 4,482 problem-solution pairs.

2. Dataset Structural Design

Problem Type Dimension:

Problem Type Math Physics Total
Multiple-Choice (MCQ) 760 2,707 3,467
Question-Answering (Q&A) 516 499 1,015

Modality Dimension:

Modality Math Physics Total
Multimodal (Text-Image) 457 710 1,167
Text-Only 819 2,496 3,315

3. Three-Level Knowledge Taxonomy

  • Domain: Core disciplinary areas, such as "Set" and "Function" in mathematics, and "Classical Mechanics", "Electrodynamics", and "Quantum Mechanics" in physics.
  • Module: Key themes under domains, such as "Probability & Statistics" and "Mechanical Motion & Physical Models".
  • Chapter: The finest granularity, such as "Exponential Function", "Trigonometric Function", "Hooke's Law", and "Equilibrium Conditions of Coplanar Forces".

4. Evaluation Framework

  • Metric: Accuracy (only evaluating the correctness of the final answer).
  • Evaluation Protocol: GPT-4o is employed as an automatic evaluator to compare model outputs against standard answers. Following iterative calibration over 180 problems, the agreement rate between GPT-4o judgment and human evaluation reached 97.22%.
  • Prompt Design: Zero-shot setting, utilizing a unified prompt template without optimization for specific models and without providing auxiliary knowledge point information.

Key Experimental Results

Table 1: Overall and Subject-Specific Accuracy of Models

Model Math Physics Overall
Gemini 1.5 Pro 002 56.74% 66.56% 63.77%
Qwen2-VL-72B-Instruct 35.50% 64.32% 56.11%
Claude 3.5 Sonnet 37.38% 60.54% 53.95%
GPT-4o 35.97% 56.89% 50.94%
Llama-3.2-90B-Vision-Instruct 16.69% 36.96% 31.19%
Qwen2.5-Math-72B-Instruct 57.39%*
DeepSeekMath-7B-Instruct 21.86%*
o1 67.40%†
Claude 3.7 Sonnet 37.64%†

*Text-only math problems only; †Multimodal math problems only

Table 2: Accuracy Comparison between Text-Only vs. Multimodal Settings

Model Math-Text Math-Multimodal Physics-Text Physics-Multimodal Overall-Text Overall-Multimodal
Gemini 1.5 Pro 002 69.60% 33.70% 74.40% 39.01% 73.21% 36.93%
Qwen2-VL-72B-Instruct 41.39% 24.95% 72.48% 35.63% 64.80% 31.45%
Claude 3.5 Sonnet 44.57% 24.51% 67.75% 35.21% 62.02% 31.02%
GPT-4o 44.69% 20.35% 64.10% 31.55% 59.31% 27.16%
Llama-3.2-90B-Vision 19.54% 11.60% 42.83% 16.34% 37.07% 14.48%

Key Finding: All LVLMs perform significantly worse on multimodal problems than on text-only problems. Gemini 1.5 Pro 002 achieves an overall accuracy of 73.21% on text-only problems, but only 36.93% on multimodal problems, resulting in a gap of 36.28 percentage points.

Table 3: Accuracy Comparison between MCQ and Q&A (excluding random guessing baseline)

Model MCQ (Overall) Q&A (Overall) MCQ beyond Random Baseline
Gemini 1.5 Pro 002 68.82% 46.50% +47.96%
Qwen2-VL-72B-Instruct 65.71% 23.35% +44.85%
Claude 3.5 Sonnet 61.55% 27.98% +40.69%
GPT-4o 57.51% 28.47% +36.65%
Llama-3.2-90B-Vision 37.96% 8.08% +17.10%

Gemini 1.5 Pro 002's accuracy on Q&A problems is 22.32 percentage points lower than on MCQs, indicating that open-ended question answering is significantly more challenging.

Table 4: Effects of Chain-of-Thought Prompting

Model Default (Chinese) CoT (Chinese) CoT (English)
Llama-3.2-90B-Vision 31.19% 33.24% 38.00%
GPT-4o 50.94% 50.85% 52.86%
Claude 3.5 Sonnet 53.95% 54.42% 55.40%
Gemini 1.5 Pro 002 63.77% 63.61% 62.25%

Most models display improved performance when using English CoT for reasoning on Chinese problems. However, Gemini's performance declines slightly, potentially due to its higher sensitivity to consistency in the prompt-response language.

Key Findings

Error Type Analysis

An in-depth analysis of problems where all models failed (240 cases) reveals the error distribution: - Reasoning Errors: 77.1% (the primary bottleneck) - Calculation Errors: 11.3% - Visual Misinterpretations: 7.5% - Information Integration Failures: 2.5% - Textual Misunderstandings: 1.7%

Reasoning errors overwhelmingly dominate, indicating a fundamental deficiency of current models in complex multi-step scientific reasoning.

Cross-Knowledge-Point Analysis

  • Model performance varies significantly across different subfields: Gemini leads in most domains but lags behind Claude and GPT-4o in the "Electrodynamics-Magnetic Field" subfield.
  • Universally weak areas for all models: "Electromagnetic Induction and its Applications" in physics, and "Geometry & Algebra" as well as "Function-Introductory Knowledge" in mathematics.
  • The overall accuracy in physics is higher than that in mathematics, partially because the proportion of text-only problems in physics is higher.

Highlights & Insights

  • Multimodal + Multi-format: Covers four combinations (text-only vs. multimodal, MCQ vs. Q&A) to support comprehensive cross-analysis.
  • Three-Level Knowledge Taxonomy: The hierarchical classification from Domain \(\to\) Module \(\to\) Chapter enables fine-grained capability diagnosis, allowing localized analysis of model weaknesses in specific knowledge points.
  • Human-Annotated Difficulty: All problems are accompanied by standardized difficulty scores annotated by K-12 teachers, with filtering \(\geq 0.7\) ensuring the challenging nature of the benchmark.
  • Detailed Solution Process: Each problem is equipped with step-by-step solution explanations, facilitating error localization and future research on model improvement.
  • Multi-Dimensional Analysis: Not only reports overall accuracy but also deeply analyzes results across disciplines, formats, modalities, knowledge points, CoT effects, and error types.

Limitations & Future Work

  1. Limited Subject Coverage: Only covers high school mathematics and physics, excluding other subjects like chemistry and biology, and does not feature university-level or Olympiad-level content.
  2. Final Answer-Only Evaluation: Ignores the correctness of intermediate reasoning steps, which might mask crucial differences in models' internal reasoning processes.
  3. Single Language: Predominantly in Chinese, which might disadvantage models primarily trained on English data; cultural and linguistic biases may impact fairness.
  4. Moderate Dataset Size: 4,482 problems is relatively small compared to large-scale benchmarks; the strict difficulty filtering (\(\geq 0.7\)) might exclude valuable boundary cases.
  5. Limitations of GPT-4o Judging: Despite the 97.22% agreement rate, the automatic evaluator may introduce systematic biases, such as being too lenient on incomplete answers or inaccurate in judging the equivalence of complex mathematical expressions.
Benchmark Subject Modality Language Difficulty Scale Knowledge Classification Solution Process
GAOKAO-Bench Math/Phys/Other Text-only Chinese High School 3K
GAOKAO-MM Math/Phys/Other Text+Image Chinese High School 650
OlympiadBench Math/Phys Text+Image CN/EN Olympiad 8K
SciBench Math/Phys/Other Text+Image English College 869
M3Exam Multi-subject Text+Image Multilingual K-12 12K Answer Only
EXAMS-V Multi-subject (20) Text+Image Multilingual Middle School 21K Answer Only
MMSciBench Math/Phys Text+Image Chinese High School 4.5K ✓ (Three-level)

MMSciBench uniquely combines multimodal evaluation, a three-level knowledge taxonomy, human-annotated difficulty, and detailed solution processes among Chinese scientific benchmarks, filling the gap in fine-grained capability evaluation.

Rating

  • Novelty: ⭐⭐⭐ — The benchmark design philosophy is solid, but the primary contribution lies in the dataset, with limited methodological innovation.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extremely comprehensive, analyzing 9 models across multiple dimensions (discipline, problem type, modality, knowledge point, CoT, and error type).
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, rich tables and figures, and diverse analytical perspectives.
  • Value: ⭐⭐⭐⭐ — Provides a high-quality benchmark for Chinese scientific reasoning evaluation; the three-level classification system facilitates fine-grained diagnostic analysis.