VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs¶

Conference: ICLR 2026 arXiv: 2506.06727 Code: GitHub Area: Multimodal VLM Keywords: mathematical reasoning benchmark, multi-image reasoning, visual similarity, image-text alignment, LMM evaluation

TL;DR¶

This paper introduces VisioMath, a benchmark comprising 1,800 K-12 mathematics problems in which all answer choices consist of highly visually similar figures. It reveals a core weakness of LMMs in multi-image–text alignment, and explores three alignment strategies that achieve up to +12.6% accuracy improvement.

Background & Motivation¶

Most existing multimodal mathematical reasoning benchmarks focus on single-image settings or text-based answer options, overlooking an important and common problem type: questions in which all answer choices are figures. Such problems are prevalent in K-12 mathematics education and require fine-grained comparative reasoning over visually near-identical geometric diagrams, function curves, and similar stimuli.

Existing multi-image benchmarks (e.g., MathVerse-mv, MV-Math) lack systematic consideration of high visual similarity. The central observation of VisioMath is that LMMs systematically fail to discriminate among nearly identical figure-based options, with the dominant failure mode being image-text misalignment—models rely on positional heuristics rather than textual cues when reasoning.

Method¶

Overall Architecture¶

VisioMath = a carefully constructed benchmark of 1,800 multiple-choice mathematics problems + comprehensive evaluation + exploration of alignment strategies.

Key Designs¶

Benchmark Construction: 1,800 multiple-choice problems are collected from authentic Chinese high school and college entrance examination papers spanning 2002–2023, containing 8,070 figure-based answer options. Three core design principles:
- Representativeness: Real examination problems covering K-12 topics including geometry, algebraic visualization, numerical comparison, and function pattern recognition.
- Reliability: JSON standardization, LaTeX mathematical notation, manually cropped images (strictly one image per option), and human cross-validation.
- High Visual Similarity: Qwen multimodal-embedding-v1 is used to compute the minimum pairwise cosine similarity among options, \(\text{Sim}(Q) = \min_{i \neq j} \cos(f(x_i), f(x_j))\); the full similarity spectrum is retained to avoid selection bias.
Visual Similarity Quantification: Problems are stratified into four quartile bins (Q1–Q4) by visual similarity, enabling systematic analysis of LMM performance as similarity increases. Approximately 50% of problems also contain figures in the question stem, further increasing the complexity of visual reasoning.
Three Alignment Strategies:
- Image Concatenation (training-free): Multiple option images are merged into a single composite layout.
- Explicit Visual-Text Anchoring (training-free): Explicit correspondence markers are established between figure regions and their textual option labels.
- Alignment-Oriented CoT Fine-tuning: A multi-image chain-of-thought dataset is constructed for supervised fine-tuning; even a small amount of data yields a +12.6% improvement.

Loss & Training¶

This work is primarily a benchmark evaluation study. The alignment-oriented CoT fine-tuning employs standard SFT on a small VisioMath-CoT dataset.

Key Experimental Results¶

Main Results¶

Model	VisioMath Avg.	No-figure Stem	Figure Stem
Human	91.3	92.3	89.7
Gemini 2.5 Pro	80.9	86.3	75.2
Seed1.6-Thinking	72.3	83.9	58.0
GPT-4.1	52.6	56.1	42.8
GLM-4.5V (best open-source)	53.7	61.2	37.2
Qwen2.5-VL-72B	43.7	49.8	33.0
Vision-R1-7B	36.7	33.7	29.2
Random	25.6	—	—

Visual Similarity Quartile	Q1 (Low)	Q2	Q3	Q4 (High)
Human	95.7	91.2	87.6	89.0
Gemini 2.5 Pro	86.2	83.8	76.7	76.9
GLM-4.5V	68.7	59.3	44.2	44.7
Qwen2.5-VL-7B	33.6	37.8	29.8	29.6

Ablation Study¶

Strategy	Accuracy	Gain	Notes
Baseline (no strategy)	baseline	—	Standard inference
Option Shuffling	−8.7% (Gemini)	significant drop	Confirms reliance on positional heuristics
Alignment-Oriented CoT Fine-tuning	+12.6%	largest gain	Effective with only a small CoT dataset

Error Analysis (GLM-4.5V, 50 samples)	Proportion	Notes
Image-text misalignment	36%	Primary error source
Other reasoning errors	64%	Includes computation and concept errors

Key Findings¶

Figure-stem problems are harder: Nearly all LMMs show a significant accuracy drop on problems whose stems contain figures (Gemini −11.1%, GLM −24%), indicating that integrating multi-source visual information is a critical bottleneck.
High similarity causes severe degradation: Model accuracy declines by 12–15 percentage points from the lowest to the highest similarity quartile.
Image-text misalignment is the primary cause: 36% of errors stem from alignment failure; LMMs tend to apply positional heuristics instead of semantic reasoning.
Human vs. LMM divergence: Human accuracy declines only slightly at high similarity before stabilizing, suggesting that human errors stem primarily from conceptual misunderstanding, whereas LMM errors arise from perception-alignment failure.
The best open-source model, GLM-4.5V (53.7%), lags behind humans (91.3%) by 37.6 percentage points, indicating that this task is far from solved.

Highlights & Insights¶

Fills the gap in evaluation of figure-based-option mathematical reasoning, representing the first systematic study of how visual similarity affects multimodal reasoning.
The option-shuffling experiment elegantly demonstrates that LMMs rely on positional heuristics rather than genuine semantic alignment.
The visual similarity quantification methodology (minimum cosine similarity + Qwen embeddings) is rigorously validated.
The +12.6% gain from CoT fine-tuning indicates that the problem can be partially mitigated through data-centric strategies.

Limitations & Future Work¶

Problem sources are limited to Chinese college entrance and high school examinations (with English translations provided), resulting in limited cultural and curricular coverage.
The benchmark scale of 1,800 problems is relatively modest, and sub-domain sample sizes may be insufficient.
The alignment strategies explored are preliminary; more systematic architecture-level improvements remain to be investigated.
Only multiple-choice problems are covered; open-ended figure-based reasoning is not addressed.

VisioMath is complementary to MathVista, MathVerse, and similar benchmarks, focusing specifically on fine-grained discrimination among multiple figure-based options.
Image-text misalignment may be a pervasive issue in any VLM task requiring the processing of multiple images (e.g., document understanding, medical image comparison).
The findings provide guidance for LMM training: explicit multi-image–text alignment capabilities need to be strengthened.

Rating¶

Novelty: ⭐⭐⭐⭐ Fills the gap in figure-based-option reasoning evaluation; visual similarity quantification is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 20+ models (closed-source, open-source, and math-specialized); error analysis and controlled experiments are comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ Rigorously structured; the observation–analysis–strategy logic is clear, and figures are intuitive.
Value: ⭐⭐⭐⭐ Reveals a core weakness of LMMs; practical impact as a benchmark paper depends on community adoption.