VisNumBench: Evaluating Number Sense of Multimodal Large Language Models¶

Conference: ICCV 2025
arXiv: 2503.14939
Code: https://wwwtttjjj.github.io/VisNumBench/
Area: Multimodal VLMs
Keywords: Number sense, visual numerical estimation, MLLM evaluation benchmark, quantity perception, multimodal reasoning

TL;DR¶

This paper proposes VisNumBench, a benchmark containing approximately 1,900 multiple-choice questions covering 7 visual numerical attributes and 4 types of visual numerical estimation tasks. It systematically evaluates the intuitive number sense of 17 MLLMs, revealing that even state-of-the-art models perform far below the human level.

Background & Motivation¶

Background: Multimodal Large Language Models (MLLMs) have made significant progress on complex multimodal tasks. Existing evaluation benchmarks (e.g., MathVista, Math-Vision) mainly focus on symbolic mathematical reasoning and structured numerical computing.

Limitations of Prior Work: Existing benchmarks focus on abstract mathematical problem-solving, overlooking a core ability in human cognition—intuitive number sense. Humans can estimate angles, lengths, and quantities at a glance, but whether MLLMs possess similar capabilities remains completely unknown.

Key Challenge: While MLLMs can handle complex mathematical problems (relying on symbolic reasoning), they may perform poorly on simple numerical estimation tasks that require visual intuition. This reveals a deep-seated deficit in their "understanding" of numbers.

Goal: (a) To construct a benchmark specifically for evaluating visual number sense; (b) To systematically measure the gap in number sense between current MLLMs and humans.

Key Insight: Grounded in the cognitive science concept of the human Approximate Number System (ANS), the paper defines 7 visual numerical dimensions and 4 types of estimation tasks.

Core Idea: MLLMs' number sense is a foundational cognitive ability independent of mathematical reasoning, requiring dedicated evaluation and optimization.

Method¶

Overall Architecture¶

VisNumBench consists of two parts: VisNumBench-Synthetic (synthetic data, 1,011 questions) and VisNumBench-Real (real-world images, 902 questions). The input is an image combined with a multiple-choice question, and the output is the model's chosen answer. An automated pipeline is utilized for evaluation to directly compare model outputs with ground truth answers.

Key Designs¶

7 Visual Numerical Attributes:
- Angle, Length, Scale, Quantity, Depth, Area, Volume
- Synthetic data covers the first 6 attributes, while real-world data covers 6 attributes excluding Area (incorporating Volume).
- Design Motivation: To cover the primary dimensions of human number sense and avoid biased evaluations based on a single attribute.
4 Visual Numerical Estimation Tasks:
- Value Comparison: Which of the two visual quantities is larger/longer?
- Value Estimation: Estimating precise numerical values for a given figure (e.g., approximately how many degrees is the angle?).
- Range Estimation: Which interval does the numerical value fall within?
- Multiplicative Estimation: Approximately how many times larger is A than B?
- Design Motivation: Hierarchical evaluation ranging from simple comparison to complex reasoning to align with different difficulty levels of human number sense.
Data Construction Pipeline:
- Synthetic Data: Programmatic control is applied to precisely govern geometric parameters (angles, line segments, shape arrangements, etc.), ensuring unambiguous ground truths. Distractors are meticulously designed to avoid being excessively easy or difficult.
- Real-world Data: Images are collected from real-world scenes (buildings, furniture, fruits, etc.), and quantities or depth relationships are manually annotated, ensuring that models' numerical intuition is tested in natural contexts.
- Design Motivation: Synthetic data provides a controlled baseline, while real-world data evaluates generalization capabilities.

Evaluation Protocol¶

All questions are formatted as multiple-choice (3-5 options) to unify the evaluation standard.
The human baseline is annotated by experts, achieving an average accuracy of approximately 95%.
Zero-shot inference is used for each MLLM without providing any examples.

Key Experimental Results¶

Main Results¶

Model	Angle	Length	Scale	Quantity	Depth	Area	Average
Random	24.4	25.4	25.0	25.0	25.0	23.7	24.8
Qwen2.5-VL-72B	37.1	59.7	65.0	57.7	61.5	70.4	58.5
InternVL2.5-78B	35.3	59.7	68.6	42.9	61.5	72.5	56.2
Gemini 2.0 Flash	31.2	57.5	81.4	55.1	51.1	70.9	57.6
GPT-4o	35.3	43.1	54.3	37.2	54.1	43.4	43.7
LLaVA-v1.5-7B	31.2	30.4	34.3	33.2	26.7	21.2	29.4
Human	90.0	96.0	100.0	96.0	98.0	92.0	95.3

Real-world Data Comparison¶

Model	Angle	Length	Scale	Quantity	Depth	Volume	Average
Qwen2.5-VL-72B	34.2	50.6	43.4	80.3	52.6	59.2	53.3
InternVL2.5-78B	36.9	58.6	49.0	79.6	52.6	62.6	56.5
GPT-4o	27.5	30.3	37.1	60.5	35.7	47.6	39.6
LLaVA-v1.6-34B	28.9	54.9	23.1	68.0	63.6	63.3	50.6
Human	~ 95	~ 95	~ 95	~ 95	~ 95	~ 95	~ 95

Key Findings¶

Angle estimation is the most prominent bottleneck for all models: Even the best-performing model (InternVL2.5-78B) only achieves 36.9% accuracy, which is close to random guessing (25%), compared to 90.0% for humans. Angle perception may require spatial rotation reasoning capabilities.
Quantity perception performs better in real-world scenarios: Qwen2.5-VL-72B reaches 80.3% accuracy in Quantity on real-world data but only 57.7% on synthetic data, indicating a higher prevalence of counting scenarios in pre-training data.
Model scaling yields limited improvements: For models of the same family, scaling parameters by 24x (from 3B to 72B) only improves average accuracy by around 16 percentage points.
Multimodal math/CoT models show no significant advantage: This indicates that number sense is an independent dimension of capability separate from symbolic reasoning.
Newer versions outperform older ones within the same family (e.g., Qwen2.5-VL > Qwen2-VL), suggesting that iterative improvements in datasets and training methodologies slowly enhance number sense.

Highlights & Insights¶

Comprehensive and theoretically grounded design of evaluation dimensions: The cross-coverage of 7 numerical attributes × 4 estimation tasks far exceeds existing benchmarks. Grounded in the Approximate Number System from cognitive science, it possesses a solid theoretical foundation.
Unveiling an overlooked bottleneck in capability: MLLMs' near-random performance on intuitive number sense could be an underlying cause of failures in many downstream tasks (e.g., chart understanding and spatial reasoning).
The complementary design of synthetic + real-world data is highly effective: Synthetic data isolates pure numerical reasoning from the interference of scene complexity, while real-world data examines practical utility. The performance discrepancy between these two domains reveals interesting patterns.

Limitations & Future Work¶

The dataset size is relatively small (only ~1,900 questions), which may lead to statistical fluctuations.
Evaluation is limited to zero-shot setups; the potential of few-shot or chain-of-thought prompting to improve number sense remains unexplored.
No training data or fine-tuning schemes are provided to improve number sense, keeping the study restricted to the "diagnostic" level.
Volume estimation only appears in the real-world dataset, lacking controlled experiments in synthetic data.
Future research could explore improvement pathways by incorporating vision-spatial reasoning enhancement methods (e.g., spatial tokens, coordinate encoding).

vs. MathVista: MathVista focuses on solving math problems (requiring symbolic computation), whereas VisNumBench measures intuitive number sense (without requiring computation). The two are complementary.
vs. PhysBench: PhysBench evaluates physical reasoning, while VisNumBench evaluates numerical intuition. Both expose the limitations of MLLMs in "common-sense" domains.
vs. Ordinal Regression Research (e.g., age estimation, image aesthetics assessment): While they test numerical estimation in specific domains, VisNumBench provides a unified cross-domain framework.

Rating¶

Novelty: ⭐⭐⭐⭐ The first benchmark to systematically evaluate MLLM number sense, featuring a novel problem definition.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive testing on 17 models, though lacking mitigation strategies.
Writing Quality: ⭐⭐⭐⭐ Clear structure and intuitive figures/tables.
Value: ⭐⭐⭐⭐ Unravels overlooked cognitive deficits in MLLMs, offering inspiring insights for the community.