ChartMuseum: Testing Chart Visual Reasoning in Large Vision-Language Models¶

Conference: NeurIPS 2025 arXiv: 2505.13444 Code: https://chartmuseum-leaderboard.github.io Area: Multimodal VLM Keywords: Chart Understanding, Visual Reasoning, Benchmark, VLM Evaluation, Chart QA

TL;DR¶

This paper introduces ChartMuseum, a chart question-answering benchmark comprising 1,162 expert-annotated questions and real-world charts from 184 distinct sources. It is the first benchmark to systematically distinguish visual reasoning from textual reasoning, revealing that the current strongest model, Gemini-2.5-Pro, achieves only 63.0% accuracy compared to 93% for humans, with visual reasoning performance lagging behind textual reasoning by 35%–55%.

Background & Motivation¶

Existing chart QA benchmarks over-rely on textual reasoning: On ChartQA, Claude-3.7-Sonnet achieves 74.1% accuracy using only extracted text information (without seeing the chart image), versus 87.4% with the image — indicating that the majority of questions do not require genuine visual reasoning.
- On ChartMuseum, the same text-only approach yields only 15.2% (vs. 61.3% with the image), a gap of 46%, demonstrating that ChartMuseum genuinely evaluates visual reasoning.
Frontier models are approaching saturation on existing benchmarks: Model accuracy on ChartQA is clustered between 85%–90%, making it difficult to differentiate model capabilities.
The distinction between visual and textual reasoning has been overlooked: Chart understanding involves two types of reasoning — inferring directly from graphical relationships (visual reasoning) versus inferring from extracted text or numerical values (textual reasoning) — yet prior work does not explicitly distinguish between them.
A synthetic data case study exposes the problem: The authors test models on synthetic charts containing no textual annotations; as visual complexity (number of overlays/subplots \(n\)) increases, model performance degrades significantly while human performance remains stable.

Method¶

Overall Architecture¶

ChartMuseum is a chart question-answering (Chart QA) benchmark dataset manually annotated by 13 computer science researchers. It contains 1,162 (image, question, short answer) tuples derived from 928 unique real-world charts sourced from 184 distinct websites. The dataset is split into dev/test = 162/1,000.

Key Designs¶

Distinguishing visual reasoning from textual reasoning: The paper explicitly categorizes chart reasoning into two types:
- Visual Reasoning: Inference drawn from graphical relationships that is difficult to express purely in natural language (e.g., judging the correlation between two variables in a scatter plot).
- Visual Extraction: A subtype of visual reasoning that involves reading numerical values by interpreting visual elements (e.g., estimating a bar's value by comparing it against the y-axis scale).
- Textual Reasoning: Logical, arithmetic, or comparative operations performed on already-extracted information, or directly reading textual annotations from the chart.
- This distinction reveals a strong bias toward textual reasoning in existing benchmarks.
Four-category question taxonomy:
- Textual Reasoning (123 questions): Answerable almost entirely through textual reasoning.
- Visual Reasoning (510 questions): Primarily requires visual reasoning; the largest category.
- Text/Visual Reasoning (234 questions): Answerable via either textual or visual reasoning.
- Synthesis Reasoning (133 questions): Requires both textual and visual reasoning simultaneously.
Multi-stage quality control pipeline:
- Stage 1: Selection of high-quality chart images.
- Stage 2: Manual creation of question–answer pairs (no LLM assistance, no templates).
- Stage 3: Independent reviewer verification of answer correctness.
- Stage 4: Iterative refinement through discussion with annotators.
- Each sample required an average of 20 minutes (10 min annotation + 5 min review + 5 min iteration), totaling approximately 400 hours.
- Annotation guidelines: answer space of ≥4 options; answers must be objective and unambiguous; why/how/descriptive/compound questions are excluded.

Loss & Training¶

This paper presents a benchmark evaluation study and does not involve model training. Evaluation employs LLM-as-a-Judge (GPT-4.1-mini) to assess answer equivalence. All questions have a single definitive answer; approximate matching with error tolerance is not used.

Key Experimental Results¶

Main Results¶

Model	Visual (510)	Synthesis (133)	Visual/Text (234)	Text (123)	Overall (1000)
Open-source small models
InternVL3-2B	12.2	13.5	18.4	30.1	16.0
Qwen2.5-VL-3B	16.7	21.1	26.5	28.5	21.0
Open-source medium models
Qwen2.5-VL-7B	19.4	24.8	36.3	41.5	26.8
InternVL3-8B	23.5	24.8	32.9	42.3	28.2
Bespoke-MiniChart-7B	26.3	32.3	41.0	54.5	34.0
Open-source large models
Qwen2.5-VL-32B	29.0	36.1	46.2	62.6	38.1
Pixtral-Large-124B	31.6	36.1	40.6	65.9	38.5
Qwen2.5-VL-72B	30.4	35.3	42.3	68.3	38.5
Closed-source models
Gemini-1.5-Flash	22.7	30.8	36.3	56.1	31.1
GPT-4o	31.8	45.1	50.9	65.9	42.2
GPT-4.1	37.1	53.4	54.3	78.9	48.4
Claude-3.5-Sonnet	45.7	53.4	61.5	78.0	54.4
Claude-3.7-Sonnet	50.6	55.6	69.2	88.6	60.3
Reasoning models
o3 (high)	50.4	63.2	69.7	85.4	60.9
o4-mini (high)	51.2	66.2	68.4	86.2	61.5
Claude-3.7-Sonnet (think)	52.5	56.4	71.8	86.2	61.7
Gemini-2.5-Pro	53.3	64.7	70.1	87.8	63.0
Human	98.2	—	—	—	93.0

Ablation Study¶

Text-extraction experiment comparing existing benchmarks:

Dataset	Text Extraction Only	With Image
ChartQA	74.1%	87.4%
ChartMuseum	15.2%	61.3%

The gap between text-extraction-only and image-based performance on ChartMuseum reaches 46%, far exceeding ChartQA's 13%, confirming that ChartMuseum genuinely evaluates visual reasoning.

Error analysis on visual question categories (50 error instances sampled per model):

Error Type	Claude-3.7-Sonnet	Gemini-2.5-Pro
Symbol Selection	34%	28%
Visual Comparison	28%	26%
Trajectory Tracking	14%	12%
X/Y Value Identification	6%	28%
Strategy Error	16%	2%
Textual Reasoning Error	6%	2%

Key Findings¶

Large gap between closed-source and open-source models: The best open-source model, Qwen2.5-VL-72B (38.5%), trails the best closed-source model, Gemini-2.5-Pro (63.0%), by 24.5 percentage points.
Visual reasoning substantially lags behind textual reasoning: All models score 35%–55% lower on the Visual category than on the Text category. For example, GPT-4.1 achieves 78.9% on Text but only 37.1% on Visual (a drop of 41.8%); Qwen2.5-VL-72B drops from 68.3% to 30.4% (a drop of 37.9%).
Reasoning models yield limited gains: Enabling extended thinking in Claude-3.7-Sonnet improves performance by only 1.4% (60.3%→61.7%), suggesting that the bottleneck lies in fundamental visual capabilities rather than reasoning chain length.
Human visual reasoning is near-perfect: Humans achieve 98.2% on the visual reasoning subset (56/57 correct), whereas the strongest model reaches only 53.3%.
Specialized models still lag significantly: Bespoke-MiniChart-7B substantially outperforms open-source models of comparable scale (34.0% vs. 26.8%/28.2%) but remains far behind closed-source models.
Strategy errors: 16% of Claude-3.7-Sonnet's errors are strategy errors — the model fails to adopt the visual reasoning "shortcut" and instead attempts to extract numerical values for computation, leading to incorrect answers.

Highlights & Insights¶

The formal distinction between visual and textual reasoning is the paper's most significant contribution; this framework enables quantification of the asymmetry between these two capabilities in LVLMs.
The "extraction-only" experiment (Section 2.2) elegantly exposes the limitations of benchmarks such as ChartQA — 74% of questions can be answered correctly without ever viewing the chart.
The four-category visual task taxonomy (Symbol Selection / Visual Comparison / Trajectory Tracking / X/Y Value Identification) provides concrete directions for future model improvement.
The discovery of strategy errors is particularly noteworthy: models are over-reliant on textualized reasoning strategies and tend to extract numerical values and compute, even when a simple visual comparison would suffice — revealing a deep architectural bias in current LVLMs.
Dataset annotation was performed entirely by humans (no LLM-generated questions), at 20 minutes per sample and approximately 400 hours in total, with rigorous quality control.

Limitations & Future Work¶

Only English-language charts and questions are included; multilingual settings are not covered.
Only short-answer QA is evaluated; tasks such as summarization and open-ended generation are not addressed.
Unanswerable questions are not included.
The dataset scale (1,162 questions) is relatively modest, with limited samples in some subcategories.
No concrete methods for improving model visual reasoning are proposed; the work is purely diagnostic.
Future directions include designing targeted training data or architectural improvements based on the identified visual reasoning weaknesses.

Evolution of chart QA benchmarks: FigureQA/DVQA (synthetic charts + template questions) → ChartQA (real charts + human-authored questions) → CharXiv/ChartQAPro (more complex but limited sources or model-generated questions) → ChartMuseum (multi-source + fully human-authored + reasoning-type distinction).
Root causes of visual reasoning difficulty: visual encoder bottlenecks (Prismatic VLMs), misalignment in visual feature decoding, limited abstract visual reasoning capability, and difficulty recognizing features that resist textual description.
Limited effectiveness of CoT for visual reasoning: Unlike the substantial gains observed in mathematics and code, extended thinking yields nearly no improvement on chart understanding, echoing findings that "thinking makes humans worse" in certain perceptual tasks.
Implication: Future LVLMs must strengthen visual reasoning at the architectural level, rather than relying solely on extending reasoning chain length.

Rating¶

Dimension	Score	Comment
Problem Significance	⭐⭐⭐⭐⭐	Exposes systematic deficiencies in LVLM visual reasoning; the problem is precisely identified
Methodological Novelty	⭐⭐⭐⭐	The formal visual vs. textual reasoning distinction and four-category taxonomy are original
Experimental Thoroughness	⭐⭐⭐⭐⭐	21 models + human baseline, multi-dimensional analysis, detailed error categorization
Writing Quality	⭐⭐⭐⭐⭐	Clear motivation chain: inadequate existing benchmarks → synthetic validation → new benchmark → comprehensive evaluation → error analysis
Value	⭐⭐⭐⭐	Provides a diagnostic tool and concrete directions for improving LVLM visual reasoning
Overall	4.6/5	A high-quality benchmark paper with precise problem formulation and complete experimental design