Skip to content

ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

Conference: ACL 2026
arXiv: 2603.28902
Code: https://ckchaos.github.io/ChartDiff
Area: Multimodal / Chart Understanding
Keywords: Chart Comparison, Benchmark, VLM Evaluation, Cross-chart Reasoning, ROUGE vs GPT Score

TL;DR

The authors constructed ChartDiff, the first large-scale benchmark specifically for "dual-chart comparative summarization" (8,541 chart pairs covering 6 chart types, 3 plotting libraries, approximately 60 visual styles, with comparative summaries generated by LLMs and verified by humans). A systematic evaluation of 14 VLMs/pipelines reveals that leading closed-source models excel in GPT Score but perform poorly in ROUGE, while specialized chart models show the opposite, exposing a severe mismatch between ROUGE and human-perceived quality. Furthermore, multi-series charts remain the most significant weakness for all models.

Background & Motivation

Background: Chart understanding (e.g., ChartQA, Chart-to-Text, Chart Summarization) has advanced rapidly, evolving from templated synthetic datasets (FigureQA, PlotQA) to real-world datasets (ChartQA, ChartQAPro, CharXiv, ChartX, ChartBench) and specialized models (ChartLlama, UniChart, MatCha, ChartAssistant, ChartGemma). However, these almost exclusively focus on "single-chart understanding," treating each chart as an independent unit for QA, data extraction, or summarization.

Limitations of Prior Work: Real-world analysis scenarios are predominantly comparative—analysts simultaneously examine two charts to evaluate A/B test results, compare models, identify trend differences across periods or regions, detect anomalies, or verify reproducibility. This capability is largely absent from current evaluation systems. Minority attempts like MultiChartQA, INTERCHART, and ReMI are small-scale (thousands), and their tasks favor QA over free-form summarization. While the contemporaneous ChartAB shares a similar scale, it focuses on "fine-grained grounding/alignment" diagnostics rather than evaluating holistic comparative reasoning. Thus, no large-scale, reproducible metric exists for the core capability of "open-ended cross-chart comparative summarization by VLMs."

Key Challenge: (1) The difficulty of comparative tasks is not a simple superposition of single-chart understanding; models must simultaneously perform data extraction, alignment, trend comparison, anomaly identification, and text generation for two charts. Performance on single-chart benchmarks cannot be extrapolated here. (2) Standard lexical overlap metrics like ROUGE are largely insensitive to the semantic or factual correctness of long summaries. If datasets reuse similar sentence structures, specialized models can "memorize templates" to achieve high scores without genuine comprehension.

Goal: (1) Construct a "dual-chart comparative summarization" benchmark with sufficient scale, diversity, and annotation quality for rigorous evaluation; (2) Systematically assess the performance gap between current closed-source LLMs, open-source LLMs, domain-specific models, and pipeline solutions; (3) Quantitatively reveal the extent to which lexical metrics fail in comparative summarization using both ROUGE and GPT Score; (4) Perform diagnostic analysis across multiple dimensions (chart types, plotting libraries).

Key Insight: The design principle for chart pairs dictates that each pair differs in only one dimension (data entity, time span, or data category) while keeping others consistent. This makes "difference" the sole variable, creating an interpretable and controlled stress test for models.

Core Idea: Scale the production of high-quality comparative summaries using a three-stage pipeline: "LLM generation → Another LLM judge → Human verification," bypassing the dilemma of "pure human annotation being too slow" and "pure LLM generation being too noisy."

Method

Overall Architecture

The construction of the ChartDiff dataset involves four phases: (1) Raw Data Collection: Scraping time-series data from Macrotrends, Yahoo Finance, and Visual Crossing across 8 domains (economy, health, migration, labor, population, trade, stocks, weather), covering ~200 countries/regions, 100 cities, and 100 stocks; filtering out breakpoints and incomplete data. (2) Pairing: Constructing CSV pairs that differ in exactly one of three dimensions—entity, time span, or data category—to ensure controlled differences. (3) Chart Rendering: Using Matplotlib, Plotly, and Plotnine with 60 style configurations to cover 6 types: line charts, bar charts, horizontal bar charts, multi-series line charts, multi-series bar charts, and pie charts. Each pair shares the same styling, with human checks to avoid occlusions or scaling anomalies. (4) Annotation: An annotate–judge–verify three-stage process. The final dataset contains 8,541 pairs (6,041 train / 1,000 val / 1,500 test). The evaluation suite covers 14 models across 4 categories: closed-source (GPT-5.4, Gemini 3.1 Pro, etc.), open-source (Qwen3.5-397B, etc.), specialized (ChartGemma, MatCha), and pipelines (DePlot + GPT/Qwen), using ROUGE-1/2/L and GPT Score (with GPT-5.4 as the judge, achieving a Pearson \(r=0.91\) with human ratings).

Key Designs

  1. "One-dimensional difference" Pairing Constraint:

    • Function: Clearly defines the comparison target for each pair, preventing models from simply listing every difference and enabling objective evaluation.
    • Mechanism: Restricts the difference between two CSVs to exactly one of (a) Entity (country/stock/city), (b) Time span, or (c) Data category (e.g., same country, different periods). Data points are strictly equal, and styling (library, color, font) is shared.
    • Design Motivation: In multi-chart benchmarks, confounding difference dimensions prevent evaluators from determining which dimension the model successfully "understood." Pair-wise differences provide a clean research signal and facilitate sliced analysis by chart type or library.
  2. Annotate–judge–verify Three-stage LLM Pipeline:

    • Function: Balances large-scale production (8.5k pairs) with high quality (human-perceived reliability).
    • Mechanism: (a) Sample \(L_1\) from a model pool \(\mathcal{A}\) to generate candidate summaries \(S\) using underlying CSV data (not images, to avoid OCR noise); (b) Sample \(L_2 \in \mathcal{A} \setminus \{L_1\}\) as a judge to evaluate \(S\); (c) Final human verification for factual correctness, completeness of differences, and clarity.
    • Design Motivation: Pure LLM annotation introduces self-bias. Cross-model judging significantly reduces homogeneity bias, while final human verification ensures the benchmark is not held hostage by LLM preferences. Using CSVs for annotation instead of images is a key trick to decouple "evaluation target (image understanding)" from "annotation accuracy (numerical correctness)."
  3. ROUGE + GPT Score Dual Metrics to Reveal Evaluation Mismatch:

    • Function: Quantitatively exposes the failure of lexical-overlap metrics in evaluating long summarization tasks.
    • Mechanism: ROUGE measures n-gram overlap (lexical), while GPT Score uses GPT-5.4 to rate quality and correctness on a 0–5 scale based on grading prompts (verified by a 300-sample human check with \(r=0.91\)).
    • Design Motivation: The severe reversal in ROUGE vs. GPT Score rankings is a critical finding of this benchmark. It exposes the hidden issue of "specialized models gaming scores via template memorization," forcing future chart papers to adopt model-based metrics rather than relying solely on ROUGE.

Key Experimental Results

Main Results: 14 Models × 4 Metrics (Test set: 1,500 pairs)

Model Category ROUGE-1 ROUGE-2 ROUGE-L GPT Score
GPT-5.4 Closed-source 46.02 12.28 23.45 4.95
Gemini 3.1 Pro Closed-source 47.21 13.48 24.20 4.86
GPT-5.4-mini Closed-source 43.00 10.62 21.68 4.82
Gemini 3.1 Flash Lite Closed-source 46.37 12.83 22.82 4.63
Claude Sonnet 4.6 Closed-source 47.54 13.31 23.42 4.58
GPT-4o Closed-source 44.43 11.48 22.44 4.23
Qwen3.5-397B-A17B Open-source 47.07 12.68 22.57 4.54
Qwen3.5-9B Open-source 44.09 10.84 21.16 3.65
Qwen2.5-VL-7B Open-source 41.18 9.82 20.88 3.18
ChartGemma Specialized 51.49 17.81 28.53 2.00
MatCha Specialized 49.52 18.34 28.75 1.45
DePlot + GPT-5.4 Pipeline 50.75 17.25 28.88 3.58
DePlot + GPT-4o Pipeline 46.46 13.19 23.66 3.38
DePlot + Qwen3.5-9B Pipeline 43.10 10.38 20.30 2.81
Random baseline 25.50 2.50 12.81 1.17

Ablation Study: GPT Score Sliced by Chart Type (Selected Models)

Model Overall Line Bar Bar(H.) Line(M.) Bar(M.) Pie
GPT-5.4 4.95 4.97 4.97 4.89 4.90 4.88 4.99
Gemini 3.1 Pro 4.86 4.82 4.90 4.94 4.65 4.85 4.98
Qwen3.5-9B 3.65 3.82 3.89 3.55 3.20 3.33 3.57
Qwen2.5-VL-7B 3.18 3.54 3.14 2.79 2.79 2.53 3.58
ChartGemma 2.00 2.36 2.36 2.01 1.30 1.36 1.68
DePlot+GPT-5.4 3.58 3.89 4.63 3.16 2.91 4.65 1.24

Plotting Library Slice (GPT Score): GPT-5.4 scores 4.94/4.97/4.93 across Matplotlib/Plotly/Plotnine (diff ≤0.04), while DePlot+GPT-5.4 scores 4.08/3.12/3.89, dropping nearly 1 point on Plotly.

Key Findings

  • Extreme Mismatch between ROUGE and GPT Score: ChartGemma and MatCha score 5+ points higher than GPT-5.4 in ROUGE but only 1.45–2.00 in GPT Score (close to the random baseline of 1.17), indicating their high ROUGE scores stem purely from template memorization.
  • Multi-series Charts are the Biggest Weakness: Scores for Line(M.) and Bar(M.) drop by 0.5–0.6 compared to single-series counterparts for medium models like Qwen3.5-9B, and fall toward the random baseline for ChartGemma.
  • Pie Charts are a Disaster for Pipelines: DePlot+GPT-5.4 scores only 1.24 on Pie charts (vs. 3.58 overall) because DePlot was not pre-trained on pie charts; pipelines are far less robust across chart types than end-to-end VLMs.
  • Strong End-to-End Models are Library-Agnostic: GPT-5.4 fluctuations across libraries are minimal (±0.02), suggesting that powerful LLMs possess robust image generalization capabilities.
  • Reliable GPT Score: High Pearson correlation (\(r=0.91\)) with human evaluation validates GPT Score as a credible metric.
  • Open-Source Gap: The strongest open-source model (Qwen3.5-397B, 4.54) still lags behind GPT-5.4 by 0.41, while smaller models drop significantly, highlighting multi-chart comparison as a task that differentiates model tiers.

Highlights & Insights

  • Exposing ROUGE Failure: This benchmark provides dramatic evidence (ROUGE 51 vs. GPT Score 2) that lexical metrics are insufficient, establishing model-based evaluation as a necessary standard for future chart research.
  • Methodology of "One-dimensional Difference": This design is transferable to other fields like multi-image VQA and causal reasoning, transforming model evaluation from chaotic multi-variable scenarios into controlled experiments.
  • Decoupled Annotation Mastery: The three-stage pipeline achieves high-quality labeling at scale. The crucial trick is using CSVs for annotation to decouple factual correctness from image perception issues.
  • Common Research Frontier: All model families struggle with multi-series charts, providing a clear signal for future research directions.
  • Precise Quantification of Pipeline Fragility: Pipeline failures on Pie charts and Plotly configurations highlight that upstream errors in chart-to-table conversion are the primary bottleneck.

Limitations & Future Work

  • Covers only 6 chart types and 3 libraries; does not include complex visualizations like Sankey, heatmaps, or 3D plots.
  • While verified by humans, the underlying summaries are LLM-generated and may inherit stylistic biases from training data.
  • Main evaluation relies on GPT Score; despite high correlation (\(r=0.91\)), it cannot fully replace in-depth expert evaluation.
  • Focuses on open-ended summarization; does not cover tasks like difference detection QA or point-by-point alignment.
  • Missing large-scale fine-tuning experiments; specialized models were only fine-tuned for 5 epochs.
  • vs. ChartAB: While both are large-scale, ChartAB focuses on grounding/alignment diagnostics, whereas ChartDiff targets holistic comparative summarization.
  • vs. MultiChartQA / INTERCHART / ReMI: ChartDiff addresses the "open-ended generation" gap with its 8.5k pairs.
  • vs. Specialized Models (MatCha, ChartGemma): The findings serve as a wake-up call that current chart-specific instruction tuning suffers from severe over-fitting to templates.
  • Insights: (1) Future chart research should target the dual challenge of "multi-chart + multi-series"; (2) The decoupled annotation paradigm (Data-level LLM labeling + Image-level testing) is highly effective; (3) Future specialized models must shift from "imitating reference style" to "generating factually correct and comprehensive descriptions."

Rating

  • Novelty: ⭐⭐⭐⭐ First 8k+ scale comparative summarization benchmark.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Solid across models and slices, though lacking scaling/fine-tuning depth.
  • Writing Quality: ⭐⭐⭐⭐ Clear arguments and user-friendly tables.
  • Value: ⭐⭐⭐⭐⭐ The findings regarding ROUGE mismatch and specialized model failure will likely reshape the evaluation paradigm in chart understanding.