ChartLens: Fine-Grained Visual Attribution in Charts¶

Conference: ACL 2025
arXiv: 2505.19360
Code: Yes
Area: Others
Keywords: Chart Attribution, Visual Grounding, Multimodal LLM, Hallucination Detection, Set-of-Marks

TL;DR¶

Proposes the task of Post-Hoc Fine-grained Visual Attribution in charts, designs the ChartLens algorithm that leverages segmentation techniques to label chart elements and prompt multimodal LLMs with a Set-of-Marks for precise attribution, and constructs the ChartVA-Eval benchmark, achieving an F1 improvement of 26-66% across three chart types.

Background & Motivation¶

Multimodal Large Language Models (MLLMs) have made progress in chart understanding tasks, but suffer from severe hallucination issues: generated texts are often inconsistent with the visual data of the charts.
Attribution is a key strategy for mitigating hallucinations, helping model responses trace back to specific visual evidence.
Existing LLM attribution methods mostly focus on text (e.g., citation retrieval, post-retrieval QA), leaving visual attribution, especially chart visual attribution, virtually unexplored.
Particular challenges of charts:
- They contain complex relationships like precise quantities, trends, and proportions.
- Visual elements (bars, pie slices, line charts) are highly structured and may overlap.
- Understanding requires contextual comprehension of chart types, data encodings, axes, legends, and colors.
Reliable chart attribution is crucial for critical domains such as financial analysis, policy-making, and scientific research.

Method¶

Overall Architecture¶

ChartLens two-stage pipeline: 1. Mark Generation: Detects chart elements (bars/pies/lines) using segmentation techniques and generates marks for each element. 2. Attribution with MLLMs: Feeds the marked charts into MLLMs to perform validation and attribution via Set-of-Marks prompting.

Key Designs¶

1. Mark Generation — Segmentation and Marking¶

Bar Chart Segmentation: - Input images are binarized using Otsu thresholding in both RGB and HSV spaces. - Dark backgrounds are automatically inverted to extract foreground contours. - Contours are decomposed based on unique pixel values to isolate individual bar elements. - Noise contours are filtered out using solidity and area thresholds. - Key Improvement: Utilizes SAM (Segment Anything Model) to refine the segmentation—sampling \(n\) points from detected elements as SAM prompts to generate precise masks. - SAM naturally suppresses background features such as gridlines (low IoU masks).

Pie Chart Segmentation: - Identifies the largest contour as the pie chart boundary. - Calculates the minimum enclosing circle to approximate the chart area. - Unrolls the chart radially into a linear representation, detecting full edges to identify sector boundaries. - Maps back to the original circular area to obtain individual sector segmentations.

Line Chart Segmentation: - Utilizes LineFormer (a Transformer architecture) to specifically handle the slender structures, overlapping trajectories, and intersecting lines of line charts. - After detecting candidate lines, divides them into small segments along the horizontal dimension to serve as fine-grained marks. - The global context of the Transformer helps differentiate closely-spaced or crossing lines.

2. Set-of-Marks Prompting for Attribution¶

Overlays alphabetical/numerical labels of the segmented elements onto the chart images.
Constructs structured prompts containing two objectives:
- Validation: Verifies whether the QA pair is consistent with the chart information.
- Attribution: Identifies and cites the specific marked elements that support the answer.
Employs few-shot textual examples and Chain-of-Thought (CoT) reasoning to guide the MLLM in step-by-step validation and attribution.

Problem Formulation¶

Given a chart image \(c\) and an associated response \(v\), the objective is to produce an attribution set \(A_{c,v} = \{a_1, a_2, ..., a_n\}\), where each \(a_i\) corresponds to a visual region (bar/line segment/sector) in the chart that supports the response. The attribution set must satisfy: 1. Relevance: Each region is directly relevant to the response. 2. Completeness: It covers all necessary visual evidence. 3. Precision: It excludes irrelevant parts.

Key Experimental Results¶

ChartVA-Eval Benchmark Statistics¶

Dataset	Queries	Charts	Bar Charts	Pie Charts	Line Charts	Source
ChartVA-AITQA	301	301	203	0	98	Synthetic
ChartVA-PlotQA	595	581	396	0	199	Synthetic
ChartVA-ChartQA	348	266	121	109	118	Real-world

Main Results: Bar Chart Attribution (P / R / F1)¶

Method	AITQA F1	PlotQA F1	ChartQA F1
Zero-shot GPT-4o	22.77	3.30	7.75
KOSMOS-2	0.51	1.01	3.13
LISA	1.62	0.34	1.01
ChartLens	69.28	34.65	64.14

Line Chart Attribution (Detection % / Chart Area %)¶

Method	AITQA Det.↑	AITQA Area↓	PlotQA Det.↑	PlotQA Area↓	ChartQA Det.↑	ChartQA Area↓
Zero-shot GPT-4o	18.28	1.94	6.79	8.63	3.39	1.15
KOSMOS-2	74.19	46.03	38.83	27.06	87.29	41.49
LISA	94.62	63.18	50.21	40.92	50.21	40.92
ChartLens	59.14	1.25	51.84	9.98	77.8	5.34

The high detection rate of LISA/KOSMOS-2 is due to covering a large percentage of the chart area (40-63%), whereas ChartLens covers only 1-10% of the area, demonstrating much better precision.

Pie Chart Attribution (ChartQA)¶

Method	P	R	F1
Zero-shot GPT-4o	8.94	5.99	7.17
KOSMOS-2	20.18	8.24	11.70
LISA	1.32	13.86	2.41
ChartLens	53.33	44.57	48.56

Ablation Study¶

Modularity of the Segmentation Module: SAM can be replaced with more advanced segmentation models to achieve further improvements.
Choice of MLLM: ChartLens is based on ChatGPT-4o, but the underlying multimodal model can be flexibly replaced.
Effectiveness of SoM Prompting: Directly allowing GPT-4o to perform zero-shot bounding box prediction yields extremely poor performance (F1 of only 3-23%), verifying the necessity of SoM prompting combined with segmentation.

Key Findings¶

ChartLens significantly outperforms baselines across all chart types: achieves +46-66% F1 on bar charts and +37% F1 on pie charts.
Existing visual grounding models are unsuitable for chart attribution: KOSMOS-2 and LISA perform poorly in precisely locating chart elements.
Zero-shot bounding box prediction by GPT-4o is unreliable: It lacks accurate visual localization capabilities.
Precision vs. Recall Tradeoff: LISA/KOSMOS-2 achieve high recall by covering large areas, but at the cost of extremely low precision.
SoM prompting significantly improves attribution quality: Tokenization enables MLLMs to explicitly reference specific visual elements.

Highlights & Insights¶

First to systematically define the task of fine-grained visual attribution in charts, filling an important gap in multimodal attribution.
Modular design with segmentation + marking + MLLM: Each component can be upgraded independently, showing good extensibility.
The ChartVA-Eval benchmark covers both synthetic and real-world charts, sourced from authoritative data such as SEC filings, the World Bank, and GTD.
Provides a plug-and-play post-processing attribution mechanism that can be used with any MLLM without affecting the generation of original answers.

Limitations & Future Work¶

Segmentation precision is a bottleneck: inaccurate segmentation leads to incomplete attribution.
Only handles visual chart elements (bars/lines/sectors), excluding text elements (attribution of titles, labels, and legends).
Relies on GPT-4o as the attribution reasoning engine, which incurs a relatively high cost.
Has not been tested on more complex chart types (e.g., scatter plots, heatmaps, combined charts).
Attribution annotation relies partly on automation and manual verification, presenting challenges for large-scale standardized annotation.

Textual Attribution (Direct/Post-retrieval/Post-hoc) \(\rightarrow\) This work extends it to the visual modality.
MATSA proposed table attribution, which ChartLens further extends to chart visual elements.
The hybrid strategy of SAM + LineFormer serves as a reference for other structured visual understanding tasks.
Insights: Similar attribution frameworks could be applied to other structured visual data, such as infographics, maps, and flowcharts.

Rating¶

Dimension	Score
Novelty	⭐⭐⭐⭐ First to define the chart visual attribution task
Experimental Thoroughness	⭐⭐⭐⭐ Three chart types + four baselines + complete benchmark
Value	⭐⭐⭐⭐ Modular and plug-and-play
Writing Quality	⭐⭐⭐⭐ Clear problem definition and detailed experiments
Overall Recommendation	⭐⭐⭐⭐