ChartLens: Fine-Grained Visual Attribution in Charts¶
Conference: ACL 2025
arXiv: 2505.19360
Code: Yes
Area: Others
Keywords: Chart Attribution, Visual Grounding, Multimodal LLM, Hallucination Detection, Set-of-Marks
TL;DR¶
Proposes the task of Post-Hoc Fine-grained Visual Attribution in charts, designs the ChartLens algorithm that leverages segmentation techniques to label chart elements and prompt multimodal LLMs with a Set-of-Marks for precise attribution, and constructs the ChartVA-Eval benchmark, achieving an F1 improvement of 26-66% across three chart types.
Background & Motivation¶
- Multimodal Large Language Models (MLLMs) have made progress in chart understanding tasks, but suffer from severe hallucination issues: generated texts are often inconsistent with the visual data of the charts.
- Attribution is a key strategy for mitigating hallucinations, helping model responses trace back to specific visual evidence.
- Existing LLM attribution methods mostly focus on text (e.g., citation retrieval, post-retrieval QA), leaving visual attribution, especially chart visual attribution, virtually unexplored.
- Particular challenges of charts:
- They contain complex relationships like precise quantities, trends, and proportions.
- Visual elements (bars, pie slices, line charts) are highly structured and may overlap.
- Understanding requires contextual comprehension of chart types, data encodings, axes, legends, and colors.
- Reliable chart attribution is crucial for critical domains such as financial analysis, policy-making, and scientific research.
Method¶
Overall Architecture¶
ChartLens two-stage pipeline: 1. Mark Generation: Detects chart elements (bars/pies/lines) using segmentation techniques and generates marks for each element. 2. Attribution with MLLMs: Feeds the marked charts into MLLMs to perform validation and attribution via Set-of-Marks prompting.
Key Designs¶
1. Mark Generation — Segmentation and Marking¶
Bar Chart Segmentation: - Input images are binarized using Otsu thresholding in both RGB and HSV spaces. - Dark backgrounds are automatically inverted to extract foreground contours. - Contours are decomposed based on unique pixel values to isolate individual bar elements. - Noise contours are filtered out using solidity and area thresholds. - Key Improvement: Utilizes SAM (Segment Anything Model) to refine the segmentation—sampling \(n\) points from detected elements as SAM prompts to generate precise masks. - SAM naturally suppresses background features such as gridlines (low IoU masks).
Pie Chart Segmentation: - Identifies the largest contour as the pie chart boundary. - Calculates the minimum enclosing circle to approximate the chart area. - Unrolls the chart radially into a linear representation, detecting full edges to identify sector boundaries. - Maps back to the original circular area to obtain individual sector segmentations.
Line Chart Segmentation: - Utilizes LineFormer (a Transformer architecture) to specifically handle the slender structures, overlapping trajectories, and intersecting lines of line charts. - After detecting candidate lines, divides them into small segments along the horizontal dimension to serve as fine-grained marks. - The global context of the Transformer helps differentiate closely-spaced or crossing lines.
2. Set-of-Marks Prompting for Attribution¶
- Overlays alphabetical/numerical labels of the segmented elements onto the chart images.
- Constructs structured prompts containing two objectives:
- Validation: Verifies whether the QA pair is consistent with the chart information.
- Attribution: Identifies and cites the specific marked elements that support the answer.
- Employs few-shot textual examples and Chain-of-Thought (CoT) reasoning to guide the MLLM in step-by-step validation and attribution.
Problem Formulation¶
Given a chart image \(c\) and an associated response \(v\), the objective is to produce an attribution set \(A_{c,v} = \{a_1, a_2, ..., a_n\}\), where each \(a_i\) corresponds to a visual region (bar/line segment/sector) in the chart that supports the response. The attribution set must satisfy: 1. Relevance: Each region is directly relevant to the response. 2. Completeness: It covers all necessary visual evidence. 3. Precision: It excludes irrelevant parts.
Key Experimental Results¶
ChartVA-Eval Benchmark Statistics¶
| Dataset | Queries | Charts | Bar Charts | Pie Charts | Line Charts | Source |
|---|---|---|---|---|---|---|
| ChartVA-AITQA | 301 | 301 | 203 | 0 | 98 | Synthetic |
| ChartVA-PlotQA | 595 | 581 | 396 | 0 | 199 | Synthetic |
| ChartVA-ChartQA | 348 | 266 | 121 | 109 | 118 | Real-world |
Main Results: Bar Chart Attribution (P / R / F1)¶
| Method | AITQA F1 | PlotQA F1 | ChartQA F1 |
|---|---|---|---|
| Zero-shot GPT-4o | 22.77 | 3.30 | 7.75 |
| KOSMOS-2 | 0.51 | 1.01 | 3.13 |
| LISA | 1.62 | 0.34 | 1.01 |
| ChartLens | 69.28 | 34.65 | 64.14 |
Line Chart Attribution (Detection % / Chart Area %)¶
| Method | AITQA Det.↑ | AITQA Area↓ | PlotQA Det.↑ | PlotQA Area↓ | ChartQA Det.↑ | ChartQA Area↓ |
|---|---|---|---|---|---|---|
| Zero-shot GPT-4o | 18.28 | 1.94 | 6.79 | 8.63 | 3.39 | 1.15 |
| KOSMOS-2 | 74.19 | 46.03 | 38.83 | 27.06 | 87.29 | 41.49 |
| LISA | 94.62 | 63.18 | 50.21 | 40.92 | 50.21 | 40.92 |
| ChartLens | 59.14 | 1.25 | 51.84 | 9.98 | 77.8 | 5.34 |
The high detection rate of LISA/KOSMOS-2 is due to covering a large percentage of the chart area (40-63%), whereas ChartLens covers only 1-10% of the area, demonstrating much better precision.
Pie Chart Attribution (ChartQA)¶
| Method | P | R | F1 |
|---|---|---|---|
| Zero-shot GPT-4o | 8.94 | 5.99 | 7.17 |
| KOSMOS-2 | 20.18 | 8.24 | 11.70 |
| LISA | 1.32 | 13.86 | 2.41 |
| ChartLens | 53.33 | 44.57 | 48.56 |
Ablation Study¶
- Modularity of the Segmentation Module: SAM can be replaced with more advanced segmentation models to achieve further improvements.
- Choice of MLLM: ChartLens is based on ChatGPT-4o, but the underlying multimodal model can be flexibly replaced.
- Effectiveness of SoM Prompting: Directly allowing GPT-4o to perform zero-shot bounding box prediction yields extremely poor performance (F1 of only 3-23%), verifying the necessity of SoM prompting combined with segmentation.
Key Findings¶
- ChartLens significantly outperforms baselines across all chart types: achieves +46-66% F1 on bar charts and +37% F1 on pie charts.
- Existing visual grounding models are unsuitable for chart attribution: KOSMOS-2 and LISA perform poorly in precisely locating chart elements.
- Zero-shot bounding box prediction by GPT-4o is unreliable: It lacks accurate visual localization capabilities.
- Precision vs. Recall Tradeoff: LISA/KOSMOS-2 achieve high recall by covering large areas, but at the cost of extremely low precision.
- SoM prompting significantly improves attribution quality: Tokenization enables MLLMs to explicitly reference specific visual elements.
Highlights & Insights¶
- First to systematically define the task of fine-grained visual attribution in charts, filling an important gap in multimodal attribution.
- Modular design with segmentation + marking + MLLM: Each component can be upgraded independently, showing good extensibility.
- The ChartVA-Eval benchmark covers both synthetic and real-world charts, sourced from authoritative data such as SEC filings, the World Bank, and GTD.
- Provides a plug-and-play post-processing attribution mechanism that can be used with any MLLM without affecting the generation of original answers.
Limitations & Future Work¶
- Segmentation precision is a bottleneck: inaccurate segmentation leads to incomplete attribution.
- Only handles visual chart elements (bars/lines/sectors), excluding text elements (attribution of titles, labels, and legends).
- Relies on GPT-4o as the attribution reasoning engine, which incurs a relatively high cost.
- Has not been tested on more complex chart types (e.g., scatter plots, heatmaps, combined charts).
- Attribution annotation relies partly on automation and manual verification, presenting challenges for large-scale standardized annotation.
Related Work & Insights¶
- Textual Attribution (Direct/Post-retrieval/Post-hoc) \(\rightarrow\) This work extends it to the visual modality.
- MATSA proposed table attribution, which ChartLens further extends to chart visual elements.
- The hybrid strategy of SAM + LineFormer serves as a reference for other structured visual understanding tasks.
- Insights: Similar attribution frameworks could be applied to other structured visual data, such as infographics, maps, and flowcharts.
Rating¶
| Dimension | Score |
|---|---|
| Novelty | ⭐⭐⭐⭐ First to define the chart visual attribution task |
| Experimental Thoroughness | ⭐⭐⭐⭐ Three chart types + four baselines + complete benchmark |
| Value | ⭐⭐⭐⭐ Modular and plug-and-play |
| Writing Quality | ⭐⭐⭐⭐ Clear problem definition and detailed experiments |
| Overall Recommendation | ⭐⭐⭐⭐ |