ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning¶

Conference: ICCV 2025 arXiv: 2512.00305 Code: None Area: Chart Understanding / Multimodal Reasoning Keywords: Chart Reasoning, Multimodal Large Language Models, Chain-of-Thought, Visual Grounding, Numerical Hallucination

TL;DR¶

This paper proposes PointCoT, which integrates reflective visual grounding (bounding boxes) into the chain-of-thought for chart reasoning, enabling MLLMs to interactively verify each reasoning step against the chart's visual content. It also constructs the ChartPoint-SFT-62k dataset containing 19.2K high-quality samples, achieving a +5.04% improvement on ChartBench.

Background & Motivation¶

Multimodal large language models (MLLMs) heavily rely on OCR-extracted textual information for chart understanding. When chart text annotations are sparse (e.g., data points lack explicit value labels), models tend to produce severe numerical hallucinations—even when reasoning steps appear plausible, the extracted numerical values contain significant errors.

The authors identify a critical observation: MLLMs exhibit extremely weak grounding capability on chart elements and proportional relationships. When prompted to indicate which chart region corresponds to each reasoning step, models either ignore the request or generate entirely irrelevant coordinates. This reveals that: - Conventional CoT enhances logic-based numerical reasoning but fails to improve the model's fundamental numerical perception - CoT generates additional reasoning tokens without enabling further interaction with the chart's visual tokens - Models lack the "look–point–read–compute" visual reasoning logic that humans employ when reading charts

This motivates the authors to incorporate grounding reflection into the reasoning chain: the model must not only articulate each reasoning step but also output a bounding box indicating which region of the chart it is attending to, verified through re-rendered chart images.

Method¶

Overall Architecture¶

The PointCoT data construction pipeline consists of four stages: 1. Step Decomposition: An LLM generates numerical questions and CoT reasoning steps 2. Code Editing: An LLM modifies the plotting code to insert special characters at key positions 3. Code Rendering: The modified code is executed to re-render the chart 4. Position Localization: OCR detects the positions of special characters and extracts bounding boxes

Key Designs¶

1. Structured Reasoning Construction¶

Function: Decomposes the reasoning process for chart QA into two categories of steps: "Grounding" and "Reasoning"
Mechanism: Qwen2.5-72B is used as the teacher model to generate data-point-related questions and step-by-step reasoning based on the plotting code. Each sub-step is classified as:
- Grounding steps: Require extracting data from the chart (e.g., locating points on coordinate axes, reading legend entries)
- Reasoning steps: Perform logical inference based on information from preceding grounding steps
Design Motivation: The cognitive process of chart reading is inherently structured—humans first identify key locations before performing numerical reasoning. This structure is not artificially imposed but emerges from the intrinsic logic of chart comprehension

2. Point Annotation via Code Editing¶

Function: Generates precise bounding box annotations for each grounding step
Mechanism: Rather than relying on unreliable direct MLLM localization, the approach leverages the chart–code correspondence:
The teacher model identifies the chart element/position corresponding to each grounding step
The plotting code is modified to insert special character markers at key positions via plt.text()
The modified code is executed to render a new chart
Multiple OCR tools detect the positions of special characters and extract bounding boxes
Design Motivation: LLM-based code modification achieves substantially higher success rates than direct MLLM-based chart element localization; code serves as a reliable intermediary for precise positional annotation

3. Four Instruction Data Formats¶

Type 1 – Standard VQA: Original chart + question; supervision is the answer or CoT + answer (bounding boxes excluded to prevent data leakage)
Type 2 – Grounding Task: Intermediate grounding steps are incorporated into the query prompt; ground truth is the predicted bounding box
Type 3 – Edited Chart Reasoning: Bounding box annotations from preceding grounding steps are overlaid onto the original chart to guide the model toward the correct regions
Type 4 – Reasoning Steps: Reasoning steps are directly incorporated into the query prompt; the final supervision signal is the final answer

The resulting ChartPoint-SFT-62k dataset comprises 19.2K charts × 62.3K instruction samples.

Loss & Training¶

Two-stage full-parameter fine-tuning: - Stage 1 – Chart Knowledge Alignment: Trained on MMC-Instruct (410K) + ChartGemma (160K) + ChartQA (28K) + ChartBench (30K) - Stage 2 – Chart-Specific Annealing Tuning: Instruction fine-tuning with the PointCoT format using ChartPoint-SFT-62k

Training details: AdamW optimizer, warmup learning rate \(5 \times 10^{-5}\), weight decay 0.1, gradient clipping 1.0, effective batch size 64, bfloat16 precision, approximately 262 GPU hours (A100-40G). Coordinates are normalized to the range \([0, 999]\).

Key Experimental Results¶

Main Results¶

ChartQA relaxed accuracy@0.05:

Model	Params	Human	Aug.	Avg.
Qwen2-VL	7B	72.08	94.24	83.16
Qwen2.5-VL	7B	78.96	93.76	86.36
ChartMoE+PoT	8B	78.32	90.96	84.64
ChartPoint_Q2	7B	76.12	94.48	85.28
ChartPoint_Q2.5	7B	81.36	94.12	87.74

ChartBench accuracy:

Model	Regular	Extra	Overall
Qwen2-VL	58.36	59.40	58.90
Qwen2.5-VL	62.73	57.26	60.91
ChartMoE	56.31	55.58	51.67
ChartPoint_Q2	63.04	62.09	62.61
ChartPoint_Q2.5	66.71	65.03	65.95

Ablation Study¶

Training strategy ablation (based on Qwen2-VL):

Configuration	ChartQA	ChartBench	Description
Baseline (Qwen2-VL)	83.16	58.90	Original model
+Stage1	83.74	60.39	Chart knowledge alignment
+Stage1+CoT	84.11	60.76	Text CoT distillation
+Stage1+PointCoT	85.30	62.61	CoT with grounding

Coordinate format ablation:

Format	Normalization	Human	Overall	Description
Type A	[0–1], 4 decimal places	73.52	83.68	Continuous coordinates
Type B	[0–1], 3 decimal places	74.68	84.42	Reduced precision
Type C	[0–999], integer	75.36	84.84	Tokenizer-friendly

Key Findings¶

PointCoT yields substantially larger gains on ChartBench (+3.71%/+4.28%) than on ChartQA (+1.56%/+1.22%), since ChartBench contains no data-point text annotations and thus places greater demands on visual grounding
Text CoT distillation offers limited improvement (+0.37%), as the reasoning is generated by an LLM rather than an MLLM and does not leverage the chart's visual information
Improvements are more pronounced on Extra-type charts (area charts, box plots, radar charts, etc.) (+7.77%), demonstrating the generalizability of the visual reasoning framework
Normalizing coordinates to \([0, 999]\) integers outperforms continuous floating-point representations, as the former is more compatible with the tokenizer

Highlights & Insights¶

Critical Diagnosis: The paper precisely identifies the core bottleneck in MLLM chart understanding—not insufficient logical reasoning, but weak visual perception (numerical value extraction)
Conceptual Innovation: Incorporating "grounding reflection" into CoT ensures that each reasoning step can be validated against visual evidence, rather than relying on purely textual inference
Elegant Data Construction: The chart–code correspondence is exploited to achieve precise positional annotation via indirect code modification, circumventing the unreliability of direct MLLM-based localization
Rigorous Quality Control: Success rates are tracked at each stage (96%→76%→51%→77%), with a final expert review achieving 91% acceptance across three annotators

Limitations & Future Work¶

The data construction pipeline achieves a success rate of only approximately 28% (66.8K→19.2K), limiting scalability
Coverage is restricted to three chart types: bar charts (57.1%), line charts (33.6%), and pie charts (9.3%)
QA pairs focus on data point reading and do not address complex numerical computation or multi-step reasoning
The approach depends on the grounding capability of the base model (Qwen2-VL/2.5-VL) and is not applicable to models without bounding box support
Inference-time scaling (e.g., beam search over location or reasoning steps) is left unexplored

MVoT observes a similar phenomenon in structured scenarios such as puzzles—reasoning steps must interact with visual inputs
ChartMoE provides chart–code metadata as a foundation; this work innovatively leverages such data for positional annotation
The proposed framework can be generalized to other reasoning tasks requiring precise visual perception, such as scientific figures, maps, and engineering diagrams

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First to integrate visual grounding reflection into chart CoT, diagnosing a core MLLM bottleneck
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablations and cross-model validation, though dataset type coverage is limited
Writing Quality: ⭐⭐⭐⭐ — Pipeline description is clear, though some notational inconsistencies remain
Value: ⭐⭐⭐⭐⭐ — Establishes a new paradigm for the chart understanding community, shifting from "textual reasoning" to "visual reasoning"