Skip to content

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

Conference: ICCV 2025 arXiv: 2512.00305 Code: None Area: Chart Understanding / Multimodal Reasoning Keywords: Chart Reasoning, Multimodal Large Language Models, Chain-of-Thought, Visual Grounding, Numerical Hallucination

TL;DR

This paper proposes PointCoT, which integrates reflective visual grounding (bounding boxes) into the chain-of-thought for chart reasoning, enabling MLLMs to interactively verify each reasoning step against the chart's visual content. It also constructs the ChartPoint-SFT-62k dataset containing 19.2K high-quality samples, achieving a +5.04% improvement on ChartBench.

Background & Motivation

Multimodal large language models (MLLMs) heavily rely on OCR-extracted textual information for chart understanding. When chart text annotations are sparse (e.g., data points lack explicit value labels), models tend to produce severe numerical hallucinations—even when reasoning steps appear plausible, the extracted numerical values contain significant errors.

The authors identify a critical observation: MLLMs exhibit extremely weak grounding capability on chart elements and proportional relationships. When prompted to indicate which chart region corresponds to each reasoning step, models either ignore the request or generate entirely irrelevant coordinates. This reveals that: - Conventional CoT enhances logic-based numerical reasoning but fails to improve the model's fundamental numerical perception - CoT generates additional reasoning tokens without enabling further interaction with the chart's visual tokens - Models lack the "look–point–read–compute" visual reasoning logic that humans employ when reading charts

This motivates the authors to incorporate grounding reflection into the reasoning chain: the model must not only articulate each reasoning step but also output a bounding box indicating which region of the chart it is attending to, verified through re-rendered chart images.

Method

Overall Architecture

The PointCoT data construction pipeline consists of four stages: 1. Step Decomposition: An LLM generates numerical questions and CoT reasoning steps 2. Code Editing: An LLM modifies the plotting code to insert special characters at key positions 3. Code Rendering: The modified code is executed to re-render the chart 4. Position Localization: OCR detects the positions of special characters and extracts bounding boxes

Key Designs

1. Structured Reasoning Construction

  • Function: Decomposes the reasoning process for chart QA into two categories of steps: "Grounding" and "Reasoning"
  • Mechanism: Qwen2.5-72B is used as the teacher model to generate data-point-related questions and step-by-step reasoning based on the plotting code. Each sub-step is classified as:
    • Grounding steps: Require extracting data from the chart (e.g., locating points on coordinate axes, reading legend entries)
    • Reasoning steps: Perform logical inference based on information from preceding grounding steps
  • Design Motivation: The cognitive process of chart reading is inherently structured—humans first identify key locations before performing numerical reasoning. This structure is not artificially imposed but emerges from the intrinsic logic of chart comprehension

2. Point Annotation via Code Editing

  • Function: Generates precise bounding box annotations for each grounding step
  • Mechanism: Rather than relying on unreliable direct MLLM localization, the approach leverages the chart–code correspondence:
  • The teacher model identifies the chart element/position corresponding to each grounding step
  • The plotting code is modified to insert special character markers at key positions via plt.text()
  • The modified code is executed to render a new chart
  • Multiple OCR tools detect the positions of special characters and extract bounding boxes
  • Design Motivation: LLM-based code modification achieves substantially higher success rates than direct MLLM-based chart element localization; code serves as a reliable intermediary for precise positional annotation

3. Four Instruction Data Formats

  • Type 1 – Standard VQA: Original chart + question; supervision is the answer or CoT + answer (bounding boxes excluded to prevent data leakage)
  • Type 2 – Grounding Task: Intermediate grounding steps are incorporated into the query prompt; ground truth is the predicted bounding box
  • Type 3 – Edited Chart Reasoning: Bounding box annotations from preceding grounding steps are overlaid onto the original chart to guide the model toward the correct regions
  • Type 4 – Reasoning Steps: Reasoning steps are directly incorporated into the query prompt; the final supervision signal is the final answer

The resulting ChartPoint-SFT-62k dataset comprises 19.2K charts × 62.3K instruction samples.

Loss & Training

Two-stage full-parameter fine-tuning: - Stage 1 – Chart Knowledge Alignment: Trained on MMC-Instruct (410K) + ChartGemma (160K) + ChartQA (28K) + ChartBench (30K) - Stage 2 – Chart-Specific Annealing Tuning: Instruction fine-tuning with the PointCoT format using ChartPoint-SFT-62k

Training details: AdamW optimizer, warmup learning rate \(5 \times 10^{-5}\), weight decay 0.1, gradient clipping 1.0, effective batch size 64, bfloat16 precision, approximately 262 GPU hours (A100-40G). Coordinates are normalized to the range \([0, 999]\).

Key Experimental Results

Main Results

ChartQA relaxed accuracy@0.05:

Model Params Human Aug. Avg.
Qwen2-VL 7B 72.08 94.24 83.16
Qwen2.5-VL 7B 78.96 93.76 86.36
ChartMoE+PoT 8B 78.32 90.96 84.64
ChartPoint_Q2 7B 76.12 94.48 85.28
ChartPoint_Q2.5 7B 81.36 94.12 87.74

ChartBench accuracy:

Model Regular Extra Overall
Qwen2-VL 58.36 59.40 58.90
Qwen2.5-VL 62.73 57.26 60.91
ChartMoE 56.31 55.58 51.67
ChartPoint_Q2 63.04 62.09 62.61
ChartPoint_Q2.5 66.71 65.03 65.95

Ablation Study

Training strategy ablation (based on Qwen2-VL):

Configuration ChartQA ChartBench Description
Baseline (Qwen2-VL) 83.16 58.90 Original model
+Stage1 83.74 60.39 Chart knowledge alignment
+Stage1+CoT 84.11 60.76 Text CoT distillation
+Stage1+PointCoT 85.30 62.61 CoT with grounding

Coordinate format ablation:

Format Normalization Human Overall Description
Type A [0–1], 4 decimal places 73.52 83.68 Continuous coordinates
Type B [0–1], 3 decimal places 74.68 84.42 Reduced precision
Type C [0–999], integer 75.36 84.84 Tokenizer-friendly

Key Findings

  1. PointCoT yields substantially larger gains on ChartBench (+3.71%/+4.28%) than on ChartQA (+1.56%/+1.22%), since ChartBench contains no data-point text annotations and thus places greater demands on visual grounding
  2. Text CoT distillation offers limited improvement (+0.37%), as the reasoning is generated by an LLM rather than an MLLM and does not leverage the chart's visual information
  3. Improvements are more pronounced on Extra-type charts (area charts, box plots, radar charts, etc.) (+7.77%), demonstrating the generalizability of the visual reasoning framework
  4. Normalizing coordinates to \([0, 999]\) integers outperforms continuous floating-point representations, as the former is more compatible with the tokenizer

Highlights & Insights

  • Critical Diagnosis: The paper precisely identifies the core bottleneck in MLLM chart understanding—not insufficient logical reasoning, but weak visual perception (numerical value extraction)
  • Conceptual Innovation: Incorporating "grounding reflection" into CoT ensures that each reasoning step can be validated against visual evidence, rather than relying on purely textual inference
  • Elegant Data Construction: The chart–code correspondence is exploited to achieve precise positional annotation via indirect code modification, circumventing the unreliability of direct MLLM-based localization
  • Rigorous Quality Control: Success rates are tracked at each stage (96%→76%→51%→77%), with a final expert review achieving 91% acceptance across three annotators

Limitations & Future Work

  • The data construction pipeline achieves a success rate of only approximately 28% (66.8K→19.2K), limiting scalability
  • Coverage is restricted to three chart types: bar charts (57.1%), line charts (33.6%), and pie charts (9.3%)
  • QA pairs focus on data point reading and do not address complex numerical computation or multi-step reasoning
  • The approach depends on the grounding capability of the base model (Qwen2-VL/2.5-VL) and is not applicable to models without bounding box support
  • Inference-time scaling (e.g., beam search over location or reasoning steps) is left unexplored
  • MVoT observes a similar phenomenon in structured scenarios such as puzzles—reasoning steps must interact with visual inputs
  • ChartMoE provides chart–code metadata as a foundation; this work innovatively leverages such data for positional annotation
  • The proposed framework can be generalized to other reasoning tasks requiring precise visual perception, such as scientific figures, maps, and engineering diagrams

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First to integrate visual grounding reflection into chart CoT, diagnosing a core MLLM bottleneck
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablations and cross-model validation, though dataset type coverage is limited
  • Writing Quality: ⭐⭐⭐⭐ — Pipeline description is clear, though some notational inconsistencies remain
  • Value: ⭐⭐⭐⭐⭐ — Establishes a new paradigm for the chart understanding community, shifting from "textual reasoning" to "visual reasoning"