ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation¶

Conference: ACL 2025
arXiv: 2501.06598
Code: https://github.com/thunlp/ChartCoder
Area: Multimodal VLM
Keywords: Chart understanding, Code generation, Chart-to-Code, Code LLM, Synthetic data

TL;DR¶

Propses ChartCoder, the first dedicated chart-to-code MLLM. Using a Code LLM as the language backbone, combined with a 160K large-scale chart-to-code dataset and a "Snippet-of-Thought" step-by-step reasoning approach, the 7B model beats all open-source MLLMs across three benchmarks and approaches the performance of GPT-4o.

Background & Motivation¶

Background: MLLMs have achieved good performance on chart understanding tasks (e.g., ChartQA). However, prevailing methods represent chart information using natural language descriptions, which inevitably loses dense information (such as data values and style details).

Limitations of Prior Work: (a) Parsing charts into code is a lossless representation, but existing open-source MLLMs produce code with low execution rates and poor preservation of chart details (e.g., InternVL2-8B often mistakes chart types and coordinate scales); (b) There is a lack of large-scale, diverse training data for chart-to-code tasks—the largest existing dataset, ChartLlama, contains only 7.8K pairs across only 10 chart types.

Key Challenge: The low proportion of code in the training corpora of general LLMs leads to an alignment gap between code generation and chart understanding in MLLMs. Furthermore, direct code generation often tends to ignore crucial details (e.g., colors, data values, styling parameters).

Goal: (a) How to improve the execution rate and visual fidelity of chart code generated by MLLMs? (b) How to construct a large-scale, high-quality chart-to-code training dataset? (c) How to guide the model to focus on critical chart details?

Key Insight: Natural code capabilities can be enhanced by replacing general LLMs with a Code LLM (DeepSeek Coder) as the language backbone. A large-scale dataset can be batch-constructed via a "generate-then-execute" pipeline to ensure one-to-one mapping. A step-by-step reasoning approach, "Snippet-of-Thought," can be utilized to emphasize critical chart elements.

Core Idea: Code LLM + 160K synthetic data + step-by-step reasoning = first dedicated chart-to-code MLLM.

Method¶

Overall Architecture¶

Input: Chart image. Output: Executable Python code (Matplotlib/Seaborn) that renders back into the original chart. Model Architecture: SigLIP-384 vision encoder + two-layer MLP connector + DeepSeek Coder 6.7B language backbone. An Any Resolution strategy is adopted to handle high-resolution chart images.

Key Designs¶

Chart2Code-160K Dataset Construction:
- Function: Constructing the first large-scale instruction tuning dataset for chart-to-code generation.
- Mechanism: A "generate code first then execute" strategy—prompting an LLM to generate code, executing it to render charts, thereby forming (chart, code) pairs. Workflow: LLM generates domain keywords -> generates mock data -> generates code using in-context demonstrations of 79 template codes across 27 chart types -> executes code -> filters out failures (removing charts with pixel anomalies or coordinate scale errors) -> obtains 160K high-quality data pairs.
- Design Motivation: Chart-to-code data requires strict one-to-one mapping (one image corresponds to one code block), and diversity must lie in the chart types rather than instruction variants; moreover, the code must be syntactically correct and executable. These constraints render traditional data augmentation methods unsuitable.
Snippet-of-Thought (SoT):
- Function: Transforming direct chart-to-code generation into a step-by-step format.
- Mechanism: Simulating the human reasoning process through four stages: Step 1 — Chart types and layouts (e.g., plt.bar(), plt.subplot()); Step 2 — Data and colors (e.g., data=[10,15], colors=['#FF0000']); Step 3 — Key details (e.g., hatch='/', loc='upper left'); Step 4 — Complete final code. Each step merges text explanations with code snippets. Sampling 50K pairs from the 160K dataset, an LLM is used to deconstruct the complete code into these four steps (deconstructed post-hoc rather than generated step-by-step during dataset creation to prevent mismatch between intermediate steps and the final code).
- Design Motivation: The issue with direct code templates is that template structures are often similar, leaving only colors, values, and other small details different. Vanilla models easily ignore these subtle differences. Step-by-step reinforcement of each metadata category forces the model to focus on fine-grained details.
Code LLM as Language Backbone:
- Function: Utilizing DeepSeek Coder 6.7B in place of a general-purpose LLM as the backbone of the MLLM.
- Mechanism: Code LLMs, pre-trained on massive code corpora, possess stronger code generation and semantic understanding of syntax. Training is done in two stages: (1) chart-to-text alignment (freezing the vision encoder and LLM, training only the connector); (2) chart-to-code instruction tuning (joint fine-tuning of the full model).
- Design Motivation: General MLLMs suffer from low code percentages in their pre-training data, which leads to poor executability and low fidelity in chart-to-code tasks.

Loss & Training¶

Two-stage training. Stage 1 (Alignment Phase) uses chart description data such as UniChart, Chart-to-Text, and SciCap, combined with LLaVA pre-training data and Chart2Code-160K (training ONLY the connector). Stage 2 (Instruction Tuning Phase) conducts joint fine-tuning of all parameters on Chart2Code-160K, SoT data, ChartQA PoT, etc.

Key Experimental Results¶

Main Results¶

Comparison across three chart-to-code benchmarks (best open-source results in bold):

Model	Params	ChartMimic Exec.Rate	ChartMimic High-Level	Plot2Code Pass Rate	ChartX GPT-score
GPT-4o	-	93.2	83.5	88.6	-
InternVL2-76B	76B	83.2	62.2	85.6	1.74
Qwen2-VL-72B	72B	73.3	50.9	72.0	1.69
InternVL2-8B	8.1B	61.8	38.9	77.3	1.63
TinyChart	3B	42.5	25.9	43.2	1.89
ChartCoder	7B	91.4	74.0	87.9	2.09

Ablation Study (ChartMimic High-Level Score)¶

ChartMimic fine-grained sub-scores (different maximum scores):

Model	Chart Types(/20)	Layout(/10)	Text(/20)	Data(/20)	Style(/20)	Clarity(/10)
GPT-4o	18.96	9.59	17.16	15.68	14.66	8.84
InternVL2-8B	7.20	6.82	8.81	5.74	5.42	6.64
ChartCoder	16.83	9.13	14.77	12.41	12.68	8.29

Key Findings¶

7B Model Approaches GPT-4o: ChartCoder is competitive with GPT-4o on ChartMimic's Exec. Rate (91.4 vs 93.2) and High-Level score (74.0 vs 83.5), significantly outperforming open-source models of comparable scale.
Code LLM Backbone is Crucial: Compared to using a general LLM with the same architecture, the Code LLM backbone boosts execution rate from ~60% to 91.4%, which underscores the vital role of code pre-training in chart-to-code tasks.
SoT Significantly Enhances Detail Restoration: Step-by-step generation shows the most pronounced improvements in detail-related dimensions such as Chart Types, Data, and Style.
Outperforms Specialized Chart Models: ChartLlama (13B) and ChartVLM (14.3B) are heavily outperformed by ChartCoder (7B) on chart-to-code generation, illustrating that specialized data and training strategies are more important than model scale.
Synthetic Charts Approach Real-world Quality: According to GPT-4o evaluation, the quality of charts in Chart2Code-160K (77.32) is highly comparable to real-world charts in ChartMimic (78.96).

Highlights & Insights¶

Pioneering Exploration of Coding Backbones for MLLMs: Introducing Code LLMs to multimodal scenarios can easily be extended to other vision tasks requiring code outputs (e.g., UI designs to code, screenshots to HTML).
Practicality of Snippet-of-Thought: Breaking code down into four stages (Type -> Data -> Details -> Complete Code) boosts model capabilities and improves the interpretability of outputs.
Clever Inversion in Data Construction: Reversing the training objective of "generating code from image" by adopting a "generate code then execute to render image" paradigm guarantees precise alignment between code and image.

Limitations & Future Work¶

Restricted to Matplotlib/Seaborn: Code generation is limited to standard Python visualization packages, lacking support for web-oriented libraries such as D3.js or ECharts.
Limited Coverage of 27 Chart Types: Special chart categories (e.g., Sankey diagrams, timelines, geographic maps) are not yet supported.
No Interactive Refinement: The model can only produce complete code in a single turn, without the ability to perform iterative revisions based on user feedback.
Template Code Homogeneity: Despite incorporating SoT, a certain level of structural homogeneity still persists in the code structures of the 160K dataset.
Future Directions: Extending to more programming languages and visualization libraries; integrating RL training with self-execution feedback.

vs. CoSyn (Previous Paper): CoSyn also synthesizes chart data via code generation, but its objective is to generate training data for chart QA VLM. In contrast, ChartCoder focuses on outputting code directly to reconstruct charts. The two tasks are complementary—CoSyn's data can be used to augment ChartCoder's training.
vs. GPT-4o: GPT-4o is currently the strongest on chart-to-code generation, but ChartCoder approaches its level at only 7B parameters, demonstrating a clear cost-efficiency advantage.
vs. TinyChart: Although TinyChart (3B) is small, its chart-to-code ability is weak (High-Level score of 25.9), demonstrating that simple model compression fails to address the code generation challenge.

Rating¶

Novelty: ⭐⭐⭐⭐ First dedicated chart-to-code MLLM; the use of a Code LLM backbone and the SoT formulation are innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on three benchmarks, with fine-grained analysis, ablations, and data quality evaluations.
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear and the method is described in a systematic manner.
Value: ⭐⭐⭐⭐ Positively pushes the boundary of the intersection between chart understanding and code generation.