InfoDet: A Dataset for Infographic Element Detection¶

Conference: ICLR 2026 arXiv: 2505.17473 Code: https://github.com/InfoDet2025/InfoDet Area: Object Detection / Document Understanding Keywords: Infographic Detection, Chart Understanding, Dataset, Grounded CoT, VLM

TL;DR¶

This paper introduces a large-scale infographic element detection dataset (101,264 infographics, 14.2 million annotations) spanning two major categories—chart elements and human-recognizable objects (HROs)—and proposes a Grounded CoT method that leverages detection results to enhance VLM chart understanding.

Background & Motivation¶

Background: Chart understanding is an important application scenario for VLMs (e.g., ChartQA), yet existing approaches have VLMs reason directly from raw images, overlooking structured visual element information.

Limitations of Prior Work: (a) No large-scale infographic detection dataset exists—state-of-the-art foundation models (DINO-X, Grounding DINO) achieve AP < 15% on infographic element detection, essentially failing entirely; (b) Infographics contain numerous non-natural-scene elements (e.g., icons, chart components) that exhibit a large domain gap from detectors trained on COCO/Objects365.

Key Challenge: Element detection in infographics is foundational to chart understanding, yet current detectors are practically unusable in this domain.

Goal: (a) Construct a large-scale infographic detection dataset, and (b) validate how detection results can improve VLM chart reasoning.

Key Insight: Combining synthetic data (90,000 images, template-based generation) with real data (11,264 images, model-in-the-loop annotation) to cover 75 chart types.

Core Idea: Treat element detection as "visual grounding" for chart understanding—detect first, then reason (Thinking-with-Boxes).

Method¶

Overall Architecture¶

The work consists of two components: (1) construction of the InfoDet dataset, and (2) the Grounded CoT method, which injects detected elements into VLMs as visual and textual prompts to augment chart reasoning.

Key Designs¶

Dataset Construction:
- Synthetic data (90,000 images): Data is sampled from VizNet's 31 million tables and rendered into infographics via 1,072 design templates; Chart and HRO annotations are extracted programmatically from SVGs, fully automated.
- Real data (11,264 images): Collected from 10 platforms, deduplicated using CLIP similarity scores and quality-verified with GPT-4o. Annotation employs iterative model-in-the-loop refinement—a detector is first trained on synthetic data, applied to real images, expert-corrected annotations are fed back to improve the detector, and this cycle is repeated over multiple rounds.
- Final quality: precision 93.9%, recall 96.7%, comparable to COCO/Objects365.
Grounded Chain-of-Thought (Thinking-with-Boxes):
- Function: Detected elements are provided as auxiliary inputs to the VLM to guide reasoning.
- Mechanism: (a) Visual prompting—detected bounding boxes are overlaid on the image and labeled with letters, using a two-layer separation strategy (chart layer + text layer) to avoid overlapping confusion; (b) Textual description—each element's attributes are enumerated in text. The VLM is then prompted to reason step by step (CoT) by referencing the labeled elements.
- Design Motivation: VLMs tend to miss or confuse elements when reasoning over complex charts (multi-chart infographics); explicit detection results provide structured visual cues.

Loss & Training¶

Detectors (Co-DETR, RTMDet) are trained on InfoDet following standard pipelines. VLMs require no additional fine-tuning; Grounded CoT is a training-free inference enhancement.

Key Experimental Results¶

Detection Results¶

Model	Pretrain	Chart AP	HRO AP	Chart AR	HRO AR
Co-DETR	Zero-shot	0.4%	1.1%	5.6%	4.8%
Co-DETR	InfoDet	81.8%	64.5%	88.2%	76.8%

Grounded CoT Results (ChartQAPro Benchmark, Relaxed Accuracy)¶

Model	Method	Infographic Single	Infographic Multi	Overall
o1	Direct	66.4%	66.0%	61.4%
o1	CoT	64.3%	67.6%	61.9%
o1	Grounded CoT	67.8%	71.9%	64.1%

Ablation Study¶

Grounded CoT Component	Accuracy
Visual prompts only	62.8%
Text description only	61.6%
Combined (single-layer)	62.3%
Combined (two-layer)	64.1%

Key Findings¶

Zero-shot detectors nearly fail on infographics (AP < 1.1%), demonstrating that the dataset fills a critical gap for detectors in the infographic domain.
After InfoDet pretraining, AP rises to 81.8%, and the learned representations transfer to other document understanding tasks (Rico +8.5 AP, DocGenome +5.4 AP).
Grounded CoT improves accuracy by 3–6% on infographic scenarios, with limited gains on simpler charts.
The two-layer visual prompt separation outperforms single-layer by 1.8%, avoiding confusion caused by overlapping boxes and letter labels.

Highlights & Insights¶

Filling a Critical Data Gap: A large-scale infographic detection dataset with 14.2 million annotations constitutes a significant resource contribution to the field.
Thinking-with-Boxes Paradigm: The detect-then-reason approach is simple yet effective, akin to equipping VLMs with a "magnifying glass." It is transferable to arbitrary visual reasoning tasks.
Synthetic + Real Data Construction: Template-based synthesis (automatic annotation) combined with model-in-the-loop labeling (efficient real-data annotation) balances scale and quality.

Limitations & Future Work¶

A domain gap between synthetic and real data persists (synthetic data is simpler), necessitating more real-world samples.
HRO detection AP (64.5%) is substantially lower than Chart AP (81.8%), indicating that icon detection remains more challenging.
Grounded CoT yields marginal improvements on simple charts and may introduce information overload.
The two-layer separation strategy is hand-engineered; more adaptive layout strategies warrant further exploration.

vs. ChartQA/ChartQAPro: These works provide QA benchmarks, on which this paper validates Grounded CoT.
vs. Grounding DINO: Zero-shot failure on infographics demonstrates the necessity of domain-specific data.
vs. DocGenome: A document layout detection dataset; InfoDet pretraining transfers favorably and improves its performance.

Rating¶

Novelty: ⭐⭐⭐⭐ The dataset and Grounded CoT task formulation are novel, though the method itself is relatively straightforward.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Detection, chart understanding, and transfer learning are all comprehensively evaluated.
Writing Quality: ⭐⭐⭐⭐⭐ Dataset construction is described with exceptional detail.
Value: ⭐⭐⭐⭐⭐ The large-scale dataset and open-source release offer extremely high community value.