InfoDet: A Dataset for Infographic Element Detection¶
Conference: ICLR 2026 arXiv: 2505.17473 Code: https://github.com/InfoDet2025/InfoDet Area: Object Detection / Document Understanding Keywords: Infographic Detection, Chart Understanding, Dataset, Grounded CoT, VLM
TL;DR¶
This paper introduces a large-scale infographic element detection dataset (101,264 infographics, 14.2 million annotations) spanning two major categories—chart elements and human-recognizable objects (HROs)—and proposes a Grounded CoT method that leverages detection results to enhance VLM chart understanding.
Background & Motivation¶
Background: Chart understanding is an important application scenario for VLMs (e.g., ChartQA), yet existing approaches have VLMs reason directly from raw images, overlooking structured visual element information.
Limitations of Prior Work: (a) No large-scale infographic detection dataset exists—state-of-the-art foundation models (DINO-X, Grounding DINO) achieve AP < 15% on infographic element detection, essentially failing entirely; (b) Infographics contain numerous non-natural-scene elements (e.g., icons, chart components) that exhibit a large domain gap from detectors trained on COCO/Objects365.
Key Challenge: Element detection in infographics is foundational to chart understanding, yet current detectors are practically unusable in this domain.
Goal: (a) Construct a large-scale infographic detection dataset, and (b) validate how detection results can improve VLM chart reasoning.
Key Insight: Combining synthetic data (90,000 images, template-based generation) with real data (11,264 images, model-in-the-loop annotation) to cover 75 chart types.
Core Idea: Treat element detection as "visual grounding" for chart understanding—detect first, then reason (Thinking-with-Boxes).
Method¶
Overall Architecture¶
The work consists of two components: (1) construction of the InfoDet dataset, and (2) the Grounded CoT method, which injects detected elements into VLMs as visual and textual prompts to augment chart reasoning.
Key Designs¶
-
Dataset Construction:
- Synthetic data (90,000 images): Data is sampled from VizNet's 31 million tables and rendered into infographics via 1,072 design templates; Chart and HRO annotations are extracted programmatically from SVGs, fully automated.
- Real data (11,264 images): Collected from 10 platforms, deduplicated using CLIP similarity scores and quality-verified with GPT-4o. Annotation employs iterative model-in-the-loop refinement—a detector is first trained on synthetic data, applied to real images, expert-corrected annotations are fed back to improve the detector, and this cycle is repeated over multiple rounds.
- Final quality: precision 93.9%, recall 96.7%, comparable to COCO/Objects365.
-
Grounded Chain-of-Thought (Thinking-with-Boxes):
- Function: Detected elements are provided as auxiliary inputs to the VLM to guide reasoning.
- Mechanism: (a) Visual prompting—detected bounding boxes are overlaid on the image and labeled with letters, using a two-layer separation strategy (chart layer + text layer) to avoid overlapping confusion; (b) Textual description—each element's attributes are enumerated in text. The VLM is then prompted to reason step by step (CoT) by referencing the labeled elements.
- Design Motivation: VLMs tend to miss or confuse elements when reasoning over complex charts (multi-chart infographics); explicit detection results provide structured visual cues.
Loss & Training¶
Detectors (Co-DETR, RTMDet) are trained on InfoDet following standard pipelines. VLMs require no additional fine-tuning; Grounded CoT is a training-free inference enhancement.
Key Experimental Results¶
Detection Results¶
| Model | Pretrain | Chart AP | HRO AP | Chart AR | HRO AR |
|---|---|---|---|---|---|
| Co-DETR | Zero-shot | 0.4% | 1.1% | 5.6% | 4.8% |
| Co-DETR | InfoDet | 81.8% | 64.5% | 88.2% | 76.8% |
Grounded CoT Results (ChartQAPro Benchmark, Relaxed Accuracy)¶
| Model | Method | Infographic Single | Infographic Multi | Overall |
|---|---|---|---|---|
| o1 | Direct | 66.4% | 66.0% | 61.4% |
| o1 | CoT | 64.3% | 67.6% | 61.9% |
| o1 | Grounded CoT | 67.8% | 71.9% | 64.1% |
Ablation Study¶
| Grounded CoT Component | Accuracy |
|---|---|
| Visual prompts only | 62.8% |
| Text description only | 61.6% |
| Combined (single-layer) | 62.3% |
| Combined (two-layer) | 64.1% |
Key Findings¶
- Zero-shot detectors nearly fail on infographics (AP < 1.1%), demonstrating that the dataset fills a critical gap for detectors in the infographic domain.
- After InfoDet pretraining, AP rises to 81.8%, and the learned representations transfer to other document understanding tasks (Rico +8.5 AP, DocGenome +5.4 AP).
- Grounded CoT improves accuracy by 3–6% on infographic scenarios, with limited gains on simpler charts.
- The two-layer visual prompt separation outperforms single-layer by 1.8%, avoiding confusion caused by overlapping boxes and letter labels.
Highlights & Insights¶
- Filling a Critical Data Gap: A large-scale infographic detection dataset with 14.2 million annotations constitutes a significant resource contribution to the field.
- Thinking-with-Boxes Paradigm: The detect-then-reason approach is simple yet effective, akin to equipping VLMs with a "magnifying glass." It is transferable to arbitrary visual reasoning tasks.
- Synthetic + Real Data Construction: Template-based synthesis (automatic annotation) combined with model-in-the-loop labeling (efficient real-data annotation) balances scale and quality.
Limitations & Future Work¶
- A domain gap between synthetic and real data persists (synthetic data is simpler), necessitating more real-world samples.
- HRO detection AP (64.5%) is substantially lower than Chart AP (81.8%), indicating that icon detection remains more challenging.
- Grounded CoT yields marginal improvements on simple charts and may introduce information overload.
- The two-layer separation strategy is hand-engineered; more adaptive layout strategies warrant further exploration.
Related Work & Insights¶
- vs. ChartQA/ChartQAPro: These works provide QA benchmarks, on which this paper validates Grounded CoT.
- vs. Grounding DINO: Zero-shot failure on infographics demonstrates the necessity of domain-specific data.
- vs. DocGenome: A document layout detection dataset; InfoDet pretraining transfers favorably and improves its performance.
Rating¶
- Novelty: ⭐⭐⭐⭐ The dataset and Grounded CoT task formulation are novel, though the method itself is relatively straightforward.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Detection, chart understanding, and transfer learning are all comprehensively evaluated.
- Writing Quality: ⭐⭐⭐⭐⭐ Dataset construction is described with exceptional detail.
- Value: ⭐⭐⭐⭐⭐ The large-scale dataset and open-source release offer extremely high community value.