ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools¶
Conference: ICCV 2025 arXiv: 2508.03284 Code: GitHub Area: Multimodal VLM Keywords: Visual Question Answering, Tool Use, Multi-step Reasoning, Dataset, Large Models, Tool Agent
TL;DR¶
This paper proposes ToolVQA — a large-scale multimodal tool-augmented VQA dataset containing 23K samples. It is automatically constructed via the ToolEngine pipeline, which combines image-guided DFS with LCS-based example matching, to generate multi-step reasoning data in realistic scenarios. LLaVA-7B fine-tuned on this dataset surpasses GPT-3.5-Turbo on 5 OOD benchmarks.
Background & Motivation¶
Integrating external tools into large foundation models (LFMs) is a key direction toward building general-purpose AI assistants. However, existing tool-augmented VQA datasets suffer from three major gaps:
Unrealistic scenarios: Synthetic images or oversimplified PDF files are used, mismatching the complexity of real-world scenes.
Overly simple queries: Queries require only single-step reasoning or provide explicit tool-use hints (e.g., "use the Cheap YouTube API tool"), lacking implicit multi-step reasoning.
High annotation cost: Manually annotated datasets (e.g., GAIA with only 500 samples) are difficult to scale.
Goal: To construct a large-scale, scalable dataset that simultaneously satisfies real-world scenarios and real-world queries, bridging the gap between synthetic data and authentic tool use.
Advantages over existing datasets (Table 1): ToolVQA is the only large-scale dataset that simultaneously supports multimodal input, real-world scenarios/queries, evaluable answers, and high reasoning complexity (2.38).
Method¶
ToolEngine Data Construction Pipeline¶
The pipeline comprises three core components (Fig. 3):
1. Real-World Example Construction
Ten users from diverse disciplines (mathematics, computer science, economics, Chinese, and arts) each recorded 15 common tool-use scenarios. After merging tools with similar functionality, the 10 most frequently used tools were selected, and the 150 initial scenarios were refined into 34 representative examples.
2. Image-Guided DFS Search
A depth-first search is performed over the tool graph to construct multi-step tool-use trajectories:
The controller \(\mathcal{M}\) is ChatGPT-4o-latest, which selects a tool and generates arguments at each step, involving real tool calls to extract image information.
3. LCS Example Matching
Dynamic example matching is performed based on the Longest Common Subsequence (LCS) algorithm. At step \(i\), the LCS similarity between the current trajectory \(\mathcal{P}_i\) and each element in the example set \(\mathcal{P}^e\) is computed, and the Top-k most similar examples are retrieved:
Key advantage: Unlike fixed example matching, LCS allows dynamic switching of examples during DFS, integrating knowledge from different example types and significantly improving reasoning complexity and data quality.
Tool Set Design¶
10 tools spanning 4 categories: - Perception: ImageCaption, OCR, ObjectDetection, RegionDescription - Operation: DrawBox, GoogleSearch - Logic: Calculator, Plot, ItemCount - Creation: TextToImage
Training Objective¶
LLaVA-7B is fine-tuned with cross-entropy loss:
Training configuration: batch_size=2, lr=2e-4, LoRA fine-tuning, 4000 epochs, 4×GTX3090.
Key Experimental Results¶
Main Results: ToolVQA Test Set (Table 4)¶
| Model | Setting | End-to-End Acc.↑ | Inst.↑ | Tool.↑ | Arg.↑ | Summ.↑ |
|---|---|---|---|---|---|---|
| ChatGPT-4o-latest | VLM | 38.29 | - | - | - | - |
| ChatGPT-4o-latest | VLM+tool | 34.96 | 36.5 | 14.68 | 8.92 | 56.1 |
| GPT-3.5-Turbo | LLM+tool | 18.37 | 73.24 | 30.46 | 20.08 | 58.18 |
| LLaVA-7B (original) | VLM+tool | 1.17 | 16.39 | 9.43 | 0 | 0.01 |
| Tuned LLaVA-7B | VLM+tool | 18.80 | 86.62 | 61.61 | 39.34 | 30.91 |
Key findings: - The fine-tuned 7B model achieves end-to-end accuracy comparable to the much larger closed-source GPT-3.5-Turbo. - Instruction formatting and tool selection improve substantially (Inst. 86.62%, Tool. 61.61%), while argument prediction and answer summarization remain bottlenecks. - GPT-4o with VLM+tool underperforms pure VLM (34.96 < 38.29), suggesting that tool-introduced noise can outweigh the benefits.
OOD Generalization (Table 5)¶
| Model | TextVQA | TallyQA | InfoSeek | GTA | TEMPLAMA |
|---|---|---|---|---|---|
| GPT-3.5-Turbo | 36.3 | 61 | 11.3 | 23.62 | 33.67 |
| LLaVA-7B | 41.2 | 60.1 | 5.2 | 12.12 | 3.06 |
| Tuned LLaVA-7B | 47 | 64.3 | 13.8 | 33.29 | 21.43 |
The fine-tuned model outperforms GPT-3.5-Turbo on 4 out of 5 OOD benchmarks, demonstrating strong generalization.
Ablation Study on ToolEngine (Table 3)¶
| Method | Acc.↑ | Cor.↑ | Nec.↑ | R.C.↑ |
|---|---|---|---|---|
| ToolEngine (full) | 90.8 | 85.2 | 87.51 | 2.38 |
| w/o Example + LCS | 27.3 | 77.6 | 21.04 | 1.1 |
| w/o LCS | 41.6 | 81.4 | 54.26 | 1.61 |
LCS matching is critical to data quality: removing it drops accuracy from 90.8% to 41.6% and reasoning complexity from 2.38 to 1.61.
Few-shot ICL Experiment (Table 6)¶
| Model | 0-shot | 1-shot | 5-shot | 10-shot |
|---|---|---|---|---|
| GPT-4o | 34.96 | 37.20 | 38.41 | 38.63 |
| Tuned LLaVA-7B | 18.80 | 19.41 | 21.13 | 20.69 |
The fine-tuned model still benefits from ICL (18.80→21.13), indicating that fine-tuning and ICL are complementary.
Dataset Statistics¶
- Total samples: 23,655
- Total tool calls: 65,785
- Average reasoning trajectory length: 2.78 steps
- Average question length: 15.74 tokens
- Average answer length: 2.69 tokens (concise and evaluable)
Highlights & Insights¶
- Elegance of LCS dynamic matching: By progressively matching different examples, the DFS process can integrate diverse types of knowledge to produce more complex reasoning chains. Fixed example matching cannot adapt to varying scenarios — this is the key driver of data quality improvement.
- Easy to generate, hard to answer: GPT-4o, which generates the questions, achieves less than 40% accuracy when answering its own questions, confirming that the ToolEngine pipeline successfully decouples single-step reasoning from multi-step end-to-end reasoning.
- Double-edged effect of tool use: Multiple models perform worse under VLM+tool than pure VLM, indicating that tool-introduced noise can exceed the benefit. Fine-tuning effectively suppresses such noise.
- Performance bottleneck lies in argument prediction and answer summarization: These subtasks require understanding newly returned tool outputs and extracting meaningful responses, areas where fine-tuning yields limited improvement.
Limitations & Future Work¶
- The tool set is relatively small (10 tools); although each tool has strong generalization capability, full coverage of all real-world scenarios is not achievable.
- Data generation relies on GPT-4o, which incurs high cost and may introduce bias.
- The 90.8% automatic generation accuracy implies approximately 10% noise in the training data.
- Fine-tuning experiments are conducted only at the 7B scale; performance of larger models remains unknown.
Related Work & Insights¶
- Compared to MM-Traj (Gao et al., 2025), ToolVQA uses real-scene images with verified answers.
- The LCS matching approach is generalizable to other data synthesis scenarios requiring dynamic example selection.
- The identified bottlenecks in tool use (argument prediction, answer summarization) point to future research on dynamic information processing in multi-turn dialogue.
- The design philosophy of combining highly generalizable tools (e.g., GoogleSearch) with task-specific tools is worth borrowing.
Rating ⭐⭐⭐⭐¶
Novelty ★★★★☆: The ToolEngine pipeline and LCS example matching design are original. Experimental Thoroughness ★★★★☆: Covers diverse models and OOD benchmarks with comprehensive analysis. Writing Quality ★★★★☆: Data quality evaluation is thorough; error analysis is insightful. Value ★★★★★: Code and dataset are publicly released, directly usable for tool-agent training and evaluation.