Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation¶

Conference: ACL 2025
arXiv: 2501.12432
Code: https://corn0205.github.io/
Area: Others
Keywords: Tool learning, Parallel tool invocation, DAG structure, Process/Thread inference, LLM Agent

TL;DR¶

Proposed DTA-Llama, which transforms sequential tool invocation paths of traditional tree search into Directed Acyclic Graph (DAG) structures to achieve parallel invocation. A Process/Thread inference framework is designed to enable the LLM to decompose tasks and execute multiple tools in parallel in each turn, allowing Llama2-7B to match the performance of GPT-3.5 Parallel Function Calling on StableToolBench.

Background & Motivation¶

Background: Tool learning enables LLMs to invoke external APIs to accomplish real-world tasks. Current approaches are divided into pipeline-based (such as CoT/ReAct, invoking one tool per turn) and tree-search-based (such as DFSDT, improving fault tolerance through depth-first search backtracking).

Limitations of Prior Work: - CoT/ReAct: Invokes only a single tool per turn, resulting in a narrow perception range and requiring more turns. - DFSDT: The backtracking mechanism leads to longer tool invocation sequences, significantly increasing token consumption and inference time. - Both types of approaches fail to invoke multiple tools in parallel within a single turn.

Key Challenge: How to reduce the number of tool invocation turns and computational overhead while maintaining the task completion rate.

Key Insight: Analogous to the Process/Thread mechanism in operating systems—each turn of the "Process" decomposes the task into parallelizable subtasks, where multiple "Threads" execute tool calls in parallel, and results are aggregated afterwards.

Core Idea: Transforming sequential paths of tree search into DAG parallel structure training data, combined with a Process/Thread inference framework, to achieve multi-tool parallel invocation in each turn.

Method¶

Overall Architecture¶

Data Construction: DFSDT search tree $\rightarrow$ extract successful paths $\rightarrow$ GPT-4 determines which tools can be parallelized $\rightarrow$ convert into DAG structure $\rightarrow$ level-order traversal to generate parallel training data DTA-Tool (~20K samples)
$\rightarrow$ Model Training: Fine-tune the Llama series models on DTA-Tool
$\rightarrow$ Inference: Process (LLM analyzes task state + decomposes parallel tool invocation plans) $\rightarrow$ Thread (executes tool APIs in parallel) $\rightarrow$ Intermediate State Lock (aggregates results) $\rightarrow$ loop until completion.

Key Designs¶

Sequential-to-Parallel Data Conversion:
- Extract successful paths $\mathcal{P}$ from tree search trajectories.
- Use GPT-4 to judge which tool invocations in the path can be parallelized (no input/output dependency + no causal relationship).
- Construct a DAG $\mathcal{G}$, where level-order traversal allows tools in the same layer to be executed in parallel.
- Data Filtering: Filter out loop calls, incomplete calls, and un-aggregatable structures.
Process/Thread Inference Framework:
- Process: LLM evaluates task state $\rightarrow$ analyzes current step needs $\rightarrow$ generates multiple parallelizable tool invocation plans (name + parameters).
- Thread: Executes all tool calls proposed by the Process in parallel.
- Intermediate State Lock: Waits for all Threads to complete and aggregates the results, which then serve as the input for the next Process turn.
Loss & Training:
- Simplified from the Thought-Action-Observation framework to Thought-Observation (where Action is integrated into Thought as the tool invocation plan).
- Loss Function: $$\mathcal{L}(\theta) = -\log \sum_{i=1}^n p_\theta(y^i | q, y^{[1:i-1]}, o^{[1:i-1]})$$

Key Experimental Results¶

Main Results (StableToolBench Average SoPR/SoWR)¶

Method	SoPR ↑	SoWR ↑
GPT-3.5 (ReAct)	47.9	-
GPT-3.5 (DFSDT)	66.7	65.5
GPT-3.5 (Parallel)	61.9	53.0
ToolLLaMA (DFSDT)	54.2	47.1
DTA-Llama2-7B	60.7	53.5

The performance of DTA-Llama2-7B (an open-source 7B model) is comparable to GPT-3.5 Parallel Function Calling.

Efficiency Comparison (Token Consumption)¶

Method	Avg. Token Consumption	Inference Time
DFSDT	Highest (massive backtracking)	Slowest
ReAct	Medium	Medium
DTA (Parallel)	Lowest	Fastest

Key Findings¶

Parallel invocation significantly reduces invocation turns: DTA requires only 2.46 tool invocation turns per data sample on average.
99.1% of the training data contains parallel tool calls.
The generalization of the method was validated across multiple models, including Llama2-7B and Llama3-8B.
DAG data quality is higher than the original sequential data—structural optimization eliminates redundant pathways.

Highlights & Insights¶

Sequential-to-parallel data conversion is the core contribution: Using GPT-4 to automatically identify dependencies between tool calls and reorganizing tree search paths into DAGs is a data engineering idea that can be transferred to data construction for other multi-step tasks.
The Process/Thread design, analogous to operating systems, is intuitive and effective: the aggregation mechanism of the Intermediate State Lock is crucial for ensuring parallel reliability.
Open-source 7B models closely approach the parallel function calling capability of GPT-3.5, demonstrating the power of data structure optimization.

Limitations & Future Work¶

Reliance on GPT-4 for sequential-to-parallel conversion introduces costs and potential biases.
Parallel invocation assumes that APIs support concurrency and will not be overloaded; practical deployment may face rate-limiting issues.
Evaluated only within the ToolBench ecosystem; its effectiveness on other tool invocation benchmarks (e.g., API-Bank) remains unknown.
Dynamic adjustment of parallelism was not explored—in some cases, parallel invocation might perform worse than sequential execution (e.g., when information dependency chains are long).

vs DFSDT (ToolLLM): DFSDT improves fault tolerance through backtracking but at a huge cost; DTA shortens the path through parallelization while maintaining effectiveness.
vs GPT-3.5 Parallel FC: GPT-3.5's parallel function calling is a black-box implementation; DTA provides an open-source, reproducible solution that achieves comparable performance even with a 7B model.
vs LLMCompiler: LLMCompiler parallelizes tool calls using a compiler-like approach; DTA's DAG data conversion and Process/Thread framework provide a more systematic solution.

Rating¶

Novelty: ⭐⭐⭐⭐ The design of DAG data conversion and the Process/Thread inference framework is innovative.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation on StableToolBench, efficiency analysis, and multi-model generalization.
Writing Quality: ⭐⭐⭐⭐ Clear diagrams, intuitive analogies, and systematic description of the method.
Value: ⭐⭐⭐⭐ Provides a practical parallelization scheme and a high-quality dataset for tool learning.