AutoTool: Efficient Tool Selection for Large Language Model Agents¶
Conference: AAAI 2026 arXiv: 2511.14650 Code: GitHub Area: Agent Keywords: tool selection, LLM agent efficiency, tool usage inertia, graph-based planning, inference cost reduction
TL;DR¶
This paper proposes AutoTool, a graph-based tool selection framework that exploits tool usage inertia to construct a Tool Inertia Graph (TIG). By leveraging statistical structure, AutoTool bypasses redundant LLM inference for tool selection and parameter filling, reducing inference overhead by up to 30% while maintaining task completion rates.
Background & Motivation¶
- Background: LLM agents have become powerful tools for automating complex tasks; frameworks such as ReAct drive multi-step decision-making through think–act–observe cycles.
- Limitations of Prior Work: Existing frameworks rely on LLM inference at every step for tool selection, incurring high computational overhead and latency, particularly in multi-step tasks with numerous LLM calls.
- Key Challenge: Not every decision step requires the full reasoning capacity of an LLM; many tool invocations occur in highly patterned contexts, making current approaches an over-reliance on LLM inference.
- Key Observation: The authors identify the phenomenon of tool usage inertia—tool invocations follow predictable sequential patterns. For example, in ScienceWorld,
go_tois followed bylook_aroundin 88.7% of cases. - Theoretical Validation: A \(k\)-th order Markov chain analysis shows that conditional entropy decreases from 3.50 bits (0th order) to 2.52 bits (1st order) to 1.93 bits (2nd order), with likelihood ratio tests yielding \(p < .001\), confirming sequential dependency.
- Core Idea: A graph structure captures statistical regularities in tool invocations; tools are selected directly under high-confidence conditions, with fallback to LLM inference when uncertainty is high.
Method¶
Overall Architecture¶
AutoTool attempts an inertia call before each standard LLM invocation, comprising two stages: ① Inertia Sensing predicts the next tool; ② Parameter Filling automates parameter assignment. The LLM is bypassed only when both stages succeed; otherwise, the system falls back to standard inference.
Module 1: Tool Inertia Graph (TIG) Construction¶
- Hierarchical Node Structure: Tool Nodes encode functional descriptions and execution states; each Tool Node embeds a sub-graph of Parameter Nodes.
- Two Types of Directed Edges:
- Tool Sequence Edges: connect Tool Nodes and encode sequential dependencies.
- Parameter Dependency Edges: connect Parameter Nodes and model inter-tool data flow.
- Online Incremental Construction: the graph is learned dynamically from historical execution trajectories; edge weights are updated via positive/negative reinforcement based on success/failure feedback.
- Edge weights are reinforced only from high-confidence LLM-generated sequences, preventing error propagation from inertia calls.
Module 2: Inertia Sensing (CIPS Score)¶
The Comprehensive Inertia Potential Score is defined as:
- Frequency Score: historical usage frequency derived from TIG edge weights.
- Contextual Score: semantic similarity between the current agent rationale and candidate tool descriptions, computed via SimCSE.
- The parameter filling stage is entered only when \(\text{CIPS}(v^*) > \theta_{\text{inertial}}\).
Module 3: Hierarchical Parameter Filling¶
A strictly prioritized, LLM-free parameter filling strategy: 1. Dependency Backtracking: traverses Parameter Dependency Edges in the TIG to retrieve parameter values from the outputs of preceding tools. 2. Environment State Matching: extracts values from key states maintained by the agent (e.g., current location). 3. Heuristic Filling: infers values from the current state or task goal. - If any parameter cannot be resolved, the inertia call is abandoned and the LLM is invoked.
Safety Constraints¶
- Inertia calls are capped at 30% of total operations.
- Consecutive inertia calls are prohibited.
- A fault-tolerance mechanism triggers a recovery path upon detection of consecutive tool failures.
Key Experimental Results¶
Main Results: Efficiency Gains (ReAct + AutoTool)¶
| Dataset | Progress Rate | Token-In Speedup | Token-Out Speedup | LLM Call Speedup |
|---|---|---|---|---|
| AlfWorld | 0.531 (↑ vs 0.394) | 1.60× | 2.87× | 1.18× |
| ScienceWorld | 0.708 (≈ 0.716) | 1.30× | 1.41× | 1.31× |
| ToolQuery-Academic | 0.895 (≈ 0.901) | 1.15× | 0.92× | 1.20× |
Main Results: Efficiency Gains (Reflexion + AutoTool)¶
| Dataset | Progress Rate | Token-In Speedup | Token-Out Speedup | LLM Call Speedup |
|---|---|---|---|---|
| AlfWorld | 0.453 (≈ 0.481) | 1.33× | 1.20× | 1.29× |
| ScienceWorld | 0.712 (≈ 0.730) | 0.93× | 1.20× | 1.28× |
| ToolQuery-Academic | 0.923 (↑ vs 0.917) | 1.33× | 1.19× | 1.26× |
Overhead Analysis¶
| Module | Time Overhead |
|---|---|
| Semantic computation (SimCSE embedding + similarity) | 2.7% ± 1.5% of total task time |
| Non-semantic modules (graph construction/search/parsing) | On the order of seconds; negligible |
Highlights & Insights¶
- Discovery and Quantification of Tool Usage Inertia: This work is the first to systematically validate sequential dependencies in tool invocations using information theory (conditional entropy) and statistical hypothesis testing, providing a rigorous theoretical foundation for the proposed method.
- Novel LLM Offloading Perspective: Rather than optimizing the LLM itself, the paper identifies which decisions do not require LLM reasoning, embodying the key insight that not every step demands full inference.
- Parameter-Level Data Flow Modeling: Beyond modeling tool sequences, the framework tracks parameter flow across tools to enable automatic parameter filling.
- Plug-and-Play Compatibility: AutoTool can directly augment existing frameworks such as ReAct and Reflexion without any fine-tuning.
Limitations & Future Work¶
- Cold Start: Online graph construction requires accumulated trajectories; inertia predictions are less reliable in early stages.
- The inertia window is fixed at 2; dynamic adjustment is not explored.
- Evaluation is limited to 3 datasets and does not cover more complex real-world API calling scenarios.
- The semantic similarity module (SimCSE) introduces an additional model dependency.
- The 30% inertia call cap is conservative and may limit further efficiency gains.
Related Work & Insights¶
- Compared with tool selection methods such as ToolNet and AnyTool, AutoTool is the only approach that simultaneously achieves LLM offloading, inertia sensing, and parameter flow modeling.
- The approach shares conceptual similarity with Agent Workflow Memory (learning from historical interactions), but AutoTool focuses on statistical patterns at the tool level.
- The inertia graph concept is generalizable to other agent scenarios requiring repetitive decision-making, such as code generation and data analysis pipelines.
Rating¶
- Novelty: ⭐⭐⭐⭐ Tool usage inertia is a novel and well-supported empirical finding; the graph-based design is well-motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐ Three datasets with multi-model validation and detailed overhead/sensitivity analyses, though dataset scale is limited.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated, theoretical analysis is rigorous, and figures are informative.
- Value: ⭐⭐⭐⭐ Offers a new direction for agent efficiency optimization with strong practical applicability.