RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning¶
Conference: ICLR 2026
arXiv: 2505.21413
Code: TBD
Area: Information Retrieval
Keywords: tool creation, reference-guided, knowledge-intensive reasoning, executable tools, hierarchical toolbox
TL;DR¶
The RefTool framework is proposed to automatically create executable Python tools based on external reference materials (textbooks, knowledge snippets), addressing the failure of existing tool creation methods that rely on LLM internal knowledge in specialized domains. It outperforms existing methods by an average of 12.3% on causal reasoning, physics, and chemistry tasks.
Background & Motivation¶
Background: LLM tool creation enables models to dynamically generate and call tools during reasoning, offering greater flexibility than predefined toolsets. Existing methods (e.g., CRAFT, TroVE) rely on the internal knowledge of LLMs to generate tools.
Limitations of Prior Work: LLM internal knowledge is unreliable in specialized domains (causal reasoning, quantum physics, organic chemistry), leading to the generation of tools containing incorrect formulas or logic.
Key Challenge: Tool creation requires precise domain knowledge, whereas LLM knowledge in specialized fields may be inaccurate or incomplete.
Goal: How to utilize external authoritative reference materials (textbooks) as knowledge sources to guide tool creation?
Key Insight: Utilize the natural chapter-section structure of textbooks to organize a hierarchical toolbox and extract executable Python functions from each section.
Core Idea: References → Tool Creation + Hierarchical Toolbox → Hierarchical Retrieval → Reasoning.
Method¶
Overall Architecture¶
RefTool addresses the pain point where LLMs generate incorrect formulas when creating tools based on internal knowledge in specialized domains. The solution involves using authoritative textbooks as knowledge sources and compiling them offline into a set of executable tools for direct invocation during reasoning. The pipeline is divided into two stages: tool creation, where textbooks are read section by section to generate Python functions with descriptions and examples, followed by execution tests for verification, and organized into a two-level toolbox based on the chapter-section structure; and tool utilization, which performs hierarchical retrieval (first by chapter-level category, then by specific tool) to feed identified tools into PoT single-turn or ReAct multi-turn reasoning to solve problems.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400, 'subGraphTitleMargin': {'top': 8, 'bottom': 16}}}%%
flowchart TD
R["Authoritative Textbooks /<br/>Knowledge Snippets"] --> GEN
subgraph CREATE["Reference-Guided Tool Creation (Design 1, Offline)"]
direction TB
GEN["Generate tool section by section:<br/>Description + Python Function + Example"] --> TEST{"Execution Test<br/>Success?"}
TEST -->|"No, feedback error for repair"| GEN
TEST -->|"Yes"| BOX["Organize into<br/>two-level toolbox"]
end
Q["Input problem q"] --> CAT
BOX --> CAT
subgraph RET["Hierarchical Tool Retrieval (Design 2)"]
direction TB
CAT["Select chapter-level category<br/>Select ≤nc related classes"] --> TOOL["Select tool within category<br/>Select ≤nt tools"]
end
TOOL -->|"Hit tools"| REASON
subgraph REASON["Two Reasoning Modes (Design 3)"]
direction TB
POT["PoT Single-turn<br/>Generate call code at once"]
REACT["ReAct Multi-turn<br/>Reason while calling"]
end
REASON --> ANS["Answer"]
Key Designs¶
1. Reference-Guided Tool Creation: Using Textbooks as Knowledge Sources Instead of LLM Imagination
This design targets the core challenge: tool creation requires precise domain knowledge, but LLM knowledge in professional fields like causal reasoning, quantum physics, and organic chemistry is inherently unreliable. RefTool draws material from each section of a textbook to generate one tool per section. Each tool contains a "trinity": a natural language description, an executable Python function, and a usage example. Generated tools are not trusted immediately; they undergo execution testing for verification: 73% of tools pass the test in one generation, and another 14% pass after one round of repair, ensuring that only correct code enters the library. The organization directly leverages the natural hierarchical structure of textbooks (Chapter → Section → Tool), eliminating the need for additional knowledge engineering, as textbooks are human-verified knowledge sources more reliable than LLM internal knowledge.
2. Hierarchical Tool Retrieval: Two-Step Selection to Narrow the Search Space
Once the tools are stored, the library can be quite large. Flat retrieval during reasoning can suffer from reduced precision due to irrelevant tools. RefTool performs a two-step retrieval following the hierarchy of the toolbox: first selecting relevant categories at the chapter level, then choosing specific tools within those categories. This significantly reduces the candidates at each step, compressing the search space and improving hit precision. For unstructured reference materials without clear sectioning, the LLM first summarizes a category hierarchy before following the same two-step retrieval.
3. Two Reasoning Modes: Complementarity of PoT and ReAct
The choice of how to use tools depends on the task type. PoT (Program of Thought) generates a single block of code containing tool calls to solve the problem in one go, which is more efficient for tasks with deterministic solution paths. ReAct employs multi-turn interaction, performing reasoning while retrieving and calling tools as needed, providing greater flexibility for tasks requiring step-by-step exploration. These two modes cover reasoning needs ranging from "one-shot solutions" to "exploratory approaches."
Key Experimental Results¶
Main Results¶
| Task | RefTool+PoT (GPT-4o) | TroVE | Domain-specific methods |
|---|---|---|---|
| Causal Reasoning (QRData) | 46.8% | 36.4% | — |
| Physics (TheoremQA) | 57.9% | — | Physics Reasoner |
| Chemistry (SciBench) | 66.4% | — | ChemAgent |
Ablation Study¶
| Configuration | Effect |
|---|---|
| No reference materials (Pure LLM knowledge) | Significant decline |
| No hierarchical structure (Flat retrieval) | Reduced retrieval precision |
| No execution test verification | Increased proportion of incorrect tools |
Key Findings¶
- Ours outperforms tool creation methods by an average of 13.0% and domain-specific methods by 10.2%.
- 73% of tools pass verification in a single generation—the quality of reference materials ensures the quality of tools.
- Hierarchical retrieval is more effective than flat retrieval—leveraging textbook structures reduces the search space.
- RefTool shows the largest gain in causal reasoning (+10.4%), indicating that LLM internal knowledge is weakest in the causal domain.
Highlights & Insights¶
- The approach of directly transforming textbooks into tools is intuitive—human learning also moves from studying textbooks to application.
- The 73% first-pass verification rate suggests that the transformation from textbook knowledge to code is more reliable than expected.
- The hierarchical toolbox utilizes the natural structure of textbooks, avoiding additional knowledge engineering.
- The method is generalizable to any professional domain with structured reference materials.
Limitations & Future Work¶
- Dependent on the availability of high-quality reference materials—cannot be used without a good textbook.
- Each section generates at most 2 tools; information-dense chapters might miss important functionalities.
- The tool hierarchy is fixed at two levels; complex knowledge systems might require more flexible organization.
- Inter-tool coordination and reuse (e.g., one tool calling another) have not been explored.
Related Work & Insights¶
- vs CRAFT/TroVE: These rely on LLM internal knowledge, which is unreliable in specialized domains; RefTool compensates with external references.
- vs RAG: RAG retrieves raw text for LLM reasoning, whereas RefTool pre-compiles knowledge into executable code—execution is more precise than reasoning.
- vs chainSTORM/domain-specific agents: Domain-specific methods require manual design; RefTool automatically generates them from textbooks.
- Inspired the "knowledge compilation" paradigm: Pre-compiling textual knowledge into executable programs to reduce cognitive burden during reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ The reference-to-tool approach is intuitive and effective.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated across three specialized domains with sufficient comparisons.
- Writing Quality: ⭐⭐⭐⭐ Framework design is clearly presented.
- Value: ⭐⭐⭐⭐ Provides a practical paradigm for tool creation in specialized fields.