Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub¶

Conference: ACL 2025 Main
arXiv: 2312.17294
Code: https://github.com/OpenBMB/OpenAct
Area: LLM/NLP
Keywords: Autonomous Tool Integration, GitHub, Hierarchical Agent, Experience Learning, Open-Domain

TL;DR¶

This paper proposes the OpenAgent system, which autonomously searches, configures, applies, and stores GitHub repositories as tools through a four-stage process of Search→Setup→Apply→Store, successfully solving open-domain tasks for LLMs in specialized areas like finance, chemistry, and biology, with an average success rate of 69.4%.

Background & Motivation¶

Background: LLM-based agents enhance their capabilities by integrating external tools (such as search engines, calculators, and knowledge bases). However, the toolsets supported by existing agents are limited and cannot cover the diverse requirements of users across various specialized domains.

Limitations of Prior Work: (1) Fixed toolsets—existing agents only support limited, predefined toolsets, leaving them helpless in specialized domains (e.g., quantitative investment, molecular retrosynthesis); (2) Weak tool creation capabilities—although some studies allow LLMs to dynamically create tools, the created tools have simple functionalities and cannot satisfy highly complex real-world requirements; (3) Lack of evaluation—there is no dataset to evaluate the open-domain task-solving capabilities of LLMs.

Key Challenge: While there is a vast pool of specialized tool repositories on GitHub, automatically integrating them into agents faces two major challenges: (a) varying repository quality, incomplete documentation, and potential bugs in the code; (b) diverse usage methods of repositories and a lack of standardized interfaces.

Goal: (1) Construct the OpenAct benchmark to evaluate open-domain task-solving capabilities; (2) design an agent system capable of autonomously integrating tools from GitHub.

Core Idea: Empower LLM agents to autonomously search for GitHub repositories and integrate them as tools, overcoming the non-standardization of repositories by learning from human experience in GitHub Issues and PRs.

Method¶

Overall Architecture¶

OpenAgent (referred to as GitAgent in early versions of the paper) adopts a hierarchical task decomposition strategy, dividing the tool integration process into four stages: Search (searching for suitable repositories) \(\rightarrow\) Setup (configuring the environment) \(\rightarrow\) Apply (applying the repository to solve the task) \(\rightarrow\) Store (storing the repository for future use). Each stage is further decomposed into multiple subtasks, each completed through a sequence of actions (such as API calls, command execution, and file I/O).

Mathematical formulation: \(Q \xrightarrow{\mathcal{M}} O_{\text{Search}} \rightarrow O_{\text{Setup}} \rightarrow O_{\text{Apply}} \rightarrow O_{\text{Store}} \rightarrow R\)

Key Designs¶

Search Stage—Adaptive Repository Search:
- Two search strategies: (a) if the user specifies a repository name, the GitHub search_by_name API is directly called; (b) if not specified, the agent extracts a list of GitHub topics from the query and calls the search_by_topic API to search.
- Repository function judgment: The agent reads the README of each candidate repository, analyzes its features, and determines if it is suitable for the current task.
- Cache retrieval: Prioritizes retrieval from previously stored repositories, achieving 100% Precision@1.
Dual-level Experience Learning Mechanism (Human Experience Learning):
- Learning from Issues: When encountering issues in the Apply stage, the agent summarizes the problem into a query \(Q_S\), calls the GitHub Issues API to search for relevant issues, evaluates their applicability one by one, and extracts solutions.
- Learning from PRs: When encountering environment configuration issues or code bugs during the Setup stage, the agent searches Pull Requests to find fixes, modifying source files through the File_Modification subtask.
- Mathematical formulation: \(A_{\text{search}}(Q_S) = \mathcal{M}_{P_{\text{abs}}}(Q, H)\), where \(H\) represents the accumulated historical information.
Secure Isolated Execution: Repositories are cloned into a Docker environment, and all subsequent commands are run in this isolated environment to ensure safety.

Loss & Training¶

This work designs an agent system and does not involve traditional training loss. It is implemented based on the Function Calling capabilities of GPT-4 (gpt-4-32k) with temperature=0.6.

Experiments¶

Main Results¶

Repository	Domain	Queries	Search Success Rate	Setup Success Rate	Apply Success Rate	Store Success Rate
Qlib	Finance	8	77.5	75.0	67.5	67.5
Bringing-Old-Photos	CV	12	100.0	85.0	63.3	63.3
Sniffles	Biology	3	100.0	100.0	100.0	100.0
AiZynthFinder	Chemistry	7	83.6	80.0	69.1	69.1

Key Findings: An average success rate of 69.4% is achieved. The success rate varies significantly across different repositories (Sniffles 100% vs. Bringing-Old-Photos 63.3%), reflecting the practical challenge of repository non-standardization.

Ablation Study—Computational Cost Analysis¶

Repository	Average API Calls	Average Tokens
Qlib	32.8	199,388
Bringing-Old-Photos	28.9	80,286
Sniffles	16.3	47,440
AiZynthFinder	30.5	90,639

Key Findings¶

The Search stage is one of the main bottlenecks: It requires the agent to infer repository functions/domains from user queries and generate relevant GitHub topics, demanding high abstract understanding capabilities.
The Apply stage has the greatest impact on the success rate: The usage of different repositories varies wildly, requiring deep understanding.
Cache retrieval is 100% accurate: The Store-and-Retrieve mechanism effectively reduces repetitive search overhead.
Three main failure reasons: Incorrect repository selection (vague README descriptions), environment configuration failure (outdated Dockerfile), and execution configuration errors (incorrect parameter settings).

Highlights & Insights¶

Forward-looking Direction: The idea of treating GitHub repositories as a tool resource pool is highly scalable, theoretically covering any specialized domain.
Experience Learning Mechanism: Leveraging GitHub Issues/PRs as a human experience knowledge base elegantly solves the non-standardization problem.
Hierarchical Architecture Design: The four-stage process is clear, and each stage can be optimized independently.
Fully Open-source: The OpenAct dataset and code are both open-sourced.

Limitations & Future Work¶

Small evaluation scale: Evaluated on only 4 repositories and 30 queries, which limits statistical significance.
Highly dependent on the reasoning capabilities of GPT-4; the performance on open-source models remains unknown.
Docker isolation increases deployment complexity.
Massive token consumption per query (averaging ~100k tokens), resulting in high costs.
Sensitive to README quality; repositories with unclear documentation are prone to failure.
Unstable results: Re-running experiments exhibits high variance, partly due to network connectivity and the randomness of LLM inference.

LLM-based Agents: General agent systems like AutoGPT, AutoGen, and XAgent, but they rely on fixed toolsets.
Tool Learning: Studies on tool learning such as Toolformer, ToolLLM, and Gorilla focus on tool usage capabilities but with predefined toolsets.
Tool Creation: Work like LATM and CREATOR allows LLMs to create tools, but these tools have simple functionalities.
Our Work Positioning: This work is the first to propose an agent system that autonomously integrates tools from GitHub, breaking the limitation of fixed toolsets.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — The idea of using GitHub as a dynamic tool library is highly innovative.
Value: ⭐⭐⭐⭐ — Possesses significant value for extending agent capability boundaries, but token costs are high.
Technical Depth: ⭐⭐⭐⭐ — The hierarchical architecture and dual-level experience learning are well-designed.
Experimental Thoroughness: ⭐⭐⭐ — Small evaluation scale, with only 4 repositories and 30 queries.
Overall Recommendation: ⭐⭐⭐⭐ — The direction is important and the system design is complete; however, the experimental scale could be larger.