ChemAmp: Amplified Chemistry Tools via Composable Agents¶
Conference: ACL 2026 arXiv: 2505.21569 Code: GitHub Area: Scientific AI / Chemistry Keywords: Tool amplification, composable agents, chemistry AI, multi-agent systems, hierarchical composition
TL;DR¶
This paper proposes a novel "tool amplification" paradigm (distinct from conventional tool orchestration) and introduces the ChemAmp framework, which treats chemistry-specific tools (UniMol2, Chemformer, etc.) as composable building blocks to dynamically construct task-specialized super-agents. ChemAmp surpasses both domain-specific models and general-purpose LLMs on four core chemistry tasks—including molecular design and reaction prediction—while reducing inference token costs by 94%.
Background & Motivation¶
Background: LLM-based agents have demonstrated the ability to orchestrate multi-step tool-use workflows in the chemistry domain (e.g., ChemCrow, Coscientist), sequentially invoking tools such as RDKit and molecular generators to complete cross-task workflows.
Limitations of Prior Work: Existing approaches focus on "tool orchestration" (scheduling tool sequences across tasks), yet within-task performance remains bounded by the atomic capabilities of individual tools. Even state-of-the-art chemistry-specific tools (UniMol2, ChemDFM) achieve only 35% exact match in molecular description when used in isolation, and errors propagate through the reasoning chain.
Key Challenge: Tool orchestration optimizes inter-task tool scheduling, but the true bottleneck constraining agent performance is within-task tool capability limitations.
Goal: Shift from "tool orchestration" to "tool amplification"—enabling tools to exceed their individual atomic capabilities within a single task through dynamic composition.
Key Insight: Treat each tool as a composable building-block agent and construct higher-performing composite tools through hierarchical iterative encapsulation.
Core Idea: A two-stage amplification process—first encapsulating atomic tools into enhanced sub-agents (Stage 1), then composing sub-agents into a hierarchical network (Stage 2), with iterative refinement guided by adaptive scoring and automatic feedback.
Method¶
Overall Architecture¶
ChemAmp constructs an agent hierarchy through a two-stage bidirectional encapsulation engine. In Stage 1 (atomic → composite amplification), each atomic tool is iteratively encapsulated into an Agent Composite Tool until performance ceases to improve, and all variants are registered in a tool library. In Stage 2 (cross-composite collaboration), the best-performing tool from the library serves as a base and is combined with other top-\(k\) tools to form higher-level composite tools, iterating until global performance stabilizes.
Key Designs¶
-
Dual Role of the Agent Composite Tool:
- Function: Serves simultaneously as a composable building block for higher-level agents and as an autonomous executor for chemistry sub-tasks.
- Mechanism: Each \(\mathcal{A}(t_1,\ldots,t_n)\) encapsulates multiple tools along with their coordination strategies, enabling both invocation by upper-level agents and independent execution. This duality allows ChemAmp to identify optimal enhancement points where tool coordination yields synergistic effects.
- Design Motivation: Avoids naive stacking and enables genuine capability emergence.
-
Two-Stage Iterative Encapsulation:
- Function: Automatically discovers optimal tool combinations.
- Mechanism: Stage 1 iteratively encapsulates each atomic tool as \(\mathcal{A}_i(t_k)\), scored by task metric \(s_i\), continuing only when improvement exceeds threshold \(\delta\). Stage 2 ranks the tool library, takes the top-1 tool as the base, and forms combinations \(\{\mathcal{A}(t_1,t_2),\ldots,\mathcal{A}(t_1,t_k)\}\) with the top-\(k\) tools, iterating until global performance no longer improves.
- Design Motivation: Manual combination is infeasible and exhaustive search is prohibitively costly; iterative encapsulation with threshold control balances efficiency and effectiveness.
-
Minimal Data Requirement (≤10 samples):
- Function: Optimizes tool composition with extremely few validation samples.
- Mechanism: Each task requires only ≤10 samples for composition scoring and selection. By leveraging the domain knowledge embedded in chemistry tools themselves, ChemAmp needs only a small number of examples to determine whether a combination yields improvement.
- Design Motivation: Annotated data is scarce in the chemistry domain, necessitating a low-data-dependency approach.
Key Experimental Results¶
Main Results (Molecular Design — ChemLLMBench)¶
| Method | Exact Match | BLEU | FTS |
|---|---|---|---|
| ChemDFM-13B | 0.32 | 0.85 | 0.74 |
| Text+Chem T5 | 0.32 | 0.85 | 0.82 |
| GPT-4o | 0.01 | 0.57 | 0.54 |
| ChemAmp | 0.42 | 0.88 | 0.84 |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| Stage 1 only | Improved | Single-tool enhancement is effective |
| Stage 1 + Stage 2 | Best | Cross-composite collaboration yields further gains |
| Vanilla multi-agent | Worse | Naive stacking underperforms structured composition |
| Token cost | 94% reduction | vs. vanilla multi-agent system |
Key Findings¶
- ChemAmp comprehensively outperforms chemistry-specific models, general-purpose LLMs, and conventional agent orchestration systems across four core chemistry tasks.
- Inference token cost is only 6% of that of vanilla multi-agent systems, demonstrating exceptional efficiency.
- A bottom-up composition strategy outperforms top-down orchestration strategies.
- Exact match on molecular design improves from the prior SOTA of 0.32 to 0.42 (+31%), validating the practical effectiveness of tool amplification.
Highlights & Insights¶
- Paradigm Innovation: The distinction between "tool amplification" and "tool orchestration" is clear and compelling, representing a shift from cross-task scheduling to within-task enhancement.
- Efficiency and Effectiveness: Surpassing SOTA while reducing inference token cost by 94% demonstrates that structured composition is far more efficient than brute-force stacking.
- Generality: Although applied to chemistry, the tool amplification paradigm is transferable to other scientific domains.
- Low Data Requirement: Composition optimization with ≤10 samples ensures strong practical utility.
Limitations & Future Work¶
- Dependence on GPT-4o as the core agent: The effectiveness of the composition strategy may be constrained by the capabilities of the underlying LLM.
- Evaluation limited to 100 instances on ChemLLMBench: The test scale is relatively small.
- Chemistry-domain specificity: Applicability to other scientific domains requires further validation.
- Future directions include extending the framework to broader scientific domains, investigating the interpretability of composition strategies, and reducing reliance on closed-source LLMs.
Related Work & Insights¶
- vs. ChemCrow / Coscientist: Representative tool orchestration systems effective at cross-task scheduling but incapable of enhancing within-task performance.
- vs. ChemToolAgent: Supports large tool sets and dynamic selection but remains within the orchestration paradigm.
- vs. AgentPrune / GPTSwarm: Automated workflow optimization without atomic tool-level enhancement.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The "tool amplification" paradigm is novel and persuasive; the two-stage encapsulation engine is elegantly designed.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across four chemistry tasks with ablation and efficiency analysis, though test scale is limited.
- Writing Quality: ⭐⭐⭐⭐ The orchestration-vs-amplification distinction figure is clear, and the algorithmic descriptions are complete.
- Value: ⭐⭐⭐⭐ Provides a new perspective for tool enhancement in scientific AI; the dual gains in efficiency and effectiveness carry practical deployment value.