ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents¶

Conference: ACL 2025
arXiv: 2410.17657
Code: https://github.com/BlueZeros/ReflecTool
Area: Medical NLP
Keywords: Tool-Augmented Agent, Clinical Agent, Reflective Learning, Long-term Memory, Medical AI

TL;DR¶

ReflecTool proposes a reflection-aware tool-augmented clinical Agent framework. By accumulating successful trajectories and tool-level experience in the optimization stage, and retrieving similar cases while refining tool usage with a validator in the inference stage, it outperforms pure LLMs by 10+ points and existing Agent methods by 3 points on ClinicalAgent Bench across 18 tasks.

Background & Motivation¶

Background: LLMs show promising potential in the medical domain but are limited to text interactions, failing to handle multimodal information (medical images, EHRs, etc.) in clinical settings.

Limitations of Prior Work: Existing clinical Agents (e.g., EHRAgent, MMedAgent) target single scenarios with limited tool types, lacking generalizability across different clinical settings.

Key Challenge: Clinical environments require Agents to possess multidimensional capabilities including knowledge reasoning, multimodal understanding, numerical analysis, data understanding, and reliability, yet a unified evaluation benchmark and a general framework are lacking.

Goal: (1) Build a comprehensive clinical Agent evaluation benchmark. (2) Design a general Agent framework capable of learning and effectively using domain-specific tools.

Key Insight: Accumulate tool usage experiences by self-reflecting on successful/failed trajectories, and store successful cases in a long-term memory.

Core Idea: Enable the Agent to "practice" tool usage and accumulate experience during an optimization stage, and retrieve similar successful cases along with tool-level experiences to guide decision-making during the inference stage.

Method¶

Overall Architecture¶

The input is a clinical task (a question plus input data in various formats) \(\rightarrow\) the toolbox contains 15 clinical tools \(\rightarrow\) a two-stage framework: an optimization stage to accumulate experience through trial and error on a small-scale training set \(\rightarrow\) an inference stage to retrieve similar cases and apply tool-wise verification \(\rightarrow\) and finally output the answer. The process is based on ReAct-style multi-step reasoning.

Key Designs¶

ClinicalAgent Bench (CAB):
- Function: Provides a comprehensive clinical Agent evaluation framework.
- Mechanism: Covers 18 tasks across 5 dimensions (knowledge reasoning, multimodal, numerical analysis, data understanding, reliability), supported by 15 clinical tools.
- Design Motivation: Existing medical Agent benchmarks only cover single scenarios, making it impossible to comprehensively evaluate clinical Agent capabilities.
Optimization Stage:
- Function: Accumulates tool usage experience and successful cases on a small-scale sample dataset.
- Mechanism: The Agent first attempts to solve the problem to generate a trajectory \(\mathcal{C}_1\), self-reflects against the ground truth to generate improvement suggestions \(\mathcal{S}\), and recreates a trajectory \(\mathcal{C}_2\) accordingly. If successful, the trajectory is saved to the long-term memory \(\mathcal{M}\), and tool-level experiences \(\mathcal{E}\) are extracted by comparing both trajectories.
- Design Motivation: Directly using tools yields poor results; through "trial-error-reflection-accumulation", the Agent learns the correct usage of domain-specific tools.
Inference Stage with Tool-wise Reflection:
- Function: Utilizes long-term memory and tool-level experience to guide reasoning.
- Mechanism: (1) BM25 retrieves similar successful cases as few-shot demonstrations. (2) Two verification modes: Iterative Refinement (iteratively refines actions until stable or reaching limits) and Candidate Selection (samples multiple candidate actions and allows a validator to select the optimal one).
- Design Motivation: Iterative Refinement is suitable for weaker models (gradual improvement), while Candidate Selection is suitable for stronger models (selecting the best among multiple plans), offering complementary adaptability.

Loss & Training¶

No fine-tuning of model parameters is required. The optimization stage requires only a few annotated samples (~200 in the experiments) to generate experiences via LLM self-reflection and store them in an external memory.

Key Experimental Results¶

Main Results¶

Model / Method	Metric (Avg)	Comparison	Gain
Qwen2-7B (Pure LLM)	38.01	-	-
ReflecTool (Qwen2-7B, IR)	49.37	vs Pure LLM	+11.36
Reflexion (Qwen2-72B)	56.37	Strongest Baseline	-
ReflecTool (Qwen2-72B, CS)	59.66	vs Reflexion	+3.29

Ablation Study¶

Configuration	Avg	Description
W/o Memory W/o Reflection (ReAct)	52.37	Baseline
+ Long-term Memory	54.86	+2.49
+ Long-term Memory + Tool-wise Experience (IR)	57.01	+4.64
+ Long-term Memory + Tool-wise Experience (CS)	60.28	+7.91

Key Findings¶

Long-term memory is the most critical component; removing it drops Refinement by 4 points and Selection by 7 points.
Candidate Selection performs better with stronger models, while Iterative Refinement is more effective with weaker models.
Performance improves with more optimization steps, but gains show diminishing marginal returns (stabilizing around ~800 steps).
A verification count of \(n=2\) is usually optimal; too many verifications can be harmful.

Highlights & Insights¶

The idea of extracting "tool-level experience" from trajectory comparisons is ingenious and can be transferred to Agents in other domains (e.g., coding, finance).
The complementary findings of the two verification methods (iterative refinement vs. candidate selection) provide practical value: strategies can be automatically selected based on model capability.
The CAB benchmark is comprehensively designed, covering 18 tasks across 5 dimensions, filling the gap in comprehensive evaluation for clinical Agents.

Limitations & Future Work¶

The optimization stage requires ground-truth answers, which limits applicability in unlabeled scenarios.
The toolbox is predefined, lacking the capability to dynamically discover or create new tools.
Evaluated only on the Qwen2 series; generalizability to other models has not been fully verified.
The safety and interpretability of the Agent in clinical scenarios are not discussed in depth.

vs Reflexion: Both utilize reflection, but Reflexion reflects at the behavioral level, whereas ReflecTool refines experience accumulation and verification down to the tool level.
vs EHRAgent/MMedAgent: These are designed for single scenarios, while ReflecTool is a general-purpose clinical Agent.
vs CRITIC: CRITIC utilizes self-verification to improve output, while ReflecTool performs verification at each action step based on tool-wise experiences.

Rating¶

Novelty: ⭐⭐⭐⭐ The design of tool-level reflective experience + double-verification methods is novel.
Experimental Thoroughness: ⭐⭐⭐⭐ 18 tasks across 5 dimensions + multiple baselines + ablation + detailed analysis.
Writing Quality: ⭐⭐⭐⭐ Clear framework diagram, standard and formal methodology description.
Value: ⭐⭐⭐⭐ Both the CAB benchmark and the tool-level reflection framework have practical value.