Skip to content

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Conference: ICLR 2026 Oral
arXiv: 2506.04405
Code: Available
Area: Medical AI / Agent Training
Keywords: biomedical data science, agentic training, code-centric reasoning, reinforcement-learning, Med-Copilot, LLM agent

TL;DR

This work introduces MedAgentGym, the first unified agentic training environment for biomedical data science, comprising 72,413 task instances spanning 12 real-world scenarios and 129 categories, equipped with an executable sandbox and verifiable ground truth. A systematic benchmark evaluation of 29 LLMs reveals a substantial gap between commercial and open-source models. By combining efficient multi-threaded trajectory sampling with offline/online RL, the authors train Med-Copilot, achieving gains of +43.02%/+45.28% respectively and attaining performance competitive with GPT-4o.

Background & Motivation

Background: Biomedical data science encompasses genomic analysis, clinical data processing, medical image analysis, drug discovery, and other subfields, each demanding complex programming and domain-specific reasoning. While LLMs have demonstrated potential as coding assistants in general software engineering, systematic evaluation and training infrastructure for biomedical coding tasks remain lacking.

Limitations of Prior Work: (1) Existing medical AI benchmarks (e.g., MedQA, PubMedQA) are static multiple-choice or QA evaluations that do not support interactive code execution or iterative debugging. (2) No unified platform covers the diverse scenarios in biomedical data science — genomics, clinical informatics, imaging, and drug discovery each maintain separate, siloed benchmarks. (3) Open-source LLMs exhibit a significant performance gap relative to closed-source models (e.g., GPT-4o) on biomedical coding tasks, necessitating effective training methods to narrow this gap.

Key Challenge: Training an agent capable of writing biomedical analysis code requires a large-scale interactive task environment, yet constructing such an environment is prohibitively costly — demanding real data, ground truth annotations, secure sandboxes, and feedback mechanisms.

Goal: To simultaneously address environment construction and agent training by providing a large-scale training environment alongside an RL training pipeline.

Key Insight: The authors unify 12 real-world biomedical scenarios into a standardized format — input data + task description → executed code → verified output — supporting interactive feedback and automated scoring.

Core Idea: A large-scale, interactive, unified training environment combined with an RL training pipeline closes the gap between open-source models and closed-source LLMs on biomedical coding tasks.

Method

Overall Architecture

MedAgentGym comprises three core components: (1) Task Repository: 72,413 task instances, each containing data files, task descriptions, an executable sandbox, ground truth answers, and a scoring function. (2) Interaction Engine: Agents interact with the sandbox through multi-turn dialogue — submitting code, receiving execution results or error messages, and iteratively refining solutions. (3) Training Pipeline: Efficient multi-threaded trajectory generation supporting both offline and online RL training.

Key Designs

  1. 12-Scenario × 129-Category Task Taxonomy:

    • Function: Covers 12 real-world scenarios including genomics (RNA-seq analysis, gene expression clustering), clinical data science (EHR prediction, survival analysis), medical imaging (pathology slide classification, X-ray detection), and drug discovery (molecular property prediction, ADMET analysis).
    • Mechanism: Each scenario defines a standardized interface — input (data file path + metadata) + task instruction (natural language description of the analysis objective) + ground truth (exact numeric answer or class label) + scoring function (\(\text{score}(\hat{y}, y) \in [0, 1]\)).
    • Design Motivation: Unifying multiple domains into a single platform enables agents to transfer and generalize across heterogeneous task types.
  2. Executable Sandbox + Interactive Feedback:

    • Function: Provides each task with an isolated Python execution environment (pre-installed with pandas, scikit-learn, biopython, etc.); agents receive stdout/stderr feedback upon code submission.
    • Mechanism: Agents interact for at most \(K\) rounds. At each round, the agent generates code \(c_t\) → the sandbox executes it and returns \((o_t, e_t)\) → the agent decides whether to revise or submit a final answer based on the output/error. The full trajectory is \(\tau = [(c_1, o_1, e_1), \ldots, (c_K, o_K, e_K)]\).
    • Design Motivation: Single-pass code generation yields low accuracy on many tasks that require debugging; interactive feedback allows agents to learn from errors.
  3. Multi-Threaded Trajectory Generation + RL Training:

    • Function: Samples interaction trajectories across multiple tasks in parallel, supporting both offline RL (learning from pre-collected trajectories) and online RL (learning through real-time environment interaction).
    • Mechanism:
      • Offline RL: Collects a large corpus of trajectories \(\{(\tau_i, r_i)\}\) from multiple LLMs, using ground truth scores \(r = \text{score}(\hat{y}, y)\) as rewards, and trains via DPO/rejection sampling. Trajectories with \(r > \theta\) are selected as positive samples.
      • Online RL: The agent collects real-time trajectories through environment interaction and optimizes policy \(\pi_\theta\) via PPO/GRPO, with reward \(R(\tau) = \text{score}(\hat{y}_\tau, y)\).
    • Design Motivation: Offline RL is data-efficient (reusing existing trajectories), while online RL enables continuous exploration and improvement.

Loss & Training

  • Backbone: Llama-3.1-8B-Instruct serves as the base model for Med-Copilot.
  • Offline phase: Trajectories are collected from GPT-4o-mini, Claude-3.5-Sonnet, DeepSeek-V2.5, and other models.
  • Online phase: Med-Copilot interacts directly with the environment, updating its policy at each iteration.

Key Experimental Results

Main Results: Benchmark Evaluation of 29 LLMs

Model Category Representative Model Avg. Score Rank
Closed-source commercial GPT-4o ~0.55 Top-1
Closed-source commercial Claude-3.5-Sonnet ~0.50 Top-3
Open-source base Llama-3.1-8B-Instruct ~0.32 Lower-mid
Med-Copilot (Offline RL) Llama-3.1-8B + Offline RL ~0.46 (+43.02%) Near GPT-4o
Med-Copilot (Online RL) Llama-3.1-8B + Online RL ~0.46 (+45.28%) Competitive with GPT-4o

Ablation Study

Configuration Gain Notes
Offline RL only +43.02% Learns from multi-model trajectories
Online RL only +45.28% Self-directed exploration; marginally superior
Multi-turn vs. single-turn Significant improvement Demonstrates value of interactive feedback
Task difficulty stratification Larger gains on easy tasks Hard tasks retain room for improvement

Key Findings

  • A substantial performance gap (~20 points) exists between commercial and open-source LLMs on biomedical coding tasks, but RL training can significantly narrow this gap.
  • Online RL marginally outperforms offline RL, yet both substantially surpass SFT baselines.
  • Multi-turn interaction (debugging loops) is critical — single-pass code generation achieves far lower success rates than iterative refinement.
  • Task difficulty varies considerably across biomedical scenarios: basic statistical analysis is relatively straightforward, whereas complex genomic pipeline analysis remains challenging.

Highlights & Insights

  • Unified Training and Evaluation: MedAgentGym serves not only as a benchmark (evaluating 29 LLMs) but also as a training environment with a directly usable RL pipeline — a first in the medical AI domain.
  • Empirical Proof of Gap Closure: An 8B open-source model trained with RL achieves GPT-4o-level performance, which is of substantial practical value for privacy-sensitive medical settings (local deployment vs. API calls).
  • Scalable Infrastructure: 72K tasks + multi-threaded trajectory sampling + standardized interfaces constitute a genuinely scalable training infrastructure.
  • Code-Centric Rather Than QA-Centric: Unlike multiple-choice benchmarks such as MedQA, MedAgentGym requires agents to write real, executable analysis code — more closely reflecting actual research practice.

Limitations & Future Work

  • The task suite is code-centric; knowledge-intensive capabilities such as clinical reasoning and diagnostic decision-making are underrepresented.
  • Ground truth requires predefined standard answers, making the framework unsuitable for open-ended exploratory research tasks.
  • Med-Copilot is only trained at the 8B scale; scaling results for larger models (70B+) are not reported.
  • Scoring functions are primarily based on exact match or numerical error; soft metrics such as code quality, readability, and efficiency are not assessed.
  • Out-of-distribution generalization beyond the training task distribution is not evaluated.
  • vs. MedQA/PubMedQA: These are static QA benchmarks with no code execution or interactive feedback; MedAgentGym supports multi-turn code interaction.
  • vs. SWE-bench: SWE-bench targets software engineering (bug fixing), whereas MedAgentGym targets biomedical data analysis — fundamentally different task natures.
  • vs. AgentBench: AgentBench covers diverse agent tasks but lacks medical focus; MedAgentGym provides deep coverage of biomedical scenarios.
  • vs. AIME: AIME and similar benchmarks assess medical reasoning, whereas MedAgentGym evaluates practical medical programming.

Rating

  • Novelty: ⭐⭐⭐⭐ — First unified agentic training environment for biomedical data science; the problem formulation is valuable, though the RL training methodology itself is not novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 72K tasks + systematic evaluation of 29 LLMs + offline/online RL comparison + Med-Copilot validation.
  • Writing Quality: ⭐⭐⭐⭐ — System description is clear; task taxonomy and experimental organization are well-structured.
  • Value: ⭐⭐⭐⭐⭐ — Provides critical infrastructure for biomedical AI agent research; the open-source environment offers long-term community value.