Training Software Engineering Agents and Verifiers with SWE-Gym¶

Conference: ICML 2025
arXiv: 2412.21139
Code: Yes (Publicly released SWE-Gym environments, models, and trajectory data)
Area: Code Intelligence
Keywords: Software Engineering Agents, SWE-Bench, Training Environments, Verifier, Inference-time scaling

TL;DR¶

This paper proposes SWE-Gym, the first environment designed for training software engineering (SWE) agents, containing 2,438 real-world task instances from 11 open-source Python repositories. By leveraging rejection sampling fine-tuning on SWE-Gym to train SWE agents and verifiers, it achieves resolve rates of \(32.0\%\) on SWE-Bench Verified and \(26.0\%\) on SWE-Bench Lite, setting a new SOTA for open-weight SWE agents.

Background & Motivation¶

Background: LLM-based SWE agents have demonstrated immense potential in automatically resolving GitHub issues, with SWE-Bench becoming the standard evaluation benchmark. However, the current state-of-the-art (SOTA) SWE agents rely heavily on closed-source models (such as GPT-4 and Claude), while the performance of open-source models lags far behind.

Limitations of Prior Work: Unlike fields such as mathematical reasoning or dialogue, which enjoy rich training datasets and environments, the software engineering domain severely lacks usable training environments. The training set of SWE-Bench only provides code patches (git diff) without step-by-step developer trajectories or executable environments.

Key Challenge: Real-world software engineering tasks require interaction with executable runtimes, configuring software dependencies, and reproducible test suites—building such a training environment is extremely challenging. Existing datasets are either synthetic tasks (e.g., R2E) or isolated coding problems (e.g., APPS, HumanEval), which fail to represent realistic repository-level programming.

Goal: To construct the first real-world software engineering training environment to support the training and enhancement of open-source SWE agents through reinforcement learning and supervised fine-tuning.

Key Insight: Systematically extract task instances from GitHub PRs, manually configure executable environments and unit tests, and build a comprehensive environment suitable for training policy models and verifiers.

Core Idea: With a training environment capable of executing tests, a simple approach of "sampling successful trajectories + fine-tuning" can significantly boost open-source SWE agents, while also enabling the training of verifiers for inference-time scaling.

Method¶

Overall Architecture¶

SWE-Gym environment \(\rightarrow\) sample agent trajectories \(\rightarrow\) two major application directions: (1) Policy Training: perform rejection sampling fine-tuning on successful trajectories to enhance base agent capabilities; (2) Inference-Time Scaling: train a verifier (Outcome-Supervised Reward Model, ORM) to rerank multiple candidate trajectories and select the optimal solution.

Key Designs¶

SWE-Gym Environment Construction:
- Function: Build 2,438 real SWE task instances from GitHub PRs, each with an executable environment and unit tests.
- Mechanism:
  - Extract 64,689 raw task instances from 358 Python repositories (SWE-Gym Raw).
  - Filter down to 11 high-quality repositories and manually configure dependency environments for each version.
  - Utilize SWE-Bench execution verification scripts to ensure gold patches pass more tests.
  - Provide a subset of 230 simpler instances, named SWE-Gym Lite, for rapid prototyping.
- Design Motivation: Built with distinct repositories from SWE-Bench to avoid data contamination. It required approximately 200 manual annotation hours and 10,000 CPU core hours to construct, resulting in the public release of 6TB Docker images.
- Novelty: The only dataset that concurrently exhibits four characteristics: authentic repository-level tasks, executable environments, real task descriptions, and designated training sets.
Agent Policy Training (General Prompting Mode):
- Function: Use OpenHands (a ReAct-based general agent framework) to sample successful trajectories and fine-tune Qwen-2.5-Coder-Instruct.
- Mechanism: Rejection sampling fine-tuning—apply supervised training strictly on successful trajectories that pass unit tests.
- Key Data: Utilized only 491 successful trajectories (sampled using GPT-4o and Claude-3.5-Sonnet), with an average of roughly 19 interaction turns and 19,000 tokens per trajectory.
- Gain: The 32B model improved from \(3.0\%\) to \(15.3\%\) (\(+12.3\%\)) on SWE-Bench Lite, and from \(7.0\%\) to \(20.6\%\) (\(+13.6\%\)) on SWE-Bench Verified.
- Extra Findings: Fine-tuning significantly reduces model behaviors of getting "stuck in a loop" (repeating the same action more than three times), with the looping rate of the 7B model dropping from \(47.0\%\) to \(31.0\%\).
Agent Policy Training (Specialized Workflow Mode):
- Function: Conduct self-improvement training using MoatlessTools (a predefined workflow-based agent framework).
- Mechanism: Iterative rejection sampling fine-tuning—generate 30 high-temperature sample trajectories per round \(\rightarrow\) filter successful trajectories \(\rightarrow\) fine-tune the policy.
- Key Innovation—Per-Instance Capping: Limit each task to a maximum of 2 trajectories to prevent the dataset from being biased toward simpler tasks.
- Gain: The 7B model improved from \(7.0\%\) to \(10.0\%\) (after two iterations), and the 32B model improved from \(19.0\%\) to \(19.7\%\).
Verifier Training and Inference-Time Scaling:
- Function: Train an Outcome-Supervised Reward Model (ORM) to estimate the success probability of a trajectory.
- Mechanism: Fine-tune 32B Qwen2.5-Coder-Instruct, with the input being the trajectory (issue description + agent action sequence + git diff), to predict <YES> or <NO>.
- Key Data: 2,636 trajectories (evenly split between success and failure), obtained from off-policy and on-policy sampling.
- Inference: Sample multiple candidate trajectories for each task, take the log probability of <YES> from the verifier as the score, and choose the trajectory with the highest score.
- Gain: Performance on SWE-Bench Verified scales from \(20.6\%\) to \(32.0\%\) (\(+11.4\%\)), demonstrating an empirical log-linear inference-time scaling law.

Loss & Training¶

Policy Model Training: Standard supervised fine-tuning (SFT) to minimize the language modeling loss on successful trajectories.

Verifier Training: Classification loss—success trajectories are labeled as <YES> and failures as <NO>, next-token probability is utilized for fine-tuning.

Per-Instance Capping: Limit each task to a maximum of \(\text{cap}=2\) trajectories, prioritizing trajectories with fewer interaction turns to balance distributional bias and dataset scale.

Key Experimental Results¶

Main Results (OpenHands Framework)¶

Model Size	Benchmark	Zero-shot Resolve Rate	Fine-tuned Resolve Rate	Gain
7B	SWE-Bench Lite	\(1.0\%\)	\(10.0\%\)	\(+9.0\%\)
14B	SWE-Bench Lite	\(2.7\%\)	\(12.7\%\)	\(+10.0\%\)
32B	SWE-Bench Lite	\(3.0\%\)	\(15.3\%\)	\(+12.3\%\)
7B	SWE-Bench Verified	\(1.8\%\)	\(10.6\%\)	\(+8.8\%\)
14B	SWE-Bench Verified	\(4.0\%\)	\(16.4\%\)	\(+12.4\%\)
32B	SWE-Bench Verified	\(7.0\%\)	\(20.6\%\)	\(+13.6\%\)

Verifier Inference-time Scaling¶

Configuration	SWE-Bench Verified	SWE-Bench Lite	Description
Fine-tuned Agent (\(t=0\))	\(20.6\%\)	\(15.3\%\)	Baseline
+ Verifier Reranking	\(32.0\%\)	\(26.0\%\)	\(+11.4\% / +10.7\%\)

MoatlessTools Self-Improvement¶

Setting	7B Resolve Rate	32B Resolve Rate	Description
Zero-shot	\(7.0\%\)	\(19.0\%\)	Specialized workflow baseline is higher than general prompting
Iteration 1	\(9.0\%\)	\(19.7\%\)	Self-improvement is effective
Iteration 2	\(10.0\%\)	\(19.7\%\)	32B model saturates

Ablation Study¶

Configuration	Key Metric	Description
Per-Instance Cap=1	Performance degradation	Dataset is too small
Per-Instance Cap=2	Optimal	Balances bias and data volume
No Cap	Slightly below Cap=2	Biased towards simpler tasks
Self-Improvement (OpenHands 32B)	\(15.3\% \rightarrow 8.7\%\)	Self-improvement is ineffective under general frameworks
Training Data Scaling	Linear improvement	491 trajectories not saturated; performance is constrained by sampling budget

Key Findings¶

Just 491 successful trajectories yield an absolute improvement of \(+12\text{--}14\%\), and performance scales linearly with the number of training trajectories without bottlenecking.
Fine-tuning significantly mitigates the behavior of getting "stuck in a loop," which is a key bottleneck when deploying open-source models as agents.
Inference-time scaling displays a roughly log-linear growth trend, proving the effectiveness of the verifier.
Self-improvement is ineffective in the general prompting mode (performance actually drops), but is highly effective in the specialized workflow mode.
Specialized workflows are more amenable to open-source models: the 32B model achieves a \(19.0\%\) zero-shot resolve rate under MoatlessTools, compared to only \(3.0\%\) under OpenHands.

Highlights & Insights¶

Fills the vacancy of SWE agent training environments with massive engineering contributions (200 manual annotation hours + 10,000 CPU core hours + 6TB Docker images).
A simple rejection sampling fine-tuning brings huge improvements, demonstrating that "having an execution environment" itself is the main bottleneck.
Per-Instance Capping is a simple yet effective optimization to rejection sampling fine-tuning.
The log-linear trend of verifier inference-time scaling aligns directly with findings in mathematical reasoning, indicating the universality of this paradigm.
The unsaturated training curve implies that a larger computing budget (more sampling) could yield greater gains.

Limitations & Future Work¶

Limited only to Python repositories, excluding languages like Java or TypeScript.
Self-improvement fails under general prompting frameworks; more advanced policy optimization methods (such as PPO) are required.
Training data is predominantly sourced from GPT-4o and Claude (off-policy), resulting in limited pure self-improvement capability.
The 32B model saturates after two iterations under specialized workflows, restricted by the action-space constraints of the workflows.
High construction costs of the environment make it difficult to scale automatically to more repositories.

Comparable with the training paradigm shift in mathematical reasoning: only with the advent of training datasets and reward signals like GSM8K/MATH did mathematical reasoning experience explosive growth; SWE-Gym aims to replicate this trajectory in software engineering.
The verifier concept originates from Outcome Reward Models (ORM) in mathematical reasoning and is successfully adapted to the agent domain.
Complementary to rather than a replacement for SWE-Bench: SWE-Gym uses distinct repositories and focuses on training rather than evaluation.
Inspiration: Other agent tasks requiring environmental interaction (e.g., web navigation, OS operations) can adopt a similar methodology to build interactive training environments.

Rating¶

Novelty: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐⭐
Value: ⭐⭐⭐⭐⭐