Toward Training Superintelligent Software Agents through Self-Play SWE-RL¶
Conference: ICML 2026
arXiv: 2512.18552
Code: None
Area: LLM Agent / Software Engineering Agent / RL
Keywords: Self-play, SWE-RL, Bug Injection, Consistency Check, Curriculum Evolution
TL;DR¶
This paper proposes Self-play SWE-RL (SSR), where the same LLM acts as both a "proposer" (creating bugs) and a "solver" (fixing bugs) within sandboxed repositories. Using only Docker images as input and joint RL with rewards based on consistency checks and solve rates, SSR achieves self-improvement of +10.4 and +7.8 points on SWE-bench Verified and SWE-Bench Pro, respectively, consistently outperforming "human-data" baselines that rely on human-annotated issues and test suites.
Background & Motivation¶
Background: Current mainstream software engineering agents (SWE-agent, CWM, DeepSWE, Kimi-K2, etc.) are trained via RL with verifiable rewards. These reward signals originate from human-annotated issue descriptions combined with pass-to-pass / fail-to-pass test sets, typically exemplified by benchmarks like SWE-bench Verified.
Limitations of Prior Work: On one hand, human-annotated issues and tests are expensive and often unreliable (SWE-bench required a Verified subset via manual verification), making scaling difficult. On the other hand, even with RL, agents essentially "replay human development trajectories," making it hard to discover new problem classes or solutions, effectively locking the performance ceiling to human knowledge.
Key Challenge: To train "superhuman" software agents, one must eliminate dependence on human-annotated data/environments. However, completely "zero-data" self-play (e.g., Absolute Zero, R-Zero, LSP) is limited to the internal rules of Python interpreters and fails to learn real-world engineering knowledge that cannot be deduced from semantics alone. In other words, self-play must be "grounded in real codebases" to break through inherent knowledge boundaries.
Goal: Construct a software agent that evolves using only Docker images (source code + dependencies). The goals are threefold: (1) no reliance on human-written issues, test commands, or test parsers; (2) joint training of proposer and solver roles sharing a single set of LLM parameters; (3) continuous evolution of curriculum difficulty alongside the current policy.
Key Insight: Formalize the entire human development process—"running tests → injecting bugs → weakening tests → fixing bugs"—into a bug artifact generated by the agent itself. Five files can completely define a bug and its repair specifications. In this way, the "correctness" of a bug is objectively verifiable through test execution, requiring no natural language for reward signals.
Core Idea: Treat the codebase itself as the "game rules." Let the same LLM engage in self-play between bug-injection and bug-solving roles. Use the solve-rate to reward the proposer for creating bugs of "optimal difficulty" and reward the solver for "passing all tests." Combined with a mechanism that upgrades "failed solutions into higher-order bugs," the training distribution evolves naturally with the policy.
Method¶
Overall Architecture¶
The input consists of a set of sandboxed Docker images (containing only source code and dependencies, not assuming existing tests, test commands, parsers, or language/framework priors). The same LLM policy is instantiated into two roles via different prompts, sharing parameters and undergoing joint RL updates:
- Bug-injection agent: Explores the repository using Bash + editor tools in the sandbox → learns how to run tests → generates a 5-file bug artifact → undergoes consistency checks.
- Bug-solving agent: Receives the "inverse test-weakening patch" from the bug artifact as the sole specification (no natural language issue). It then produces a repair patch for the code where the bug was injected.
- Solver failure trajectories are recycled as "second-order bugs" to expand the training distribution.
- Proposer Reward = Consistency Check + Solver Solve-rate (encouraging "difficult but solvable" tasks). Solver Reward = Binary signal indicating if all tests passed.
The tool scaffolding reuses the Code World Model (CWM) implementation, and the base model is CWM-sft (32B, the checkpoint before CWM's RL phase) to ensure a fair comparison.
Key Designs¶
-
Bug Artifact Quintet + Consistency Check:
- Function: Formalizes "what a bug is and how to judge its fix" into five files:
test_script.sh(runs tests),test_files.txt(whitelist of oracle test files reset before evaluation),test_parser.py(parses outputs into JSON),bug_inject.diff(patch injecting the bug), andtest_weaken.diff(weathers/deletes existing tests to hide the bug). - Mechanism: Each artifact must pass an automated validation suite to be considered valid: test files must exist and cover the weakening patch's range; the parser must reliably output pass/fail JSON; the original code must pass at least
min_passing_testsintest_script.sh;bug_inject.diffmust touch at leastmin_changed_files; after injection, at leastmin_failing_testspreviously passing tests must fail; and the weakening patch must restore some failing tests to passing. The most critical part is reverse mutation testing: for each file in the bug patch, it is individually rolled back to the fixed version; if at least one failing test passes, the file "contributes to the bug"; otherwise, the artifact is rejected.test_files.txtensures oracle tests are reset during evaluation to prevent "hacking" the tests. - Design Motivation: Converts "what is a valid bug" from human judgment to executable verification, making the reward \(r_{\text{inject}}\) meaningful without human labels. Reverse mutation prevents the proposer from passing checks by adding irrelevant changes.
- Function: Formalizes "what a bug is and how to judge its fix" into five files:
-
Adaptive Reward + Higher-order Bug Curriculum Evolution:
- Function: Ensures the proposer generates "just right" difficulty bugs and upgrades failed states into new bugs to iterate the training distribution.
- Mechanism: Given solver solve-rate \(s \in [0, 1]\), the reward is defined as \(r_{\text{inject}} = -1.0\) (consistency failure) / \(-\alpha\) (valid but \(s=0\) or \(s=1\)) / \(1-(1+\alpha)s\) (ideal difficulty when \(0 < s < 1\)), with \(\alpha=0.8\). This maximizes reward in the "neither too easy nor too hard" zone and provides small negative values for extreme solve-rates to preserve gradients. The solver uses a simple binary reward \(r_{\text{solve}}=+1\) (all tests pass) / \(-1\) (otherwise). High-order bugs are constructed by applying
bug_inject.diff+test_weaken.diffand then applying the solver's failedpred_patch.diffto form a new buggy state. The.gitdirectory is removed to prevent leakage. This is limited to second-order to avoid high overlap. - Design Motivation: Static datasets (e.g., SWE-smith, BugPilot) cannot scale in difficulty with the agent. Solve-rate rewards internalize "difficulty scheduling" into the policy, while high-order bugs simulate real-world "regression" errors, teaching multi-step editing patterns.
-
Bug-injection Strategy: Removal + History Rollback (No NLP Issues):
- Function: Ensures generated bugs are diverse and "non-trivial," avoiding degradation into single-line literal changes.
- Mechanism: Uses two sampling prompts: (a) Removal-only: Requires the agent to delete whole files or blocks and fix compatibility to ensure the repo builds, forcing the solver to reconstruct missing functionality; (b) History-aware: Has the agent read git logs to pick and reverse meaningful changes, aligning bug patterns with real evolution. The solver's prompt contains no natural language issues, only the "inverse of test_weaken.diff" as a formal spec. This is equivalent to "implement behavior that makes these weakened tests pass," bypassing the impossible task of evaluating NLP issue quality. Downstream gains on SWE-bench Verified (which uses NLP issues) must come from "learning to write code that passes tests" rather than in-domain leakage.
- Design Motivation: Direct injection often collapses to one-line changes with weak signals. Removal forces "from scratch" reconstruction and repo-structure understanding, while history-aware injection introduces complex real-world patterns. A random mix of both yielded optimal results.
Loss & Training¶
- Both roles share the same LLM parameters. Rewards \(r_{\text{inject}}\) and \(r_{\text{solve}}\) are applied to their respective trajectories for joint on-policy RL updates.
- Evaluation uses temperature=1.0, top-p=0.95, with only one run per task (no best-of-N, no reranker) to exclude test-time scaling interference.
- The baseline and SSR use identical environment images and hyperparameters; the only difference is the baseline's access to human issue descriptions and pass-to-pass / fail-to-pass tests.
Key Experimental Results¶
Main Results¶
| Benchmark | Metric | Base (CWM-sft) | Baseline RL (human-data) | SSR (Ours) | Gain |
|---|---|---|---|---|---|
| SWE-bench Verified | resolve rate | Start | Consistently below SSR | +10.4 pt gain | +10.4 |
| SWE-Bench Pro (public) | resolve rate | Start | Consistently below SSR | +7.8 pt gain | +7.8 |
Key Observation: SSR consistently outperformed the human-data baseline throughout the training process across both benchmarks, indicating that self-generated tasks provide richer and more effective learning signals than human-engineered data.
Ablation Study¶
| Configuration | Resolve Rate Trend | Explanation |
|---|---|---|
| Full SSR (Self-play) | Stable Rise, Optimal | Joint training of proposer + solver |
| Injection-only | Decline / No Gain | Only proposer trained; no fix signals |
| Repair-only | Improvement but < SSR | RL on static bug pool; lacks curriculum evolution |
| Direct-injection prompt | Worst | Bugs degrade to one-line changes |
| Removal-only prompt | Medium | Forces functional reconstruction |
| Removal + history | Optimal | History rollback introduces real complexity |
| Binary reward (ignore solver) | Slightly below full | Solve-rate signal is noisy; limited proposer gain |
Average resolve rates were calculated over 1,231 tasks (500 SWE-bench Verified + 731 SWE-Bench Pro).
Key Findings¶
- Self-play vs. Repair-only: Repair-only uses the SSR bug pool but lacks the "online curriculum" of upgrading bugs during training, performing significantly worse than full SSR. Real gains come from the distribution evolving with the policy.
- Creating bugs is learning: The proposer must learn to run tests, write parsers, and design weakening patches. These activities provide high-quality training signals.
- Solver-feedback Reward Margin: Adding solve-rate to \(r_{\text{inject}}\) yielded marginal gains over consistency checks alone due to signal noise. However, even without solver feedback, the online policy updates still drive the curriculum.
- Generalization without NLP Issues: Despite being trained solely on "inverse test-weakening patches," the solver improved on SWE-bench Verified's real NLP issues, proving it learned the fundamental capability of "writing code to pass tests."
Highlights & Insights¶
- Formalizing bugs instead of issues is the smartest design choice: NLP issues are hard to evaluate and scarce. SSR formalizes "what a bug is" via five executable files, making reward generation entirely mechanical.
- Reverse mutation testing is an elegant anti-hacking measure: It prevents the proposer from deceiving checks with irrelevant diffs by verifying that each file actually contributes to the bug state.
- High-order bugs turn failures into assets: Recycling failed solver patches naturally simulates "overlapping errors" in multi-step editing, a mechanism applicable to any self-play framework.
- Grounded self-play is the differentiator: The "human who only knows the Python interpreter vs. one who can read GitHub" thought experiment illustrates the ceiling of ungrounded self-play (Absolute Zero). Grounding in real repositories is essential for superhuman capability.
Limitations & Future Work¶
- Limitations: Evaluation was limited to 1 attempt without test-time scaling; a performance gap remains compared to top closed-source systems. Evolution beyond second-order bugs was not verified due to overlap concerns.
- Methodological: Bugs are generated by "breaking existing code," which stays within the "existing semantic manifold." Agents do not yet spontaneously propose creative tasks like "this repo needs feature X."
- Rewards: The solve-rate feedback for the proposer is weak. Future work could explore denser rewards based on "information gain" or "learning progress."
- Improvements: Introduce cross-repository generalization tests (train on Repo A, evaluate on Repo B). De-duplication for higher-order bugs could utilize semantic hashing instead of simple avoidance.
Related Work & Insights¶
- vs. SWE-RL (Wei et al., 2025): SWE-RL uses RL + rule-based rewards but relies on human data (GitHub PRs). SSR eliminates this dependency, relying solely on sandboxed images.
- vs. SWE-smith / BugPilot (Yang 2026a / Sonwane 2025): These use LLMs to synthesize bugs but rely on stronger human priors (teacher models, specific test suites) and are static pipelines. SSR is online and minimizes priors.
- vs. Absolute Zero / R-Zero / LSP (Zhao 2025 / Huang 2025 / Kuba 2025): These are ungrounded and limited to fixed rule spaces (interpreters). SSR breaks this through repository grounding.
- vs. SPICE (Liu 2025): SSR can be seen as the software engineering realization of SPICE's corpus-grounded self-play, using Docker images to provide executable feedback.
- vs. CWM (FAIR CodeGen 2025): SSR proves that using CWM-sft as a base, self-play can push the SOTA further without additional human data.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to make grounded self-play work for SWE agents. The formal bug artifact and reverse mutation are significant methodological contributions.
- Experimental Thoroughness: ⭐⭐⭐⭐ Solid dual-benchmark evaluation + key ablations, though lacking cross-repo generalization and scale-up curves.
- Writing Quality: ⭐⭐⭐⭐⭐ Extremely clear motivation (the "interpreter" thought experiment is excellent), well-illustrated method, and rigorous definitions.
- Value: ⭐⭐⭐⭐⭐ Provides a complete, executable, and scalable paradigm for software agents to bypass the human data ceiling.