Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning¶
Conference: ICLR 2026 arXiv: 2504.17192 Code: github.com/going-doer/Paper2Code Area: Code Intelligence Keywords: paper-to-code, multi-agent framework, repository-level code generation, scientific reproducibility, LLM
TL;DR¶
This work proposes PaperCoder — a multi-agent LLM framework that automatically converts machine learning papers into executable code repositories via a three-stage pipeline: Planning, Analysis, and Coding. 88% of the generated repositories are rated as best by the original paper authors, and the framework substantially outperforms baselines on the PaperBench benchmark.
Background & Motivation¶
- Reproducibility is central to scientific progress, yet only ~19.5% of top-venue papers from 2024 provide available code, forcing researchers to reverse-engineer methods from papers at considerable cost.
- LLMs have demonstrated strong capabilities in understanding scientific documents and generating high-quality code, but existing scientific workflow automation approaches (e.g., ideation, experiment improvement) typically rely on pre-existing code implementations.
- Repository-level code generation is far more complex than single-file generation, requiring coordinated architectural design, module structure, and cross-file dependency management.
- Papers are written to communicate ideas to humans, containing high-level motivation and narrative content that is noisy, loosely structured, and ambiguous from a software engineering perspective.
- Existing multi-agent code generation frameworks (ChatDev, MetaGPT) adopt bottom-up strategies, incrementally expanding from short requirement descriptions, making them ill-suited to handle lengthy scientific documents.
- Core challenge: Starting solely from a paper (without code, APIs, or supplementary materials), can a faithful code implementation be generated?
Method¶
Overall Architecture¶
PaperCoder simulates a typical human developer lifecycle, decomposing the paper-to-code task into three structured stages:
Key Designs¶
1. Planning Stage¶
Comprises four sequential sub-components that progressively transform unstructured paper content into implementation-level abstractions:
- Overall Plan: Extracts a high-level summary of core components and functionality (model components, training objectives, data processing, evaluation protocol).
- Architecture Design: Defines the repository-level architecture — file list, class diagrams (static structure), and sequence diagrams (dynamic interactions).
- Logic Design: Specifies file dependencies and execution order, ensuring that code generation does not fail due to inter-file dependencies (e.g., generating file A before file B when B requires modules from A).
- Configuration Generation: Synthesizes a
config.yamlcontaining hyperparameters, model settings, and runtime options, enabling researchers to easily adjust experimental configurations.
2. Analysis Stage¶
For each file \(f_i\) identified during planning, a detailed file-level analysis \(a_i\) is generated, covering: functional objectives, input/output behavior, intra- and inter-file dependencies, and algorithmic specifications derived from the paper.
3. Coding Stage¶
Files are generated sequentially in execution order, with each file conditioned on all accumulated context:
This ensures that when generating the \(i\)-th file, the model is fully aware of its dependencies and the current state of the repository.
Loss & Training¶
- Non-training framework; a prompt-engineering-based multi-agent system.
- Uses o3-mini-high as the default backbone LLM.
- Supports a Self-Refine verify-and-refine step to further improve output quality at each stage.
- Evaluation: reference-based (when author code is available) + reference-free (when no code exists) + human evaluation (scored by first authors).
Key Experimental Results¶
Main Results¶
Paper2CodeBench (90 papers, ICLR/ICML/NeurIPS 2024):
| Method | Ref-Based (ICLR) | Ref-Based (ICML) | Ref-Based (NeurIPS) | Ref-Free (ICLR) | Ref-Free (ICML) | Ref-Free (NeurIPS) |
|---|---|---|---|---|---|---|
| ChatDev | 2.70 | 2.97 | 2.96 | 4.00 | 4.12 | 4.01 |
| MetaGPT | 2.48 | 2.75 | 2.95 | 3.52 | 3.63 | 3.59 |
| Paper (one-shot) | 3.08 | 3.28 | 3.22 | 4.15 | 4.30 | 4.08 |
| PaperCoder | 3.68 | 3.72 | 3.83 | 4.73 | 4.73 | 4.77 |
| Oracle (author code) | N/A | N/A | N/A | 4.84 | 4.80 | 4.83 |
PaperBench Code-Dev (20 ICML 2024 papers):
| Method | Replication Score (o3-mini) | Replication Score (Claude 3.5) |
|---|---|---|
| BasicAgent | 5.1% | 35.4% |
| IterativeAgent | 16.4% | 27.5% |
| PaperCoder | 45.14% | 51.14% |
Ablation Study¶
| Cumulative Components | Ref-Based | Ref-Free |
|---|---|---|
| Paper only | 3.28 | 4.30 |
| + Overall Plan | 3.40 | 4.34 |
| + Arch. Design | 3.13 (↓) | 4.07 (↓) |
| + Logic Design | 3.60 | 4.50 |
| + Config File | 3.66 | 4.45 |
| + Analysis (Full) | 3.72 | 4.73 |
Key Findings¶
- PaperCoder approaches author-level quality: The Ref-Free score (~4.74) shows no statistically significant difference from the Oracle (~4.82).
- Advantage of the top-down strategy: Systematically analyzing the full paper before generation outperforms the bottom-up expansion strategies of ChatDev and MetaGPT.
- Logic Design is a critical turning point: Adding Architecture Design alone decreases performance (execution order ambiguity causes confusion), but performance recovers substantially once Logic Design is incorporated.
- Human evaluation consistency: 88% of PaperCoder-generated repositories are rated best by first authors, and 92% report them as "genuinely helpful."
- Executability: On average, only 0.81% of code lines require modification for successful execution.
Highlights & Insights¶
- This work defines and systematizes the novel "paper-to-code" task and constructs a comprehensive evaluation framework (Paper2CodeBench).
- The three-stage pipeline elegantly mirrors the human developer workflow of Plan → Analyze → Code, with each stage executed by a specialized agent.
- The evaluation framework is thorough: reference-based, reference-free, and first-author human evaluation are used in conjunction, with high correlation validated between model-based and human evaluation (r = 0.79).
- Self-Refine experiments demonstrate that refining early planning outputs propagates quality improvements downstream to code generation (Config File refinement yields the largest gain of +1.00).
Limitations & Future Work¶
- Heavy dependence on backbone LLM capability — open-source models (DS-Coder, Qwen-Coder) perform significantly worse than o3-mini-high, limiting practical accessibility due to API costs.
- Data processing achieves the lowest coverage — papers typically provide insufficient descriptions of data formats and preprocessing steps, which is the primary source of generation errors.
- Evaluation is limited to ML papers; generalizability to other scientific domains (physics, biology) remains unknown.
- Evaluation metrics are primarily LLM-based; while highly correlated with human scores, they do not fully substitute human judgment.
Related Work & Insights¶
- ChatDev / MetaGPT: Multi-agent code development frameworks employing bottom-up strategies, unsuitable for processing lengthy paper inputs.
- PaperBench (Starace et al.): A concurrent work providing an evaluation benchmark with human-annotated rubrics (20 ICML papers), focused on evaluation rather than methodology.
- Self-Refine (Madaan et al.): The verify-and-refine paradigm, integrated into PaperCoder's planning and analysis stages.
- Insight: Automating paper-to-code conversion can substantially accelerate scientific reproducibility, but requires a top-down methodology of "understand the whole before coding."
Rating¶
- ⭐ Novelty: 4/5 — New task definition (Paper2Code) + structured three-stage framework + comprehensive evaluation system
- ⭐ Experimental Thoroughness: 4.5/5 — Automatic evaluation on 90 papers + human evaluation on 21 papers + external validation on PaperBench + extensive ablations
- ⭐ Writing Quality: 4.5/5 — Clear structure, precise formalization, high-quality figures and tables
- ⭐ Value: 4.5/5 — Directly addresses reproducibility pain points in the ML community; code is open-sourced with broad potential impact