Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning¶

Conference: ICLR 2026 arXiv: 2504.17192 Code: github.com/going-doer/Paper2Code Area: Code Intelligence Keywords: paper-to-code, multi-agent framework, repository-level code generation, scientific reproducibility, LLM

TL;DR¶

This work proposes PaperCoder — a multi-agent LLM framework that automatically converts machine learning papers into executable code repositories via a three-stage pipeline: Planning, Analysis, and Coding. 88% of the generated repositories are rated as best by the original paper authors, and the framework substantially outperforms baselines on the PaperBench benchmark.

Background & Motivation¶

Reproducibility is central to scientific progress, yet only ~19.5% of top-venue papers from 2024 provide available code, forcing researchers to reverse-engineer methods from papers at considerable cost.
LLMs have demonstrated strong capabilities in understanding scientific documents and generating high-quality code, but existing scientific workflow automation approaches (e.g., ideation, experiment improvement) typically rely on pre-existing code implementations.
Repository-level code generation is far more complex than single-file generation, requiring coordinated architectural design, module structure, and cross-file dependency management.
Papers are written to communicate ideas to humans, containing high-level motivation and narrative content that is noisy, loosely structured, and ambiguous from a software engineering perspective.
Existing multi-agent code generation frameworks (ChatDev, MetaGPT) adopt bottom-up strategies, incrementally expanding from short requirement descriptions, making them ill-suited to handle lengthy scientific documents.
Core challenge: Starting solely from a paper (without code, APIs, or supplementary materials), can a faithful code implementation be generated?

Method¶

Overall Architecture¶

PaperCoder simulates a typical human developer lifecycle, decomposing the paper-to-code task into three structured stages:

\[\text{Planning: } P = M_{\text{plan}}(R), \quad \text{Analysis: } A = M_{\text{analysis}}(R, P), \quad \text{Coding: } C = M_{\text{code}}(R, P, A)\]

Key Designs¶

1. Planning Stage¶

Comprises four sequential sub-components that progressively transform unstructured paper content into implementation-level abstractions:

Overall Plan: Extracts a high-level summary of core components and functionality (model components, training objectives, data processing, evaluation protocol).
Architecture Design: Defines the repository-level architecture — file list, class diagrams (static structure), and sequence diagrams (dynamic interactions).
Logic Design: Specifies file dependencies and execution order, ensuring that code generation does not fail due to inter-file dependencies (e.g., generating file A before file B when B requires modules from A).
Configuration Generation: Synthesizes a config.yaml containing hyperparameters, model settings, and runtime options, enabling researchers to easily adjust experimental configurations.

2. Analysis Stage¶

For each file \(f_i\) identified during planning, a detailed file-level analysis \(a_i\) is generated, covering: functional objectives, input/output behavior, intra- and inter-file dependencies, and algorithmic specifications derived from the paper.

3. Coding Stage¶

Files are generated sequentially in execution order, with each file conditioned on all accumulated context:

\[c_i = \text{LLM}(\mathcal{T}_{\text{code}}(R, P, f_i, a_i, \{c_1, ..., c_{i-1}\}))\]

This ensures that when generating the \(i\)-th file, the model is fully aware of its dependencies and the current state of the repository.

Loss & Training¶

Non-training framework; a prompt-engineering-based multi-agent system.
Uses o3-mini-high as the default backbone LLM.
Supports a Self-Refine verify-and-refine step to further improve output quality at each stage.
Evaluation: reference-based (when author code is available) + reference-free (when no code exists) + human evaluation (scored by first authors).

Key Experimental Results¶

Main Results¶

Paper2CodeBench (90 papers, ICLR/ICML/NeurIPS 2024):

Method	Ref-Based (ICLR)	Ref-Based (ICML)	Ref-Based (NeurIPS)	Ref-Free (ICLR)	Ref-Free (ICML)	Ref-Free (NeurIPS)
ChatDev	2.70	2.97	2.96	4.00	4.12	4.01
MetaGPT	2.48	2.75	2.95	3.52	3.63	3.59
Paper (one-shot)	3.08	3.28	3.22	4.15	4.30	4.08
PaperCoder	3.68	3.72	3.83	4.73	4.73	4.77
Oracle (author code)	N/A	N/A	N/A	4.84	4.80	4.83

PaperBench Code-Dev (20 ICML 2024 papers):

Method	Replication Score (o3-mini)	Replication Score (Claude 3.5)
BasicAgent	5.1%	35.4%
IterativeAgent	16.4%	27.5%
PaperCoder	45.14%	51.14%

Ablation Study¶

Cumulative Components	Ref-Based	Ref-Free
Paper only	3.28	4.30
+ Overall Plan	3.40	4.34
+ Arch. Design	3.13 (↓)	4.07 (↓)
+ Logic Design	3.60	4.50
+ Config File	3.66	4.45
+ Analysis (Full)	3.72	4.73

Key Findings¶

PaperCoder approaches author-level quality: The Ref-Free score (~4.74) shows no statistically significant difference from the Oracle (~4.82).
Advantage of the top-down strategy: Systematically analyzing the full paper before generation outperforms the bottom-up expansion strategies of ChatDev and MetaGPT.
Logic Design is a critical turning point: Adding Architecture Design alone decreases performance (execution order ambiguity causes confusion), but performance recovers substantially once Logic Design is incorporated.
Human evaluation consistency: 88% of PaperCoder-generated repositories are rated best by first authors, and 92% report them as "genuinely helpful."
Executability: On average, only 0.81% of code lines require modification for successful execution.

Highlights & Insights¶

This work defines and systematizes the novel "paper-to-code" task and constructs a comprehensive evaluation framework (Paper2CodeBench).
The three-stage pipeline elegantly mirrors the human developer workflow of Plan → Analyze → Code, with each stage executed by a specialized agent.
The evaluation framework is thorough: reference-based, reference-free, and first-author human evaluation are used in conjunction, with high correlation validated between model-based and human evaluation (r = 0.79).
Self-Refine experiments demonstrate that refining early planning outputs propagates quality improvements downstream to code generation (Config File refinement yields the largest gain of +1.00).

Limitations & Future Work¶

Heavy dependence on backbone LLM capability — open-source models (DS-Coder, Qwen-Coder) perform significantly worse than o3-mini-high, limiting practical accessibility due to API costs.
Data processing achieves the lowest coverage — papers typically provide insufficient descriptions of data formats and preprocessing steps, which is the primary source of generation errors.
Evaluation is limited to ML papers; generalizability to other scientific domains (physics, biology) remains unknown.
Evaluation metrics are primarily LLM-based; while highly correlated with human scores, they do not fully substitute human judgment.

ChatDev / MetaGPT: Multi-agent code development frameworks employing bottom-up strategies, unsuitable for processing lengthy paper inputs.
PaperBench (Starace et al.): A concurrent work providing an evaluation benchmark with human-annotated rubrics (20 ICML papers), focused on evaluation rather than methodology.
Self-Refine (Madaan et al.): The verify-and-refine paradigm, integrated into PaperCoder's planning and analysis stages.
Insight: Automating paper-to-code conversion can substantially accelerate scientific reproducibility, but requires a top-down methodology of "understand the whole before coding."

Rating¶

⭐ Novelty: 4/5 — New task definition (Paper2Code) + structured three-stage framework + comprehensive evaluation system
⭐ Experimental Thoroughness: 4.5/5 — Automatic evaluation on 90 papers + human evaluation on 21 papers + external validation on PaperBench + extensive ablations
⭐ Writing Quality: 4.5/5 — Clear structure, precise formalization, high-quality figures and tables
⭐ Value: 4.5/5 — Directly addresses reproducibility pain points in the ML community; code is open-sourced with broad potential impact