SciCoQA: Quality Assurance for Scientific Paper–Code Alignment¶
Conference: ACL 2026 arXiv: 2601.12910 Code: https://github.com/ukplab/scicoqa Area: Scientific Reproducibility / Paper-Code Consistency Verification Keywords: Paper-Code Discrepancy Detection, Scientific Reproducibility, Cross-Modal Verification, LLM Evaluation, Quality Assurance
TL;DR¶
SciCoQA is the first benchmark for detecting semantic discrepancies between scientific papers and their code implementations, containing 635 discrepancy instances (92 real + 543 synthetic). Evaluation of 22 LLMs reveals the strongest model detects only 46.7% of real discrepancies, uncovering a critical capability gap in automated scientific quality assurance.
Method¶
Key Designs¶
-
Strict Discrepancy Definition with Three-Type Classification: Difference (code logic differs from paper), Paper Omission (code contains undescribed components), Code Omission (paper-described steps missing in code). Explicitly excludes bugs, CLI-resolvable hyperparameter differences, and standard engineering practices.
-
Six-Category Impact Taxonomy: Algorithm, Model, Loss, Evaluation, Data, Training.
-
Synthetic Data Generation Pipeline: Extends dataset from CS/AI to physics, statistics, and quantitative biology. Real-synthetic detection rate correlation reaches \(r = 0.94\).
Key Experimental Results¶
| Model | Precision | Recall | F1 |
|---|---|---|---|
| GPT-5 | 88.0 | 51.2 | 64.7 |
| Gemini 2.5 Pro | 94.6 | 41.1 | 57.3 |
- Recall is the core bottleneck: models find mostly correct matches but miss too many
- Paper Omission is hardest to detect; longer inputs consistently degrade performance
- Data contamination significantly affects results: detection rates lower on 2025 papers
Highlights & Insights¶
- Fills a critical gap by formalizing paper-code consistency verification as a benchmarkable NLP task
- "High precision, low recall" insight: in verification scenarios, missed discrepancies provide false security
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐