RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics¶

Conference: NeurIPS 2025 arXiv: 2505.12575 Code: GitHub / HuggingFace Area: LLM Reasoning / Mathematical Benchmark Evaluation Keywords: mathematical reasoning, research-level mathematics, benchmark, data contamination, automatic evaluation

TL;DR¶

This paper introduces RealMath, a continuously refreshable benchmark that automatically extracts verifiable mathematics problems from arXiv papers and Math StackExchange, designed to evaluate LLMs on real-world research-level mathematical tasks.

Background & Motivation¶

Background: Existing mathematical reasoning benchmarks primarily draw from three categories: (1) curriculum/competition problems (GSM8K, MATH, AIME, IMO); (2) formal theorem proving (LeanDojo, MiniF2F); (3) expert-curated extremely difficult problems (FrontierMath, HLE).

Limitations of Prior Work: These benchmarks cover only a narrow slice of mathematical practice — competition mathematics ≠ research mathematics; formal proofs ≠ day-to-day mathematical research; extremely hard problems address only the frontier capabilities of human experts.

Key Challenge: Problems encountered in real mathematical research differ fundamentally from competition problems in structure, topic, and difficulty, yet existing benchmarks fail to reflect the practical value of LLMs as assistants in authentic research settings.

Goal: Construct a mathematical reasoning benchmark that reflects real research practice, supports automatic evaluation, and can be continuously refreshed to resist data contamination.

Key Insight: Automatically extract mathematical theorems with definitive answers from arXiv papers and StackExchange, and convert them into QA pairs.

Core Idea: An automated pipeline extracts "constructive theorems" (those with unique, exact answers) from academic papers, serving as a continuously refreshable benchmark for evaluating LLMs on research-level mathematics.

Method¶

Overall Architecture¶

A five-stage automated data collection pipeline: paper retrieval → LaTeX source extraction → constructive theorem identification → QA pair generation → easy-question filtering.

Key Designs¶

Paper Retrieval and Parsing: Mathematics-related papers are retrieved in bulk from the arXiv API (approximately 4,000 papers over a five-month window), with LaTeX source downloaded and parsed to preserve mathematical notation.
Constructive Theorem Identification: An LLM (o3-mini) serves as a judge to screen extracted theorems for those possessing a unique, exact answer. Theorems involving inequalities, multiple solutions, or non-constructive proofs are excluded. Approximately 407 theorems are selected from ~14,747 candidates.
QA Pair Generation: Selected theorems are converted into question–answer pairs; the context preceding each theorem in the paper (from the introduction up to the theorem statement) is retained as contextual input.
Quality Filtering: An LLM review filters out samples whose answers are obvious or trivial, retaining approximately 280 high-quality QA pairs per five-month window.
Continuous Refresh: The pipeline runs continuously, generating 70+ new samples per month and automatically resisting data contamination.

Design Criteria¶

Authentic application focus: Sourced from real research papers rather than artificially constructed
Automatic verification: Only problems with exact numerical or symbolic answers are retained
Continuous collection: Synchronized with newly published papers to avoid contamination

Dataset Scale and Sources¶

Dataset	Time Span	QA Pairs
Math.arXiv	2022.05–2022.09 + 2024.12–2025.03	633
CS.arXiv	2022.05–2023.10	111
Math.StackExchange	2024.04–2025.03	542

Pipeline Stage Statistics (Example: 4,000 Papers)¶

Stage	Output
Paper retrieval	4,000 papers
LaTeX source extraction	3,922 papers
All theorems extracted	14,747
Constructive theorems confirmed	407
QA pairs generated	401
After easy-question filtering	280

Key Experimental Results¶

Main Results¶

Model	Math.arXiv	CS.arXiv	Math.StackExchange
o3	49.1	44.1	70.7
o4-mini	43.4	42.3	70.8
Gemini 2.5-pro	32.5	25.2	60.9
DeepSeek-R1	30.5	31.5	62.2
Claude 3.7-Sonnet	34.1	31.5	61.1
Grok 3	29.5	25.2	54.8
Claude 3.5-Sonnet	18.3	16.2	37.6
Llama 3.1-405B	16.4	15.3	32.1
GPT-4o-mini	12.5	7.2	40.8

Difficulty-Stratified Analysis¶

Difficulty	o3 Accuracy
Easy	97.5%
Medium	81.4%
Hard	27.9%

Key Findings¶

LLMs perform surprisingly well on research mathematics: o3 achieves 49.1% on Math.arXiv, substantially higher than its performance on extreme benchmarks such as FrontierMath, suggesting LLMs may already serve as valuable assistants to mathematicians.
Models exhibit distinct capability profiles: o3 excels in highly theoretical areas such as representation theory and number theory, while Gemini 2.5-pro performs better on applied domains such as machine learning and optimization — the two models' strongest and weakest domains are nearly complementary.
Context is not always necessary: o4-mini achieves 21.6% on CS.arXiv without context (vs. 42.3% with context), indicating that models can infer some notation and concepts independently.
No evidence of data contamination: Models perform better on 2025 papers than on 2022 papers.
Fine-tuning yields limited gains: Fine-tuning GPT-4o-mini on 500 samples produces no accuracy improvement, suggesting the bottleneck lies in a lack of specialized mathematical knowledge and skills.
Primary error types: Reasoning errors > conceptual misunderstandings > failure to identify key insights.

Highlights & Insights¶

Paradigm innovation: This is the first work to propose an automated pipeline for constructing a mathematics benchmark from real academic papers, departing from the paradigms of competition problems and expert-curated construction.
Sustainability: The continuously auto-refreshable design is more sustainable and reproducible than FrontierMath's private, fixed test set.
Practical insight: The results reveal that LLMs may already be practically useful for research-level mathematics (~50% accuracy), whereas prior benchmarks such as FrontierMath (near 0%) conveyed an overly pessimistic signal.
Data quality: A high-quality rate of 94% is achieved without manual annotation, yielding 633 high-quality samples from 9,000+ papers.
Divergent capability profiles: Model performance varies substantially and complementarily across mathematical subfields; o3 and Gemini 2.5-pro exhibit nearly opposite strongest and weakest domains.
Theorem-to-QA conversion design: Reformulating theorems as "context + QA" pairs preserves original mathematical notation and prerequisites, offering greater scalability than direct theorem-proving evaluation.

Limitations & Future Work¶

Constructive problems only: Proof problems, inequalities, and open-ended questions — important components of mathematical research — are excluded.
Dependence on paper correctness: The pipeline assumes theorems in arXiv papers are correct, though arXiv papers are not peer-reviewed.
Lower quality of StackExchange data: User-submitted content frequently contains errors or ambiguous descriptions.
Relatively high initial performance: The overall accuracy is comparatively high (o3: 49%), which may pose a ceiling effect concern.
Evaluation scope: The benchmark focuses solely on theorem verification, without covering proof generation, mathematical modeling, or other capabilities.

vs. FrontierMath: FrontierMath targets extremely hard problems with a private test set; RealMath spans the full difficulty range with a publicly refreshable collection.
vs. MathConstruct: Both adopt a constructive problem approach, but RealMath extracts problems directly from papers rather than designing them manually.
vs. SWE-bench: RealMath parallels the trend in software engineering of evaluating models on real-world tasks, constituting an "in the wild" evaluation for mathematics.

Rating¶

Novelty: ⭐⭐⭐⭐ (new paradigm for constructing benchmarks from real papers)
Experimental Thoroughness: ⭐⭐⭐⭐ (covers 10+ models, multiple data sources, and multi-dimensional analysis)
Writing Quality: ⭐⭐⭐⭐ (clear and fluent, with well-motivated arguments)
Value: ⭐⭐⭐⭐ (fills a gap in research-level mathematical evaluation; pipeline is continuously reusable)