MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy¶
Conference: AAAI 2026
arXiv: 2508.05592
Code: https://github.com/Jasaxion/MathSmith
Area: Reinforcement Learning
Keywords: Mathematical Reasoning, Synthetic Data, Reinforcement Learning, Large Language Models, Difficulty Control
TL;DR¶
This paper proposes MathSmith, a framework that generates hard mathematical problems by randomly sampling concept pairs from PlanetMath, applying 9 predefined difficulty strategies, and jointly optimizing structural validity, reasoning complexity, and answer consistency via GRPO-based reinforcement learning. The resulting high-difficulty synthetic problems significantly improve LLM mathematical reasoning on AIME and OlympiadBench.
Background & Motivation¶
Large language models have achieved remarkable progress in mathematical reasoning, yet their advancement is constrained by the following critical bottlenecks:
Scarcity of high-difficulty training data: Most high-quality mathematical problems are human-authored, limited in quantity, and unevenly distributed in difficulty. Models lack sufficient hard training data to push the upper bound of their reasoning capabilities.
Limitations of existing synthesis methods: Most mathematical problem synthesis approaches rely on extracting templates or structures from existing problems, followed by rewriting (MetaMath), augmentation (OpenMathInstruct), back-translation (MathGenie), or evolutionary transformation (WizardMath). These methods are inherently constrained by the distribution and structure of human-authored problems, lacking generative autonomy and precise difficulty control.
Data contamination risk: Methods based on transforming existing problems tend to produce problems similar to test sets, raising concerns about data contamination and the authenticity of performance gains.
Inspiration from the "Bitter Lesson": As Sutton noted, sustainable progress in AI should rely on general, compute-intensive methods rather than hand-crafted knowledge. Future reasoning agents should be capable of autonomously generating high-quality, highly challenging mathematical problems.
The core philosophy of MathSmith resembles that of a "mathematical blacksmith": starting from raw materials (mathematical concept-explanation pairs), it progressively forges complex and coherent mathematical problems entirely without relying on any existing human-authored problems.
Method¶
Overall Architecture¶
MathSmith comprises three core stages: 1. Concept-Explanation Collection: Collecting challenging mathematical concept pairs from PlanetMath 2. Supervised Fine-Tuning (SFT): Training basic generation capability using seed data generated by GPT-4o 3. Reinforcement Learning (RL): Optimizing problem format, difficulty, and correctness via a multi-objective reward function
An additional Weakness-Focused Improvement Pipeline module is included to targeted enhance model performance on specific concepts.
Key Designs¶
-
Concept-Explanation Collection: Mathematical pages are crawled from PlanetMath—a mathematical encyclopedia renowned for advanced mathematics and theoretical depth—filtered to remove non-concept entries, and processed with GPT-4o to automatically extract core concepts from each page, yielding a dataset of 11,000 mathematical concepts with explanations. PlanetMath is chosen because its concepts are inherently challenging, ensuring difficulty at the source. During generation, 5 concepts and their explanations are randomly sampled as input, completely independent of any existing mathematical problems, thereby avoiding data contamination.
-
Nine Predefined Difficulty Strategies: Through analysis of the structural and cognitive characteristics of hard mathematical problems, 9 difficulty strategies are designed as soft constraints during generation: multi-step reasoning, cross-topic fusion, implicit or inverse logic, distractor construction, abstract modeling, multiple solution paths, advanced operations, extreme conditions, and non-standard representations. Each generated problem is required to incorporate at least 2 strategies to ensure sufficient complexity.
SFT stage: Each generated sample consists of two components—a rationale part (exactly 5 reasoning steps describing the problem construction process) and a problem part (the final question). Approximately 8K cold-start samples are generated with GPT-4o to fine-tune Qwen3-8B, yielding MathSmith-SFT.
- Multi-Objective Reinforcement Learning Reward Function: The core innovation lies in designing a composite reward comprising three components:
(1) Structure reward \(r_{structure}\): Checks whether the output contains both the rationale and problem parts (\(r_{format} \in \{0,1\}\)), and whether the reasoning step count equals 5 (\(r_{step}\), maximized at 5 steps with decay for deviations). \(r_{structure} = \alpha_{format} \cdot r_{format} + \alpha_{step} \cdot r_{step}\), where \(\alpha_{format}=0.7\) and \(\alpha_{step}=0.3\).
(2) Reasoning complexity reward \(r_{complexity}\): A teacher model Qwen3-30B-A3B solves the generated problems, and the token length of its reasoning trajectory is used as an indirect estimate of difficulty: \(r_{complexity} = \frac{1}{K \cdot T_{max}} \sum_{i=1}^{K} \ell_{cot}^{(i)}\). The motivation is that more challenging problems tend to elicit significantly longer reasoning trajectories, and longer trajectories contain low-entropy intermediate tokens that provide more informative supervision signals during training.
(3) Answer consistency reward \(r_{consistency}\): \(K\) answers are sampled from the teacher model; if a majority answer exists (i.e., some answer appears more than \(K/2\) times), the reward is 1, otherwise 0. This encourages generation of clear, unambiguous problems.
Final reward: \(r_{total} = r_{structure} + \beta_{complexity} \cdot r_{complexity} + \beta_{consistency} \cdot r_{consistency}\), where \(\beta_{complexity}=0.7\) and \(\beta_{consistency}=0.3\).
Loss & Training¶
GRPO (Group Relative Policy Optimization) is adopted to optimize the policy model \(\pi_\theta\). For each group of 5 concept inputs \(c\), \(G\) problems are generated, their composite reward scores \(R_i\) are computed, and then normalized into advantage estimates: \(\hat{A}_{i,t} = \frac{R_i - \text{mean}(\{R_j\})}{\text{std}(\{R_j\})}\), with updates performed via a PPO-style clipped objective (Equations 8–10) plus KL divergence penalty.
Implementation details: - Base generation model: Qwen3-8B, LoRA rank=16, SFT trained for 5 epochs (8×H100) - RL stage uses the verl library, trained on 20×H100, with the final model selected at convergence at step 100 - Teacher model sampling: \(K=5\) - Evaluation and training uniformly use LlamaFactory, learning rate \(1e{-5}\), 5 epochs
Two model variants: - MathSmith-HC: Full complexity + consistency reward (final recommended version) - MathSmith-Hard: Complexity reward only, without the consistency term
Key Experimental Results¶
Main Results¶
Benchmarks are divided into two difficulty tiers: Easy & Medium (GSM8K, MATH-500) and Hard (AIME2024, AIME2025, OlympiadBench). All methods use the same amount of training data (50K) and a unified teacher model.
| Model | Method | GSM8K | MATH-500 | AIME2024 | AIME2025 | Olympiad | Hard Avg (Rel. Imp.) |
|---|---|---|---|---|---|---|---|
| Qwen2.5-7B (short-CoT) | baseline | 92.2 | 72.2 | 16.7 | 6.7 | 38.6 | 20.7 |
| Qwen2.5-7B (short-CoT) | PromptCOT | 87.6 | 73.2 | 23.3 | 6.7 | 35.9 | 21.9 (+6.2%) |
| Qwen2.5-7B (short-CoT) | MathSmith-HC | 91.2 | 75.2 | 23.3 | 10.0 | 39.9 | 24.4 (+18.1%) |
| Qwen3-8B (short-CoT) | baseline | 93.4 | 82.8 | 30.0 | 16.7 | 51.0 | 32.6 |
| Qwen3-8B (short-CoT) | MathSmith-HC | 92.9 | 84.4 | 33.3 | 23.3 | 53.1 | 36.6 (+12.3%) |
| DS-R1 (long-CoT) | baseline | 89.3 | 88.6 | 43.3 | 36.7 | 52.4 | 44.1 |
| DS-R1 (long-CoT) | MathSmith-HC | 89.2 | 91.6 | 53.3 | 43.3 | 56.5 | 51.0 (+15.6%) |
| Qwen3-8B (long-CoT) | baseline | 94.8 | 94.4 | 66.7 | 63.3 | 66.2 | 65.4 |
| Qwen3-8B (long-CoT) | MathSmith-HC | 95.1 | 96.4 | 76.7 | 70.0 | 68.8 | 71.8 (+9.8%) |
Ablation Study¶
| Training Stage | Easy&Med Avg | Hard Avg | Available Ratio | Notes |
|---|---|---|---|---|
| MathSmith-SFT | 87.7 | 30.3 | 71.50% | SFT only |
| MathSmith-Hard | 89.25 | 36.6 | 84.92% | RL (complexity reward only) |
| MathSmith-HC | 88.65 | 36.6 | 95.38% | RL (complexity + consistency) |
| Weakness-Focused Method | Easy&Med Avg | Hard Avg | Practice Acc |
|---|---|---|---|
| Original | 38.2 | 14.5 | 23.6 |
| WF Epoch 1 | 69.9 | 18.8 | 33.1 |
| WF Epoch 3 | 77.6 | 21.6 | 34.7 |
| Random (control) | 69.4 | 15.6 | 30.0 |
Key Findings¶
- Greater gains on harder benchmarks: Improvement margins on Hard benchmarks (9.8%–18.1%) substantially exceed those on Easy & Medium benchmarks.
- Stronger advantage in long-CoT settings: MathSmith yields notably larger gains under long-CoT configurations, indicating that the generated hard problems elicit deeper reasoning.
- Good scalability: From 50K to 200K data, MathSmith-HC maintains its lead with widening margins.
- Larger models benefit more: Across the Qwen3 series (1.7B→30B), larger models gain more from MathSmith data.
- Available Ratio: The availability rate of MathSmith-HC (95.38%) is substantially higher than MathSmith-Hard (84.92%), demonstrating that the consistency reward effectively improves problem quality.
- Longest reasoning trajectories: Problems generated by MathSmith-HC/Hard elicit the longest reasoning trajectories across all datasets, validating that the RL stage further enhances difficulty.
Highlights & Insights¶
- Breakthrough of the "forge from scratch" paradigm: Problems are generated entirely from randomly sampled concept pairs without relying on any existing problems, fundamentally avoiding data contamination—a key distinction from methods such as MetaMath and NuminaMath.
- Reasoning trajectory length as a difficulty proxy: A simple yet effective heuristic—harder problems elicit longer reasoning chains. Although length does not directly equate to quality, longer chains contain more low-entropy intermediate tokens, providing better training signals.
- Weakness-focused mechanism: Since each problem is traceable to its concept set, variant problems can be targeted generated for concepts where the model is weak, enabling iterative improvement. This traceability is a distinctive advantage of the framework.
- Trade-off between HC and Hard: The consistency reward might appear to "reduce difficulty," but it substantially increases availability (from 85% to 95%), making large-scale synthesis more practical.
Limitations & Future Work¶
- Using reasoning trajectory length as a difficulty measure is heuristic and does not necessarily equate to "difficulty that genuinely improves reasoning capability."
- Performance occasionally degrades on simple word problems such as GSM8K, suggesting that overly heavy reasoning may have a negative effect on simpler tasks.
- The concept set is sourced solely from PlanetMath, which may have limited coverage (biased toward advanced mathematics, lacking elementary and applied mathematics).
- The current difficulty strategies are a fixed set of 9; future work could explore adaptive strategy discovery.
- The capability ceiling of the teacher model constrains the quality and difficulty upper bound of generated problems.
Related Work & Insights¶
- PromptCOT (Zhao et al. 2025a): The most closely related work—uses concept-driven prompts and multi-step planning to generate Olympiad-level problems, but still relies on manually selected concepts and lacks deeper reasoning control.
- ScaleQuest (Ding et al. 2024): Generates new problems from scratch but lacks difficulty control.
- JiuZhang3.0 (Zhou et al. 2024): Controls prompt difficulty in a stratified manner by educational level.
- GRPO (Shao et al. 2024): The policy optimization algorithm directly employed by MathSmith, originating from DeepSeek.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Synthesizing problems from concept pairs + using reasoning length as a difficulty proxy + multi-objective RL optimization is highly creative)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (5 benchmarks, 4 models, short/long-CoT, data/model scaling experiments, weakness-focused evaluation)
- Writing Quality: ⭐⭐⭐⭐ (Clear structure and well-formatted equations, though figure density is high)
- Value: ⭐⭐⭐⭐⭐ (Addresses a critical bottleneck in mathematical reasoning data synthesis, with significant implications for the broader LLM reasoning community)