Skip to content

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Conference: AAAI 2026 arXiv: 2508.05592
Code: https://github.com/Jasaxion/MathSmith
Area: Reinforcement Learning Keywords: Mathematical Reasoning, Synthetic Data, Reinforcement Learning, Large Language Models, Difficulty Control

TL;DR

This paper proposes MathSmith, a framework that generates hard mathematical problems by randomly sampling concept pairs from PlanetMath, applying 9 predefined difficulty strategies, and jointly optimizing structural validity, reasoning complexity, and answer consistency via GRPO-based reinforcement learning. The resulting high-difficulty synthetic problems significantly improve LLM mathematical reasoning on AIME and OlympiadBench.

Background & Motivation

Large language models have achieved remarkable progress in mathematical reasoning, yet their advancement is constrained by the following critical bottlenecks:

Scarcity of high-difficulty training data: Most high-quality mathematical problems are human-authored, limited in quantity, and unevenly distributed in difficulty. Models lack sufficient hard training data to push the upper bound of their reasoning capabilities.

Limitations of existing synthesis methods: Most mathematical problem synthesis approaches rely on extracting templates or structures from existing problems, followed by rewriting (MetaMath), augmentation (OpenMathInstruct), back-translation (MathGenie), or evolutionary transformation (WizardMath). These methods are inherently constrained by the distribution and structure of human-authored problems, lacking generative autonomy and precise difficulty control.

Data contamination risk: Methods based on transforming existing problems tend to produce problems similar to test sets, raising concerns about data contamination and the authenticity of performance gains.

Inspiration from the "Bitter Lesson": As Sutton noted, sustainable progress in AI should rely on general, compute-intensive methods rather than hand-crafted knowledge. Future reasoning agents should be capable of autonomously generating high-quality, highly challenging mathematical problems.

The core philosophy of MathSmith resembles that of a "mathematical blacksmith": starting from raw materials (mathematical concept-explanation pairs), it progressively forges complex and coherent mathematical problems entirely without relying on any existing human-authored problems.

Method

Overall Architecture

MathSmith comprises three core stages: 1. Concept-Explanation Collection: Collecting challenging mathematical concept pairs from PlanetMath 2. Supervised Fine-Tuning (SFT): Training basic generation capability using seed data generated by GPT-4o 3. Reinforcement Learning (RL): Optimizing problem format, difficulty, and correctness via a multi-objective reward function

An additional Weakness-Focused Improvement Pipeline module is included to targeted enhance model performance on specific concepts.

Key Designs

  1. Concept-Explanation Collection: Mathematical pages are crawled from PlanetMath—a mathematical encyclopedia renowned for advanced mathematics and theoretical depth—filtered to remove non-concept entries, and processed with GPT-4o to automatically extract core concepts from each page, yielding a dataset of 11,000 mathematical concepts with explanations. PlanetMath is chosen because its concepts are inherently challenging, ensuring difficulty at the source. During generation, 5 concepts and their explanations are randomly sampled as input, completely independent of any existing mathematical problems, thereby avoiding data contamination.

  2. Nine Predefined Difficulty Strategies: Through analysis of the structural and cognitive characteristics of hard mathematical problems, 9 difficulty strategies are designed as soft constraints during generation: multi-step reasoning, cross-topic fusion, implicit or inverse logic, distractor construction, abstract modeling, multiple solution paths, advanced operations, extreme conditions, and non-standard representations. Each generated problem is required to incorporate at least 2 strategies to ensure sufficient complexity.

SFT stage: Each generated sample consists of two components—a rationale part (exactly 5 reasoning steps describing the problem construction process) and a problem part (the final question). Approximately 8K cold-start samples are generated with GPT-4o to fine-tune Qwen3-8B, yielding MathSmith-SFT.

  1. Multi-Objective Reinforcement Learning Reward Function: The core innovation lies in designing a composite reward comprising three components:

(1) Structure reward \(r_{structure}\): Checks whether the output contains both the rationale and problem parts (\(r_{format} \in \{0,1\}\)), and whether the reasoning step count equals 5 (\(r_{step}\), maximized at 5 steps with decay for deviations). \(r_{structure} = \alpha_{format} \cdot r_{format} + \alpha_{step} \cdot r_{step}\), where \(\alpha_{format}=0.7\) and \(\alpha_{step}=0.3\).

(2) Reasoning complexity reward \(r_{complexity}\): A teacher model Qwen3-30B-A3B solves the generated problems, and the token length of its reasoning trajectory is used as an indirect estimate of difficulty: \(r_{complexity} = \frac{1}{K \cdot T_{max}} \sum_{i=1}^{K} \ell_{cot}^{(i)}\). The motivation is that more challenging problems tend to elicit significantly longer reasoning trajectories, and longer trajectories contain low-entropy intermediate tokens that provide more informative supervision signals during training.

(3) Answer consistency reward \(r_{consistency}\): \(K\) answers are sampled from the teacher model; if a majority answer exists (i.e., some answer appears more than \(K/2\) times), the reward is 1, otherwise 0. This encourages generation of clear, unambiguous problems.

Final reward: \(r_{total} = r_{structure} + \beta_{complexity} \cdot r_{complexity} + \beta_{consistency} \cdot r_{consistency}\), where \(\beta_{complexity}=0.7\) and \(\beta_{consistency}=0.3\).

Loss & Training

GRPO (Group Relative Policy Optimization) is adopted to optimize the policy model \(\pi_\theta\). For each group of 5 concept inputs \(c\), \(G\) problems are generated, their composite reward scores \(R_i\) are computed, and then normalized into advantage estimates: \(\hat{A}_{i,t} = \frac{R_i - \text{mean}(\{R_j\})}{\text{std}(\{R_j\})}\), with updates performed via a PPO-style clipped objective (Equations 8–10) plus KL divergence penalty.

Implementation details: - Base generation model: Qwen3-8B, LoRA rank=16, SFT trained for 5 epochs (8×H100) - RL stage uses the verl library, trained on 20×H100, with the final model selected at convergence at step 100 - Teacher model sampling: \(K=5\) - Evaluation and training uniformly use LlamaFactory, learning rate \(1e{-5}\), 5 epochs

Two model variants: - MathSmith-HC: Full complexity + consistency reward (final recommended version) - MathSmith-Hard: Complexity reward only, without the consistency term

Key Experimental Results

Main Results

Benchmarks are divided into two difficulty tiers: Easy & Medium (GSM8K, MATH-500) and Hard (AIME2024, AIME2025, OlympiadBench). All methods use the same amount of training data (50K) and a unified teacher model.

Model Method GSM8K MATH-500 AIME2024 AIME2025 Olympiad Hard Avg (Rel. Imp.)
Qwen2.5-7B (short-CoT) baseline 92.2 72.2 16.7 6.7 38.6 20.7
Qwen2.5-7B (short-CoT) PromptCOT 87.6 73.2 23.3 6.7 35.9 21.9 (+6.2%)
Qwen2.5-7B (short-CoT) MathSmith-HC 91.2 75.2 23.3 10.0 39.9 24.4 (+18.1%)
Qwen3-8B (short-CoT) baseline 93.4 82.8 30.0 16.7 51.0 32.6
Qwen3-8B (short-CoT) MathSmith-HC 92.9 84.4 33.3 23.3 53.1 36.6 (+12.3%)
DS-R1 (long-CoT) baseline 89.3 88.6 43.3 36.7 52.4 44.1
DS-R1 (long-CoT) MathSmith-HC 89.2 91.6 53.3 43.3 56.5 51.0 (+15.6%)
Qwen3-8B (long-CoT) baseline 94.8 94.4 66.7 63.3 66.2 65.4
Qwen3-8B (long-CoT) MathSmith-HC 95.1 96.4 76.7 70.0 68.8 71.8 (+9.8%)

Ablation Study

Training Stage Easy&Med Avg Hard Avg Available Ratio Notes
MathSmith-SFT 87.7 30.3 71.50% SFT only
MathSmith-Hard 89.25 36.6 84.92% RL (complexity reward only)
MathSmith-HC 88.65 36.6 95.38% RL (complexity + consistency)
Weakness-Focused Method Easy&Med Avg Hard Avg Practice Acc
Original 38.2 14.5 23.6
WF Epoch 1 69.9 18.8 33.1
WF Epoch 3 77.6 21.6 34.7
Random (control) 69.4 15.6 30.0

Key Findings

  1. Greater gains on harder benchmarks: Improvement margins on Hard benchmarks (9.8%–18.1%) substantially exceed those on Easy & Medium benchmarks.
  2. Stronger advantage in long-CoT settings: MathSmith yields notably larger gains under long-CoT configurations, indicating that the generated hard problems elicit deeper reasoning.
  3. Good scalability: From 50K to 200K data, MathSmith-HC maintains its lead with widening margins.
  4. Larger models benefit more: Across the Qwen3 series (1.7B→30B), larger models gain more from MathSmith data.
  5. Available Ratio: The availability rate of MathSmith-HC (95.38%) is substantially higher than MathSmith-Hard (84.92%), demonstrating that the consistency reward effectively improves problem quality.
  6. Longest reasoning trajectories: Problems generated by MathSmith-HC/Hard elicit the longest reasoning trajectories across all datasets, validating that the RL stage further enhances difficulty.

Highlights & Insights

  • Breakthrough of the "forge from scratch" paradigm: Problems are generated entirely from randomly sampled concept pairs without relying on any existing problems, fundamentally avoiding data contamination—a key distinction from methods such as MetaMath and NuminaMath.
  • Reasoning trajectory length as a difficulty proxy: A simple yet effective heuristic—harder problems elicit longer reasoning chains. Although length does not directly equate to quality, longer chains contain more low-entropy intermediate tokens, providing better training signals.
  • Weakness-focused mechanism: Since each problem is traceable to its concept set, variant problems can be targeted generated for concepts where the model is weak, enabling iterative improvement. This traceability is a distinctive advantage of the framework.
  • Trade-off between HC and Hard: The consistency reward might appear to "reduce difficulty," but it substantially increases availability (from 85% to 95%), making large-scale synthesis more practical.

Limitations & Future Work

  1. Using reasoning trajectory length as a difficulty measure is heuristic and does not necessarily equate to "difficulty that genuinely improves reasoning capability."
  2. Performance occasionally degrades on simple word problems such as GSM8K, suggesting that overly heavy reasoning may have a negative effect on simpler tasks.
  3. The concept set is sourced solely from PlanetMath, which may have limited coverage (biased toward advanced mathematics, lacking elementary and applied mathematics).
  4. The current difficulty strategies are a fixed set of 9; future work could explore adaptive strategy discovery.
  5. The capability ceiling of the teacher model constrains the quality and difficulty upper bound of generated problems.
  • PromptCOT (Zhao et al. 2025a): The most closely related work—uses concept-driven prompts and multi-step planning to generate Olympiad-level problems, but still relies on manually selected concepts and lacks deeper reasoning control.
  • ScaleQuest (Ding et al. 2024): Generates new problems from scratch but lacks difficulty control.
  • JiuZhang3.0 (Zhou et al. 2024): Controls prompt difficulty in a stratified manner by educational level.
  • GRPO (Shao et al. 2024): The policy optimization algorithm directly employed by MathSmith, originating from DeepSeek.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ (Synthesizing problems from concept pairs + using reasoning length as a difficulty proxy + multi-objective RL optimization is highly creative)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (5 benchmarks, 4 models, short/long-CoT, data/model scaling experiments, weakness-focused evaluation)
  • Writing Quality: ⭐⭐⭐⭐ (Clear structure and well-formatted equations, though figure density is high)
  • Value: ⭐⭐⭐⭐⭐ (Addresses a critical bottleneck in mathematical reasoning data synthesis, with significant implications for the broader LLM reasoning community)