Skip to content

ScaleQuest: Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch

Conference: ACL 2025
arXiv: 2410.18693
Code: https://scalequest.github.io
Area: LLM/NLP
Keywords: math reasoning, question synthesis, QFT, QPO, data scaling, instruction tuning

TL;DR

Proposes ScaleQuest, which transforms a 7B problem-solving model into a question-generation model via a two-stage training of Question Fine-Tuning (QFT) + Question Preference Optimization (QPO). It synthesizes 1 million high-quality math question-answer pairs from scratch, comprehensively outperforming all open-source datasets across four benchmarks, with performance continuing to rise and showing no saturation as data scales to 1M.

Background & Motivation

Scarcity of high-quality reasoning data: Scaling LLM math reasoning requires large-scale, diverse, high-quality datasets, but the open-source community severely lacks such resources—the success of leading models (o1, Claude-3.5) heavily relies on proprietary high-quality data.

Limited diversity in query-driven methods: Methods like MetaMath (rephrasing), WizardMath (evol-instruct), and Orca-Math (back-translation) generate questions highly similar to the seed data (only changing numbers or adding constraints), limiting scalability due to diversity bottlenecks.

High cost of knowledge-driven methods: Although NuminaMath (knowledge graph-guided) and KPMath (key knowledge point sampling) improve diversity, they still rely on powerful models like GPT-4 to generate questions, making large-scale synthesis commercially unfeasible due to API costs.

Poor performance of direct question generation from solver models: Magpie-style methods that directly generate instructions using instruct models perform poorly on reasoning tasks (as shown in Figure 1 where Llama3-8B-Magpie lags far behind other methods), because the instruction tuning loss is only computed on the responses, and the question generation ability is not explicitly activated.

Need for lightweight, low-cost solutions: The open-source community requires a low-cost solution capable of synthesizing large-scale data using 7B-level lightweight models, without relying on powerful closed-source models.

Theoretical demand for data scalability: An ideal data synthesis method should continuously improve performance as the data volume increases, whereas existing methods (such as DART-Math based on rejection sampling from limited seeds) saturate quickly.

Method

Overall Architecture

ScaleQuest consists of three key phases: (1) QFT to activate the problem-solver's ability to generate questions; (2) QPO to enhance the solvability and difficulty of generated questions via preference optimization; (3) Filtering + Response Generation—applying multi-dimensional filtering, and then selecting the optimal response using a Best-of-5 reward model. Consequently, two 7B models are utilized to generate 1 million questions each, which are then filtered to obtain 1 million question-answer pairs.

Key Designs

  1. QFT (Question Fine-Tuning) — Activating Question Generation Capability

    • Function: Finetuning the solver model (Qwen2-Math-7B-Instruct) on approximately 15K math questions (excluding answers, containing only question text + EOS token) so that the model learns to "stop after generating the question."
    • Mechanism: Causal language models apply a causal mask to inputs during instruction tuning, and hidden states evolve based on the context, thus implicitly modeling \(P(x_i|x_{<i})\). QFT merely needs to activate this capability rather than memorize the training questions.
    • Design Motivation: Validation experiments verified "activation rather than memorization"—the difficulty distributions of questions generated by QFT models trained on GSM8K and MATH, respectively, converged (instead of replicating their respective training sets), demonstrating that QFT activates a general question-generation ability.
  2. QPO (Question Preference Optimization) — Improving Question Quality

    • Function: The QFT model generates 10K questions, which are optimized by an external LLM in terms of solvability and difficulty to construct preference pairs (optimized, original) for DPO training.
    • Mechanism: Borrowing the DPO preference framework, QPO shifts it from "optimizing responses" to "optimizing questions"—the loss function \(\mathcal{L}_{\text{QPO}}\) encourages the model to generate more solvable and challenging questions.
    • Design Motivation: Although questions generated after QFT make sense, their quality is still insufficient—some are unsolvable (insufficient constraints / incorrect answers) or too simplistic. Randomly choosing one optimization direction (solvability or difficulty) per sample avoids conflicts of optimizing both objectives simultaneously. Experiments demonstrated that GPT-4o-mini is most effective for solvability optimization.
  3. Multi-dimensional Filtering + Reward Model for Solution Selection

    • Function: Language filtering (removing ~20% non-English questions) → solvability filtering (Qwen2-Math determines if the question is meaningful and constraints are sufficient) → difficulty sampling (filtering overly simple questions based on a difficulty scorer trained on fail rates) → generating 5 solutions per question and selecting the highest-scoring solution using InternLM2-7B-Reward.
    • Mechanism: Post-generation filtering is more flexible and efficient than in-generation constraint enforcement. The difficulty scorer operationalizes difficulty as "the error rate of sampling the question \(n\) times."
    • Design Motivation: Triple filtering addresses language mixing, unsolvability, and difficulty imbalance, respectively. Selecting solutions with a reward model ensures the response quality of the final dataset.

Experiments

Table 1: Main Results (Four Mathematical Reasoning Benchmarks, Zero-Shot Pass@1 Accuracy)

Base Model - Dataset Generator Model GSM8K MATH College Math OlympiadBench Avg
Mistral-7B-MetaMath GPT-3.5 77.7 28.2 19.1 5.8 32.7
Mistral-7B-NuminaMath GPT-4o 82.1 49.4 33.8 19.4 46.2
Mistral-7B-ScaleQuest Qwen2-7B 88.5 62.9 43.5 26.8 55.4
Llama3-8B-MetaMath GPT-3.5 77.3 32.5 20.6 5.5 34.0
Llama3-8B-NuminaMath GPT-4o 77.2 50.7 33.2 17.8 44.7
Llama3-8B-ScaleQuest Qwen2-7B 87.9 64.4 42.8 25.3 55.1
DSMath-7B-DART-Math DSMath-RL 86.8 53.6 40.7 21.7 50.7
DSMath-7B-ScaleQuest Qwen2-7B 89.5 66.6 47.7 29.9 58.4
Qwen2-Math-7B-NuminaMath GPT-4o 84.6 65.6 45.5 33.6 57.3
Qwen2-Math-7B-ScaleQuest Qwen2-7B 89.7 73.4 50.0 38.5 62.9

Table 2: Comparison of Question Quality (Unified evaluation using Qwen2-Math-7B-Instruct to generate responses)

Question Source GSM8K MATH College Math OlympiadBench Avg
MetaMath 84.5 53.8 40.1 22.1 50.1
OrcaMath 84.2 53.7 40.5 23.7 50.5
NuminaMath 86.0 65.9 46.1 30.2 57.1
ScaleQuest 89.5 66.6 47.7 29.9 58.4

Table 3: Cost Analysis

Method GPU Time USD Cost
ScaleQuest (1M samples) 522.9 GPU-hours $680.8
GPT-4o (equivalent tokens) - $6,115.9
GPT-4 (equivalent tokens) - $24,939.5

Key Findings

  • Comprehensively outperforming all open-source datasets: ScaleQuest yields average performance gains of 5.6%~11.5% across four base models, and Qwen2-Math-7B-ScaleQuest reaches 73.4 on MATH, matching GPT-4-Turbo.
  • Outperforming the teacher model: Qwen2-Math-7B-ScaleQuest comprehensively outperforms its teacher model Qwen2-Math-7B-Instruct on GSM8K (89.7), MATH (73.4), and OlympiadBench (38.5).
  • Unsaturated data scaling: From 100K to 1M, both in-domain (MATH) and out-of-domain (OlympiadBench) performance continue to improve without showing signs of convergence, whereas other datasets (such as DART-Math) have long since saturated.
  • Every step of QFT+QPO is effective: Ablation studies show that QFT improves solvability and diversity, QPO further enhances difficulty and solvability, and reward filtering further boosts the final performance.
  • Multi-generator enhances diversity: Mixing data from DSMath-QGen and Qwen2-Math-QGen performs better than a single generator—the former favors practical problems, while the latter favors theoretical ones, complementarily enhancing diversity.
  • Cost is only 11% of GPT-4o: Generating 1M samples costs only $680.8 (approx. 2.7 days on 8×A100), which is less than 1/9 of the cost of utilizing GPT-4o at the same scale.

Highlights & Insights

  • A paradigm shift_ "from problem solving to question generation": QFT+QPO transforms a solver model into a questioner model with only ~15K seed questions (no answers), showing elegant simplicity and high efficiency—opening up a new paradigm for reasoning data synthesis.
  • Sophisticated preference optimization design in QPO: It adapts DPO from "optimizing response quality" to "optimizing question quality," randomly choosing one optimization direction per sample to avoid multi-objective conflicts.
  • Crucial discovery of unsaturated data scaling: This implies the feasibility of further scaling, providing empirical support for the "more data, better performance" scaling law in the reasoning domain.
  • Generalization from math to code reasoning: The method also demonstrates significant improvements in code reasoning tasks, showing it is not limited to the mathematical domain.

Limitations & Future Work

  • Only validated on 7B-level models; effectiveness on larger models (e.g., 72B, 70B) remains unknown.
  • External LLM optimization during the QPO phase may introduce distribution bias.
  • The filtering threshold for difficulty sampling is empirically determined, lacking theoretical guidance.
  • The quality of generated answers is "still not fully satisfactory"—there remains room for improvement in question preference alignment.
  • Human evaluation reveals that synthesized data remains inferior to human-written datasets (GSM8K, MATH) in terms of clarity and rationality.
  • Question-driven methods: WizardMath (evol-instruct), MetaMath (rephrasing), MMIQC (hybrid), and Orca-Math (back-translation) are constrained in diversity by their seed questions.
  • Knowledge-driven methods: MathScale (knowledge graphs), KPMath (key knowledge points), and NuminaMath (hybrid of real and synthetic questions) improve diversity but rely heavily on strong models.
  • Response quality enhancement: DART-Math (difficulty-aware rejection sampling) optimizes from the response side, complementing ScaleQuest which optimizes from the question side—the two can be combined.
  • Other avenues for boosting mathematical reasoning: Pretraining data optimization (Llemma), tool-integrated reasoning (PAL, PoT), and preference tuning (DeepSeekMath-RL).

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The QFT+QPO two-stage questioner training represents a brand-new paradigm for data synthesis; the concept of "from problem solving to question generation" is highly novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across 4 base models × 4 benchmarks + scalability analysis + ablation studies + cost analysis + human evaluation + code generalization.
  • Writing Quality: ⭐⭐⭐⭐ Clear motivation, complete logical chain in the methodology, and exhaustive ablation analysis.
  • Value: ⭐⭐⭐⭐⭐ Provides a low-cost, scalable reasoning data synthesis solution for the open-source community, possessing significant practical impact.