Skip to content

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=VmEkhV2yCX
Code: To be confirmed
Area: LLM Reasoning / Training Data Configuration
Keywords: Reasoning data, Pre-training, Supervised Fine-Tuning, Data scaling, Reinforcement Learning

TL;DR

Under a fixed reasoning token budget, this study systematically decomposes whether "reasoning data should be placed in pre-training or post-training." It finds that front-loading reasoning data into pre-training builds a persistent advantage that SFT cannot compensate for and proposes an asymmetric data allocation principle: "diversity for pre-training, quality for SFT."

Background & Motivation

  • Background: The current mainstream paradigm for enhancing LLM reasoning capabilities is to inject high-quality, long-CoT reasoning data during the post-training phase (mid-training / SFT / RL), treating reasoning as a specialized skill layered on a general base.
  • Limitations of Prior Work: The role of reasoning data during the pre-training phase is virtually a blank space—pre-training corpora of frontier models are opaque, and end-to-end pre-training experiments are costly. Consequently, community research focuses on the more accessible post-training phase, lacking a systematic comparison of "when to feed reasoning data."
  • Key Challenge: Under controlled token budgets, does early (pre-training) injection of reasoning data yield better results, or cause overfitting that harms generalization? Can subsequent SFT allow a "reasoning-poor" base to "catch up"? These questions are conflicting and unresolved.
  • Goal: To conduct the first systematic study of the impact of reasoning data—in terms of scale, diversity, and quality—injected at different training stages on the final model (after RL), providing a data allocation guide for the entire training pipeline.
  • Core Idea: [Front-Loading Reasoning] Front-load reasoning data into pre-training. Under a fully crossed experimental design strictly controlling the total reasoning token budget (80B), quantify the synergy, redundancy, and trade-offs between pre-training and SFT, concluding that "early investment yields compound returns."

Method

Overall Architecture

The authors formalize the problem as a budget-constrained data allocation optimization: given a fixed total reasoning data budget \(B = |D^{PT}_{res}| + |D^{SFT}_{res}|\), find the optimal configuration for the pre-training side \(D^{PT}_{res}\) and SFT side \(D^{SFT}_{res}\) to maximize the expected accuracy \(P(D^{PT}_{res}, D^{SFT}_{res}) = \mathbb{E}_{t\sim T}[\mathrm{Acc}(f_{\theta_{SFT}}(t))]\) on a downstream reasoning task set \(T\). To this end, all experiments run on the same three-stage pipeline—Pre-training → SFT → RL—with a fully crossed comparison of carefully designed dataset variants.

graph LR
    A[Dbase General Corpus 6.2T] --> B[Pre-training 1T tokens<br/>600B Dbase + 400B 80/20 Mix with Dres]
    B --> C{4 Base Models<br/>Mbase / MSHQ / MLDQ / MLMQ}
    C --> D[SFT<br/>Fine-tune with DSHQ/DLDQ/DLMQ]
    D --> E[RL GRPO<br/>Verifiable Rewards]
    E --> F[Evaluation across math/science/code]

Key Designs

1. Full Cross-Data Matrix: Decomposing "Quality × Diversity × Scale" into controllable variables. The authors curate four datasets around reasoning data \(D_{res}\) to decouple dimensions: the large-scale and diverse \(D_{LDQ}\) (268M samples, 56% Math/17% Code/27% Science and General, mixed quality, representing "quantity over quality"), the small-scale high-quality \(D_{SHQ}\) (1.2M strong teacher long-CoT samples, representing "high quality but narrow"), the mixed-quality union of the two \(D_{LMQ}\), and the "complexity isolated" subset \(D_{ALF}\) (7.1M) filtered by answer length >4096 tokens. Four base models are trained: a baseline with no reasoning data \(M_{base}\), and \(M_{LDQ}/M_{SHQ}/M_{LMQ}\), with their mean denoted as \(M_{res}\). This matrix allows "early/late, diverse/high-quality" to become independent knobs.

2. Controlled Token Budget and Ratio: Ensuring fairness across experiments. All bases are pre-trained from scratch for 1T tokens—the first 600B uses \(D_{base}\) exclusively, and the latter 400B uses a mix of 80% \(D_{base}\) + 20% \(D_{res}\). Thus, all experiments share a constant 80B reasoning token budget; small datasets (e.g., \(D_{SHQ}\)) are upsampled to maintain the same token count, isolating "what data and which stage" as variables. The base models use an 8B Mamba2 + Self-Attention + FFN hybrid Transformer, trained on 512 H100s.

3. Three-stage Synergy + RL Sustainability Test. After pre-training, each base undergoes SFT on different \(D_{res}\) (4.8M samples, 32k context), forming a 4×3 cross-evaluation to test three hypotheses: the Catch-Up Hypothesis (can \(M_{base}\) catch up via doubled SFT), the Influence of Diversity (broad vs. deep pre-training for SFT absorption), and the Marginal Utility of SFT Quality. Finally, GRPO + verifiable rewards (based on NEMOTRON-CROSSTHINK) are used for RL to check if early reasoning gains are sustainable and can translate into a decisive advantage in expert-level tasks like AIME.

Key Experimental Results

Main Results

Accuracy after pre-training (Table 1) and three-stage evolution (Table 2/3):

Stage Model Avg Math Science Code
Post-PT Mbase 52.70 47.17 47.13 40.89
Post-PT MLDQ 64.09 75.56 54.38 49.94
Post-PT Mres (Mean) 61.05 66.84 51.92 48.95
Post-SFT Mbase+SFT 26.62 34.48 20.92 7.09
Post-SFT Mres+SFT 35.92 40.61 34.77 16.75
Post-RL Mbase+SFT_SHQ+RL 37.92
Post-RL MLMQ+SFT_SHQ+RL 56.66

Pre-training creates an initial +8.35% average gap, which expands to +9.3% after SFT. After RL, MLMQ leads the baseline by +18.57%, with AIME competition math leading by as much as +39.32%—confirming "early investment, compound returns."

Ablation Study

Catch-Up failure + latent value of high-quality data (Table 4):

Model Avg Math Note
Mbase + SFT_SHQ 29.92 42.79 Baseline
Mbase + SFT_SHQ (2× epochs) 34.01 48.05 Doubled SFT still fails to catch up
MSHQ + SFT_SHQ 37.33 50.52 Weakest reasoning base exceeds doubled baseline
MLDQ + SFT_SHQ 46.70 60.79 Advantage of diverse PT persists
MLMQ + SFT_SHQ 50.95 64.67 High-quality latent gain activated by SFT

Pre-training ratio sensitivity (Table 6, MLMQ): Increasing \(D_{base}:D_{res}\) from 80/20 to 60/40 improves the overall score from 64.07 to 67.28, with simultaneous increases in math/science/code and no degradation in general tasks.

Key Findings

  • The Catch-Up Hypothesis is refuted: \(M_{base}\) with doubled SFT cannot catch up to even the weakest reasoning base, indicating SFT cannot replace the reasoning foundation established during pre-training.
  • Asymmetric Allocation Principle: Pre-training favors diversity and scale (MLDQ's diversity brings +11% magnitude gain over MSHQ), while SFT favors quality (high-quality \(D_{SHQ}\) brings +15% magnitude gain).
  • Latent Effect: High-quality but narrow data yields almost no immediate benefit in the pre-training stage but "unlocks" an additional +4.25% gain after SFT (MLMQ vs. MLDQ).
  • Blindly expanding SFT is harmful: Expanding SFT with large-scale mixed-quality data yields no average gain and decreases math accuracy by ~5%; however, increasing high-quality data by only 0.4% provides sustainable improvements.

Highlights & Insights

  • Systematic study of "when to feed reasoning data" from first principles: Conclusions are drawn under strictly controlled budgets, full cross-validation, and a three-stage pipeline (Pre-training/SFT/RL), offering a solid methodology that transcends the "more is better" intuition.
  • High operability of the Asymmetric Principle: Distinct heuristics of "diversity for pre-training, quality for SFT" directly guide data procurement and allocation decisions.
  • Anomalous gains in science domains: Unlike most post-training work that only impacts math, this study finds the most significant gap in science, suggesting early reasoning data helps the model build cross-domain transferable abstract/logical internal representations rather than just memorizing facts.
  • Discovery of Latent Effects: The value of high-quality data is "delayed" until the alignment phase, revealing deeper synergistic mechanisms between pre-training and post-training.

Limitations & Future Work

  • Validated only on 8B hybrid architecture + 1T tokens (with 1.2B Transformer confirming trends); scaling laws for larger models or longer training remain to be confirmed.
  • Reasoning ratio is an empirical knob; the optimal ratio varies by domain and dataset. Increasing the ratio strengthens reasoning but slightly harms instruction following (breadth–alignment trade-off), requiring systematic exploration per deployment domain.
  • Dataset quality/diversity is defined by heuristics (answer length, source mix); lacks finer-grained quality metrics.
  • The RL stage only compared two extreme bases; the RL behavior of intermediate configurations is not fully characterized.
  • Post-training reasoning paradigms (Long-CoT SFT, Guha et al. 2025, etc.): This paper proves the ceiling of these methods is constrained by the pre-training base, serving as a supplement and boundary definition.
  • Pre-training/Mid-training reasoning injection (Cheng et al. 2024, etc.): This study extends "small-scale CoT injection during mid-training" to "large-scale end-to-end pre-training injection" and quantifies synergy with post-training.
  • Insights for Practice: Data engineering should shift from "independent stage-wise optimization" to "collaborative allocation across the pipeline"—prioritize diverse, large-scale reasoning corpora in pre-training to establish transferable priors, and use high-quality long CoT for targeted refinement in SFT, avoiding signal dilution with noisy data.

Rating

  • Novelty: ⭐⭐⭐⭐ First systematic study of cross-stage allocation under controlled budgets; asymmetric principles and latent effects are valuable new discoveries.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Three-stage design with multiple ablations and cross-architecture validation; limited to 8B scale.
  • Writing Quality: ⭐⭐⭐⭐ Clear problem statement and actionable conclusions.
  • Value: ⭐⭐⭐⭐ Provides direct guidance for data strategy in the industry.