Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=VmEkhV2yCX
Code: To be confirmed
Area: LLM Reasoning / Training Data Configuration
Keywords: Reasoning data, Pre-training, Supervised Fine-Tuning, Data scaling, Reinforcement Learning

TL;DR¶

Under a fixed reasoning token budget, this study systematically decomposes whether "reasoning data should be placed in pre-training or post-training." It finds that front-loading reasoning data into pre-training builds a persistent advantage that SFT cannot compensate for and proposes an asymmetric data allocation principle: "diversity for pre-training, quality for SFT."

Background & Motivation¶

Background: The current mainstream paradigm for enhancing LLM reasoning capabilities is to inject high-quality, long-CoT reasoning data during the post-training phase (mid-training / SFT / RL), treating reasoning as a specialized skill layered on a general base.
Limitations of Prior Work: The role of reasoning data during the pre-training phase is virtually a blank space—pre-training corpora of frontier models are opaque, and end-to-end pre-training experiments are costly. Consequently, community research focuses on the more accessible post-training phase, lacking a systematic comparison of "when to feed reasoning data."
Key Challenge: Under controlled token budgets, does early (pre-training) injection of reasoning data yield better results, or cause overfitting that harms generalization? Can subsequent SFT allow a "reasoning-poor" base to "catch up"? These questions are conflicting and unresolved.
Goal: To conduct the first systematic study of the impact of reasoning data—in terms of scale, diversity, and quality—injected at different training stages on the final model (after RL), providing a data allocation guide for the entire training pipeline.
Core Idea: [Front-Loading Reasoning] Front-load reasoning data into pre-training. Under a fully crossed experimental design strictly controlling the total reasoning token budget (80B), quantify the synergy, redundancy, and trade-offs between pre-training and SFT, concluding that "early investment yields compound returns."

Method¶

Overall Architecture¶

The authors formalize the problem as a budget-constrained data allocation optimization: given a fixed total reasoning data budget \(B = |D^{PT}_{res}| + |D^{SFT}_{res}|\), find the optimal configuration for the pre-training side \(D^{PT}_{res}\) and SFT side \(D^{SFT}_{res}\) to maximize the expected accuracy \(P(D^{PT}_{res}, D^{SFT}_{res}) = \mathbb{E}_{t\sim T}[\mathrm{Acc}(f_{\theta_{SFT}}(t))]\) on a downstream reasoning task set \(T\). To this end, all experiments run on the same three-stage pipeline—Pre-training → SFT → RL—with a fully crossed comparison of carefully designed dataset variants.

graph LR
    A[Dbase General Corpus 6.2T] --> B[Pre-training 1T tokens<br/>600B Dbase + 400B 80/20 Mix with Dres]
    B --> C{4 Base Models<br/>Mbase / MSHQ / MLDQ / MLMQ}
    C --> D[SFT<br/>Fine-tune with DSHQ/DLDQ/DLMQ]
    D --> E[RL GRPO<br/>Verifiable Rewards]
    E --> F[Evaluation across math/science/code]

Key Designs¶

1. Full Cross-Data Matrix: Decomposing "Quality × Diversity × Scale" into controllable variables. The authors curate four datasets around reasoning data \(D_{res}\) to decouple dimensions: the large-scale and diverse \(D_{LDQ}\) (268M samples, 56% Math/17% Code/27% Science and General, mixed quality, representing "quantity over quality"), the small-scale high-quality \(D_{SHQ}\) (1.2M strong teacher long-CoT samples, representing "high quality but narrow"), the mixed-quality union of the two \(D_{LMQ}\), and the "complexity isolated" subset \(D_{ALF}\) (7.1M) filtered by answer length >4096 tokens. Four base models are trained: a baseline with no reasoning data \(M_{base}\), and \(M_{LDQ}/M_{SHQ}/M_{LMQ}\), with their mean denoted as \(M_{res}\). This matrix allows "early/late, diverse/high-quality" to become independent knobs.

2. Controlled Token Budget and Ratio: Ensuring fairness across experiments. All bases are pre-trained from scratch for 1T tokens—the first 600B uses \(D_{base}\) exclusively, and the latter 400B uses a mix of 80% \(D_{base}\) + 20% \(D_{res}\). Thus, all experiments share a constant 80B reasoning token budget; small datasets (e.g., \(D_{SHQ}\)) are upsampled to maintain the same token count, isolating "what data and which stage" as variables. The base models use an 8B Mamba2 + Self-Attention + FFN hybrid Transformer, trained on 512 H100s.

3. Three-stage Synergy + RL Sustainability Test. After pre-training, each base undergoes SFT on different \(D_{res}\) (4.8M samples, 32k context), forming a 4×3 cross-evaluation to test three hypotheses: the Catch-Up Hypothesis (can \(M_{base}\) catch up via doubled SFT), the Influence of Diversity (broad vs. deep pre-training for SFT absorption), and the Marginal Utility of SFT Quality. Finally, GRPO + verifiable rewards (based on NEMOTRON-CROSSTHINK) are used for RL to check if early reasoning gains are sustainable and can translate into a decisive advantage in expert-level tasks like AIME.

Key Experimental Results¶

Main Results¶

Accuracy after pre-training (Table 1) and three-stage evolution (Table 2/3):

Stage	Model	Avg	Math	Science	Code
Post-PT	Mbase	52.70	47.17	47.13	40.89
Post-PT	MLDQ	64.09	75.56	54.38	49.94
Post-PT	Mres (Mean)	61.05	66.84	51.92	48.95
Post-SFT	Mbase+SFT	26.62	34.48	20.92	7.09
Post-SFT	Mres+SFT	35.92	40.61	34.77	16.75
Post-RL	Mbase+SFT_SHQ+RL	37.92	—	—	—
Post-RL	MLMQ+SFT_SHQ+RL	56.66	—	—	—

Pre-training creates an initial +8.35% average gap, which expands to +9.3% after SFT. After RL, MLMQ leads the baseline by +18.57%, with AIME competition math leading by as much as +39.32%—confirming "early investment, compound returns."

Ablation Study¶

Catch-Up failure + latent value of high-quality data (Table 4):

Model	Avg	Math	Note
Mbase + SFT_SHQ	29.92	42.79	Baseline
Mbase + SFT_SHQ (2× epochs)	34.01	48.05	Doubled SFT still fails to catch up
MSHQ + SFT_SHQ	37.33	50.52	Weakest reasoning base exceeds doubled baseline
MLDQ + SFT_SHQ	46.70	60.79	Advantage of diverse PT persists
MLMQ + SFT_SHQ	50.95	64.67	High-quality latent gain activated by SFT

Pre-training ratio sensitivity (Table 6, MLMQ): Increasing \(D_{base}:D_{res}\) from 80/20 to 60/40 improves the overall score from 64.07 to 67.28, with simultaneous increases in math/science/code and no degradation in general tasks.

Key Findings¶

The Catch-Up Hypothesis is refuted: \(M_{base}\) with doubled SFT cannot catch up to even the weakest reasoning base, indicating SFT cannot replace the reasoning foundation established during pre-training.
Asymmetric Allocation Principle: Pre-training favors diversity and scale (MLDQ's diversity brings +11% magnitude gain over MSHQ), while SFT favors quality (high-quality \(D_{SHQ}\) brings +15% magnitude gain).
Latent Effect: High-quality but narrow data yields almost no immediate benefit in the pre-training stage but "unlocks" an additional +4.25% gain after SFT (MLMQ vs. MLDQ).
Blindly expanding SFT is harmful: Expanding SFT with large-scale mixed-quality data yields no average gain and decreases math accuracy by ~5%; however, increasing high-quality data by only 0.4% provides sustainable improvements.

Highlights & Insights¶

Systematic study of "when to feed reasoning data" from first principles: Conclusions are drawn under strictly controlled budgets, full cross-validation, and a three-stage pipeline (Pre-training/SFT/RL), offering a solid methodology that transcends the "more is better" intuition.
High operability of the Asymmetric Principle: Distinct heuristics of "diversity for pre-training, quality for SFT" directly guide data procurement and allocation decisions.
Anomalous gains in science domains: Unlike most post-training work that only impacts math, this study finds the most significant gap in science, suggesting early reasoning data helps the model build cross-domain transferable abstract/logical internal representations rather than just memorizing facts.
Discovery of Latent Effects: The value of high-quality data is "delayed" until the alignment phase, revealing deeper synergistic mechanisms between pre-training and post-training.

Limitations & Future Work¶

Validated only on 8B hybrid architecture + 1T tokens (with 1.2B Transformer confirming trends); scaling laws for larger models or longer training remain to be confirmed.
Reasoning ratio is an empirical knob; the optimal ratio varies by domain and dataset. Increasing the ratio strengthens reasoning but slightly harms instruction following (breadth–alignment trade-off), requiring systematic exploration per deployment domain.
Dataset quality/diversity is defined by heuristics (answer length, source mix); lacks finer-grained quality metrics.
The RL stage only compared two extreme bases; the RL behavior of intermediate configurations is not fully characterized.

Post-training reasoning paradigms (Long-CoT SFT, Guha et al. 2025, etc.): This paper proves the ceiling of these methods is constrained by the pre-training base, serving as a supplement and boundary definition.
Pre-training/Mid-training reasoning injection (Cheng et al. 2024, etc.): This study extends "small-scale CoT injection during mid-training" to "large-scale end-to-end pre-training injection" and quantifies synergy with post-training.
Insights for Practice: Data engineering should shift from "independent stage-wise optimization" to "collaborative allocation across the pipeline"—prioritize diverse, large-scale reasoning corpora in pre-training to establish transferable priors, and use high-quality long CoT for targeted refinement in SFT, avoiding signal dilution with noisy data.

Rating¶

Novelty: ⭐⭐⭐⭐ First systematic study of cross-stage allocation under controlled budgets; asymmetric principles and latent effects are valuable new discoveries.
Experimental Thoroughness: ⭐⭐⭐⭐ Three-stage design with multiple ablations and cross-architecture validation; limited to 8B scale.
Writing Quality: ⭐⭐⭐⭐ Clear problem statement and actionable conclusions.
Value: ⭐⭐⭐⭐ Provides direct guidance for data strategy in the industry.