Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning¶

Conference: ICLR 2026 arXiv: 2603.16127 Code: Not open-sourced Area: LLM Pre-training Keywords: Learning rate scheduling, pre-training, supervised fine-tuning, loss landscape, Warmup-Stable-Only

TL;DR¶

This paper proposes the Warmup-Stable-Only (WSO) learning rate schedule—completely eliminating the decay phase during pre-training. Despite yielding worse pre-training metrics, WSO consistently outperforms all decay-based schedules after SFT. Loss landscape analysis reveals that WSO's advantage stems from maintaining flatter minima.

Background & Motivation¶

State of the Field¶

Learning rate (LR) scheduling is one of the most critical yet operationally complex hyperparameters in LLM pre-training. Mainstream approaches include:

Cosine decay: The most widely used schedule since GPT-3, where LR decays along a cosine curve toward near zero.
Linear decay: Recent studies suggest that linear decay to zero achieves lower pre-training loss.
WSD (Warmup-Stable-Decay): Applies a brief decay only at the end of training; more flexible and adopted by models such as MiniCPM.

These strategies share a common objective: decaying LR toward the end of training to optimize pre-training metrics.

Root Cause¶

Existing LR strategies optimize for pre-training performance \(\mathtt{Task}_{\rm pre}(M_{\rm pre})\), whereas in practice the critical objective is post-SFT performance \(\mathtt{Task}_{\rm post}(M_{\rm post})\).

Recent work (Sun & Dredze 2025; Springer et al. 2025) explicitly demonstrates that a model with better pre-training performance does not necessarily perform better after SFT. This raises a fundamental question: when a model is intended to undergo SFT, is LR decay—chosen to optimize pre-training metrics—still the optimal choice?

Formal Definition¶

The conventional pipeline greedily optimizes each stage independently:

\[\widehat{M}_{\rm pre} = \arg\max_{M_{\rm pre} \in \mathcal{M}_{\rm pre}} \{\mathtt{Task}_{\rm pre}(M_{\rm pre}[M_{\rm rand}])\}\]

The ideal objective should instead be a joint global optimization:

\[\widehat{M}_{\rm post} = \arg\max_{(M_{\rm pre}, M_{\rm post}) \in (\mathcal{M}_{\rm pre}, \mathcal{M}_{\rm post})} \{\mathtt{Task}_{\rm post}(M_{\rm post}[M_{\rm pre}[M_{\rm rand}]])\}\]

Method¶

Overall Architecture¶

The paper evaluates four LR schedulers—WSO, WSD, Cosine, and Linear—under both two-stage (pre-training + SFT) and three-stage (pre-training + mid-training + SFT) settings.

Warmup-Stable-Only (WSO) Definition¶

WSO is a minimalist variant of WSD that directly removes the decay phase (setting \(\alpha_{\text{pre}}=1.0\)):

\[\eta^{\text{WSO}}(t, \alpha_{\text{pre}}) = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & t \leq T_{\text{warmup}} \\ \eta_{\max} & T_{\text{warmup}} < t \leq T_{\text{pre}} \end{cases}\]

In contrast to WSD's three-phase schedule (warmup → stable → decay), WSO retains only two phases: warmup → stable.

Key Design: Min LR Factor Parameterization¶

All schedulers are uniformly parameterized by a minimum LR factor \(\alpha_{\text{pre}}\):

\(\alpha_{\text{pre}} = 0.0\): decay to zero (most aggressive)
\(\alpha_{\text{pre}} = 0.1\): decay to 10% of peak LR (commonly used in Llama 3, OLMo 2, etc.)
\(\alpha_{\text{pre}} = 1.0\): no decay, i.e., WSO

Mid-training LR Schedule¶

For the three-stage setting, \(\alpha_{\text{mid}}\) is introduced to control LR decay during mid-training:

\(\alpha_{\text{mid}} = 0.0\): linear decay to zero during mid-training
\(\alpha_{\text{mid}} = 1.0\): constant LR throughout mid-training

Loss Landscape Analysis¶

To explain why WSO performs better after SFT, the paper measures loss landscape flatness via Hessian trace (sharpness):

\[\text{Sharpness}(\theta_t) = \text{Tr}(\mathbf{H}_{\mathcal{L}}(\theta_t)) = \sum_{i=1}^{d} \frac{\partial^2 \mathcal{L}(\theta_t; \mathcal{D})}{\partial \theta_i^2}\]

This is computed efficiently using the Hutchinson unbiased estimator. The key finding is that WSO maintains lower sharpness (flatter minima), while decay-based schedules lead to 2–3× higher sharpness.

Key Experimental Results¶

Main Results: Two-Stage Setting (Pre-training + SFT)¶

Model architectures: 1B and 8B (Llama 3 series); pre-training data: FineWeb-Edu; SFT data: Tulu-3 SFT mixture.

Model	Scheduler	\(\alpha_{\text{pre}}\)	PT Valid Loss ↓ Δ	PT Task Avg Δ	SFT Task Avg Δ
1B	WSO	1.0	+0.071	-1.7	+0.3
1B	WSD	0.1	+0.004	-1.5	+0.0
1B	WSD	0.0	+0.000	-1.2	-1.0
1B	Linear	0.0	+0.016	+0.0	-0.9
1B	Cosine	0.1	+0.019	-0.1	-0.7
8B	WSO	1.0	+0.127	-0.8	+1.1
8B	WSD	0.1	+0.019	-0.2	-0.8
8B	WSD	0.0	+0.014	+0.0	-0.3
8B	Linear	0.0	+0.000	-1.8	+0.0

Key finding: WSO yields the worst pre-training loss (0.127 higher for 8B) but achieves the best post-SFT performance (1.1 points higher for 8B).

Three-Stage Setting (Pre-training + Mid-training + SFT)¶

Model	Scheduler	\(\alpha_{\text{pre}}\)	\(\alpha_{\text{mid}}\)	MT Task Avg Δ	SFT Task Avg Δ
1B	WSO	1.0	1.0	-0.1	+0.8
1B	WSD	1.0	0.0	+0.0	+0.0
1B	Cosine	0.1	0.0	-3.1	-3.7
8B	WSO	1.0	1.0	-2.1	+1.1
8B	WSD	1.0	0.0	+0.0	-1.4
8B	Linear	0.1	0.0	-9.0	-3.7

Ablation Study: Over-training Setting (2T tokens)¶

Model	Scheduler	\(\alpha_{\text{pre}}\)	PT Task Avg Δ	SFT Task Avg Δ
1B	WSO	1.0	-1.5	+0.7
1B	WSD	0.1	+0.0	+0.0
1B	WSD	0.0	+0.0	-0.3

Under over-training combined with mid-training (2T + 500B tokens), WSO's advantage is even larger: SFT Task Avg Δ reaches +1.4.

Key Findings¶

Performance reversal: The scheduler with the best pre-training performance (decay to zero) performs worst after SFT.
WSO dominates across the board: Consistently achieves the best results across 1B/8B, two-stage/three-stage, and standard/over-training settings.
Decay at any stage is harmful: In the three-stage setting, applying decay even only during mid-training degrades SFT performance.
Sharpness is negatively correlated with SFT performance: Pearson correlation \(r=-0.709\).

Highlights & Insights¶

Counter-intuitive core finding: Better pre-training loss ≠ better downstream performance; LR decay actually impairs model adaptability.
Clear theoretical explanation: Loss landscape analysis provides a complete causal chain from flat minima to better post-SFT performance.
Minimal implementation complexity: WSO is simpler than any decay strategy—no decay ratio or decay phase length to tune.
High practical value: The paper recommends that open-source models be trained and released using WSO to maximize adaptability for downstream users.
Consistent across scales: Findings hold across 1B to 8B model sizes and 100B to 2T token training scales.

Limitations & Future Work¶

Only SFT is examined as a post-training method; alignment techniques such as DPO and RLHF are not evaluated.
Experiments are limited to 8B parameters; applicability to larger models (70B+) remains to be verified.
WSO incurs significantly higher pre-training loss, which may be unsuitable for scenarios requiring low pre-training loss (e.g., distillation).
The sample size for the sharpness–SFT performance correlation analysis is relatively small.

Bergsma et al. 2025: Advocates linear decay to zero as optimal—however, this holds only for pre-training loss.
WSD (Hu et al. 2024): WSO can be viewed as an extreme simplification of WSD, echoing WSD's flexibility advantage.
Wen et al. 2025: Theoretical analysis of WSD finds that the decay phase increases sharpness, a problem WSO avoids entirely.
Insight: Future work should select training strategies based on the ultimate deployment objective (post-SFT/RLHF performance) rather than pre-training metrics.

Rating¶

Novelty: ⭐⭐⭐⭐ — Challenges the widely held belief that "lower decay is better"; the argument is clear and compelling.
Theoretical Depth: ⭐⭐⭐⭐ — Loss landscape analysis and formal framework are thorough.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 2 model scales × 4 schedulers × 3 training settings × over-training; extremely comprehensive.
Value: ⭐⭐⭐⭐⭐ — Provides directly actionable recommendations with significant implications for LLM training and model release strategies.
Overall: ⭐⭐⭐⭐☆ — Rigorous experiments, counter-intuitive yet practically useful conclusions; an important contribution to pre-training strategy research.