Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning¶
Conference: ICLR 2026 arXiv: 2603.16127 Code: Not open-sourced Area: LLM Pre-training Keywords: Learning rate scheduling, pre-training, supervised fine-tuning, loss landscape, Warmup-Stable-Only
TL;DR¶
This paper proposes the Warmup-Stable-Only (WSO) learning rate schedule—completely eliminating the decay phase during pre-training. Despite yielding worse pre-training metrics, WSO consistently outperforms all decay-based schedules after SFT. Loss landscape analysis reveals that WSO's advantage stems from maintaining flatter minima.
Background & Motivation¶
State of the Field¶
Learning rate (LR) scheduling is one of the most critical yet operationally complex hyperparameters in LLM pre-training. Mainstream approaches include:
- Cosine decay: The most widely used schedule since GPT-3, where LR decays along a cosine curve toward near zero.
- Linear decay: Recent studies suggest that linear decay to zero achieves lower pre-training loss.
- WSD (Warmup-Stable-Decay): Applies a brief decay only at the end of training; more flexible and adopted by models such as MiniCPM.
These strategies share a common objective: decaying LR toward the end of training to optimize pre-training metrics.
Root Cause¶
Existing LR strategies optimize for pre-training performance \(\mathtt{Task}_{\rm pre}(M_{\rm pre})\), whereas in practice the critical objective is post-SFT performance \(\mathtt{Task}_{\rm post}(M_{\rm post})\).
Recent work (Sun & Dredze 2025; Springer et al. 2025) explicitly demonstrates that a model with better pre-training performance does not necessarily perform better after SFT. This raises a fundamental question: when a model is intended to undergo SFT, is LR decay—chosen to optimize pre-training metrics—still the optimal choice?
Formal Definition¶
The conventional pipeline greedily optimizes each stage independently:
The ideal objective should instead be a joint global optimization:
Method¶
Overall Architecture¶
The paper evaluates four LR schedulers—WSO, WSD, Cosine, and Linear—under both two-stage (pre-training + SFT) and three-stage (pre-training + mid-training + SFT) settings.
Warmup-Stable-Only (WSO) Definition¶
WSO is a minimalist variant of WSD that directly removes the decay phase (setting \(\alpha_{\text{pre}}=1.0\)):
In contrast to WSD's three-phase schedule (warmup → stable → decay), WSO retains only two phases: warmup → stable.
Key Design: Min LR Factor Parameterization¶
All schedulers are uniformly parameterized by a minimum LR factor \(\alpha_{\text{pre}}\):
- \(\alpha_{\text{pre}} = 0.0\): decay to zero (most aggressive)
- \(\alpha_{\text{pre}} = 0.1\): decay to 10% of peak LR (commonly used in Llama 3, OLMo 2, etc.)
- \(\alpha_{\text{pre}} = 1.0\): no decay, i.e., WSO
Mid-training LR Schedule¶
For the three-stage setting, \(\alpha_{\text{mid}}\) is introduced to control LR decay during mid-training:
- \(\alpha_{\text{mid}} = 0.0\): linear decay to zero during mid-training
- \(\alpha_{\text{mid}} = 1.0\): constant LR throughout mid-training
Loss Landscape Analysis¶
To explain why WSO performs better after SFT, the paper measures loss landscape flatness via Hessian trace (sharpness):
This is computed efficiently using the Hutchinson unbiased estimator. The key finding is that WSO maintains lower sharpness (flatter minima), while decay-based schedules lead to 2–3× higher sharpness.
Key Experimental Results¶
Main Results: Two-Stage Setting (Pre-training + SFT)¶
Model architectures: 1B and 8B (Llama 3 series); pre-training data: FineWeb-Edu; SFT data: Tulu-3 SFT mixture.
| Model | Scheduler | \(\alpha_{\text{pre}}\) | PT Valid Loss ↓ Δ | PT Task Avg Δ | SFT Task Avg Δ |
|---|---|---|---|---|---|
| 1B | WSO | 1.0 | +0.071 | -1.7 | +0.3 |
| 1B | WSD | 0.1 | +0.004 | -1.5 | +0.0 |
| 1B | WSD | 0.0 | +0.000 | -1.2 | -1.0 |
| 1B | Linear | 0.0 | +0.016 | +0.0 | -0.9 |
| 1B | Cosine | 0.1 | +0.019 | -0.1 | -0.7 |
| 8B | WSO | 1.0 | +0.127 | -0.8 | +1.1 |
| 8B | WSD | 0.1 | +0.019 | -0.2 | -0.8 |
| 8B | WSD | 0.0 | +0.014 | +0.0 | -0.3 |
| 8B | Linear | 0.0 | +0.000 | -1.8 | +0.0 |
Key finding: WSO yields the worst pre-training loss (0.127 higher for 8B) but achieves the best post-SFT performance (1.1 points higher for 8B).
Three-Stage Setting (Pre-training + Mid-training + SFT)¶
| Model | Scheduler | \(\alpha_{\text{pre}}\) | \(\alpha_{\text{mid}}\) | MT Task Avg Δ | SFT Task Avg Δ |
|---|---|---|---|---|---|
| 1B | WSO | 1.0 | 1.0 | -0.1 | +0.8 |
| 1B | WSD | 1.0 | 0.0 | +0.0 | +0.0 |
| 1B | Cosine | 0.1 | 0.0 | -3.1 | -3.7 |
| 8B | WSO | 1.0 | 1.0 | -2.1 | +1.1 |
| 8B | WSD | 1.0 | 0.0 | +0.0 | -1.4 |
| 8B | Linear | 0.1 | 0.0 | -9.0 | -3.7 |
Ablation Study: Over-training Setting (2T tokens)¶
| Model | Scheduler | \(\alpha_{\text{pre}}\) | PT Task Avg Δ | SFT Task Avg Δ |
|---|---|---|---|---|
| 1B | WSO | 1.0 | -1.5 | +0.7 |
| 1B | WSD | 0.1 | +0.0 | +0.0 |
| 1B | WSD | 0.0 | +0.0 | -0.3 |
Under over-training combined with mid-training (2T + 500B tokens), WSO's advantage is even larger: SFT Task Avg Δ reaches +1.4.
Key Findings¶
- Performance reversal: The scheduler with the best pre-training performance (decay to zero) performs worst after SFT.
- WSO dominates across the board: Consistently achieves the best results across 1B/8B, two-stage/three-stage, and standard/over-training settings.
- Decay at any stage is harmful: In the three-stage setting, applying decay even only during mid-training degrades SFT performance.
- Sharpness is negatively correlated with SFT performance: Pearson correlation \(r=-0.709\).
Highlights & Insights¶
- Counter-intuitive core finding: Better pre-training loss ≠ better downstream performance; LR decay actually impairs model adaptability.
- Clear theoretical explanation: Loss landscape analysis provides a complete causal chain from flat minima to better post-SFT performance.
- Minimal implementation complexity: WSO is simpler than any decay strategy—no decay ratio or decay phase length to tune.
- High practical value: The paper recommends that open-source models be trained and released using WSO to maximize adaptability for downstream users.
- Consistent across scales: Findings hold across 1B to 8B model sizes and 100B to 2T token training scales.
Limitations & Future Work¶
- Only SFT is examined as a post-training method; alignment techniques such as DPO and RLHF are not evaluated.
- Experiments are limited to 8B parameters; applicability to larger models (70B+) remains to be verified.
- WSO incurs significantly higher pre-training loss, which may be unsuitable for scenarios requiring low pre-training loss (e.g., distillation).
- The sample size for the sharpness–SFT performance correlation analysis is relatively small.
Related Work & Insights¶
- Bergsma et al. 2025: Advocates linear decay to zero as optimal—however, this holds only for pre-training loss.
- WSD (Hu et al. 2024): WSO can be viewed as an extreme simplification of WSD, echoing WSD's flexibility advantage.
- Wen et al. 2025: Theoretical analysis of WSD finds that the decay phase increases sharpness, a problem WSO avoids entirely.
- Insight: Future work should select training strategies based on the ultimate deployment objective (post-SFT/RLHF performance) rather than pre-training metrics.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Challenges the widely held belief that "lower decay is better"; the argument is clear and compelling.
- Theoretical Depth: ⭐⭐⭐⭐ — Loss landscape analysis and formal framework are thorough.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 2 model scales × 4 schedulers × 3 training settings × over-training; extremely comprehensive.
- Value: ⭐⭐⭐⭐⭐ — Provides directly actionable recommendations with significant implications for LLM training and model release strategies.
- Overall: ⭐⭐⭐⭐☆ — Rigorous experiments, counter-intuitive yet practically useful conclusions; an important contribution to pre-training strategy research.