SkyLadder: Better and Faster Pretraining via Context Window Scheduling¶
Conference: NeurIPS 2025 arXiv: 2503.15450 Code: https://github.com/sail-sg/SkyLadder Area: LLM Efficiency Keywords: Context window scheduling, pretraining efficiency, long context, attention mechanism, training stability
TL;DR¶
SkyLadder, a progressive short-to-long context window scheduling strategy, achieves superior pretraining efficiency (22% training time saved) and improved model performance (+3.7%) under a fixed compute budget, challenging the prevailing belief that "longer context = better performance."
Background & Motivation¶
Background: The context windows of LLMs have continuously expanded (GPT 512→Llama-3 8K→128K+), yet fair comparisons under a fixed token budget remain lacking.
Limitations of Prior Work: Under strictly controlled conditions, short-context models consistently outperform long-context models on standard benchmarks—contradicting widely held assumptions in the field.
Key Challenge: Models must simultaneously achieve strong performance on standard tasks and long-sequence processing capability, yet strategies for balancing the two under a fixed budget have not been systematically explored.
Goal: How can one fully exploit the training efficiency advantage of short contexts while ensuring the model ultimately acquires long-sequence understanding capability?
Key Insight: Drawing an analogy to curriculum learning—progressively advancing from short context (simple) to long context (complex), allowing the model to master fundamentals before extending its capabilities.
Core Idea: A progressive short-to-long window scheduling strategy implemented via modified attention masks, requiring no changes to data or architecture—scheduling alone yields dual gains.
Method¶
Overall Architecture¶
SkyLadder divides pretraining into progressive stages: starting from a very short window (32 tokens) and linearly expanding to the target window (8K/32K), with a block-wise attention mask \(M_{ij}\) controlling the effective context length at each step.
Key Designs¶
-
Linear Window Expansion Strategy:
- Function: Gradually increases the effective context window at a fixed rate.
- Mechanism: \(w(t) = \min(w_e, w_s + \lfloor\alpha t\rfloor)\), where \(w_s=32\) (initial), \(w_e\) = target window, \(\alpha=1/8\) (expansion rate). Implemented via block-wise attention masks without modifying data packing.
- Design Motivation: Linear scheduling achieves the best performance among six evaluated strategies (linear / sinusoidal / exponential / staircase / continual fine-tuning / Dataset Decomposition).
-
Block-wise Attention Mask:
- Function: Restricts each position to attending only to preceding tokens within the current window.
- Mechanism: \(M_{ij} = 0\) if \(\lfloor i/w \rfloor \cdot w \leq j \leq i\), else \(-\infty\). Orthogonally composable with strategies such as IntraDoc Mask.
- Design Motivation: Mask-based implementation avoids data re-packing and introduces no domain bias (vs. Dataset Decomposition, which segments data by length and thereby over-represents long-text domains such as books).
-
Training Stability Gain:
- Function: The short-window phase provides more stable training dynamics.
- Mechanism: Loss variance during short-window training is 0.023 vs. 0.041 for long windows (×1.78), with lower and more consistent gradient norms.
- Design Motivation: Attention logits tend to explode on long sequences; progressive expansion allows the model to adapt smoothly.
Loss & Training¶
Standard language modeling loss with no additional loss terms. Key hyperparameters: \(w_s=8\) (larger scheduling range), \(\alpha=1/8\) (balancing short and long tasks), linear scheduling. Orthogonal to learning rate scheduling.
Key Experimental Results¶
Main Results (1B model, 100B tokens)¶
| Method | Avg. Accuracy | ARC-E | HellaSwag | MMLU | Gain |
|---|---|---|---|---|---|
| Random Baseline | 46.3% | 58.0 | 43.0 | 29.9 | - |
| + SkyLadder | 50.0% | 65.4 | 47.0 | 32.4 | +3.7% |
| IntraDoc Baseline | 47.4% | 61.8 | 45.6 | 30.5 | - |
| + SkyLadder | 49.3% | 64.8 | 47.9 | 31.8 | +1.9% |
Ablation Study¶
| Expansion Rate \(\alpha\) | Standard Tasks | Long Tasks | Training Time Saved |
|---|---|---|---|
| 1/12 (slowest) | 46.8 | 13.1 | 15% |
| 1/8 (default) | 48.6 | 14.1 | 13.1% |
| 1 (fastest) | 47.2 | 12.3 | 8% |
Scaling Behavior¶
| Model Size | Baseline | + SkyLadder | Gain |
|---|---|---|---|
| 120M | 40.1% | 41.2% | +1.1% |
| 360M | 47.2% | 49.6% | +2.4% |
| 3B | 57.0% | 60.5% | +3.5% |
Gains increase with model size. For the 32K context setting: 22.2% training time saved and 26.3% FLOPs saved.
Key Findings¶
- Counter-intuitive: Under a fixed token budget, short-context models consistently outperform long-context models on standard benchmarks.
- Dual gain: SkyLadder simultaneously improves both performance and efficiency—+3.5% accuracy and 22% time saved for the 3B model.
- Nature of IntraDoc: Its success stems not from denoising but from implicitly inducing a shorter effective window length distribution.
- Training stability: Short-window training is more stable (44% lower loss variance, 20% lower gradient norms).
Highlights & Insights¶
- Challenging conventional wisdom: Rigorous controlled experiments demonstrate that "larger context ≠ better performance," with fundamental implications for pretraining strategy.
- Extreme simplicity: Only the attention mask is modified; data, architecture, and loss function remain unchanged. The method requires only three hyperparameters.
- Training stability perspective: This work is the first to quantify the stability advantage of short-window training (loss variance, gradient norms, attention logits), providing a mechanistic explanation for the observed performance gains.
Limitations & Future Work¶
- The largest evaluated model is only 3B; validation at the 13B/70B scale is absent.
- No theoretical explanation is provided for why linear scheduling is optimal.
- Long-task evaluation relies on synthetic benchmarks (RULER); evaluation on naturally occurring long documents is insufficient.
- The relationship between hyperparameters and model size or data quality has not been systematically explored.
Related Work & Insights¶
- vs. Dataset Decomposition: Segmenting data by length introduces domain bias (long documents ≈ books); SkyLadder leaves the data distribution unchanged.
- vs. continual fine-tuning approaches (YaRN, ProLong): Training from scratch vs. fine-tuning an existing model; SkyLadder avoids fine-tuning costs.
- vs. GrowLength (Jin 2023): Validated only on 400M models; SkyLadder provides the first systematic validation at the 3B scale.
Rating¶
- Novelty: ⭐⭐⭐⭐ The curriculum learning inspiration is not entirely new, but the empirical finding that challenges the long-context assumption carries genuine innovative value.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Rigorous variable control, comprehensive ablations, cross-scale validation from 120M to 3B, and public code release.
- Writing Quality: ⭐⭐⭐⭐ Clear logical structure, high information density in figures and tables, and in-depth quantitative analysis of training stability.
- Value: ⭐⭐⭐⭐⭐ A directly deployable engineering solution with dual benefits of 22% speedup and +3.7% accuracy gain, immediately applicable to industrial-scale pretraining.