How to Train Long-Context Language Models (Effectively)¶
Conference: ACL 2025
arXiv: 2410.02660
Area: LLM Efficiency
Keywords: Long-context training, Continual pre-training, Data mixing, Supervised fine-tuning, Positional extrapolation
TL;DR¶
This paper systematically studies how to effectively train long-context language models through continual pre-training and SFT. It proposes a series of key designs, including data mixing ratios, training length scaling, and evaluation protocols. The resulting ProLong-8B achieves the same-scale SOTA on 128K length using only 5% of Llama-3.1's long-context training data.
Background & Motivation¶
- Urgent application needs: Long-context LMs have unlocked new applications such as book summarization, many-shot ICL, and long-document RAG. However, adapting pre-trained models to 128K+ contexts poses dual challenges in both infrastructure and data.
- Training-free extrapolation is unreliable: Training-free methods (e.g., YaRN, Dynamic NTK) that modify the RoPE frequency base fail to reliably pass simple Needle-in-a-Haystack (NIAH) tasks, still requiring further training on billions of long-document tokens.
- Opaque design decisions: Frontier open-source models (e.g., Llama-3.1) adopt the "long-context continual pre-training \(\rightarrow\) SFT" paradigm, but key decisions such as data mixing ratios, sequence length selection, and SFT data types are not fully transparent to the community.
- Flawed evaluation methods: Existing evaluation approaches (perplexity, NIAH) are unreliable. NIAH is already saturated for strong models (both Llama-3.1-8B and 70B score 100), while perplexity correlates poorly with downstream performance (training with 100% long data continuously improves PPL but severely impairs downstream execution).
Method¶
Overall Architecture¶
A three-stage approach is adopted: (1) Establish a reliable evaluation protocol based on HELMET to guide model development; (2) Conduct two-stage long-context continual pre-training (64K \(\rightarrow\) 512K) starting from Llama-3-8B-Instruct; (3) Perform SFT on the short instruction dataset UltraChat. This yields ProLong-8B, which supports a 512K token context window with only 40B total training tokens.
Key Designs¶
Design 1: Reliable Evaluation Protocol — Post-SFT Evaluation + Multi-Task Benchmark
The HELMET evaluation suite is adopted to cover six categories of downstream tasks (Recall, RAG, Re-ranking, ICL, QA, Summarization) instead of relying on perplexity or NIAH. Core finding: evaluation must be conducted after SFT—some improvements in long-context capabilities (e.g., RAG, Re-ranking) only become apparent post-SFT. Concurrently, short-context benchmarks (HellaSwag, MMLU, ARC-c, WinoGrande, GSM8K) are tracked to ensure no performance degradation. The evaluation protocol comparison is shown below:
| Evaluation Method | NIAH | Recall | RAG | Re-rank | Distinguish Strong/Weak Models? |
|---|---|---|---|---|---|
| NIAH only | 100 | - | - | - | ✗ (Strong models saturated) |
| PPL only | ↓ | - | - | - | ✗ (Inconsistent with downstream) |
| HELMET (Post-SFT) | 100 | 99.4 | 56.3 | 37.0 | ✓ (Multi-dimensional distinction) |
Design 2: Data Mixing Strategy — 60% Long + 40% High-quality Short Mixture
Ablations on long data sources reveal that code repositories (concatenating all files from the same repo into a single document, 98.8B long tokens) and books (33.2B long tokens) are the optimal long data sources, with a 1:1 mixture performing best. Ablations on short-to-long ratios show that 60% long data + 40% short data is the optimal ratio; 100% long data severely harms downstream long-context tasks. The short data mixture design, ProLong ShortMix, preserves mathematical reasoning capabilities:
| Short Data Component | Ratio | Function |
|---|---|---|
| FineWeb | 25% | General web text |
| FineWeb-Edu | 25% | Educational web text |
| Wikipedia | 10% | Encyclopedic knowledge |
| Tulu-v2 | 10% | Instruction data |
| StackExchange | 10% | Technical Q&A |
| ArXiv | 10% | Academic papers |
| OpenWebMath | 10% | Mathematical reasoning preservation |
Design 3: Two-Stage Training Length Scaling + Purely Short SFT
Training employs a curriculum learning strategy: Stage 1 trains on 20B tokens at 64K length (RoPE base = 8×10⁶, 2.2K H100 hours), and Stage 2 trains on 20B tokens at 512K length (RoPE base = 1.28×10⁸, 12.2K H100 hours). Key finding: training beyond the evaluation length significantly improves performance (512K training vs. 64K training on 64K evaluation: Re-rank 32.9 vs. 28.0). In the SFT stage, using only short-context instruction data UltraChat (averaging 1.2K tokens) yields the strongest long-context performance; adding synthetic long SFT data (even just 1%) decreases performance instead. Other designs: disabling cross-document attention (improves performance + training throughput), initializing from Instruct instead of Base (significantly preserves short-context capability).
Key Experimental Results¶
Main Results: HELMET 128K Evaluation¶
| Model | Params | Max Len | Recall | RAG | ICL | Re-rank | QA | Summ. | Avg. |
|---|---|---|---|---|---|---|---|---|---|
| ProLong | 8B | 512K | 98.8 | 63.2 | 86.5 | 22.5 | 43.9 | 29.2 | 49.4 |
| Llama-3.1 | 8B | 128K | 95.2 | 59.5 | 83.9 | 14.0 | 43.2 | 27.0 | 46.5 |
| MegaBeam-Mistral | 7B | 512K | 89.6 | 57.0 | 86.2 | 14.7 | 37.3 | 28.9 | 45.4 |
| Llama-3.1 | 70B | 128K | 90.7 | 56.2 | 81.4 | 24.5 | 56.3 | 31.6 | 49.7 |
ProLong-8B surpasses Llama-3.1-8B-Instruct using only 40B tokens (5% of Llama-3.1's long-context training budget), leading in all categories except Summarization, and even approaching the 70B Llama-3.1 in Avg. score.
Ablation Study¶
Comparison of Long Data Sources (60% Long + 40% ShortMix, trained on 5B tokens):
| Long Data Source | Recall | RAG | Re-rank | ICL | QA | Summ. | Avg. |
|---|---|---|---|---|---|---|---|
| CommonCrawl | 84.1 | 53.3 | 28.1 | 67.5 | 35.2 | 37.0 | 50.9 |
| Books | 94.9 | 53.9 | 30.7 | 72.2 | 33.2 | 37.7 | 53.8 |
| Code Repos | 99.2 | 53.8 | 29.0 | 61.2 | 34.7 | 36.2 | 52.3 |
| Books/Repos 1:1 | 96.0 | 54.9 | 29.4 | 73.9 | 35.7 | 37.9 | 54.6 |
Comparison of Short Data Sources:
| Short Data Source (40%) | Long-Ctx Avg. | HellaSwag | MMLU | GSM8K | Short Avg. |
|---|---|---|---|---|---|
| SlimPajama | 52.9 | 81.2 | 63.0 | 41.9 | 64.2 |
| FineWeb-Edu | 53.0 | 81.0 | 62.6 | 39.4 | 63.0 |
| DCLM-Baseline | 52.0 | 82.0 | 65.6 | 39.4 | 64.8 |
| ProLong ShortMix | 54.6 | 81.6 | 65.3 | 46.6 | 65.5 |
Impact of Synthetic SFT Data Ratio:
| Synthetic Data Ratio | RAG | Re-rank | ICL | QA | Summ. | Avg. |
|---|---|---|---|---|---|---|
| 0% (Pure UltraChat) | 58.1 | 38.5 | 80.3 | 49.7 | 42.1 | 55.7 |
| 1% | 57.0 | 38.3 | 80.8 | 45.3 | 41.5 | 54.1 |
| 10% | 55.5 | 36.1 | 80.6 | 41.7 | 39.4 | 53.9 |
| 50% | 48.8 | 18.8 | 70.5 | 42.3 | 33.3 | 43.3 |
Key Findings¶
- Pure long-context training is harmful: Although 100% long data continuously improves PPL, it severely degrades downstream long-context task performance (post-SFT RAG and Re-ranking drop significantly).
- Extremely long training lengths are beneficial: 512K training vs. 64K training evaluates on 64K with Recall 98.5 vs. 95.0 and Re-rank 32.9 vs. 28.0.
- Short SFT data is sufficient: 0% synthetic long SFT data yields an Avg. of 55.7, dropping sharply to 43.3 after including 50% synthetic data.
- Short-context performance preservation: ProLong ShortMix achieves a short-context average of 65.5, close to the original Llama-3-8B's 66.0.
Highlights & Limitations¶
Highlights:
- Challenging the intuition of using purely full-length data for long-context training, systematically demonstrating for the first time the criticality of mixing high-quality short data.
- Demonstrating for the first time the benefit of training sequence lengths exceeding the evaluation length, accompanied by a theoretical explanation based on dependency distance.
- The finding that SFT requires only short instruction data greatly simplifies the training workflow for long-context models.
- Contribution to evaluation methodology: revealing the unreliability of PPL and NIAH, urging the community to adopt multi-task evaluations like HELMET.
- Exceptional data efficiency: surpassing Llama-3.1-8B with only 40B training tokens (a 5% data budget).
Limitations:
- Experiments are restricted to Llama-3-8B (~8B parameters); whether the conclusions hold at larger scales remains unverified.
- The impact of RLHF/preference optimization on long-context SFT is unexplored.
- The ineffectiveness of synthetic long SFT data may relate to generator quality, requiring verification with stronger models.
- The computational cost of 512K training is significant (12.2K vs. 2.2K H100 hours), and the compute-optimal strategy warrants further exploration.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Systematic ablation experiments yielding multiple counter-intuitive discoveries (e.g., pure long data is harmful, short SFT is superior).
- Value: ⭐⭐⭐⭐⭐ — Complete, reproducible training recipe (ProLong recipe) that is ready to use.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extensive ablations covering evaluation protocols, data mixing, length scaling, and SFT strategies.
- Writing Quality: ⭐⭐⭐⭐⭐ — Well-structured with masterfully designed Takeaway Boxes and high-density figures/tables.