Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale¶
Conference: NeurIPS 2025 arXiv: 2510.24963 Code: GitHub Area: LLM Pre-training / Interpretability Keywords: language model behavioral phases, n-gram probability, semantic similarity, training dynamics, architecture-agnostic
TL;DR¶
By analyzing over 1,400 model checkpoints on 110,000+ tokens, this paper demonstrates that autoregressive language models exhibit highly consistent behavioral phases during training — predicted probabilities successively overfit to n-gram distributions of increasing \(n\) — and that three simple heuristics (word frequency, n-gram probability, and semantic similarity) explain up to 98% of the variance in model behavior. This pattern holds consistently across architectures (Transformer/Mamba/RWKV), datasets, and scales.
Background & Motivation¶
- Background: Language models trained via next-token prediction exhibit emergent capabilities such as grammatical generation and knowledge-based reasoning, yet the regularities underlying the learning process remain poorly understood.
- Limitations of Prior Work: Existing analyses have largely focused on abrupt changes in specific behaviors or subnetworks, lacking a systematic characterization of overall model behavior across training.
- Key Challenge: It remains unclear whether universal learning regularities exist that are independent of model-specific details such as architecture, scale, and data.
- Goal: To quantitatively characterize the behavioral changes of language models throughout training using simple heuristics.
- Key Insight: The analysis focuses on three heuristics — word frequency (unigram), n-gram probability, and contextual semantic similarity.
- Core Idea: All models undergo the same behavioral phases: initial overfitting to low-order n-grams, progressive overfitting to higher-order n-grams, and early rapid establishment of correlation with semantic similarity.
Method¶
Overall Architecture¶
A total of 1,418 model checkpoints are collected (Pythia/Mamba/RWKV × multiple scales × multiple seeds) and evaluated on the decontaminated benchmark NaWoCo. The Pearson/Spearman correlations and regression analyses between model log-probabilities and each heuristic are computed across over 110,000 tokens.
Key Designs¶
- Parc Model Suite: The first publicly released checkpointed Mamba-1 and RWKV-4 models, with all three architectures trained in parallel on the same OpenWebText data (identical sequences and training steps), using 6 random seeds and 73 checkpoints each. A shared tokenizer ensures fair comparison.
- NaWoCo Dataset: An evaluation set of 150,000+ words in sentential contexts, extracted from FineWeb. Tokens are required to be single tokens (valid across all models), absent from training data (verified via infini-gram counts), and low-toxicity (probability < 0.1). The dataset is split into train/validation/test sets.
- Regression Analysis: Unigram, 2–5-gram log-probabilities, and fastText semantic similarity scores (Wikipedia and Common Crawl variants, with uniform and SGPT-based weighting) are used as features to predict model log-probabilities; \(R^2\) is computed to measure explained variance.
- n-gram Computation: Word-level n-gram probabilities are computed over training data using the infini-gram toolkit with Stupid Backoff smoothing.
Loss & Training¶
- This is a purely analytical study; no training loss is designed.
- Pearson correlation, Spearman correlation, and \(R^2\) regression analysis are employed.
- Model scales range from 14M to 12B parameters (the Pythia suite covers the full range).
Key Experimental Results¶
| Finding | Details |
|---|---|
| Explained variance | Three heuristics explain up to 98% of model log-probability variance |
| Cross-architecture consistency | Pythia/Mamba/RWKV achieve Pearson \(r \geq 0.93\) at matched training steps |
| Behavioral phases | Sequential overfitting: unigram → bigram → trigram → ... → 5-gram |
| Scale effect | Larger models show stronger decorrelation from low-order n-grams |
Key Findings¶
- All models, regardless of architecture, scale, or data, exhibit the same n-gram overfitting sequence.
- The peak correlation with semantic similarity co-occurs with unigram (Common Crawl variant) or trigram (Wikipedia variant) correlations.
- Variance across random seeds is negligible (confidence intervals are nearly invisible).
- Larger models are better able to disentangle from low-order n-grams and learn more complex relationships.
Parc Model Suite Details¶
| Architecture | Parameters | Training Data | Checkpoints | Seeds |
|---|---|---|---|---|
| Pythia | 14M–12B | The Pile | 143 | 1 |
| Mamba-1 | ~160M | OpenWebText | 73 | 6 |
| RWKV-4 | ~160M | OpenWebText | 73 | 6 |
Behavioral Phase Timeline¶
- Phase 1 (0–5K steps): Unigram overfitting; the model learns word frequency distributions.
- Phase 2 (5K–20K steps): Bigram overfitting; local dependencies begin to be captured.
- Phase 3 (20K–100K steps): Trigram+ overfitting; longer-range dependencies are learned.
- Phase 4 (100K+ steps): Decorrelation from high-order n-grams; semantic relationships begin to emerge.
Highlights & Insights¶
- The work uncovers a rare cross-architecture universal regularity in deep learning.
- Three minimalist heuristics explain 98% of variance — suggesting that language models fundamentally learn these three types of patterns.
- The Parc model suite and NaWoCo dataset constitute important public resources for the community.
- The findings offer a new perspective for understanding scaling laws and emergent behaviors.
Limitations & Future Work¶
- The analysis is restricted to token-level behavior and does not extend to sentence- or paragraph-level semantic analysis.
- Simple heuristics may be insufficient to explain more complex behaviors such as multi-step reasoning or planning.
- Behavioral phase changes following instruction fine-tuning or RLHF are not examined; alignment training may alter phase ordering.
- The causal mechanism underlying these phases remains unexplained; the findings are observational rather than mechanistic.
- The 98% explained variance may overestimate the importance of the heuristics, given the strong inherent statistical regularities of natural language.
- The effect of varying training data composition (e.g., proportion of code data) on behavioral phases is not explored.
- The Mamba and RWKV models evaluated are relatively small; behavioral phases in larger-scale non-Transformer architectures may differ.
- The construction of the NaWoCo evaluation set may introduce selection bias — restricting to single-token words may not be representative of more complex lexical categories.
Related Work & Insights¶
- vs. Chang et al. 2024: n-gram overfitting was previously observed only in GPT-2; this work extends the finding to multiple architectures and scales.
- vs. Voita et al. 2024: That work analyzes n-gram-specialized neurons; this work characterizes the same phenomenon at the behavioral level.
- vs. Schaeffer et al. 2023: That work argues emergence is a measurement artifact; this work approaches training dynamics from a complementary perspective.
Additional Discussion¶
- The core methodological contribution lies in transforming a unidimensional problem into a multi-heuristic analysis, yielding a more comprehensive understanding.
- The experimental design covers diverse settings and baseline comparisons, with statistically significant results.
- The modular design of the method facilitates extension to related tasks and new datasets.
- Open-sourcing of code and data is of significant value for community reproduction and follow-up research.
- Compared to contemporaneous work, this paper demonstrates greater depth in problem formulation and comprehensiveness in experimental analysis.
- The paper is logically structured, forming a complete loop from problem definition through method design to experimental validation.
- The computational overhead of the method is reasonable, making it deployable in practical settings.
- Future work may consider integration with additional modalities (e.g., audio, 3D point clouds).
- Validating scalability on larger data and models is an important next step.
- Combining the approach with reinforcement learning for end-to-end optimization is worth exploring.
- Cross-domain transfer is a direction warranting further investigation — the generality of the method requires broader validation.
- Lightweight variants of the method for edge computing and mobile deployment deserve attention.
- Long-term evaluation and user studies would provide a more comprehensive assessment.
- Comparative analysis with human experts would better delineate the strengths and limitations of the approach.
- Robustness testing under adversarial conditions is a necessary step prior to real-world deployment.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — The discovery of universal behavioral phases across architectures is a significant scientific contribution.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Large-scale experiments spanning 1,400+ checkpoints, 3 architectures, and multiple scales.
- Writing Quality: ⭐⭐⭐⭐ — Rigorous analysis with clear visualizations.
- Value: ⭐⭐⭐⭐⭐ — Fundamental implications for understanding the learning mechanisms of language models.