Procedural Pretraining: Warming Up Language Models with Abstract Data¶
Conference: ICML 2026
arXiv: 2601.21725
Code: Yes
Area: LLM Pretraining / Data-Centric AI
Keywords: Procedural Data, Warm-up Pretraining, Algorithmic Skills, Formal Languages, Data Efficiency
TL;DR¶
Inserting a minimal amount of "procedural data" (formal languages, stacks, cellular automata, etc.) as a "warm-up" phase before standard language/code/math pretraining achieves stable downstream performance gains with only 0.1–0.3% additional tokens. This strategy allows models to replicate the same loss using only 55–86% of the original data, representing a pretraining strategy that decouples "reasoning scaffolding" from "knowledge."
Background & Motivation¶
Background: The current de facto standard for LLM training is a "one-pot" approach—direct next-token prediction on web-scale corpora, forcing the model to simultaneously acquire semantic knowledge and the skills to manipulate that knowledge.
Limitations of Prior Work: Knowledge and reasoning are highly entangled within the same set of weights. Consequently, models tend to rely on surface heuristics rather than systematic reasoning processes. Several studies (Han 2025, Kumar 2025, Nikankin 2025) have identified this as a core defect in current models.
Key Challenge: Learning both knowledge and algorithmic skills from "semantic data" is highly inefficient. Semantics are often mixed with approximations, ambiguities, and shortcuts, making it difficult to stably cultivate precise symbolic manipulation capabilities. Ideally, models should develop like humans—learning logic and mathematics first before advancing to high-level reasoning—but no engineered path currently exists to decouple the two.
Goal: To implement the cognitive development concept of "structure before semantics" into a reproducible pretraining pipeline, verifying its effectiveness across various scales and semantic domains, and explaining which components benefit at the mechanistic level.
Key Insight: The authors start from "procedural data"—non-semantic sequences generated by explicit rules such as formal languages, simple algorithms, and cellular automata. Unlike "LLM-generated synthetic data," procedural data contains no semantic content but possesses rigid structures (nesting, recursion, long-range dependencies), serving as pure "algorithmic scaffolding." By warming up on these scaffolds before feeding standard corpora, models may better learn "operational capabilities" and "knowledge" in distinct stages.
Core Idea: Prepose 0.1–0.3% of procedural data before standard pretraining as a "pre-pretraining" stage. This stage does not teach semantics; it only teaches the model how to manage stacks, balance brackets, and perform sequence transformations.
Method¶
Overall Architecture¶
The pipeline consists of two stages: Stage 1 trains a GPT-2 style decoder-only transformer from scratch using \(T_1\) procedural tokens (all weights participate in training); Stage 2 continues training with \(T_2\) semantic tokens (from C4, CodeParrot, or DeepMind-Math). The baseline uses \(T_1=0\) (pure standard pretraining). Both stages utilize standard next-token prediction loss without freezing. Two core configurations are explored: (i) Additive setting: Fixing \(T_2\) and adding \(T_1\) to observe gains from extra tokens; (ii) Substitutive setting: Reducing \(T_2\) and adding a small \(T_1\) to see if the same loss can be replicated with fewer semantic tokens.
Key Designs¶
-
Procedural Data Generator (Scaffolding Library):
- Function: Provides a set of "non-semantic, strongly structured" sequences, each \(\le 128\) tokens, as warm-up corpora.
- Mechanism: The authors curated a data family including: (i) Sequence Transformations (Set/Reverse/Identity/Union/Sort/Delete), where the model predicts the transformed sequence; (ii) Memory Operations (Stack), simulating push/pop operations to predict final stack contents; (iii) Formal Languages (\(k\)-Dyck balanced brackets and \(k\)-Dyck Shuffle non-nested versions), where \(k\) controls nesting depth; (iv) Cellular Automata (ECA rule 110), where binary sequences evolve according to deterministic Markov dynamics. These data types share the property that the next token relies heavily on precise symbolic operations, compositionality, and long-range dependencies, with almost no surface-level shortcuts.
- Design Motivation: Choosing "procedural data" over LLM-generated data is critical. Procedural data has a provable generation process and clear structure, allowing researchers to precisely decouple which structure teaches which skill. Additionally, short sequences (\(\le 128\) tokens) ensure that algorithmic learning is not obscured by long contexts.
-
Algorithmic Skill Diagnosis → Semantic Domain Transfer:
- Function: Diagnose "which procedural data teaches which algorithmic skill" on small models first, then verify if these skills transfer to natural language, code, and math on larger models.
- Mechanism: The diagnostic phase uses small transformers (2 layers, 4 heads) with 10 seeds for additive settings across tasks like Haystack (long-context retrieval), Arithmetic (Addition / Reversed Addition / Multiplication), and Sorting. A shuffled control experiment (randomizing token order within procedural sequences) shows performance collapsing to baseline, proving that "structure" rather than "token distribution" drives the gains. The transfer phase scales models to 1.3B parameters and 10.5B tokens to examine loss and downstream performance on C4 / CodeParrot / DeepMind-Math.
- Design Motivation: Directly performing ablations on large models is too costly and easily obscured by scale noise. Small models act as "diagnostic tools" to establish causal relationships before verifying transferability at scale.
-
Selective Transfer as Interpretable Probes:
- Function: Transfers weights learned from procedural pretraining in three ways—"attention-only," "MLP-only," or "full-model"—to locate where "knowledge" is stored.
- Mechanism: At the start of Stage 2, only specific layers (attention or MLP) from pretraining are retained, while others are reset to random initialization. Comparing performance shows that if a specific layer transfer outperforms full-model transfer, that layer type is the carrier of "useful structure," while the other may introduce negative transfer. For natural language (C4), MLP-only is superior; for structured code (CodeParrot), attention-only is superior; for math, both are important.
- Design Motivation: This "modular probe" serves as both a scientific explanation (verifying the hypothesis that MLPs store knowledge and attention stores patterns) and a practical engineering guide for layer retention based on the downstream domain.
Loss & Training¶
Standard next-token prediction loss is used for both stages. For procedural data involving input/output pairs, loss is calculated only on output tokens. In Stage 2 (Sections 5 and 6), token embeddings are reset to random values (as there is no mapping between procedural and semantic vocabularies); in Section 4 (procedural → algorithmic), embeddings are initialized as mean vectors. The authors also performed counterfactual controls: (a) adding explicit attention sharpening regularization, which failed to replicate the gains, ruling out simple "sharper attention" explanations; (b) shuffling weights by layer while preserving magnitude distributions, which caused performance to collapse, ruling out "initialization scale" explanations.
Key Experimental Results¶
Main Results¶
| Setting | Data | Gain |
|---|---|---|
| Haystack (context recall) | \(k\)-Dyck warm-up vs. baseline | Accuracy 10% → 98% |
| Additive, 1.3B model | +0.1–0.3% procedural tokens | Consistent improvement on C4/CodeParrot/Math |
| Substitutive, equivalent loss | C4 | Requires only 55% of original data |
| Substitutive, equivalent loss | CodeParrot | Requires only 67% of original data |
| Substitutive, equivalent loss | DeepMind-Math | Requires only 86% of original data |
| Downstream Tasks | Language, Code Gen, Reasoning | Gains from procedural warm-up persist |
Ablation Study¶
| Configuration | Meaning | Conclusion |
|---|---|---|
| Full (Structured procedural data) | Complete method | Baseline |
| Shuffled procedural data | Same token distribution, broken structure | Gains collapse; structure is key |
| Attention sharpening reg. | Explicitly sharpening attention | Fails to replicate warm-up gains |
| Weight magnitude shuffle | Preserving magnitude, shuffling position | Performance drops; magnitude alone is insufficient |
| Attention-only transfer | Transferring only attention weights | ~80% gain over full-model on Identity/Haystack; better for code |
| MLP-only transfer | Transferring only MLP weights | Better for Reversed Addition and Natural Language (C4) |
| Procedural data mixtures | Combining different scaffolds | Yields further gains; suggests future directions |
Key Findings¶
- Different procedural data types specialize in different skills: \(k\)-Dyck enhances long-context retrieval and sorting, ECA rule 110 enhances reversed addition, and Union/Delete enhances multiplication. This indicates a traceable correspondence between "algorithmic structure" and "algorithmic skill."
- Procedural warm-up information is highly localized: attention layers primarily carry structural capabilities (stacks, brackets, retrieval patterns), while MLP layers are more valuable for natural language. This supports the hypothesis that MLPs are knowledge containers, though the fact they benefit from "non-semantic" procedural data is a counter-intuitive finding.
- Reversed addition is a rare task where MLP-only/full-model transfer outperforms attention-only, suggesting its bottleneck lies in numerical processing rather than pattern matching.
Highlights & Insights¶
- Placing 0.1% of scaffolding at the very beginning of training is an almost "free" improvement—providing both additive performance gains and substitutive data savings. It is directly applicable to industrial pretraining without changing architecture or increasing compute.
- The causal chain of "structure → skill → component (MLP/Attention)" is clearly established. This progressive research structure—from diagnosis to transfer to mechanism to combination—is highly recommended for future studies on data mixing strategies.
- Procedural data is non-verbal and cross-modal. Concurrent work (Shinnick 2026) has observed similar gains in vision, suggesting the existence of "modality-independent algorithmic mechanisms."
Limitations & Future Work¶
- The model scale is 1.3B, which is still small compared to flagship LLMs. Whether procedural warm-up remains effective at 10B+ or 100B+ scales requires further validation.
- Procedural data types are still limited (Stack, Dyck, ECA, etc.), and mixture strategies are only preliminary. There is no established methodology for "data selection + ratio optimization + sequencing."
- Downstream evaluations focus on language modeling loss, code generation, and common sense reasoning; they do not cover more structured downstream tasks (e.g., multi-step reasoning benchmarks, agentic behavior).
- While attention learns "patterns" and MLP learns "structure" phenomenologically, a mechanistic-level explanation (e.g., circuit analysis, activation patching) is still missing.
Related Work & Insights¶
- vs. Hu et al. (2025) "pre-pretraining on formal languages": Hu et al. view formal languages as "higher value per token" data emphasizing "substitution"; this work expands to a "procedural data family" emphasizing "complementarity" and "modular mechanisms."
- vs. Wu et al. (2022) / Zhang et al. (2024): These works replace standard pretraining with algorithmic or CA data. Ours proves that just preposing 0.1–0.3% is sufficient and systematically compares skill specialization.
- vs. Code Pretraining: Code has long been used as an implicit scaffold for reasoning. This work generalizes that intuition to any structured non-semantic data and explains why it works.
- vs. LLM Synthetic Data: Synthetic data still carries semantics from the teacher LLM. Procedural data is rule-generated, zero-semantic, and strongly structured, making it a "skill scaffold" that complements rather than replaces synthetic data.
Rating¶
- Novelty: ⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐
Rating¶
- Novelty: TBD
- Experimental Thoroughness: TBD
- Writing Quality: TBD
- Value: TBD