ICLR2026 LLM Pretraining Feature emergence next-token prediction pre-caching circuit sharing world models mechanistic interpretability

Understanding the Emergence of Seemingly Useless Features in Next-Token Predictors¶

Conference: ICLR2026 arXiv: 2603.14087 Code: GitHub Area: LLM Pre-training Keywords: Feature emergence, next-token prediction, pre-caching, circuit sharing, world models, mechanistic interpretability

TL;DR¶

This paper explains, from the perspective of gradient signals, why Transformers trained with next-token prediction (NTP) learn features that appear "useless" for predicting the immediate next token. It proposes a decomposition of gradient pathways into three components — direct learning, pre-caching, and circuit sharing — and validates this framework on toy tasks, OthelloGPT, and language models.

Background & Motivation¶

LLMs are typically trained with the NTP objective: learning $p(x_{t+1}|x_1 \cdots x_t)$
Intuitively, models need only learn features useful for predicting the next token
Yet extensive research has shown that Transformers acquire representations far richer than this: abstract features, world models, multi-step lookahead, etc.
OthelloGPT even learns board-state representations, despite the fact that predicting legal moves does not require knowing all piece positions
Core Problem: What gradient signals arising from the NTP objective drive the learning of these "useless" features?

Core Problem¶

How do Transformers trained for NTP learn features that are useless for predicting the current next token? Which components of the gradient signal drive the emergence of such features?

Method¶

Three-Way Gradient Pathway Decomposition¶

For the residual stream $r_{\theta,i}^k(x)$ at position $i$ and layer $k$, the loss gradient can be decomposed into three independent pathways:

1. Direct Learning — green pathway

Gradient from the next-token prediction loss at position $i+1$, passing through $r_{\theta,i}^k(x)$:

\[\nabla_\theta L_{i(\text{direct})}^k = \nabla_\theta L_i - \nabla_\theta L_i^{\text{sg}(k,i)}\]

2. Pre-caching — blue pathway

Gradient from prediction losses at positions $j > i+1$, which "look back" to position $i$ via the attention mechanism:

\[\nabla_\theta L_{i(\text{pre-cached})}^k = \nabla_\theta \sum_{j \neq i} \left[L_j - L_j^{\text{sg}(k,i)}\right]\]

3. Circuit Sharing — orange pathway

Pathways that do not pass through $r_{\theta,i}^k(x)$. Because Transformers share parameters, gradient signals from other positions also affect parameters involved in computing position $i$:

\[\nabla_\theta L_{i(\text{shared})}^k = \sum_j \nabla_\theta L_j^{\text{sg}(k,i)}\]

Proposition 3.1: The three components constitute an exact decomposition of the total gradient: $$\nabla_\theta L = \nabla_\theta L_{i(\text{direct})}^k + \nabla_\theta L_{i(\text{pre-cached})}^k + \nabla_\theta L_{i(\text{shared})}^k$$

Ablation Methods¶

Ablating pre-caching: myopic training, which blocks gradient flow between different positions
Ablating circuit sharing: $m$-untied training, which uses separate parameters before and after position $m$
Extreme case: myopic + untied = "split brain," isolating all three pathways

Influence Metric¶

Feature Mismatch: $$R(x|\theta_1, \theta_2, w_i^k) = \frac{1}{2}(\langle w_i^k, r_{\theta_1,i}^k(x)\rangle - \langle w_i^k, r_{\theta_2,i}^k(x)\rangle)^2$$

Influence: $$I_i^k(\theta, x | w_i^k, \theta^*, G) = \frac{d}{d\varepsilon} R(x|\theta + \varepsilon G, \theta^*, w_i^k)\bigg|_{\varepsilon=0}$$

Substituting each of the three gradient components for $G$ yields the integrated influence of each pathway.

Adaptation to Large Models: Proposition 5.1¶

For large models where retraining is infeasible, the ratio of influences is approximated via activation interventions:

\[Q(w) = \frac{\sum_{j>i+1} d_j^{/i}}{d_{i+1}^{/i}} \approx \frac{I_{\text{pre-cached}}}{I_{\text{direct}}}\]

A high $Q(w)$ indicates that the feature is primarily driven by pre-caching.

Key Experimental Results¶

Toy Tasks: Majority and Conditioned Majority¶

Training Setup	Majority Feature Probe	Conditioned Majority Feature Probe
Standard training	High	High
Myopic (no pre-caching)	Reduced	Cannot be learned
Untied (no circuit sharing)	Reduced	High
Myopic + Untied	Lowest	Cannot be learned

The Conditioned Majority task requires pre-caching to be learned (requires two-layer attention interaction)
Ablating both mechanisms largely eliminates the emergence of NTP-useless features

OthelloGPT¶

Gradient Component	NTP-Useful Features 95% CI	NTP-Useless Features 95% CI
Direct	[2.85, 12.38]	[-4.69, 2.74]
Pre-cached	[-1.99, 0.66]	[0.55, 3.05]
Shared	[4.80, 12.48]	[2.93, 9.91]
Combined	[12.14, 19.05]	[4.42, 10.07]

Direct influence is significantly positive only for NTP-useful features; it is not significant for NTP-useless features
Pre-caching and circuit sharing influences remain positive for NTP-useless features, explaining why they are nonetheless learned
This provides an explanation for the world model fragility reported by Vafa et al. (2025)

Language Model (TinyStories)¶

Feature Type	Requires Pre-caching	Pre-caching Influence
POS tags	No	Low
Dependency labels	No	Low
Positional features (story position)	Yes	High

Myopic model loss: $3.29 \pm 0.02$; standard model loss: $2.53 \pm 0.10$ (large gap)
Simple syntactic features do not require pre-caching, but coherent text generation does

Gemma 2 2B: SAE Feature Analysis¶

Features with extremely high or low $Q(w)$ are predominantly associated with programming and formal reasoning
$\sigma_{\text{formal}} = 1.63 \pm 0.03$ vs. $\sigma_{\text{not formal}} = 1.23 \pm 0.02$
Steering via high-$Q(w)$ features generates more code and punctuation
Pre-caching features are negatively correlated with lookahead, supporting the "breadcrumb hypothesis"

Highlights & Insights¶

Understanding features from a developmental perspective: Unlike traditional functional interpretability, this work explains feature emergence from the developmental angle of gradient signals
Elegance of the three-pathway decomposition: The total gradient is decomposed exactly into three components with clear physical interpretations
A new explanation for OthelloGPT world models: Representations of NTP-useless board squares arise from pre-caching and circuit sharing, not direct learning
Refuting the pre-caching = lookahead hypothesis: Empirical results show that pre-caching features are negatively correlated with lookahead ability
Cross-scale validation: Consistent findings from toy tasks to Gemma 2 2B

Limitations & Future Work¶

The attribution method relies on retraining models, which is infeasible for large models; Proposition 5.1 is only an approximation
Feature definitions are limited to linear probe directions; nonlinear features are not covered
Large-model experiments are conducted only on Gemma 2; validation on additional LLMs remains to be done
Myopic training simultaneously blocks forward information flow and backward gradient flow, making it difficult to fully disentangle causal relationships
The influence of post-training stages such as RLHF on feature emergence is not explored

Direction	Contribution of This Paper
Li et al. (2023) OthelloGPT	Explains why NTP-useless features can still be learned
Vafa et al. (2025) world model fragility	Uses gradient components to explain the gap between NTP-useful and NTP-useless features
Wu et al. (2024) pre-caching hypothesis	Introduces circuit sharing and refutes the pre-caching = lookahead equation
Bachmann & Nagarajan (2024) NTP limitations	Complements prior analysis by identifying sources of NTP's positive capabilities

Further Connections¶

"Developmental interpretability" emerges as a new paradigm for understanding LLMs
The pre-caching mechanism is connected to chain-of-thought reasoning: models may prepare for later inference at early positions
Circuit sharing explains why parameter-sharing Transformers are more powerful than position-independent models
The $Q(w)$ metric serves as a practical tool for identifying "planning-type" features among SAE features

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — An entirely fresh perspective on feature emergence via gradient signal analysis
Experimental Thoroughness: ⭐⭐⭐⭐ — A complete chain from toy tasks to large models, though large-model experiments remain limited
Writing Quality: ⭐⭐⭐⭐⭐ — Clear logic, rigorous mathematics, and well-crafted figures
Value: ⭐⭐⭐⭐⭐ — Provides a novel analytical framework for understanding internal representations in LLMs