Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization¶
Conference: NeurIPS 2025 arXiv: 2511.07378 Code: None Area: LLM Reasoning Keywords: CoT Theory, Length Generalization, Attention Concentration, State Tracking, Complexity Classes
TL;DR¶
This paper provides the first optimization-theoretic proof that a one-layer Transformer trained via gradient descent can learn CoT reasoning on a synthetic state-tracking task and achieve length generalization. It is the first work to establish convergence guarantees for constant-depth Transformers learning \(\mathsf{NC}^1\)-complete problems, going beyond prior theory that was limited to \(\mathsf{TC}^0\).
Background & Motivation¶
Background: Chain-of-thought (CoT) reasoning is a key technique enabling modern LLMs to tackle complex tasks by generating intermediate reasoning steps that exceed the model's single-step processing capacity. Theoretically, it is known that Transformers with CoT can express \(\mathsf{NC}^1\)-level problems, surpassing the \(\mathsf{TC}^0\) limitation of CoT-free models.
Limitations of Prior Work: Existing theoretical work either studies expressiveness alone ("can it represent?") without addressing optimization ("can it be learned?"), or proves convergence guarantees only for simple \(\mathsf{TC}^0\) tasks such as parity. A critical gap remains: no prior work has proven that Transformers can learn, through training, to produce CoT that generalizes to longer inputs.
Key Challenge: Expressiveness results do not imply learnability. Even if a set of weights exists that enables a Transformer with CoT to solve a problem, it is unclear whether gradient descent can find those weights and whether the resulting model generalizes to longer inputs.
Goal: To provide rigorous convergence and length generalization guarantees for Transformers learning CoT reasoning via gradient descent, at the complexity level of \(\mathsf{NC}^1\).
Key Insight: The authors focus on LEGO (a group-operation-based state-tracking task) as the theoretical benchmark. LEGO is a canonical problem separating \(\mathsf{TC}^0\) from \(\mathsf{NC}^1\) and possesses rich algebraic structure that can be exploited analytically.
Core Idea: The attention concentration mechanism links the retrieval robustness of the attention layer to the algebraic structure of the state-tracking task, showing that the group structure determines the extent of length generalization.
Method¶
Overall Architecture¶
The analysis concerns a one-layer Transformer (softmax attention + smoothed ReLU MLP) without positional encoding (NoPE), trained with a CoT loss on the LEGO state-tracking task. The LEGO task requires the model to execute a sequence of group operations \(g_1, g_2, \ldots, g_L\) and autoregressively output intermediate states \(y_1, y_2, \ldots, y_L\), where \(y_{i+1} = g_{i+1}(y_i)\).
Key Designs¶
-
Attention Concentration Mechanism:
- Function: Characterizes how the attention layer implements precise retrieval of relevant context.
- Mechanism: The Transformer learns two key retrieval patterns via the attention matrix \(Q\): (1) \(Q_{4,3}\) retrieves the corresponding group operation \(g_{i+1}\) from context; (2) \(Q_{4,4}\) retrieves the preceding answer \(y_i\). When both attention distributions are sufficiently "concentrated," the MLP can correctly compute \(g_{i+1}(y_i)\).
- Design Motivation: The key question for length generalization is whether attention remains focused on the correct positions as input length increases. This depends on the algebraic structure of the task.
-
Generalization Analysis under Two Group Structures:
- Simply transitive group operations (e.g., cyclic groups): The learned features exhibit high "purity," yielding highly concentrated attention (attention mass on the correct clause is \(1 - 1/d^c\)), enabling generalization to lengths \(\text{poly}(d)\) times the training length.
- Symmetric group operations (\(S_n\)): The learned features are insufficiently "pure," so direct training generalizes only to a constant multiple of the training length. However, recursive self-training (progressive length doubling) can iteratively extend the solvable length.
-
Recursive Self-Training Scheme:
- Function: Progressively extends the model's effective input length through curriculum learning.
- Mechanism: The model is first trained to convergence on length \(L\); its greedy outputs are then used to generate training data of length \(2L\), and training continues. After each doubling round, model accuracy reaches \(1 - O(2^{-k/\text{poly}(d)})\).
- Design Motivation: For complex group structures that do not admit direct length generalization, self-training provides a principled path to progressively expanding capability beyond the coverage of the original training data.
Loss & Training¶
- Two-stage curriculum learning: first train the MLP to learn single-step group operations (one-step reasoning), then freeze the MLP and train the attention to perform contextual retrieval.
- No positional encoding is used. Theoretical analysis shows that standard PE schemes (e.g., RoPE) inherently hinder length generalization due to their positional dependence.
Key Experimental Results¶
Main Results (Simply Transitive / Cyclic Group \(C_6\))¶
| Training Length | Test L=5 | Test L=10 | Test L=20 | Test L=30 | Test L=40 | Test L=50 |
|---|---|---|---|---|---|---|
| CoT (L=5) | 100% | 100% | 100% | 100% | 99.7% | 98.1% |
| Non-CoT (L=5) | 16.4% | 16.9% | 15.1% | 18.1% | 19.2% | 19.0% |
| Non-CoT (L=10) | 16.8% | 18.4% | 16.6% | 17.9% | 17.4% | 17.9% |
Ablation Study¶
| Configuration | Key Finding |
|---|---|
| CoT vs. Non-CoT | Non-CoT models fail to learn the task even within the training distribution (near-random performance) |
| Cyclic group vs. Symmetric group | Cyclic group generalizes to 10× training length; symmetric group generalizes to only ~2–3× |
| No PE vs. with PE | Removing PE is necessary for length generalization (standard PE biases toward local positions) |
Key Findings¶
- CoT is necessary: Transformers without CoT perform near chance (~17%) even within the training distribution, confirming that state tracking genuinely exceeds \(\mathsf{TC}^0\) capability.
- Algebraic structure governs generalization: Simply transitive groups (cyclic groups) yield far better generalization than symmetric groups, consistent with theoretical predictions.
- Recursive self-training is effective: Under the \(S_5\) setting, each length-doubling round successfully brings the model to high accuracy on the new target length.
- Attention concentration is empirically verifiable: Post-training attention weights are observed to concentrate on the correct clauses, as predicted by theory.
Highlights & Insights¶
- First optimization guarantee at the \(\mathsf{NC}^1\) level: Prior theory addressed only \(\mathsf{TC}^0\) tasks such as parity. This work extends convergence guarantees to \(\mathsf{NC}^1\)-complete problems, representing a qualitative advance in theoretical complexity.
- Attention concentration \(\Leftrightarrow\) length generalization: The paper presents a clean and elegant picture: length generalization depends on retrieval robustness of attention, which depends on feature "purity," which in turn depends on group structure.
- Theoretical support for NoPE: The paper provides a principled explanation for why removing positional encoding facilitates length generalization—the positional bias introduced by standard PE degrades the attention mechanism's long-range retrieval ability.
- Self-training transcends training data: The paper proves that self-training genuinely expands the model's capability boundary rather than merely reinforcing what has already been learned.
Limitations & Future Work¶
- Task specificity: The analysis is restricted to LEGO state tracking. Although this is a standard theoretical benchmark, it remains distant from natural language reasoning.
- Single-layer Transformer: The analysis is limited to one layer. While the authors argue this represents the current theoretical frontier, conclusions may differ for multi-layer settings.
- No positional encoding: Practical LLMs employ PE schemes such as RoPE; whether theoretical insights derived under the NoPE assumption translate to practice remains an open question.
- Curriculum learning assumption: The two-stage curriculum (MLP first, then attention) differs from standard end-to-end training.
- Restriction on symmetric groups: Theorem 4.2 requires \(|Y| < \log\log d\), leaving more general group structures unaddressed.
Related Work & Insights¶
- vs. Merrill et al. (ICLR 2024): They establish that CoT endows Transformers with \(\mathsf{NC}^1\) (and even P) expressiveness, but do not address optimization. This paper is the first to provide learnability guarantees.
- vs. Kim & Suzuki (ICLR 2025): They prove that Transformers with CoT can efficiently solve parity (\(\mathsf{TC}^0\)). This paper extends the complexity level to \(\mathsf{NC}^1\).
- vs. Wen et al.: They analyze the benefit of CoT for sparse-dependency problems from a sample efficiency perspective. This paper focuses on length generalization, and the two lines of work are complementary.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First learnability and length generalization guarantee for CoT at the \(\mathsf{NC}^1\) level; both the attention concentration framework and the theoretical treatment of recursive self-training are original contributions.
- Experimental Thoroughness: ⭐⭐⭐ — Experiments are adequate for a theory paper and validate the theoretical predictions, but are confined to synthetic tasks and the non-CoT baseline is somewhat underexplored.
- Writing Quality: ⭐⭐⭐⭐ — The paper is clear and rigorous with consistent notation, though reviewers noted that figures could be improved in readability.
- Value: ⭐⭐⭐⭐ — Deepens theoretical understanding of CoT reasoning; the insight linking attention concentration to length generalization carries practical implications.