Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones¶
Conference: NeurIPS 2025 arXiv: 2505.21825 Code: github.com/seyedparsa/let-me-think Area: LLM Reasoning Keywords: Test-time scaling, sequential scaling vs. parallel scaling, chain-of-thought, complexity theory, graph connectivity
TL;DR¶
This paper demonstrates theoretically and empirically that there exist reasoning tasks (graph connectivity) for which a single long CoT (sequential scaling) is equivalent in capability to exponentially many short CoTs (parallel scaling)—i.e., reducing CoT length by even a small amount requires an exponential increase in parallel samples to achieve the same accuracy.
Background & Motivation¶
Background: Test-time computation has two primary scaling axes—sequential scaling (longer CoT) and parallel scaling (multiple short CoTs with voting/Best-of-N). Models such as o1 and R1 rely on long CoTs, but the computational cost of long CoTs grows quadratically with sequence length.
Limitations of Prior Work: There is currently no theoretical understanding of the trade-off between the two scaling approaches. Some argue that multiple short CoTs with majority voting are more efficient than a single long CoT; others maintain that long CoTs are irreplaceable for hard problems. No clear theoretical framework exists to answer this question.
Key Challenge: Intuitively, sequential reasoning (e.g., DFS over a graph) requires accumulating information step by step, whereas each parallel sample starts from scratch and cannot share intermediate computation. This intuition, however, lacks formal grounding.
Goal: Does there exist a reasoning task for which sequential scaling offers an exponential advantage over parallel scaling?
Key Insight: The paper studies graph connectivity—determining whether a source node in a given graph is connected to \(t_1\) or \(t_2\). This serves as a proxy for multi-step reasoning and is known to be intimately related to the expressibility limits of Transformers (\(\mathsf{TC}^0\) vs. \(\mathsf{L}\)).
Core Idea: On graph connectivity tasks, bounded-depth Transformers with \(O(1)\)-length CoTs succeed no better than random guessing even with polynomially many parallel samples, yet a single polynomial-length CoT suffices for perfect solutions—establishing an exponential sequential-vs.-parallel gap.
Method¶
Overall Architecture¶
The paper proceeds in three layers: (1) a theoretical separation based on Transformer expressibility (Theorem 1); (2) a finer separation using the Vertex Query Model (VQM) abstraction (Theorems 2 & 3); and (3) empirical validation on small Transformers trained from scratch and on frontier large language models.
Key Designs¶
-
Theoretical Separation via \(\mathsf{TC}^0\) (Theorem 1):
- Function: Proves that sequential scaling can solve graph connectivity while parallel scaling cannot.
- Mechanism: (a) A polynomial-length CoT can simulate BFS, so one CoT suffices (a corollary of Merrill & Sabharwal 2024b); (b) a Transformer with \(O(1)\)-length CoT is equivalent to a \(\mathsf{TC}^0\) circuit, and \(\mathsf{TC}^0 \not\supseteq \mathsf{L}\) (standard complexity assumption), so connectivity cannot be solved. Key insight: taking a majority vote over multiple independent short CoTs remains a \(\mathsf{TC}^0\) circuit (since MAJORITY gates are part of \(\mathsf{TC}^0\)), so parallel scaling cannot break this barrier.
- Design Motivation: Establishes the strongest form of the impossibility result—polynomially many parallel samples still fail.
-
Fine-Grained Analysis via the Vertex Query Model (Theorems 2 & 3):
- Function: Quantifies the sequential-vs.-parallel gap in a more refined computational model.
- Mechanism: In the VQM, algorithms access the graph only through "neighbor queries," querying the neighbors of one vertex per step. For two-path graphs (Theorem 2): \(O(L)\) queries suffice, but with fewer than \(L/2\) queries the success probability is exactly 1/2. For bridge graphs (Theorem 3): even with more queries than the shortest-path length (\((1-\delta)\cdot\frac{3}{2}ld\) queries), the success probability remains only \(1/2+\exp(-\Omega(d))\)—requiring \(\exp(\Omega(d))\) independent runs (parallel scaling) to reach a 2/3 success rate.
- Design Motivation: The bridge graph forces the model to guess the direction of the shorter path at each junction with probability 1/2; the cumulative probability decays exponentially over \(d\) junctions.
-
Emergence of Long CoTs under RL (Section 5):
- Function: Trains models with STaR (Self-Taught Reasoner) and observes spontaneous CoT length growth.
- Mechanism: Models are trained on bridge graphs with short CoTs (Path strategy), then iteratively fine-tuned via RL (verifying correct CoTs → self-training). The correct CoTs gradually become longer, and backtracking behavior absent from the training data emerges.
- Design Motivation: Connects to the CoT length growth observed during RL training of DeepSeek-R1, suggesting that graph connectivity is an effective proxy task for studying long-CoT emergence.
Loss & Training¶
Small Transformers (4-layer Mistral, 128-dimensional hidden layer) are trained on 500k CoT samples for 200 epochs. RL follows STaR: 500k samples are drawn per round, correct ones are retained, and the model is fine-tuned for 20 epochs.
Key Experimental Results¶
Main Results¶
Bridge(5) Task: Evidence Accuracy under Different CoT Strategies
| CoT Strategy | CoT Length | Evidence Accuracy | Notes |
|---|---|---|---|
| Shortest-Path | ~35 tokens | 0.0% | Too short to explore |
| Path (DFS-tree path) | ~50 tokens | 11.16% | Slightly better but insufficient |
| DFS (full search) | ~90 tokens | ~100% | Perfect solution |
DeepSeek-R1-Distill-Qwen-32B: Sequential vs. Parallel Scaling on Bridge Graphs
| CoT Length (tokens) | Parallel = 1 | Parallel = 64 | Notes |
|---|---|---|---|
| 512 | ~50% | ~50% | Too short; parallel yields no gain |
| 1024 | ~55% | ~60% | Begins to take effect |
| 2048 | ~75% | ~90% | Sequential drives parallel |
| 4096 | ~95% | ~99% | Sequential alone sufficient |
Ablation Study¶
| Configuration | Bridge(3) Evidence Accuracy | Notes |
|---|---|---|
| Path model (before RL) | 21.16% | Limited by short CoT |
| Path model (after 4 RL rounds) | 92.02% | CoT grows spontaneously; backtracking emerges |
| DFS model | ~100% | Training data contains long CoTs |
Key Findings¶
- An exponential gap exists: On bridge graphs, reducing CoT length from the full DFS length to the shortest-path length requires an exponential increase in parallel samples (from 1 to \(\sim 3^d \cdot 4^{d-1}\)) to achieve the same accuracy.
- Parallel scaling is useful only after sequential scaling reaches non-trivial accuracy: In large-model experiments, increasing parallel samples from 1 to 64 yields almost no gain when CoT length is insufficient; parallel scaling only becomes effective once CoT length is adequate.
- RL induces spontaneous CoT length growth: After STaR training, the average length of correct CoTs grows from ~15 to ~30 tokens, and backtracking behavior absent from the training data emerges.
- Under a fixed total token budget, sequential scaling consistently outperforms parallel scaling.
- The trend holds on AIME 2024: s1-32B on AIME2024 likewise exhibits the irreplaceable nature of sequential scaling.
Highlights & Insights¶
- First proof of an exponential separation between sequential and parallel scaling: This is an important theoretical milestone. The core insight—that majority voting over parallel samples remains a \(\mathsf{TC}^0\) circuit—is the genuine key contribution, implying that the voting mechanism itself cannot overcome depth limitations.
- Elegant bridge graph construction: By placing a short path, a long path, and a dead end at each junction, the construction forces genuine sequential search rather than parallel guessing, and itself constitutes a useful benchmark.
- Inspiring explanation of long-CoT emergence under RL: The graph connectivity task provides a concise theoretical framework for explaining the CoT length growth observed during RL training of DeepSeek-R1.
Limitations & Future Work¶
- The theoretical results rely on the complexity assumption \(\mathsf{TC}^0 \not\supseteq \mathsf{L}\), which is widely believed but unproven.
- The VQM as an abstraction of Transformer CoT has only heuristic support and lacks a direct theoretical proof.
- Graph connectivity is a specific structured task; the extent to which the findings transfer to natural language reasoning requires further study.
- Experiments are validated only on graph tasks and AIME; broader benchmark coverage would strengthen the claims.
Related Work & Insights¶
- vs. s1 (Simple Test-Time Scaling): s1 shows that extending CoT length via "Wait" tokens improves performance; this paper provides theoretical justification for that practice.
- vs. Snell et al. 2025 (Scaling LLM Test-Time Compute): Their empirical scaling laws suggest that parallel scaling is preferable to sequential scaling in some settings; this paper identifies an opposing extreme.
- vs. Li et al. 2024 (CoT empowers Transformers): That work theoretically proves that CoT enables Transformers to solve inherently sequential problems; this paper builds on it to quantify the sequential-vs.-parallel gap.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First exponential separation proof between sequential and parallel scaling; elegant bridge graph construction.
- Experimental Thoroughness: ⭐⭐⭐⭐ Small models trained from scratch + frontier LLM validation + RL emergence experiments + AIME transfer.
- Writing Quality: ⭐⭐⭐⭐⭐ Theoretically clear, intuitions well-explained, proof structure elegant.
- Value: ⭐⭐⭐⭐⭐ Provides a theoretical foundation for the core question of "long thinking vs. many short thoughts," with far-reaching implications.