Continuous Chain of Thought Enables Parallel Exploration and Reasoning¶

Conference: ICLR 2026 arXiv: 2505.23648 Code: https://github.com/alperengozeten/CoT2 Area: LLM Reasoning / Model Compression Keywords: Continuous chain of thought, parallel reasoning, multi-trajectory tracking, GRPO, information theory

TL;DR¶

CoT2 proposes replacing discrete tokens with continuous-valued tokens (convex combinations of vocabulary embeddings) for chain-of-thought reasoning, enabling the model to track multiple reasoning paths in parallel within a single forward pass. The approach is theoretically shown to be equivalent to \(K\) rounds of self-consistency/best-of-N sampling, and is further improved via GRPO-based reinforcement learning.

Background & Motivation¶

Background: CoT reasoning in modern LLMs operates through autoregressive sampling of discrete tokens, with self-consistency (majority voting over multiple samples) or best-of-N decoding employed to improve accuracy.

Limitations of Prior Work: - Discrete sampling transmits at most \(\log_2(v)\) bits of information per step, whereas each token embedding can store \(O(d)\) bits — a severe underutilization of representational capacity. - Once a token is sampled, the model commits to a single reasoning path and cannot explore alternatives. - Self-consistency/best-of-N requires multiple forward passes, leading to linearly growing inference costs.

Key Challenge: The irreversibility of discrete sampling causes a single reasoning chain to accumulate errors in a snowball effect, while the remedies (multiple sampling runs) impose substantial computational overhead.

Goal: - How can a model track multiple reasoning paths simultaneously within a single inference pass? - How powerful is the parallel tracking capacity of continuous tokens, and what is its theoretical relationship to discrete multi-sample decoding? - How should continuous-token models be trained and deployed?

Key Insight: Rather than discretely sampling from the softmax output at each step, the output distribution is passed directly as a continuous token — a weighted combination of all vocabulary embeddings — into the next step. This "superposition state" naturally encodes information from multiple paths.

Core Idea: Continuous tokens, as convex combinations of vocabulary embeddings, inherently implement parallel path tracking. Their effect is theoretically equivalent to aggregating \(K\) independent discrete CoT chains — one forward pass subsumes \(K\) sampling runs.

Method¶

Overall Architecture¶

Given input \(\bm{X}\), the model autoregressively generates \(m\) tokens: for the first \(m-1\) steps, it outputs continuous tokens \(\bm{z}_t = \bm{E}^\top \bm{\alpha}_t\) (the product of the softmax distribution and the embedding matrix); at the final step, a discrete answer token is sampled. Training proceeds in two stages: Continuous Supervised Fine-Tuning (CSFT) followed by GRPO-based reinforcement learning.

Key Designs¶

Continuous Supervised Fine-Tuning (CSFT):
- Function: Uses "multi-trajectory superposition" as the supervision signal for intermediate steps.
- Mechanism: Given a budget of \(B\) optimal trajectories, the supervision distribution at intermediate step \(t\) is \(\alpha_{t,g}^* = \frac{1}{B}\sum_{\pi \in \Pi_B} \mathbf{1}\{g_t(\pi)=g\}\) — the empirical distribution over states visited at step \(t\) across the \(B\) trajectories. The final step uses a one-hot distribution (correct answer). The model is trained via cross-entropy/KL divergence to fit these soft labels.
- Design Motivation: Setting \(B=1\) reduces to discrete CoT (one-hot); \(B=|\mathcal{T}|\) tracks all trajectories (maximum parallelism). The budget provides flexible control over the trade-off between parallelism and model capacity.
Budget–Embedding Dimension Trade-off:
- Function: Quantifies the theoretical relationship between parallelism and embedding dimensionality.
- Mechanism: An information-theoretic lower bound of \(d = \Omega(B\log(v/B))\) establishes that reliably decoding the superposition of \(B\) trajectories requires embedding dimension \(\Omega(B\log(v/B))\). When \(d\) is sufficiently large, increasing \(B\) monotonically improves performance; otherwise, there exists an optimal sweet spot for \(B\).
- Design Motivation: This explains why \(B=8\) outperforms \(B=16\) at \(d=16\) (insufficient capacity), whereas \(B=16\) is optimal at \(d=32\).
Single-Layer Transformer Construction (Proposition 1):
- Function: Proves that a single-layer Transformer using CoT2 can solve the Minimum Non-Negative Sum (MNNS) problem.
- Mechanism: Trigonometric embeddings encode all \(2^k\) states in non-overlapping (sin, cos) representations; the attention layer expands states (by adding or subtracting new numbers); the MLP layer reads and filters. Each step tracks an exponentially growing number of states in parallel, and the final step selects the minimum non-negative sum.
- Design Motivation: MNNS is essentially a subset-sum problem requiring search over \(2^m\) possibilities. Discrete CoT must commit to one path, whereas CoT2 tracks all paths simultaneously.
Multi-Token Sampling (MTS) + GRPO:
- Function: Introduces controllable stochasticity into CoT2 to enable the use of RL methods.
- Mechanism: At each step, \(K\) discrete tokens are sampled and averaged: \(\bm{z}_t = \frac{1}{K}\sum_{r=1}^K \bm{e}_{i_r}\). This yields an unbiased but noisy estimate of \(\bm{\alpha}_t\). Proposition 3 shows that the estimation error of MTS is equivalent to aggregating \(K\) independent discrete CoT chains, reducing sample complexity by a factor of \(K\).
- Design Motivation: Base CoT2 is deterministic (no stochasticity), making it impossible to directly compute policy ratios for GRPO. MTS introduces controlled noise, allowing the GRPO policy ratio \(r_t^{(i)}(\theta)\) to be defined and computed.

Loss & Training¶

CSFT stage: \(\mathcal{L}_{CSFT} = \sum_{t=1}^m D(\bm{\alpha}_t^* \| \bm{\alpha}_t)\), using soft-label cross-entropy for intermediate steps and standard CE for the final step.
GRPO stage: Standard GRPO clipped surrogate objective with KL regularization and sparse rewards (correct = 1, incorrect = 0).
Teacher forcing is applied during CSFT (even though inference is autoregressive), which outperforms self-feeding.

Key Experimental Results¶

Main Results (MNNS task, 4 digits in range 1–99)¶

Method	d=16 acc	d=24 acc	d=32 acc
No-CoT	~15%	~15%	~15%
Discrete CoT (B=1)	~55%	~70%	~75%
COCONUT	~45%	~60%	~65%
CoT2 (B=16)	~60%	~95%	~98%

Pass@k Comparison (d=24, MNNS)¶

Method	Pass@1	Pass@4	Pass@8	Pass@16
Discrete CoT	~70%	~82%	~88%	~93%
CoT2	~95%	~96%	~97%	~98%

Key Findings¶

Single CoT2 pass ≈ multiple discrete CoT samples: CoT2's Pass@1 matches the Pass@16 level of discrete CoT.
Budget–dimension sweet spot exists: At \(d=16\), \(B=8\) is optimal (\(B=16\) exceeds capacity); at \(d=32\), \(B=16\) is optimal.
GRPO is effective on CoT2: RL fine-tuning leads the model to prioritize relevant reasoning paths, reducing the entropy of continuous tokens.
CoT2 outperforms COCONUT: When external search supervision signals are available, directly fitting the multi-trajectory distribution is more effective than hidden-state substitution.
Strong theory–experiment agreement: The lower bound \(d=\Omega(B\log(v/B))\) is empirically validated.

Highlights & Insights¶

Deep information-theoretic insight: Discrete tokens carry at most \(\log_2 v\) bits per step, whereas continuous tokens can pack \(B \cdot \log_2(v/B)\) bits — this argument elegantly explains why continuous tokens are more expressive.
Theoretical guarantee "one pass ≈ K samples" (Proposition 3) is a particularly powerful result, establishing a quantitative equivalence between CoT2 and self-consistency and giving continuous tokens a clear practical interpretation.
Extending RL to continuous action spaces for LLMs: Conventional GRPO/PPO operates over discrete token spaces; CoT2's MTS strategy ingeniously introduces controllable noise in the continuous space via "sample-and-average," making policy gradient methods applicable.

Limitations & Future Work¶

Validation is limited to synthetic tasks (MNNS, ProntoQA, ProsQA); the approach has not been tested on real-world NLP tasks or large-scale LLMs.
Assumption 1 (Markov property + linear superposition) may not hold strictly in practical Transformers.
Continuous tokens cannot be directly interpreted as natural language, sacrificing the interpretability of CoT.
The orthogonality of the vocabulary embedding matrix \(\bm{E}\) affects superposition quality; in practice, embeddings may be highly correlated.
Only the final step produces a discrete token; extending the framework to settings requiring multi-step discrete outputs (e.g., long-form answers) remains an open problem.

vs. COCONUT: Both operate in the continuous thought chain paradigm, but COCONUT substitutes LLM hidden states without explicit multi-trajectory supervision; CoT2 directly fits the trajectory distribution via CSFT, yielding superior performance.
vs. Self-Consistency: Self-consistency requires \(K\) sampling passes with majority voting; CoT2 is theoretically equivalent to \(K\) samples in a single forward pass, improving inference efficiency by a factor of \(K\).
vs. Latent Reasoning (Coconut/Quiet-STaR): These methods focus on internalizing reasoning into latent space but lack CoT2's information-theoretic guarantees and the explicit formalization of multi-trajectory parallel tracking.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ Information-theory-driven continuous reasoning with parallel tracking; solid theoretical contributions.
Experimental Thoroughness: ⭐⭐⭐ Limited to synthetic tasks; large-scale LLM validation is absent.
Writing Quality: ⭐⭐⭐⭐⭐ Theoretical derivations are clear, intuitions are well-explained, and figures are well-designed.
Value: ⭐⭐⭐⭐ Outstanding theoretical contribution to understanding continuous reasoning, though practical applicability remains to be demonstrated.