Training Large Language Models to Reason in Parallel with Global Forking Tokens¶
Conference: ICLR2026 arXiv: 2510.05132 Code: GitHub Area: Code Intelligence Keywords: parallel reasoning, global forking token, set supervised fine-tuning, bipartite matching, diverse reasoning
TL;DR¶
This paper proposes Set Supervised Fine-Tuning (SSFT), which introduces global forking tokens and a set-based loss via bipartite matching to train LLMs to produce diverse and correct reasoning patterns triggered by a single control token, outperforming standard SFT+GRPO on both Pass@1 and Cons@k.
Background & Motivation¶
- Test-time parallel sampling (e.g., self-consistency, Best-of-N) is an effective approach for improving LLM reasoning, but relies on generating diverse and correct reasoning paths.
- For challenging problems, the "forking tokens" that trigger different reasoning modes are typically buried deep in the sampling tree; increasing temperature improves diversity but degrades accuracy.
- Theoretical studies show that simply raising temperature cannot guarantee diversity unless the model is explicitly trained for coverage.
- Standard SFT on multiple reasoning trajectories leads to mode collapse — different control tokens produce nearly identical reasoning.
Core Problem¶
How to train an LLM such that simple control tokens, placed at the beginning of inference, can trigger diverse and correct reasoning modes?
Method¶
Global Forking Tokens¶
- A set of \(N\) special tokens \(\{\)
<think i>\(\}_{i=1}^N\) is reserved (with \(N=6\) in experiments). - Objective: given a question \(\mathbf{x}\), different
<think i>tokens initiate \(M\) distinct reasoning trajectories (\(M=4\)).
SSFT Loss Function¶
Parallel reasoning is formulated as a set prediction problem, solved via bipartite matching:
Step 1: Compute the cost matrix
For each pair of (forking token \(g^{(i)}\), reasoning trajectory \(\mathbf{r}^{(j)}\)), compute the NTP loss: $\(\mathcal{L}_{\text{matching}}(g^{(i)}, \mathbf{r}^{(j)}) = -\text{sg}\left(\frac{1}{T_\mathbf{r}} \sum_{t=1}^{T_\mathbf{r}} \log \pi_\theta(r_t^{(j)}|\mathbf{x}, g^{(i)}, \mathbf{r}_{<t}^{(j)})\right)\)$
where \(\text{sg}(\cdot)\) denotes the stop-gradient operation to reduce memory usage.
Step 2: Solve for the optimal matching via the Hungarian algorithm
Step 3: Train under the optimal matching
Computational Efficiency¶
- Only the first \(L < T_\mathbf{r}\) tokens are used to compute the matching cost (with \(L \approx 1000\) in experiments).
- Time complexity is reduced by a factor of \(k = T_\mathbf{r}/L\); backpropagation is performed only over the \(M\) matched sequences.
GFPO: Global Forking Policy Optimization¶
- A small number of RL steps are applied after SSFT, optimizing only the output distribution of the global forking tokens \(g^{(i)}\).
- Implementation is minimal: only a few slicing lines need to be added to existing GRPO code.
Inference Strategy¶
- Cons@k: Each
<think i>token is used to prompt the model separately, followed by majority voting. - Pass@1: A graph-based heuristic selects the \(g^{(i^\star)}\) that covers the most reasoning trajectories, or GFPO is used to let the model sample the optimal \(g^{(i)}\) autonomously.
Key Experimental Results¶
Pass@1 Comparison (Qwen2.5-32B-Instruct backbone)¶
| Method | AIME24 | AIME25 | MATH-500 | GPQA-D | LCB(v5) |
|---|---|---|---|---|---|
| Qwen2.5-32B-Instruct | 15.80 | 10.40 | 80.40 | 47.00 | 23.35 |
| s1.1-32B (single-traj.) | 54.79 | 44.27 | 92.16 | 62.12 | – |
| SFT-mixed-32B (multi-traj.) | 58.23 | 51.96 | 88.49 | 59.96 | 32.34 |
| SSFT-32B | 64.06 | 58.13 | 90.02 | 60.39 | 38.92 |
| SSFT-32B-GFPO | 64.22 | 58.80 | 89.90 | 62.48 | 42.10 |
Cons@6 and Cons@32 (Parallel Reasoning)¶
| Method | AIME24 Cons@6 | AIME25 Cons@6 | AIME25 Cons@32 |
|---|---|---|---|
| s1.1-32B | 70.30 | 53.33 | 63.33 |
| SFT-mixed-32B | 73.94 | 70.00 | 76.67 |
| SSFT-32B | 75.45 | 73.94 | 86.67 |
| SSFT-32B-GFPO | 76.67 | 78.48 | 83.33 |
- SSFT outperforms SFT-mixed on Pass@1 by 8.33% on AIME24 and 6.57% on AIME25.
- Cons@32 reaches 86.67% on AIME25.
- Substantial improvements are also observed on the OOD code generation benchmark LCB.
Ablation: Optimal Matching vs. Random Matching¶
| Matching | AIME24 Pass@1 | AIME25 Cons@6 |
|---|---|---|
| Random matching | 61.77 | 67.58 |
| Optimal matching | 64.06 | 73.94 |
Highlights & Insights¶
- First application of DETR-style set loss to language modeling: The bipartite matching idea from object detection is elegantly transferred to the problem of reasoning diversity.
- Emergent behavior of global forking tokens: Different
<think i>tokens automatically learn distinct reasoning lengths and strategies without manual specification. - Elegant resolution of mode collapse: Standard SFT causes all control tokens to collapse to the same mode, whereas SSFT preserves diversity.
- Training data efficiency: Significant gains on the 32B model are achieved using only 1K questions, each with 4 reasoning trajectories.
- OOD generalization to code generation: The method transfers effectively to code tasks.
Limitations & Future Work¶
- Thorough validation is limited to the Qwen2.5 series; performance gains on other architectures such as Llama are smaller.
- The matching cost computation introduces additional overhead, though the paper reports its impact to be minimal.
- The diversity of reasoning trajectories depends on the quality of the teacher model.
- The behavior under larger values of \(N\) and \(M\) remains unexplored.
- The GFPO stage uses only a limited number of RL steps; more extensive RL training may yield further gains.
Related Work & Insights¶
| Method | Mechanism | Diversity Guarantee | Extra Annotation | Pass@1 |
|---|---|---|---|---|
| Self-consistency | Temperature sampling + voting | None | No | Low |
| Multiverse | Parallel CoT rewriting | Implicit | Teacher model | Medium |
| Standard SFT (multi-traj.) | Repeated training | Mode collapse | No | Medium |
| SSFT | Set matching loss | Explicit | Teacher trajectories | High |
The bipartite matching + language modeling paradigm has the potential to generalize to other tasks requiring diverse outputs (e.g., translation, multi-solution code generation). Global forking tokens can be viewed as learnable prompts or mode selectors, and bear a conceptual resemblance to mixture-of-experts, but operating at the level of reasoning paths.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Introducing DETR-style set loss into language modeling; the global forking token concept is original.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple benchmarks and sufficient ablations, but limited model diversity.
- Writing Quality: ⭐⭐⭐⭐ — Clear formalization and intuitive figures.
- Value: ⭐⭐⭐⭐ — Provides a novel training paradigm for parallel test-time computation.