Training Large Language Models to Reason in Parallel with Global Forking Tokens¶

Conference: ICLR2026 arXiv: 2510.05132 Code: GitHub Area: Code Intelligence Keywords: parallel reasoning, global forking token, set supervised fine-tuning, bipartite matching, diverse reasoning

TL;DR¶

This paper proposes Set Supervised Fine-Tuning (SSFT), which introduces global forking tokens and a set-based loss via bipartite matching to train LLMs to produce diverse and correct reasoning patterns triggered by a single control token, outperforming standard SFT+GRPO on both Pass@1 and Cons@k.

Background & Motivation¶

Test-time parallel sampling (e.g., self-consistency, Best-of-N) is an effective approach for improving LLM reasoning, but relies on generating diverse and correct reasoning paths.
For challenging problems, the "forking tokens" that trigger different reasoning modes are typically buried deep in the sampling tree; increasing temperature improves diversity but degrades accuracy.
Theoretical studies show that simply raising temperature cannot guarantee diversity unless the model is explicitly trained for coverage.
Standard SFT on multiple reasoning trajectories leads to mode collapse — different control tokens produce nearly identical reasoning.

Core Problem¶

How to train an LLM such that simple control tokens, placed at the beginning of inference, can trigger diverse and correct reasoning modes?

Method¶

Global Forking Tokens¶

A set of $N$ special tokens $\{$<think i>$\}_{i=1}^N$ is reserved (with $N=6$ in experiments).
Objective: given a question $\mathbf{x}$, different <think i> tokens initiate $M$ distinct reasoning trajectories ($M=4$).

SSFT Loss Function¶

Parallel reasoning is formulated as a set prediction problem, solved via bipartite matching:

Step 1: Compute the cost matrix

For each pair of (forking token $g^{(i)}$, reasoning trajectory $\mathbf{r}^{(j)}$), compute the NTP loss: $$\mathcal{L}_{\text{matching}}(g^{(i)}, \mathbf{r}^{(j)}) = -\text{sg}\left(\frac{1}{T_\mathbf{r}} \sum_{t=1}^{T_\mathbf{r}} \log \pi_\theta(r_t^{(j)}|\mathbf{x}, g^{(i)}, \mathbf{r}_{<t}^{(j)})\right)$$

where $\text{sg}(\cdot)$ denotes the stop-gradient operation to reduce memory usage.

Step 2: Solve for the optimal matching via the Hungarian algorithm

\[\hat{\boldsymbol{\sigma}} = \arg\min_{\boldsymbol{\sigma} \in \mathfrak{S}_P} \sum_{j=1}^M \mathcal{L}_{\text{matching}}(g^{(\boldsymbol{\sigma}(j))}, \mathbf{r}^{(j)})\]

Step 3: Train under the optimal matching

\[\mathcal{L}_{\text{Hungarian}}(\theta) = -\mathbb{E}_{\mathbf{x}, \mathbf{R} \sim \mathcal{D}}\left[\sum_{j=1}^M \sum_t \log \pi_\theta(r_t^{(j)}|\mathbf{x}, g^{(\hat{\sigma}(j))}, \mathbf{r}_{<t}^{(j)})\right]\]

Computational Efficiency¶

Only the first $L < T_\mathbf{r}$ tokens are used to compute the matching cost (with $L \approx 1000$ in experiments).
Time complexity is reduced by a factor of $k = T_\mathbf{r}/L$; backpropagation is performed only over the $M$ matched sequences.

GFPO: Global Forking Policy Optimization¶

A small number of RL steps are applied after SSFT, optimizing only the output distribution of the global forking tokens $g^{(i)}$.
Implementation is minimal: only a few slicing lines need to be added to existing GRPO code.

Inference Strategy¶

Cons@k: Each <think i> token is used to prompt the model separately, followed by majority voting.
Pass@1: A graph-based heuristic selects the $g^{(i^\star)}$ that covers the most reasoning trajectories, or GFPO is used to let the model sample the optimal $g^{(i)}$ autonomously.

Key Experimental Results¶

Pass@1 Comparison (Qwen2.5-32B-Instruct backbone)¶

Method	AIME24	AIME25	MATH-500	GPQA-D	LCB(v5)
Qwen2.5-32B-Instruct	15.80	10.40	80.40	47.00	23.35
s1.1-32B (single-traj.)	54.79	44.27	92.16	62.12	–
SFT-mixed-32B (multi-traj.)	58.23	51.96	88.49	59.96	32.34
SSFT-32B	64.06	58.13	90.02	60.39	38.92
SSFT-32B-GFPO	64.22	58.80	89.90	62.48	42.10

Cons@6 and Cons@32 (Parallel Reasoning)¶

Method	AIME24 Cons@6	AIME25 Cons@6	AIME25 Cons@32
s1.1-32B	70.30	53.33	63.33
SFT-mixed-32B	73.94	70.00	76.67
SSFT-32B	75.45	73.94	86.67
SSFT-32B-GFPO	76.67	78.48	83.33

SSFT outperforms SFT-mixed on Pass@1 by 8.33% on AIME24 and 6.57% on AIME25.
Cons@32 reaches 86.67% on AIME25.
Substantial improvements are also observed on the OOD code generation benchmark LCB.

Ablation: Optimal Matching vs. Random Matching¶

Matching	AIME24 Pass@1	AIME25 Cons@6
Random matching	61.77	67.58
Optimal matching	64.06	73.94

Highlights & Insights¶

First application of DETR-style set loss to language modeling: The bipartite matching idea from object detection is elegantly transferred to the problem of reasoning diversity.
Emergent behavior of global forking tokens: Different <think i> tokens automatically learn distinct reasoning lengths and strategies without manual specification.
Elegant resolution of mode collapse: Standard SFT causes all control tokens to collapse to the same mode, whereas SSFT preserves diversity.
Training data efficiency: Significant gains on the 32B model are achieved using only 1K questions, each with 4 reasoning trajectories.
OOD generalization to code generation: The method transfers effectively to code tasks.

Limitations & Future Work¶

Thorough validation is limited to the Qwen2.5 series; performance gains on other architectures such as Llama are smaller.
The matching cost computation introduces additional overhead, though the paper reports its impact to be minimal.
The diversity of reasoning trajectories depends on the quality of the teacher model.
The behavior under larger values of $N$ and $M$ remains unexplored.
The GFPO stage uses only a limited number of RL steps; more extensive RL training may yield further gains.

Method	Mechanism	Diversity Guarantee	Extra Annotation	Pass@1
Self-consistency	Temperature sampling + voting	None	No	Low
Multiverse	Parallel CoT rewriting	Implicit	Teacher model	Medium
Standard SFT (multi-traj.)	Repeated training	Mode collapse	No	Medium
SSFT	Set matching loss	Explicit	Teacher trajectories	High

The bipartite matching + language modeling paradigm has the potential to generalize to other tasks requiring diverse outputs (e.g., translation, multi-solution code generation). Global forking tokens can be viewed as learnable prompts or mode selectors, and bear a conceptual resemblance to mixture-of-experts, but operating at the level of reasoning paths.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — Introducing DETR-style set loss into language modeling; the global forking token concept is original.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multiple benchmarks and sufficient ablations, but limited model diversity.
Writing Quality: ⭐⭐⭐⭐ — Clear formalization and intuitive figures.
Value: ⭐⭐⭐⭐ — Provides a novel training paradigm for parallel test-time computation.