TSLM: Tree-Structured Language Modeling for Divergent Thinking¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=PV5Dy4lW3t
Area: LLM Reasoning
Keywords: Tree-Structured Language Modeling, Divergent Thinking, Search Tree Supervision, Test-time Scaling, Systematic Exploration

TL;DR¶

TSLM linearizes a complete search tree (including successful paths and failure dead ends) using several special tokens. This allows a standard autoregressive language model to natively produce multi-branch exploration structures in a single generation, thereby internalizing systematic search capabilities via supervised learning. It achieves a 100% pass@1 on Game of 24 (baseline 17%), 91.5% on larger Gridworld OOD tasks far exceeding Tree-of-Thought's 42.7%, and demonstrates significantly faster inference speeds.

Background & Motivation¶

Background: Current language models generate tokens sequentially according to $p(y\mid x)=\prod_t p(y_t\mid x,y_{<t})$, where inference follows a single linear trajectory. To perform "multi-path exploration," mainstream approaches either rely on lengthening the reasoning chain (as in o1 / DeepSeek-R1) or using external search scaffolds like Tree-of-Thought (ToT)—sampling multiple candidates at each step and performing post-hoc coordination with external algorithms (e.g., beam search).

Limitations of Prior Work: Sequential generation has four inherent flaws: linear commitment to a single path, error propagation along the chain, redundant computation when multiple solutions are needed, and the inability to explore alternatives in a parallel and systematic manner. External search methods like ToT involve independent sampling of $k$ trajectories, where the shared prefix must be recomputed for every path, leading to exponential computational growth with depth. Worse, these samples lack coordination, resulting in redundant overlaps in some regions while missing critical branches.

Key Challenge: The model itself lacks any mechanism to coherently reconstruct the entire search space. Sequential models collapse all reasoning into a single path, while external search fragments exploration into mutually unaware parallel samplings, yielding a "fragmented" distribution rather than a structurally complete tree.

Goal: To enable a standard Transformer to output a complete branching structure in a single generation, moving "systematic exploration" from an external scaffold into the model parameters.

Key Insight: The authors observe that as long as a search tree can be losslessly linearized into a token sequence, it can be trained using standard language modeling loss. The model then learns not just the answer, but the complete exploration process—including which branches to expand at each node and which paths are dead ends.

Core Idea: Encode the search tree into a linear sequence using special tokens ([SEP], [FAIL], [GOAL]), allowing the sequential language model to internalize tree-based exploration through supervised learning. During inference, branches are "stitched" back into independent contexts for parallel expansion.

Method¶

Overall Architecture¶

TSLM addresses how to make a sequentially generating Transformer produce a branching search tree. The approach first linearizes the search tree (generated by rules for structured tasks or synthesized via bootstrap algorithms for open-domain tasks) into a linear sequence using special tokens. It is then trained with a standard language modeling loss—but with training signals constrained by the tree topology (each node only attends to its ancestors and left-side siblings, ignoring irrelevant subtrees) to learn "contextual decoupling." During inference, the model generates the branching structure at once, parses special tokens to identify feasible branches, and stitches them into independent contexts for parallel expansion via BFS until a [GOAL] is hit or the search is exhausted.

The pipeline is as follows:

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Input Task<br/>(State + Goal)"] --> B["Search Tree Supervision Bootstrap<br/>Rule-based Generation or ToT Path Synthesis"]
    B --> C["Token Serialization<br/>[SEP]/[FAIL]/[GOAL] Branch Encoding"]
    C --> D["Context-Decoupled Tree Training<br/>Standard LM Loss + Topological Constraints"]
    D --> E["Coherent Tree Inference<br/>Tree Generation → Parsing → Stitching BFS Expansion"]
    E -->|Hit [GOAL]| F["Output Solution"]
    E -->|Exhausted| G["Determine Unsolvable"]

Key Designs¶

1. Search Tree Bootstrap: Synthesizing trees for open-domain tasks

Structured tasks (Game of 24, Gridworld) have clear transition functions $T(s,a)$ and finite action spaces $A(s)$, allowing for the direct enumeration of complete search trees for TSLM to imitate. However, most real-world tasks (ProntoQA, GSM8K) only provide correct answers or gold trajectories without explicit tree structures. The authors use a simple bootstrap algorithm (Algorithm 1) to synthesize training trees: for each instance, a supervised language model samples $k$ candidate actions via beam search at each state. Known gold trajectories are integrated as high-priority branches to ensure each tree contains at least one valid solution. Remaining branches are ranked using the RAP reward function $R(s,a)$ to prioritize promising actions, and redundant paths are de-duplicated while preserving the tree structure. The resulting pseudo-trees contain both successful paths and dead ends explored by the model, providing the "complete exploration process" supervision signal TSLM requires.

2. Token Serialization: Fitting tree topology into linear sequences

This is the core representation of TSLM. The authors introduce structural tokens to allow a standard Transformer to read and write trees in a linear sequence: [SEP] marks an expandable feasible action; [FAIL] marks a dead end (infeasible action); [GOAL] marks the target state; and [BOS]/[EOS] mark the boundaries of each node's child sequence. Serialization uses depth-first search (DFS) to flatten the tree. The token sequence generated for each node $s_i$ is $y_{s_i}=[a_i, m_i]$, where $a_i$ is the action description and $m_i\in\{[SEP],[FAIL],[GOAL]\}$ is the structural marker. Crucially, this serialization records both successful paths and failed explorations, teaching the model the full search process rather than just the final answer—a fundamental difference from sequential baselines (SC/PC) that only clone a single gold CoT.

3. Context-Decoupled Tree Training Objective: Selective attention per node

Although the loss takes the form of standard cross-entropy $L=-\sum_t \log p(y_t\mid y_{<t},x)$, it carries a "non-standard" training signal. In sequential modeling, each token depends on its entire prefix; in TSLM, tokens depend on the tree structure. Formally, the context of node $s_i$ is defined as: $$\text{ctx}(s_i)=\bigcup_{s_j\in\pi(s_i)} y_{s_j}\ \cup\ \bigcup_{s_k\in\text{siblings}(s_i),\,k<i} y_{s_k}$$ meaning it only includes its ancestor path $\pi(s_i)$ and already generated left siblings ($k<i$ to enforce ordering within branching factors), and excludes irrelevant subtrees. The training objective becomes $L=-\frac{1}{|N|}\sum_{s_i\in N}\sum_t \log p(y_{s_i,t}\mid y_{s_i,<t},\text{ctx}(s_i))$. This "selective conditioning" is what distinguishes TSLM from simply concatenating exploration paths: although the input sequence is linear, the conditional dependencies are determined by tree topology. Consequently, markers like [SEP]/[FAIL] are trained to depend on whether an action is expandable and distinct from its siblings, teaching the model to generate multiple diverse actions at each decision point.

4. Coherent Tree Inference and Test-time Scaling: Exploring k branches within one tree

During inference, TSLM reconstructs the tree structure: it first generates the next step with multiple candidate actions, parses structural tokens to find feasible branches ([SEP]), forks different branches into independent sequences by "stitching" them into their respective contexts for parallel expansion, and recursively expands feasible states until [GOAL] or exhaustion. It defaults to Breadth-First Search (BFS). This introduces a new test-time scaling paradigm: traditional methods treat the model as a fixed sampler, where increasing $k$ means sampling $k$ independent and redundant trajectories from the same distribution. TSLM, however, takes the top $k$ terminal states for verification from a single tree construction. Because shared prefixes are computed only once and irrelevant subtrees are ignored during inference, TSLM is much faster than ToT (which lacks computation sharing) and faster than Procedure Cloning (which must process the entire linearized search trajectory). Furthermore, the explicit tree structure makes the search strategy controllable—one can choose BFS (for solution optimality) or DFS (for model confidence), or even adaptively expand high-reward branches more aggressively, which is impossible with independent sampling.

Loss & Training¶

Training involves applying standard language modeling cross-entropy loss (formula above) to the entire serialized sequence. The loss covers both reasoning content tokens and structural tokens. The base model defaults to Llama-3-8B, with less than 10K instances used for post-training per task (intentionally controlling data volume to compare architectural differences rather than scaling effects). During inference, the top $k=5$ generated actions are selected for forking, with exact matching used for de-duplication.

Key Experimental Results¶

Main Results¶

TSLM is evaluated using pass@1 (single attempt), while ToT uses pass@100 (success if any of the 100 terminal nodes is correct, representing the convergence upper bound for the sampling scaffold). Despite this, TSLM matches or exceeds ToT in most tasks.

Task	SC (pass@1)	PC (pass@1)	GRPO (pass@1)	TSLM (pass@1)	ToT (pass@100)
Game of 24	17.0%	47.0%	15.0%	100%	32.0%
Gridworld (i.d, 10×10)	78.2%	99.7%	24.0%	100%	95.0%
Gridworld (o.o.d, 20×20)	33.0%	81.1%	6.0%	91.5%	42.7%
ProntoQA	99.7%	97.5%	99.8%	100%	100%
GSM8K	55.8%	55.9%	60.8%	61.6%	62.3%

The most striking result is the Gridworld OOD performance: while ToT achieves 95.0% in-distribution (10×10), it collapses to 42.7% at 20×20. TSLM only drops from 100% to 91.5%, demonstrating that internalized search processes generalize better than external search algorithms. On GSM8K, ToT is slightly higher (62.3% vs 61.6%), but for tasks requiring systematic exploration like Game of 24, ToT degrades to the level of purely sequential methods (17%).

Ablation Study¶

Configuration / Analysis	Key Metric	Description
TSLM, k=1 (GSM8K)	61.3%	Single candidate nearly matches ToT's convergence value of 62.3%.
TSLM, Saturated (GSM8K)	67.2%	Significantly exceeds ToT upper bound after expanding candidates.
BFS vs DFS (top-1)	61.3% / 63.1%	DFS follows high-confidence actions (higher top-1); BFS provides wider coverage.
Training k=5 → Inference k=10	67.2% → 71.1%	Exploration capability generalizes to larger branching factors.
Unsolvability Detection (Game of 24, 100 cases)	97/100	TSLM correctly terminates; SC/PC/GRPO almost all hallucinate invalid solutions.

Key Findings¶

Internalized Search > External Scaffolds: TSLM's pass@1 outperforms ToT's pass@100 in most cases and remains robust OOD (larger grids, larger branching factors), proving that learning search into parameters generalizes better than external scaffolding.
Emergent Unsolvability Detection: In 100 unsolvable Game of 24 instances, TSLM correctly refused to answer 97 times by "exhausting the tree." Sequential baselines, trained to always provide an answer, almost all hallucinated. This links "systematic exploration" directly to "reduced hallucination."
Efficient Scaling Paradigm: TSLM explores $k$ branches within a single tree by sharing prefixes, resulting in wall-clock times far lower than ToT's independent sampling. Scaling becomes "smarter exploration" rather than "more sampling."

Highlights & Insights¶

Modeling the Search Tree as Language: The core trick involves using 5 special tokens + DFS serialization to losslessly compress a "tree" into a "linear sequence." This allows pure supervised learning to teach the model branch exploration without architectural changes, RL, or external coordination—a clever "representation over framework" shift.
Contextual Decoupling as the Hidden Key: Although the sequence is linear, conditional dependencies follow the tree topology (ancestors + left siblings). This allows the model to "ignore irrelevant subtrees" during inference, serving as both a source of efficiency and generalization.
Utility of Failure Paths: Exposing the model to [FAIL] dead ends teaches it to detemine unsolvability, transforming the hallucination problem into a search completeness problem—a signal missing from sequential CoT training.

Limitations & Future Work¶

Author's Admission: The supervision for open-domain tasks relies on ToT sampling + RAP rewards, which is not the optimal source for TSLM. Using stronger reasoning models (e.g., ReJump) or self-training (e.g., SoS) could produce better trees.
Narrow Task Scope: The primary benchmarks are tasks with clear transitions/action spaces or enumerability (Game of 24, Gridworld, GSM8K). Defining and synthesizing search trees for truly open-ended natural language tasks remains an open problem.
Inference Forking Overhead: While tree construction is coherent, forking branches into independent sequences still requires multiple calls, and structural tokens lengthen the sequence. Overhead at extreme branching factors or depths requires further verification.
Comparison Consistency: Pass@100 for ToT vs pass@1 for TSLM is a "conservative estimate" by the authors, but comparisons across different task difficulties and candidate budgets should be interpreted with caution.

vs Tree-of-Thought (ToT): ToT samples candidates independently and coordinates via external beam search, causing path incoordination, redundant prefix computation, and poor OOD generalization. TSLM coherently generates the tree in one process, shares prefixes, ignores irrelevant subtrees, and is significantly more robust and faster.
vs o1 / DeepSeek-R1 (RL-based Reasoning): These rely on lengthening sequential reasoning chains + RL (e.g., GRPO) to incentivize reasoning, but remain sequential generators unable to explicitly decouple parallel branches. TSLM internalizes branching via supervised learning, challenging the assumption that robust reasoning requires RL or inference-time search.
vs Sequence Cloning / Procedure Cloning: SC clones a single gold path; PC clones the entire search trajectory linearly but still treats it as a single sequence. TSLM explicitly preserves tree topology, allowing it to learn from failure paths and skip irrelevant subtrees, outperforming both in performance and efficiency.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ A clean and rare framework-level innovation using token serialization to turn search trees into a supervised language modeling task.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers structured/open-domain tasks, OOD, test-time scaling, and unsolvability, though tasks remain somewhat controlled.
Writing Quality: ⭐⭐⭐⭐ Clear motivation and formalization with intuitive diagrams; some small discrepancies in numerical reporting between the conclusion and main text.
Value: ⭐⭐⭐⭐⭐ Provides a supervised learning alternative for "test-time scaling" without relying on RL or external search, offering high transferability.