TreeRL: LLM Reinforcement Learning with On-Policy Tree Search¶

Conference: ACL 2025
arXiv: 2506.11902
Code: https://github.com/THUDM/TreeRL
Area: Reinforcement Learning
Keywords: Tree Search, Reinforcement Learning, Process Supervision, Entropy-Guided Branching, LLM Reasoning

TL;DR¶

TreeRL is proposed to directly integrate Entropy-Guided Parallel Tree search (EPTree) into on-policy reinforcement learning training for LLMs. By branching at tokens with high uncertainty, it expands the diversity of reasoning paths and utilizes global and local advantages derived from the tree structure as process supervision signals, surpassing traditional multi-chain sampling RL on mathematics and code reasoning tasks.

Background & Motivation¶

Background: Current reinforcement learning (RL) methods for LLM reasoning capabilities mainly adopt independent and identically distributed (i.i.d.) multi-chain sampling, independently generating \(K\) complete answers for each prompt, and performing reward feedback and policy updates based on the correctness of the final answer (e.g., GRPO, RLOO).

Limitations of Prior Work: - The answers generated by i.i.d. multi-chain sampling exhibit limited diversity, making it difficult to fully explore the reasoning space. - Outcome-level reward signals are too sparse, providing feedback only at the end of the response, and intermediate steps lack gradient signals. - Although Process Reward Models (PRMs) provide intermediate step supervision, offline-trained PRMs face distribution shift and reward hacking issues.

Key Challenge: Tree search has been proven effective in traditional domains like AlphaZero but remains underutilized in LLM RL. Existing attempts either use tree search solely during the inference phase to boost performance, or use tree search to generate offline training data (SFT/DPO), rather than directly using it in on-policy RL training. The step-by-step generation method of classic MCTS requires a large number of iterations, which is unfriendly and highly inefficient for modern inference engines like vLLM.

Goal - How to design an efficient tree search algorithm that generates more diverse correct answers than multi-chain sampling and MCTS under the same inference budget. - How to convert the structural information of the tree search into reliable on-policy process supervision signals for RL training.

Key Insight: The predictive entropy (uncertainty) of certain tokens during the generation process is exceptionally high—these locations are critical branching points for reasoning paths. Branching to generate new paths at these high-uncertainty tokens is both efficient and maximizes exploration diversity.

Core Idea: Use token-level entropy to guide tree search branching, combined with on-policy process supervision rewards derived from the tree structure, to directly integrate tree search into LLM reinforcement learning training.

Method¶

Overall Architecture¶

The TreeRL pipeline: Given a prompt → Use EPTree to generate a reasoning tree (containing multiple branches and leaf nodes) → Verify answer correctness for each leaf node → Calculate process supervision signals for each intermediate step based on the tree structure → Update the policy model using policy gradient. The entire process is on-policy: a new tree is generated in each RL iteration, and reward signals come directly from the current policy's generation results.

Key Designs¶

EPTree (Entropy-guided Parallel Tree search):
- Function: An efficient parallel tree search algorithm that generates more diverse answers than MCTS and multi-chain sampling under the same inference budget.
- Mechanism:
  - Initialization: Parallelly generate \(M\) independent chains as the initial tree.
  - Branching Point Selection: Compute the cross-entropy of each token \(H(\mathbf{y}_t) = -\log \pi_\theta(\mathbf{y}_t | \mathbf{x}, \mathbf{y}_{<t})\), and select the top-\(N\) tokens with the highest entropy across the entire tree as branching points (while masking out tokens near the end of the sequence).
  - Expansion: Continue generating \(T\) new branches from the prefix of each branching point until completion.
  - Repeat branching and expansion \(L\) times, ultimately obtaining \(M \times (T \times N \times L + 1)\) leaf nodes.
- Design Motivation: MCTS requires expanding many iterations step-by-step, which is inefficient; EPTree requires only about 2 iterations to build a diversified search tree. High-entropy tokens represent the positions where the model is most "hesitant" (e.g., mathematical operators, logical conjunctions, self-reflection words like "wait"), and branching from these positions maximizes path diversity.
- Key parameters are denoted as an \((M, N, L, T)\)-tree, e.g., \((6,2,1,2)\) denotes 6 parallel trees, selecting 2 branching points per round, 1 round of iteration, and expanding 2 branches at each branching point.
On-policy Process Supervision Reward:
- Function: Utilize the tree structure to compute reward signals for each intermediate reasoning step.
- Mechanism:
  - Value of each node \(s_n\): The ratio of correct answers among all its leaf descendants \(V(s_n) = \frac{1}{|L(s_n)|} \sum_{l \in L(s_n)} \mathbf{1}(l \text{ correct})\).
  - Global Advantage: \(G_A(s_n) = V(s_n) - V(\text{root})\), measuring the advantage of the step compared to the overall average correctness rate.
  - Local Advantage: \(L_A(s_n) = V(s_n) - V(p(s_n))\), measuring the improvement of the step relative to its parent node.
  - Final Reward: \(R(s_n) = (G_A(s_n) + L_A(s_n)) / \sqrt{|L(s_n)|}\).
- Design Motivation: Global advantage provides a macro evaluation of step quality, while local advantage captures the marginal contribution of each step. Dividing by \(\sqrt{|L(s_n)|}\) is because non-leaf nodes sharing prefixes will repeatedly appear in multiple sequences, necessitating down-weighting to prevent overfitting. This design can be viewed as a specialized form of Generalized Advantage Estimation (GAE).
- Key difference from PRMs: Completely on-policy, requiring no extra reward model training, and naturally robust against reward hacking.
Tree-to-Sequence Training:
- Function: Unroll the tree structure into multiple sequences for RL training.
- Mechanism: Each leaf node corresponds to a complete sequence from root to leaf, totaling \(M \times (N \times L \times T + 1)\) sequences within the tree. In each sequence, each token is associated with the process reward \(R(s_n)\) of its corresponding step, which is updated using policy gradient.
- Design Motivation: Compared to multi-chain sampling with 16 independent chains, the \((6,2,1,2)\) configuration of TreeRL produces around 30 sequences under the same inference budget, providing more training data (albeit with greater training computation).

Loss & Training¶

Use standard policy gradient, KL coefficient \(\beta = 10^{-4}\), learning rate \(1.5 \times 10^{-6}\).
Sampling temperature 1.2, top-p = 0.95, maximum sequence length 8192.
Rule-based reward: Correct answer +1, incorrect 0.
Training data is sourced from MATH-train and NuminaMath.

Key Experimental Results¶

Main Results¶

Model	MATH500	Omni-MATH-500	AIME2024	AMC	OlympiadBench	LiveCodeBench	Avg
Qwen-2.5-14B SFT	76.6	29.5	10.6	48.0	36.9	14.5	36.0
ChainRL (Qwen-14B)	81.6	32.7	22.2	53.9	41.1	18.2	41.6
TreeRL (Qwen-14B)	81.7	36.7	28.0	55.9	44.6	20.8	44.5
R1-Distilled-7B SFT	94.0	47.8	55.9	85.5	54.4	43.9	63.6
ChainRL (R1-7B)	93.6	48.1	59.7	85.5	54.5	46.1	64.5
TreeRL (R1-7B)	94.4	49.8	60.8	85.0	57.1	47.4	65.8

TreeRL outperforms ChainRL by \(2.9\%\) on average on Qwen-14B, and by \(1.3\%\) on R1-Distilled-7B.

Ablation Study¶

Process Reward Design	#Response	MATH500	Omni-MATH	AIME2024	Avg Gain
\((G_A + L_A)/\sqrt{n}\) (Full)	30	81.7	36.7	28.0	—
\(G_A + L_A\) (No down-weighting)	30	81.5	32.0	24.1	-1.9 ↓
\(G_A/\sqrt{n}\) (No local advantage)	30	80.1	35.1	24.7	-1.3 ↓
\((G_A + L_A)/\sqrt{n}\) (16 sequences)	16	80.1	32.5	24.5	-3.2 ↓

Branching Strategy (M,N,L,T)	Entropy-Guided	PassRate ↑	Token Count ↓
(6,2,1,2) EPTree	✓	56.9	22268
(6,2,1,2) Random Branching	✗	54.8	24213
(16,0,0,0) Multi-chain Sampling	-	52.4	19858

Key Findings¶

EPTree consistently outperforms multi-chain sampling by about 3% in PassRate under the same inference budget, while generating approximately 2x more diverse answers.
The three components of process supervision are all indispensable: removing the down-weighting factor (\(1/\sqrt{n}\)) causes a drop of \(1.9\%\), removing local advantage causes a drop of \(1.3\%\), and reducing training sequences causes a drop of \(3.2\%\).
Entropy-guided branching achieves a higher PassRate than random branching with fewer tokens (56.9 vs 54.8), indicating that high-entropy positions are indeed critical branching points.
High-frequency branching tokens include mathematical operators like (, logical conjunctions, transitional words like "Since"/"But", and self-reflection words like "wait", which aligns with intuition.
TreeRL begins to outperform ChainRL after around 100 training steps, and the gap continues to widen.

Highlights & Insights¶

Entropy-guided branching is the most ingenious design: It avoids the inefficiency of step-by-step expansion in MCTS, requiring only 2 search iterations to build a tree. Utilizing token entropy to identify where the model is most "hesitant" for branching is both efficient and targeted, presenting a natural uncertainty-guided exploration strategy.
On-policy process supervision does not require training a PRM: It directly derives gradient signals from the correctness statistics of the tree structure, avoiding the distribution shift issues of offline PRMs. This method can be productively transferred to any RL scenarios with outcome rewards.
The combination of global and local advantages can be viewed as a specialized form of GAE, providing a theoretical foundation. Furthermore, the \(1/\sqrt{n}\) down-weighting is a noteworthy technical detail—overfitting caused by shared prefixes repeatedly appearing in multiple sequences is a unique challenge in tree-structured RL.

Limitations & Future Work¶

Current inference engines (e.g., vLLM) are not specifically optimized for tree search. EPTree still requires 2+ iterations, and its actual speed is about 1/2 of multi-chain sampling.
Experiments are only validated on mathematical and code reasoning, without extension to NLP generation, dialogue, or other tasks.
The reward design of \((G_A + L_A)/\sqrt{n}\) is relatively heuristic, and the optimal allocation of step importance weights \(\lambda_j\) has not been fully explored.
The RL gain is smaller on the R1-Distilled model, potentially because the training data is too simple for this model—matching training data difficulty is a critical challenge in RL.
Future work could explore combining EPTree with the UCB strategy of MCTS, or utilizing it jointly with reward models.

vs MCTS: MCTS requires a value network and a large number of iterative rollouts, making it highly inefficient in LLM scenarios. EPTree replaces UCB with entropy to select branching points, requiring only 2 iterations, making it better suited for GPU parallelization.
vs GRPO/RLOO: Traditional RL performs updates after multi-chain sampling using outcome rewards, resulting in sparse signals. TreeRL provides step-level on-policy signals, offering much higher information density.
vs AlphaMath Zero: AlphaMath uses MCTS to generate data for offline training (SFT/DPO), which suffers from distribution shift. TreeRL directly performs on-policy training, resulting in more reliable signals.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of combining tree search and RL has been proposed before, but the specific designs of EPTree's entropy-guided branching and on-policy process supervision exhibit solid innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple benchmarks, multiple base models, and detailed ablation studies, but lacks validation on non-reasoning tasks.
Writing Quality: ⭐⭐⭐⭐ Clear structure, abundant plots/tables, and standardized algorithm descriptions.
Value: ⭐⭐⭐⭐ Provides a practical tree-search solution for LLM RL training; the open-source code helps lower the entry barrier for reproduction.