Skip to content

LiTS: A Modular Framework for LLM Tree Search

Conference: ACL 2026
arXiv: 2603.00631
Code: https://github.com/xinzhel/lits-llm
Area: LLM Agent / Tree Search / Reasoning Framework
Keywords: LLM Tree Search, MCTS, BFS, Agent Framework, Tool Use

TL;DR

LiTS decomposes LLM tree search into Policy, Transition, RewardModel, and a unified data structure. Using a decorator registry, it allows the same search algorithms, components, and task logic to be composed and reused across mathematical reasoning, environment planning, and tool use, while highlighting that policy diversity in open text action spaces is a bottleneck for tree search.

Background & Motivation

Background: Methods such as Tree-of-Thoughts, RAP, ReST-MCTS, and LATS treat LLM reasoning as a search problem, exploring multiple reasoning trajectories via MCTS, BFS, or similar planning algorithms. Such approaches are highly attractive for complex mathematics, planning, and tool use.

Limitations of Prior Work: Existing implementations are often deeply coupled with specific tasks. Changing a task requires rewriting state structures, action generation, environment transitions, reward models, and evaluation logic. Furthermore, when comparing different search algorithms, it is difficult to ensure that domain components remain identical. Consequently, both algorithm researchers and domain experts perform significant repetitive engineering.

Key Challenge: Tree search requires a unified search interface, yet the states, actions, tools, environments, and reward forms of LLM tasks vary significantly. The framework must be abstract enough to support general algorithms like MCTS/BFS, while remaining flexible enough for users to inject domain-specific prompts, tools, and transitions.

Goal: The authors aim to propose a modular Python framework that allows domain experts to modify only task logic and algorithm researchers to modify only search algorithms, enabling components, algorithms, and task types to be combined orthogonally.

Key Insight: LiTS decomposes the LLM reasoning agent into three types of components: Policy generates actions, Transition executes actions and updates states, and RewardModel provides value signals for the search. All components communicate via universal structures such as Action, Step, State, and Node, and are composed through a registry and CLI.

Core Idea: To transform LLM tree search from "one monolithic implementation per paper" into a "registrable, replaceable, and reusable component grammar."

Method

LiTS is not a single algorithm but a framework. Its key lies in defining a unified grammar that allows different task types to be manipulated by tree search agents: in language-grounded reasoning, actions are text thoughts; in tool-use, actions are structured tool calls; in environment-grounded tasks, actions are environment commands. However, all of these implement the same interface.

Overall Architecture

The architecture is divided from bottom to top into data structures, components, prompts, agents, and run artifacts. Data structures define Action, Step, State, and Node; components define Policy, Transition, and RewardModel; PromptRegistry supports explicit parameters, task names, task types, and default fallbacks. Agents include chain agents and tree search agents. At runtime, all configs, checkpoints, terminal nodes, and logs are written to a single save_dir to support post-hoc evaluation.

The framework covers three task categories: Environment Grounded (e.g., BlocksWorld and Crosswords), Language Grounded (e.g., MATH500), and Tool Use (e.g., MapEval-SQL). Users extend the framework using decorators such as @register_transition, @register_dataset, @register_policy, @register_search, and @register_resource.

Key Designs

  1. Unified Data Structures Action -> Step -> State -> Node:

    • Function: Allows search algorithms to remain agnostic of low-level task details, operating only on unified node and trajectory interfaces.
    • Mechanism: Action is the atomic action generated by the Policy; Step encapsulates the action and its execution result; State accumulates Steps and provides a render method; Node contains search fields such as parent, children, reward, and visit count. Different tasks only need to implement corresponding subclasses, such as ToolUseAction, EnvAction, ThoughtStep, or SubQAStep.
    • Design Motivation: If search algorithms depend directly on task objects, they cannot be reused across tasks. The unified structure isolates the "search loop" from "task semantics."
  2. Decoupling of Policy / Transition / RewardModel Components:

    • Function: Breaks down candidate action generation, state transition, and path scoring in LLM agents into replaceable modules.
    • Mechanism: Policy generates actions based on the current state; Transition executes actions and returns a new state; RewardModel scores search nodes or actions. Chain methods require only Policy + Transition, while Tree methods additionally use RewardModel.
    • Design Motivation: The same task components can be reused between MCTS and BFS, and the same search algorithm can be tested for generalization on new task components.
  3. Decorator Registry and CLI-first Composition:

    • Function: Enables users to extend the framework with new tasks, components, or search algorithms without modifying the core package.
    • Mechanism: For example, Crosswords only requires registering its transition, prompt, and dataset, and can be used in the CLI via --dataset crosswords. MapEval-SQL returns tools and tool_context through dataset and resource registries. Custom BFS is injected via @register_search("bfs").
    • Design Motivation: The usability of a framework depends not only on the elegance of its abstraction but also on the brevity of the extension path for users. The decorator registry reduces the learning cost for domain experts.

Loss & Training

LiTS itself does not train models. The "training strategies" in the experiments primarily refer to search configurations and inference resource settings: environment-grounded and tool-use experiments utilize Claude 3.5 Sonnet via AWS Bedrock and report costs; language-grounded MATH500 uses self-deployed Llama3-8B or Llama3-8B-Instruct and reports wall-clock time. BlocksWorld MCTS uses 10 iterations, branching factor 3, and max depth 6; Crosswords MCTS uses 30 iterations and max depth 10; on MATH500, all tree search methods use 10 iterations, branching factor 3, and temperature 0.7-0.8.

Key Experimental Results

Main Results

The experimental goal of the paper is not to break SOTA records but to verify component reusability. Experiments in environment planning, tool calling, and mathematical reasoning demonstrate different extension paths.

Task Method Out Tok Cost / Time Call Count Acc
BlocksWorld (30 ex.) Chain 17K $1.48 N/A 26.7%
BlocksWorld (30 ex.) MCTS 488K $21.99 N/A 66.7%
Crosswords (30 ex.) Chain 2.5K $0.28 N/A 6.67% / 10.33%
Crosswords (30 ex.) MCTS 14K $2.42 N/A 0% / 22.67%
MapEval-SQL (10 ex.) ReAct 10.6K $0.57 62 40%
MATH500 (100 ex.) CoT 12.9K 0.6h 100 17%
MATH500 (100 ex.) RAP (MCTS) 4.47M 8.0h 3.6K 18%
MATH500 (100 ex.) ReST (MCTS) 2.24M 26.0h 4.0K 37%
MATH500 (100 ex.) ToT (BFS) 1.53M 14.7h 2.8K 39%

Ablation Study

The paper does not present traditional ablation studies but provides a critical failure analysis: in open action spaces like Crosswords, temperature escalation does not resolve action repetition, indicating that policy diversity, rather than reward quality, is the bottleneck for tree search.

Crosswords action diversity metrics Value
Unique states visited 16
Avg. policy calls per state 7.9
Duplicate rate (all) 81.1%
Duplicate rate (incorrect) 81.0%
Correct outputs 17.3%

Key Findings

  • In BlocksWorld, MCTS improved performance from 26.7% (Chain) to 66.7%, demonstrating that tree search benefits significantly from finite action spaces and reliable environment transitions.
  • In Crosswords, MCTS achieved 0% exact match but 22.67% partial match, with a duplicate rate as high as 81.1%. Even with an oracle reward, the search failed due to insufficient action diversity.
  • On MapEval-SQL, ReAct achieved 40% with 10 samples; MCTS attempted on 3 samples cost $18.40 (approx. $6.13/example vs $0.05/example for ReAct) and yielded 0% accuracy. The primary bottleneck was the self-preference bias of the LLM-as-judge reward model.
  • On MATH500, ToT-BFS and ReST-MCTS used the same ConcatPolicy, ConcatTransition, and GenerativePRM. BFS slightly outperformed MCTS (39% vs 37%) with significantly lower wall-clock time (14.7h vs 26.0h).
  • RAP used user-registered components but reached only 18% on MATH500, suggesting that component formulation may be more critical than the search algorithm itself.

Highlights & Insights

  • The core contribution of LiTS is engineering abstraction rather than a specific algorithm. It clearly delineates the reusable boundaries of tree search: algorithms care about Nodes and rewards, task logic cares about Action/Step/State, and tool calls care about BaseTool and the resource registry.
  • The framework makes "fair algorithm comparison" much easier. For example, ReST-MCTS and ToT-BFS can share identical built-in components while varying only the search algorithm, providing a more reliable comparison than independent codebases.
  • The discovery of mode collapse is insightful: in open text action spaces, the randomness of LLM sampling occurs at the token level rather than the action semantic level, so increasing temperature may still produce semantically redundant actions.
  • For tool-use agents, reward model quality is the actual bottleneck. The failure of MCTS on MapEval-SQL indicates that if the LLM-as-judge prefers verbose but incorrect SQL, tree search will spend more budget in the wrong direction.

Limitations & Future Work

  • Experiments are primarily demonstration-focused with small sample sizes: only 100 samples for MATH500, 30 for BlocksWorld/Crosswords, and 10 for MapEval-SQL.
  • Crosswords mode-collapse was demonstrated only in a single open action environment and requires systematic verification across more open text tasks and different decoding strategies.
  • Tool-use tree search is affected by LLM-as-judge reward bias; future work needs to calibrate verifiers or train task-specific PRMs.
  • Currently, built-in search algorithms are mainly MCTS and BFS; A* and beam search variants remain directions for future expansion.
  • Throughput remains an engineering challenge; scaling tree search requires concurrent and batched LLM calls.
  • BaseTool currently requires users to write and register Python classes; the authors plan to introduce MCP to allow external tool servers to connect via standard JSON-RPC.
  • vs LLM Reasoners: LLM Reasoners also support tree search, but task logic is more easily coupled with configuration classes; LiTS emphasizes component sharing and registry-based extension.
  • vs LangGraph: LangGraph is suitable for agent graph orchestration but lacks native tree search algorithms; LiTS provides pre-implemented MCTS/BFS and allows components to be reused across tasks.
  • vs Tree-of-Thoughts / RAP / ReST-MCTS: These are specific reasoning methods; LiTS functions as a unified experimental platform where their structures can be decomposed into registrable components for reuse.
  • Insights for Future Work: When developing new tree search algorithms, researchers should report component formulation and reward quality simultaneously; the algorithm itself is not the only variable, and the action diversity of the Policy may determine the upper bound.

Rating

  • Novelty: ⭐⭐⭐⭐ The framework abstraction is not a brand-new concept, but the component boundaries and registry design are highly practical.
  • Experimental Thoroughness: ⭐⭐⭐ Demonstrations cover three task types, but sample sizes and SOTA comparisons are limited.
  • Writing Quality: ⭐⭐⭐⭐ Architecture, extension examples, and failure analyses are all clearly presented.
  • Value: ⭐⭐⭐⭐ Directly provides engineering value to LLM agent/tree search researchers and tool developers.