GAMEBoT: Transparent Assessment of LLM Reasoning in Games¶

Conference: ACL 2025
arXiv: 2412.13602
Code: https://github.com/Visual-AI/GAMEBoT
Area: LLM/NLP
Keywords: LLM Evaluation, Game Reasoning, Chain-of-Thought, Intermediate Reasoning Verification, Data Contamination

TL;DR¶

This paper proposes GAMEBoT, a game-based LLM reasoning evaluation platform. By decomposing complex in-game reasoning into predefined modular subproblems combined with rule-based ground-truth verification, GAMEBoT achieves transparent reasoning capability assessment across 17 mainstream LLMs.

Background & Motivation¶

Background: As LLMs increasingly undertake tasks requiring complex reasoning in practical applications, reliable evaluation of their reasoning capacities has become essential. Existing benchmarks, such as GSM8K, MATH, and MMLU, evaluate reasoning from various angles, but mostly exist as static datasets.

Limitations of Prior Work: Current LLM reasoning evaluation faces three core challenges: (1) Lack of interpretability: most benchmarks only evaluate the correctness of the final answer, failing to determine whether the model's intermediate reasoning process is sound (which can lead to "getting the right answer for the wrong reasons"); (2) Performance saturation: the performance gap among SOTA models on popular benchmarks like MMLU and GSM8K is shrinking, reducing their discriminative power; (3) Data contamination: static benchmarks are prone to being leaked into the training corpora of models, leading to unreliable evaluation results.

Key Challenge: An ideal reasoning evaluation framework must simultaneously be "interpretable" (capable of verifying intermediate steps), "sufficiently challenging" (capable of distinguishing top-tier models), and "contamination-resistant" (dynamically generated to prevent leaks). However, existing methods struggle to balance all three.

Goal: Design an LLM reasoning evaluation framework that simultaneously provides interpretability, sufficient difficulty, and robustness against data contamination.

Key Insight: Games serve as natural environments for evaluating reasoning. Game rules define clear reasoning structures, game states can be formally represented, and each decision step involves multiple sub-reasoning processes (e.g., board understanding, strategic planning, opponent modeling). Additionally, games offer inherent dynamism, with each playthrough being unique.

Core Idea: Decompose complex reasoning in games into predefined modular subproblems. For each subproblem, chain-of-thought (CoT) prompts and rule-based ground-truth generation algorithms are designed, enabling both the assessment of final decision quality and the verification of intermediate reasoning steps.

Method¶

Overall Architecture¶

The evaluation pipeline of GAMEBoT consists of: choosing a game \(\rightarrow\) initializing a match \(\rightarrow\) prompting the LLM to solve a series of subproblems in each turn \(\rightarrow\) making action decisions \(\rightarrow\) executing actions with the game engine and updating states \(\rightarrow\) logging intermediate reasoning and final actions \(\rightarrow\) calculating multi-dimensional scores after the match. It supports LLM-vs-LLM adversarial modes to further reduce the risk of data contamination.

Key Designs¶

Modular Subproblem Decomposition:
- Function: Decomposes complex reasoning within games into verifiable, atomic reasoning steps.
- Mechanism: For each game environment, the decision-making process is analyzed and broken down into standardized subproblems. For example, in Othello, the steps include: (1) board state understanding (accurately identifying current piece distribution); (2) legal move enumeration (listing all playable positions); (3) position value evaluation (assessing the strategic value of each position); and (4) final move selection. Each subproblem has explicit input and output definitions to facilitate independent verification. Different games employ tailored subproblem decompositions based on their rules.
- Design Motivation: Evaluating only final actions fails to determine the correctness of the reasoning process, as a model might "guess" the optimal move despite possessing flawed intermediate reasoning. Decomposition pinpoints specific weaknesses in model reasoning.
Domain-Knowledge-Driven CoT Prompt Design:
- Function: Uses game-specific domain knowledge to guide LLMs through structured reasoning.
- Mechanism: A set of CoT prompts is designed for each game, guiding the LLM step-by-step through the subproblems. Prompts incorporate game domain knowledge (e.g., instructing the strategic value of corners in Othello, or offering basic hand evaluation rules in Texas Hold'em). These CoT prompts not only help LLMs make better decisions but also render the model reasoning process observable and analyzability. Even with detailed CoT guidance, most models still demonstrate low accuracy on intermediate steps (e.g., GPT-4o achieves only \(0.52\)), demonstrating that GAMEBoT is sufficiently challenging.
- Design Motivation: CoT prompts serve a dual purpose: they assist the model in reasoning (providing optimal conditions) while acting as an evaluation tool (exposing intermediate steps for verification).
Rule-Based Ground Truth Generation:
- Function: Provides precise reference answers for intermediate reasoning steps.
- Mechanism: Rule-based algorithms are developed for each subproblem to automatically generate ground truths. For instance, legal move enumeration is calculated precisely via the game engine, board states are read directly from the game state, and position value evaluations are determined using predefined heuristics or search algorithms (e.g., minimax). These ground truths are exact—based on explicit rules rather than LLM judgments—ensuring highly reliable verification of intermediate LLM outputs.
- Design Motivation: Relying on another LLM to evaluate correctness introduces bias and errors from the evaluator itself. Rule-based ground truths eliminate this issue, guaranteeing objective and replicable evaluations.

Loss & Training¶

Since GAMEBoT is an evaluation framework rather than a training method, it does not involve loss functions or training. The evaluation metrics consist of two dimensions: (1) final action quality, measured via win rates or cumulative scores; and (2) intermediate reasoning accuracy, representing correctness rates for each subproblem. The final score is a weighted average of both intermediate subproblems and final actions.

Key Experimental Results¶

Main Results¶

Comprehensive ranking of 17 LLMs across 8 games (sorted by Decision Score):

Rank	Model	Comprehensive Score	Othello	Pong	Connect4	Checkers	TicTacToe
1	GPT-4o	0.470	0.395	0.685	0.525	0.270	0.475
2	Claude-3.5-Sonnet	0.390	0.280	0.545	0.620	0.050	0.395
3	GPT-4	0.355	0.135	0.475	0.545	0.090	0.405
4	Llama-3.1-405B	0.305	0.215	0.640	0.520	0.000	0.325
5	Llama-3.1-70B	0.250	0.135	0.575	0.300	0.050	0.495
6	GPT-4o-mini	0.205	-0.175	0.430	0.335	-0.015	0.170
17	Jamba-1.5-mini	-0.100	0.065	0.070	-0.145	-0.250	-0.115

Intermediate reasoning accuracy (selected models):

Model	Board Understanding	Legal Move Enumeration	Strategy Evaluation	Overall Intermediate Reasoning
GPT-4o	0.71	0.58	0.39	0.52
Claude-3.5-Sonnet	0.65	0.52	0.35	0.47
GPT-4	0.62	0.48	0.32	0.44
Llama-3.1-405B	0.55	0.43	0.28	0.39

Ablation Study¶

Impact of CoT prompts (GPT-4o, Othello):

Configuration	Decision Score	Intermediate Reasoning Accuracy	Description
Full CoT	0.395	0.52	CoT guidance provided for all subproblems
No CoT	0.215	N/A	No reasoning guidance provided; drops by \(0.18\)
Partial CoT (Board Understanding only)	0.285	0.71*	Guided board understanding only, free reasoning elsewhere
Partial CoT (Strategy Evaluation only)	0.310	0.39*	Guided strategy evaluation only

Key Findings¶

Low reasoning accuracy even for SOTA models: GPT-4o achieves an intermediate reasoning accuracy of only \(0.52\) even with elaborate CoT guidance, revealing that GAMEBoT is highly discriminative and sufficiently challenging.
Gap between board understanding and strategic reasoning: All models perform significantly better in "understanding the current state" than in "making strategic decisions," indicating a clear gap between perception and reasoning capabilities.
Checkers as the most difficult game: Almost all models score lowest on Checkers (several get negative scores) as it requires multi-step planning and complex positional evaluations.
Model scale is not the sole determinant: Llama-3.1-70B outperforms GPT-4o-mini on certain games, implying that reasoning capacity is not entirely determined by parameter count.
CoT prompts yield consistent improvements: However, gains vary by game. Environments with more complex rules and deep reasoning requirements benefit the most from CoT.
The latest GPT-5 has demonstrated dominant advantages in Connect4 and Checkers (according to the latest updates on the project website).

Highlights & Insights¶

Verifiability of intermediate reasoning steps: This is the most crucial innovation in GAMEBoT. By decomposing complex reasoning into verifiable subproblems and utilizing rule-based algorithms to generate ground truths, the framework evaluates the process of reasoning rather than just the final outcome. This paradigm can be extended to alternative domains where reasoning can be formally decomposed.
Inherent resistance to data contamination: The combination of dynamic gameplay and LLM-vs-LLM match-ups ensures that game states are virtually non-repetitive, fundamentally resolving dataset leakage issues. This offers a more robust solution than simply hiding test sets.
Multidimensional reasoning capabilities across 8 games: Ranging from simple TicTacToe to complex Texas Hold'em, the framework spans spatial reasoning, mathematical reasoning, risk management, and opponent modeling, generating a comprehensive multi-dimensional footprint of reasoning capabilities instead of a single scalar score.

Limitations & Future Work¶

Game environments are simplified worlds. Whether in-game reasoning capabilities transfer directly to real-world reasoning tasks (e.g., scientific deduction or legal reasoning) remains unproven.
Subproblem decomposition and CoT prompt design require domain expertise; extending the framework to new games entails manual effort.
Rule-based ground truths might not be perfectly precise for certain games (e.g., optimal strategies in Texas Hold'em are still open research questions).
Evaluation costs are high, requiring numerous game playouts and consuming substantial API tokens.
Currently, only text-based game-state descriptions are supported; future extensions could incorporate visual inputs (e.g., understanding game states via screenshots).
An evaluation dimension for "learning ability" could be introduced to evaluate whether models improve their strategies via in-context learning over multiple game iterations.

vs LMSYS Chatbot Arena: Chatbot Arena evaluates conversational quality using human preferences, whereas GAMEBoT uses automated, rule-based game environments to evaluate reasoning capabilities. They are complementary: Arena evaluates the capability to "generate good responses," while GAMEBoT evaluates "correct reasoning."
vs GSM8K/MATH: These benchmarks assess mathematical reasoning using static datasets, which are susceptible to data contamination. GAMEBoT circumvents this through dynamic gameplay, though it evaluates strategic and procedural reasoning rather than mathematical deduction.
vs AgentBench: AgentBench evaluates LLM agent capabilities across various tasks, including game environments. In contract, GAMEBoT focuses specifically on strategic gaming, introducing its unique advantage of intermediate step verification.

Rating¶

Novelty: ⭐⭐⭐⭐ The paradigm of reasoning decomposition paired with rule-based ground-truth verification is highly novel, and the 8-game selection offers comprehensive multi-dimensional coverage.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Conducts large-scale evaluations on 17 models across 8 games, presents in-depth CoT ablation analyses, and continues to update results for new models on the project website.
Writing Quality: ⭐⭐⭐⭐ Well-structured paper with logical arguments for game selection and reasoning decomposition; project page is exceptionally polished.
Value: ⭐⭐⭐⭐ Provides a vital complementary perspective on LLM reasoning evaluation, with long-term utility sustained by continuous project updates.