Can Large Language Models Master Complex Card Games?¶

Conference: NeurIPS 2025 arXiv: 2509.01328 Code: https://github.com/THUDM/LLM4CardGame Area: LLM Evaluation Keywords: LLM game capability, card games, supervised fine-tuning, multi-task learning, general capability retention

TL;DR¶

This paper systematically evaluates the ability of LLMs to learn eight complex card games. It finds that through SFT on high-quality game trajectory data, LLMs can approach the performance of strong game AIs and simultaneously master multiple games, though general capabilities degrade — a decline that can be mitigated by mixing in general instruction data.

Background & Motivation¶

Background: AlphaGo/AlphaZero/MuZero have achieved superhuman performance in perfect-information games such as Go and Chess via reinforcement learning. LLMs have demonstrated strong performance on knowledge QA, mathematics, and coding. This naturally raises the question: can LLMs achieve comparable mastery in complex games?

Limitations of Prior Work: (a) Existing LLM game evaluations mostly rely on prompt-based methods that assess transfer of existing knowledge rather than learning capacity; (b) fine-tuning evaluations involve games of insufficient complexity to probe the upper bound of LLM learning ability; (c) systematic evaluation of simultaneous multi-game learning and general capability retention is lacking.

Key Challenge: As general-purpose language models, can LLMs match specialized game AIs by learning from strategic game data? Do multiple games mutually reinforce or conflict with one another? Can game competence and general capability coexist?

Goal: Three research questions are investigated: (1) Can LLMs master complex card games, and how much data is required? (2) Can LLMs simultaneously master multiple games? (3) Does acquiring game competence degrade general capabilities?

Key Insight: Eight card games are selected — Dou Di Zhu, Guandan, Riichi Mahjong, UNO, Gin Rummy, and three variants of Texas Hold'em — spanning a wide range of complexity (information sets from \(10^3\) to \(10^{67}\)). High-quality trajectory data generated by strong game AIs are used for SFT.

Core Idea: Rather than having LLMs explore environments autonomously (prohibitively expensive), the paper leverages high-quality trajectories generated by strong AIs for supervised fine-tuning, systematically evaluating LLMs' game-learning capacity.

Method¶

Overall Architecture¶

The pipeline consists of three stages: (1) Data generation: strong game AIs (DouZero, DanZero, etc.) play to generate trajectories, which are filtered and converted into instruction-tuning format; (2) LoRA fine-tuning: SFT is applied to Qwen2.5, Llama3.1, and GLM4; (3) Evaluation: fine-tuned LLMs play against opponent AIs, and win rates/rewards are computed.

Key Designs¶

Game Selection and Complexity Analysis:
- Eight games spanning from simple (Leduc Hold'em, 6 cards) to extremely complex (Guandan, \(10^{67}\) information sets).
- Key complexity dimensions: number of information sets, average information set size, number of legal actions per step, and average decision steps per game.
- Dou Di Zhu, Guandan, and Mahjong are high-complexity games (long decision chains, large action spaces); the remaining five are relatively simpler.
High-Quality Trajectory Data Generation:
- Function: Existing strong game AIs serve as "teachers" to generate training data, avoiding the enormous cost of having LLMs explore environments themselves.
- Teacher models: DouZero for Dou Di Zhu, DanZero for Guandan, expert data from the Tenhou platform for Mahjong, and rule-based models or DQN for simpler games.
- Data filtering: Only decision data from the winning side is retained, and only decision points where the number of legal actions exceeds one (i.e., non-trivial choices) are kept.
- Data scale: 1 million samples each for Dou Di Zhu, Guandan, and Mahjong; 400K each for simpler games.
Instruction Tuning Format:
- Each observation–action pair is converted into one instruction: comprising game introduction, current state (hand cards, public cards, action history, legal actions), and output format specification.
- Outputs are action selections in JSON format.
- LoRA (rank=8, alpha=16) is used for fine-tuning: 1 epoch, lr=1e-4.
Multi-Game Mixed Training:
- Based on single-game experiments to determine per-game data requirements, a mixed dataset of 3.1 million samples is constructed.
- More data is allocated to complex games (Guandan: 950K, Dou Di Zhu: 700K); simpler games receive less (Gin Rummy: 50K).

Loss & Training¶

Models: Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct, GLM4-9B-Chat (and variants ranging from 0.5B to 14B).
LoRA fine-tuning on 8×H100 GPUs.
General capabilities are evaluated using MMLU-Pro, Math-500, and HumanEval.

Key Experimental Results¶

Main Results: Single-Game Performance (Qwen2.5-7B)¶

Game	Baseline LLM	SFT LLM	Teacher AI	Notes
Dou Di Zhu	0.087	0.806 win rate	~0.85	Approaches DouZero
Guandan	0.000	0.649 round win rate	~0.71	Approaches DanZero
Riichi Mahjong	0.04	1.18 reward	Mortal: 1.55	Reaches strong level
UNO	0.032	0.188	Rule-based: 0.188	Matches teacher

Multi-Game Mixed Training vs. API Models¶

Model	Dou Di Zhu	Guandan	Mahjong	Leduc	Limit Texas	No-limit Texas
GPT-4o	0.180	0.019	0.25	0.84	0.60	2.73
DeepSeek-R1	0.185	0.020	0.05	0.88	0.24	1.88
Qwen-7B-mix	0.852	0.634	1.08	1.24	2.66	4.86

The fine-tuned 7B model comprehensively outperforms GPT-4o and DeepSeek-R1 across all eight games.

Ablation Study: General Capability Retention¶

Configuration	MMLU-Pro	Math-500	HumanEval	Dou Di Zhu
Base model	56.3	80.0	86.6	0.087
Game data only	42.1	53.6	67.7	0.806
Game + 10% general	53.2	69.0	79.9	0.785
Game + 50% general	54.2	72.0	83.5	0.775

Key Findings¶

Remarkable LLM learning capacity: LLMs approach specialized strong AIs on high-complexity games (Dou Di Zhu, Guandan), and a single model can play multiple roles.
Multi-game co-learning: Games with similar rules (e.g., three Texas Hold'em variants) mutually reinforce each other; games with divergent rules (e.g., Dou Di Zhu vs. Mahjong) exhibit interference.
Model scale: Performance scales positively from 0.5B to 7B, but the 14B model underperforms the 7B — attributed to insufficient data for large model training, particularly due to data quality issues in the peasant role.
General capability degradation is mitigable: Pure game fine-tuning causes a 14% drop on MMLU-Pro, but mixing in 50% general instruction data recovers performance to within 2% of the original, with only a marginal decrease in game performance.
Data quality is critical: The peasant role performs significantly worse than the landlord; the root cause is that the filtering strategy allows low-quality "free-rider" peasant data to contaminate the training set.

Highlights & Insights¶

Elegant engineering approach: Rather than having LLMs explore via RL (too costly), the paper stands on the shoulders of giants (DouZero, etc.) and applies SFT at minimal cost to probe the upper bound of LLM game-learning ability.
Multi-game co-learning experimental design: The finding that similar-rule games mutually reinforce while dissimilar-rule games conflict offers practical insights for multi-task LLM training.
Data quality > data quantity: The peasant role issue vividly exposes the "free-rider" data trap in team-based games.

Limitations & Future Work¶

Only SFT is explored; RL is not investigated — self-play or RLHF could potentially yield further performance gains.
Opponent models are relatively weak (rule-based, random); the evaluation measures win rates against weak opponents rather than against the strongest AIs.
The data filtering strategy is insufficiently fine-grained — in team games, key contributors should be distinguished rather than retaining all data from the winning side.
The 14B degradation suggests that LoRA rank may need to be scaled with model size.
The study is limited to card games and does not extend to perfect-information games such as board games.

vs. AlphaGo/AlphaZero: These systems learn from scratch via self-play and RL, whereas this paper uses existing AI-generated data for SFT. The advantage of LLMs lies in the generality of a single model across multiple games.
vs. Suspicion-Agent (Guo et al.): Prompt-based methods rely on the model's intrinsic knowledge and fall far short of SFT-based approaches (GPT-4o achieves only 0.18 on Dou Di Zhu vs. 0.85 after SFT).
vs. Specialized game AIs: LLMs approach but do not surpass specialized AIs, indicating a ceiling for SFT that may require RL to overcome.

Rating¶

Novelty: ⭐⭐⭐ The method (SFT) itself is not novel, but the systematic evaluation framework and experimental findings are valuable.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Eight games, multiple model types and scales, and three evaluation dimensions (single-game, multi-game, general capability) constitute a very comprehensive study.
Writing Quality: ⭐⭐⭐⭐ Experiment-driven, with clear conclusions and detailed data.
Value: ⭐⭐⭐⭐ Provides solid benchmark evidence for the capability boundaries of LLMs as general-purpose agents.