DipLLM: Fine-Tuning LLM for Strategic Decision-Making in Diplomacy¶
Conference: ICML 2025
arXiv: 2506.09655
Code: None
Area: LLM Pre-training
Keywords: Diplomacy, LLM agent, fine-tuning, autoregressive factorization, Nash equilibrium
TL;DR¶
This paper proposes DipLLM, which decomposes the exponential combinatorial action space of Board Game Diplomacy into unit-level decision sequences through an autoregressive factorization framework, and fine-tunes an LLM to learn equilibrium strategies, outperforming Cicero using only 1.5% of its training data.
Background & Motivation¶
Background: The action space of Diplomacy can reach up to \(10^{64}\), and traditional methods rely on equilibrium search to generate large amounts of game data.
Limitations of Prior Work: Cicero's CoShar-piKL requires 448 GPUs for game simulation, incurring immense computational overhead. LLM prompting methods perform poorly in complex strategic environments.
Key Challenge: While LLMs possess strong general reasoning capabilities, making direct decisions is nearly impossible when facing \(26^{34}\) action combinations.
Goal: Can LLMs be fine-tuned to learn equilibrium strategies with a small amount of data?
Key Insight: Autoregressive factorization + weighted SFT based on unit Q-value.
Core Idea: Autoregressive factorization aligns the next-token prediction of LLMs with unit-by-unit decisions.
Method¶
Overall Architecture¶
The TextDiplomacy module converts the board state into text, and the LLM sequentially generates actions for each unit to form a joint strategy.
Key Designs¶
-
Autoregressive Factorization: \(\boldsymbol{\pi}_i(a_i^{1:D}|s) = \prod_{d=1}^{D} \pi_i^d(a_i^d | s, a_i^{1:d-1})\), where each step only selects from approximately 26 actions.
-
Equilibrium Strategy Objective: Defines the unit Q-value \(Q_i^d\) and proves that the decomposed joint strategy is equivalent to piKL-Hedge (Theorem 1), converging to an approximate Nash equilibrium in two-player zero-sum games (Theorem 2).
-
Fine-Tuning Loss: \(\max_{\pi_\phi} \mathbb{E}[\log \pi_\phi(a_i^d|s,a_i^{1:d-1}) \cdot \exp\{Q_i^d\}]\), which consists of an SFT term and equilibrium weights.
Loss & Training¶
LLaMA 3 8B + LoRA (\(\alpha=32\), rank=16), AdamW lr=2e-4, 5 epochs, with only about 500 game sessions of data.
Key Experimental Results¶
Main Results (1v6 Competition)¶
| Agent | SoS Score ↑ | Win Rate ↑ | Survived ↑ | Defeated ↓ |
|---|---|---|---|---|
| DipLLM | 23.0% | 22.3% | 50.3% | 27.4% |
| Cicero | 20.8% | 20.5% | 50.1% | 29.4% |
| DNVI | 6.6% | 4.3% | 31.1% | 64.6% |
| DipNet | 4.2% | 2.1% | 24.3% | 73.6% |
Ablation Study¶
| Configuration | SoS | Win | Defeated | Description |
|---|---|---|---|---|
| AF + Fine-tune | 29.4% | 25.2% | 29.0% | Full |
| AF only | 9.9% | 6.7% | 53.3% | No equilibrium learning |
| FT only (No AF) | 0.8% | 0.0% | 80.8% | Action space too large |
| Baseline | 0.2% | 0.0% | 95.7% | No AF, No FT |
Key Findings¶
- Both AF and FT are indispensable; the inference efficiency of DipLLM is 5-10 times that of Cicero.
- Fine-tuning with 100 games of data outperforms DipNet, and fine-tuning with 500 games achieves a 6.7% lead.
Highlights & Insights¶
- Autoregressive factorization perfectly aligns with the LLM token prediction paradigm.
- Theoretical guarantees provide a solid foundation for the proposed method.
- Outperforming the SOTA with only 1.5% of the data highlights the leverage effect of pre-trained LLM knowledge.
Limitations & Future Work¶
- Still relies on external data generation.
- Only tested on the "no-press" version.
- The potential of integrating with online search has not been fully explored.
Related Work & Insights¶
- The core idea is similar to the autoregressive Q-function in Q-Transformer.
- The potential of LLMs in game environments is still far from fully exploited.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First study to fine-tune an LLM for Diplomacy equilibrium strategies.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive comparisons, ablations, and case studies.
- Writing Quality: ⭐⭐⭐⭐ Clear framework.
- Value: ⭐⭐⭐⭐⭐ Significant improvement in data efficiency.
Supplementary Reflections¶
Relation to Domain Trends¶
The research direction of this paper is closely related to several major trends in current AI research: model capability evaluation and reliability assurance, parameter-efficient fine-tuning and model compression, as well as AI safety and alignment. From a methodological standpoint, this paper represents an exploration into the deep mechanisms of LLMs, helping drive the research paradigm shift from empirically driven to theoretically driven.
Specific Suggestions for Future Work¶
- Combine the core idea with other modalities (vision, audio, multimodal) to verify the cross-modal universality of the method.
- Validate the conclusions on larger-scale models (70B+) and newer architectures (such as Mixture-of-Experts).
- Explore the possibility of combining this approach with reinforcement learning and online learning to achieve dynamic adaptation.
- Develop automated evaluation and optimization tools to lower the barriers to adopting this method.
- Consider the intersection with LLM alignment research to explore the coordinated optimization of safety and performance.