Evaluating LLMs in Open-Source Games¶
Conference: NeurIPS 2025 arXiv: 2512.00371 Code: https://github.com/swadeshs/llm-osgt Area: Interpretability Keywords: Game Theory, Program Equilibrium, Open-Source Games, Multi-Agent Cooperation, Code Transparency
TL;DR¶
This work introduces a novel paradigm of open-source games—where agents submit programs rather than raw actions—to systematically evaluate LLMs on strategic reasoning, mutual learning, and cooperative gameplay, finding that LLMs can automatically discover approximate program equilibria.
Background & Motivation¶
Background: Multi-agent LLM research has largely focused on communication and task decomposition, rarely addressing strategic reasoning and cooperation; traditional game theory has primarily targeted human or conventional RL agents.
Limitations of Prior Work: - LLMs' strategic reasoning capabilities in complex multi-agent environments remain poorly understood. - Existing evaluations mostly rely on natural language or black-box actions, making interpretation and verification difficult. - Whether LLMs can spontaneously reach cooperative equilibria in cooperative games is unknown.
Key Challenge: How to evaluate LLMs' ability to both safeguard self-interest and achieve cooperation in multi-agent strategic environments.
Goal: Leverage code transparency to design an evaluation framework that investigates strategic reasoning and the emergence of cooperation in LLMs.
Key Insight: Open-source games—overturning the "black-box" constraint by having agents exchange source code, enabling reasoning over known opponent strategies.
Core Idea: A three-tier progressive investigation of LLM strategic reasoning via the SPARC benchmark (code comprehension) + open-source games (dynamic strategy) + evolutionary analysis (long-term stability).
Method¶
Overall Architecture¶
A three-stage evaluation architecture: Tier 1 — SPARC benchmark assesses code comprehension ability → Tier 2 — open-source games (two-player matches) examine emergent strategic mechanisms → Tier 3 — evolutionary dynamics analyze the stability of program equilibria.
Key Designs¶
-
SPARC Benchmark:
- Function: Evaluates LLMs' ability to understand opponent strategy code.
- Mechanism: 239 IPD strategies (from the Axelrod library); given an opponent's code, the model predicts whether the strategy will always cooperate with a pure cooperator within 10 rounds. Three difficulty levels: unmasked, masked (semantic information removed), and obfuscated (all identifiers randomly replaced).
- Design Motivation: Code transparency is a prerequisite for open-source games; it must first be verified that LLMs can understand strategy code.
-
Open-Source Game Experiments:
- Function: Two-player matches where agents submit Python programs rather than direct actions.
- Mechanism: Three agent objectives — PM (purely self-interested), CPM (cooperation-first), and DPM (deception-prone). 10 meta-rounds; after each round, agents exchange code and execute it, then revise their strategies based on outcomes.
- Strategic Feature Evaluation: GPT-4o is used as a judge to assess five features (independent development, exploitation, counter-strategy, imitation, and deception).
- Design Motivation: Investigates LLM strategic behavior when opponent strategies are known.
-
Evolutionary Dynamics Analysis:
- Function: Analyzes the long-term stability of different strategy types.
- Mechanism: Replicator dynamics equation \(\dot{x}_i = x_i[(Ax)_i - x^TAx]\); CPM/DPM/PM populations are initialized uniformly and their evolutionary trajectories are observed.
- Design Motivation: Single-round games reveal only local behavior; evolutionary analysis exposes system-level equilibria.
Key Experimental Results¶
Main Results: SPARC Benchmark¶
| Model | Unmasked Zero-Shot | Unmasked CoT | Masked CoT | Obfuscated Zero-Shot | Obfuscated CoT |
|---|---|---|---|---|---|
| Qwen2.5 (7B) | 56.4% | 75.1% | 75.1% | 43.6% | 65.6% |
| Qwen2.5 (72B) | 59.8% | 83.8% | 83.8% | 51.9% | 78.8% |
| DeepSeek-V3 | 81.7% | 86.3% | 87.6% | 72.2% | 81.7% |
| Kimi-K2 | 80.1% | 86.7% | 85.9% | 77.2% | 83.0% |
| DeepSeek-R1 | 82.6% | - | 84.2% | 83.4% | - |
| o4-mini | 87.6% | - | 88.0% | 84.2% | - |
Evolutionary Dynamics Analysis¶
| Game | Long-Term Stable Type | PM Attraction | Notes |
|---|---|---|---|
| IPD | CPM + DPM coexistence | No | Tit-for-Tat-style cooperative strategies are stable; PM is eliminated. |
| Coin Game | Pure PM dominance | Yes | Spatial reasoning is more complex; defense is ineffective and active occupation is required. |
Key Findings¶
- CoT prompting significantly improves non-reasoning models (average +20%) but has little effect on reasoning models.
- Obfuscation only marginally reduces performance (72–84%), suggesting LLMs rely primarily on algorithmic structure rather than semantic information.
- Although DPM agents have deceptive intent, deception is largely ineffective in a code-transparent environment.
- The same set of agents produces completely opposite evolutionary trajectories across different games, demonstrating that environmental characteristics determine strategy viability.
Highlights & Insights¶
- Strategic Advantage of Code Transparency: LLMs can understand and reason over opponent code logic, maintaining 72–84% accuracy even after obfuscation, demonstrating deep algorithmic comprehension.
- Effectiveness of Goal Instructions: PM/CPM/DPM prompts successfully induce markedly different strategic patterns, showing that LLM behavioral objectives can be substantially shaped through prompt engineering.
- Conditional Stability of Cooperation: CPM strategies can persist stably in IPD, indicating that cooperation can be self-sustaining in structurally repeated games — an important insight for multi-agent safety.
- Three-Tier Progressive Design: The progression from code comprehension → dynamic games → evolutionary stability yields an elegant and rigorous experimental design.
Limitations & Future Work¶
- Only two-player games are studied; more complex scenarios such as multi-player coalitions are not addressed.
- The assumption of full code transparency may not hold in practice, where partial concealment is possible.
- IPD is limited to 10 rounds and the Coin Game uses a small grid.
- Formal verification is not integrated, so it is impossible to guarantee that generated code satisfies safety properties.
Related Work & Insights¶
- vs. Traditional Game-Theoretic LLM Research: Several prior works study LLM behavior in payoff-matrix games; this paper is the first to systematically investigate code-level strategic reasoning.
- vs. Cooperative AI: Frameworks by Hammond/Dafoe are primarily theoretical; this paper provides empirical evaluation tools.
- Open-Source Game Theory: Rubinstein's theoretical work is empirically validated on LLMs for the first time.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — An entirely new perspective on empirical LLM research through open-source games.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Three-tier progression from SPARC → dyadic games → evolutionary dynamics.
- Writing Quality: ⭐⭐⭐⭐ — Concepts are clear, though some game-theoretic details could be made more accessible.
- Value: ⭐⭐⭐⭐⭐ — Offers deep insights into multi-agent safety, cooperation mechanisms, and strategic reasoning.