Evaluating Counterfactual Strategic Reasoning in Large Language Models¶
Conference: ACL2026
arXiv: 2603.19167
Code: https://github.com/dimjimitris/llm_gm_thesis
Area: LLM Reasoning / Game Evaluation / Counterfactual Robustness
Keywords: counterfactual games, strategic reasoning, Prisoner's Dilemma, Rock-Paper-Scissors, opponent comprehension
TL;DR¶
This paper evaluates the strategic adaptation capabilities of LLMs using label perturbations, payoff perturbations, and joint counterfactual versions of the Iterated Prisoner's Dilemma and Rock-Paper-Scissors. It finds that many models appear proficient in familiar games but continue to utilize templated strategies even after payoff structures are altered.
Background & Motivation¶
Background: LLMs have been widely applied in multi-agent collaboration, competition, and game simulations. Researchers often observe whether models can cooperate, compete, identify opponent strategies, and approach equilibrium behavior through structured games such as the Prisoner's Dilemma, Rock-Paper-Scissors, and Matching Pennies.
Limitations of Prior Work: Conventional game evaluations tend to overestimate model capabilities. Models might memorize templates such as "cooperate/defect in the Prisoner's Dilemma" or "randomize in Rock-Paper-Scissors" rather than re-calculating strategies based on the payoff matrix. Once action labels are renamed or payoff structures are modified counterfactually, fluent explanations do not necessarily translate into correct actions.
Key Challenge: True strategic reasoning requires models to perform conditional updates based on current environmental labels, payoffs, and historical interactions; however, LLM behavior may stem more from canonical game patterns encountered during pre-training. Distinguishing between surface recognition and incentive sensitivity is difficult in default games, requiring counterfactual interventions to decouple them.
Goal: To construct a compact, controllable, and reproducible experimental framework to separately diagnose label robustness, payoff sensitivity, opponent modeling, and token-normalized efficiency, thereby determining whether models understand the current game or are merely reproducing familiar templates.
Key Insight: The authors select two complementary games: the Prisoner's Dilemma to examine dynamic adaptation between cooperation and defection, and Rock-Paper-Scissors to examine randomization, pattern exploitation, and three-action equilibrium. Subsequently, label perturbations, payoff perturbations, and joint perturbations are applied to both, forcing models to re-interpret action meanings and payoff structures.
Core Idea: Use counterfactual label/payoff interventions in repeated games to distinguish between an LLM's ability to "describe a strategy" and its ability to "execute a strategy according to new incentives."
Method¶
The method in this paper is essentially a behavioral evaluation framework. It does not train models; instead, it places LLMs in multi-round interactive games against another LLM or an algorithmic opponent, recording actions, payoffs, opponent comprehension speed, cooperation rates, and token efficiency.
Overall Architecture¶
The input consists of a set of LLMs, prompting strategies, opponent types, and game definitions. Each experiment uses one LLM as a player against a same-model instance or an algorithmic player. The Prisoner's Dilemma is repeated for 16 rounds, and Rock-Paper-Scissors for 24 rounds; each non-self-consistency player is repeated 5 times, while self-consistency players are repeated twice.
The framework includes four types of game settings: default games, label-based counterfactuals, payoff-based counterfactuals, and joint counterfactuals. Label-based settings only change action names or dominance relationship descriptions while keeping payoffs constant; payoff-based settings change the payoff matrix so that the original equilibrium no longer applies; joint settings simultaneously change labels and payoffs to form the ultimate stress test.
Model variants include zero-shot, Chain-of-Thought, Solo Performance Prompt, and self-consistency prompting. Opponent types include both LLMs and algorithmic strategies such as SREP, PP, MF/TFT, and AP. This allows for simultaneous investigation of LLM-LLM coordination, predictable opponent exploitation, and adaptive opponent confrontation.
Key Designs¶
-
Counterfactual Game Construction:
- Function: Decomposes default games into surface label changes and deep payoff changes as pressure sources.
- Mechanism: In PD, Stag Hunt is used as a payoff-based counterfactual, changing the strict defection advantage into a coordination problem; simultaneously, C/D labels are renamed to Stag/Hare to test influence from action names. In RPS, payoffs for certain win/loss combinations are amplified, making the uniform mixed strategy no longer an equilibrium.
- Design Motivation: It is difficult to know if a model truly reads payoffs in default PD/RPS. Counterfactual versions expose "label anchoring" and "incentive rigidity."
-
Opponent Comprehension Metric:
- Function: Measures the round at which a model begins to stably understand and exploit the opponent's strategy.
- Mechanism: \(m\) is defined as the earliest round such that from that round until the end of the game, the LLM obtains a payoff no lower than the opponent's in at least \(t_p=90\%\) of subsequent rounds. A smaller \(m\) indicates earlier opponent modeling; values exceeding the game length indicate a lack of stable understanding.
- Design Motivation: Total scores only reflect cumulative earnings, not whether the model understood the opponent early or recovered by chance later. \(m\) isolates the speed of dynamic adaptation.
-
Performance and Efficiency Joint Evaluation:
- Function: Distinguishes between "better play" and "better token consumption for explanation."
- Mechanism: In addition to total points, cooperation/action distribution, and failure rates, the authors define efficiency as \(\textit{points}/\textit{tokens} \times c\) (default \(c=1000\)).
- Design Motivation: Reasoning models may output more chain-of-thought, but extra deliberation does not necessarily result in faster adaptation. Efficiency metrics identify reasoning-overhead mismatches.
Loss & Training¶
This paper does not involve training models; the focus is on evaluation design. The default payoff for the Prisoner's Dilemma is \((C,C)=(4,4)\), \((C,D)=(1,6)\), \((D,C)=(6,1)\), and \((D,D)=(2,2)\), with 16-round cumulative scores ranging from 16 to 96. In RPS, win/loss/tie payoffs per round are \(1/-1/0\), with 24-round cumulative scores ranging from -24 to 24. In the RPS payoff-based counterfactual, the magnitude of win/loss for the Rock-Paper combination is amplified to 3, shifting the theoretical equilibrium from a uniform distribution to \(\pi^*(R)=0.2\), \(\pi^*(P)=0.2\), \(\pi^*(S)=0.6\).
Key Experimental Results¶
Main Results¶
| Setting | Metric | Representative Results | Explanation | Conclusion |
|---|---|---|---|---|
| Default PD vs. SREP | Total points | Baseline for continuous \((D,D)\) is 32 points when SREP always defects; most LLMs cluster around 30 points | Models typically identify continuous defection and provide near-optimal responses | Simple algorithmic opponents are easier |
| Default PD vs. LLM | Total points | Claude 3.5/3.7 and Llama 3.3-70B reach 64 points under various prompting | 64 points corresponds to 16 rounds of complete mutual cooperation | Some models achieve stable coordination in LLM-LLM setups |
| Default PD | Instability cases | Mistral Large ranges from 18.6±10.6 to 29.8±2.2 under SREP; Claude 4/DeepSeek R1 range from 31.4±0.0 to 49.4±15.5 in LLM-LLM | Weaker or over-reasoning models may be more unstable | High capability does not guarantee strategic stability |
| Default RPS | Opponent comprehension | Claude 3.5 Sonnet v2 is 10.6±13.1 for ZS, 21.4±4.6 for SPP, and 19.6±5.6 for CoT | Opponent modeling in RPS is slower and closer to the 24-round horizon | Three-action games with no dominant strategy are harder |
| RPS payoff counterfactual | Theoretical equilibrium | Shifted from uniform \((1/3, 1/3, 1/3)\) to \((0.2, 0.2, 0.6)\) | Remaining close to uniform indicates failure to re-calculate strategy based on new payoffs | Payoff perturbation best exposes templating |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| Label-only counterfactual | Degradation usually moderate | Strong models remain relatively stable, while models like Mistral are more prone to fluctuations due to action renaming |
| Payoff-based counterfactual | Stronger degradation | Requires re-calculating incentives, especially transitioning from uniform randomization to biased mixed strategies in RPS |
| Joint counterfactual | Maximum pressure | Near-horizon comprehension failures and high variance are more common when both labels and payoffs change |
| CoT / thinking variants | Inconsistent effects | Helpful for some strong models, but can lead to overthinking or distrust tendencies in Claude 4 or DeepSeek R1 scenarios |
| Self-consistency | Reduces variance but not fundamental bias | It often reinforces existing behavioral patterns rather than correcting an erroneous strategy |
Key Findings¶
- Payoff-based counterfactuals are more diagnostic than label-only ones because they force the model to re-calculate the reward structure rather than just processing surface name changes.
- Performance in default games does not represent counterfactual robustness. Claude 3.7 is the most stable overall, Claude 4 is strong in RPS but shows mixed counterfactual stability, and Llama 3.3 is stable in PD cooperation but weaker under RPS/payoff shifts.
- Thinking more is not necessarily better. Thinking-enabled variants increase token consumption in some settings without a proportional increase in total points or opponent comprehension.
- RPS exposes delayed opponent modeling better than PD because there is no simple cooperative convergence point; models must maintain near-equilibrium behavior or identify exploitable patterns.
Highlights & Insights¶
- The value of this paper lies in its "clean" evaluation design: label changes, payoff changes, and joint changes correspond to different failure modes, allowing for the decoupling of template memory, label anchoring, and incentive rigidity.
- Opponent comprehension is more explanatory than final scores. Many models might achieve acceptable final scores, but a late \(m\) suggests they stumbled into the strategy through interaction rather than understanding the opponent from the start.
- The \((0.2, 0.2, 0.6)\) equilibrium in the RPS payoff counterfactual is crucial. It demonstrates that "randomization" is not always correct; continuing uniform randomization after payoff changes represents canonical equilibrium persistence.
- The paper serves as a reminder that in agent evaluation, the chain-of-thought in strong models may increase defensiveness, skepticism, or exploration noise. Longer reasoning traces do not automatically translate to steadier strategic execution.
Limitations & Future Work¶
- The evaluation only covers two-player, synchronous, repeated games with fixed payoffs, which has limited ecological validity compared to real-world multi-party negotiations, markets, auctions, or open-ended collaborations.
- Algorithmic opponents and payoff structures are preset; different conclusions might emerge with more adaptive opponents, multi-agent populations, or games of incomplete information.
- All metrics are derived from observable actions and tokens; inferring "understanding" is behavioral and does not equate to explaining the model's internal mechanisms.
- The set of models, prompts, and counterfactual types is not exhaustive. Future work could include more open/closed-source models, more complex payoff transformations, natural language rule ambiguity, and consistency analysis of internal reasoning traces.
Related Work & Insights¶
- vs. Conventional Game Evaluation: Default PD/RPS only observe if a model can play familiar games; this paper tests if the model truly updates strategies based on current rules via counterfactual perturbations.
- vs. Static Counterfactual QA: Many counterfactual benchmarks are single-turn I/O; this paper incorporates counterfactuals into repeated interactions, allowing for the observation of adaptation speed and history dependency.
- vs. Multi-agent Collaboration Evaluation: Conventional agent benchmarks focus more on task success rates; this paper emphasizes multi-dimensional diagnosis of payoffs, opponent modeling, efficiency, and failure rates, making it suitable as a compact stress test for agentic LLMs.
- Insight: When designing LLM benchmarks, one should include control groups with "rule-preserved but label-changed" and "label-preserved but payoff-changed" settings to avoid misinterpreting "familiar template execution" as "abstract reasoning."
Rating¶
- Novelty: ⭐⭐⭐⭐ Diagnosing LLM strategic robustness via counterfactual repeated games is a compact and explanatory setup.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple frontier LLMs, prompting strategies, opponent types, and metrics, though the variety of games is still relatively small.
- Writing Quality: ⭐⭐⭐⭐ The main text logic is clear, and the appendix provides sufficient numerical data; some tables are very large and require textual summaries for interpretation.
- Value: ⭐⭐⭐⭐ Directly valuable for LLM agent evaluation, strategic reasoning, and counterfactual robustness research.