Ad-Hoc Human-AI Coordination Challenge (AH2AC2)¶
Conference: ICML2025
arXiv: 2506.21490
Code: FLAIROx/ah2ac2
Area: LLM Reasoning
Keywords: Human-AI Coordination, Ad-Hoc Teamplay, Hanabi, Behavioural Cloning, Regularised RL, Human Proxy Agents, benchmark
TL;DR¶
This work proposes the AH2AC2 challenge—based on the cooperative card game Hanabi—which constructs human proxy agents via behavioral cloning and regularized reinforcement learning, and open-sources a limited human dataset to provide a standardized, reproducible evaluation framework for human-AI ad-hoc coordination research.
Background & Motivation¶
Core Problem¶
AI agents in the real world must collaborate with humans in an ad-hoc manner (ad-hoc teamplay), but existing methods face two major dilemmas:
Self-play overfitting: Agents trained via traditional self-play develop exclusive, idiosyncratic communication protocols, making them unable to coordinate with unseen partners (especially humans).
Unreproducible evaluation: Prior assessments of human-AI coordination relied on closed-source datasets and private proxy agents, making research results difficult to replicate and compare.
Why Hanabi?¶
Hanabi is an ideal testbed because it simultaneously features:
- Imperfect information: Players cannot see their own cards.
- Restricted communication: Information can only be transmitted through limited hints.
- Theory of Mind: Players must reason about their partner's intentions and knowledge.
- Cooperative actions: All players share a single goal (achieving a maximum score of 25 points).
Research Motivation¶
Existing literature lacks standardized evaluation methods for human-AI coordination. Various studies employ different proxy agents and datasets, preventing fair comparison between methods. This work aims to bridge this gap by providing a unified evaluation protocol and public baselines.
Method¶
Overall Architecture: AH2AC2 Challenge Design¶
AH2AC2 comprises two parts of evaluation tasks:
- Part 1 — Coordination with Human Proxies: Participants develop agents that play 1,000 Hanabi games with the human proxy agents provided by the paper.
- Part 2 — Human Action Prediction: Predicting human player actions on a closed-source human game dataset (evaluated using cross-entropy loss).
Data sources are from the hanab.live platform: a total of 101,096 two-player games and 46,525 three-player games were collected. Among these, 3,079 games (1,858 for two-player + 1,221 for three-player) are open-sourced for participants to use, intentionally limiting the data volume to encourage research into data-efficient methods.
Key Designs: HDR-IPPO (Human-Data-Regularised IPPO)¶
Training of the human proxy agents consists of two steps:
Step 1: Behavioural Cloning (BC)
- Use an LSTM architecture to model the policy \(\pi_\theta^{BC}\), taking the action-observation history (AOH) as input.
- Train via supervised learning using standard cross-entropy loss.
- Evaluate self-play scores at the end of each training epoch and save the optimal parameters \(\theta'\).
Step 2: Regularised Reinforcement Learning
- Initialize the human proxy policy \(\pi_\theta^{HP}\) from the optimal BC parameters \(\theta'\).
- Train via self-play using IPPO (Independent PPO).
- Add a KL regularization term to constrain the final policy from deviating too far from human behavior:
where \(\lambda \in [0, 1]\) controls the regularization strength. This allows the final policy to enhance its game-playing capability while retaining human style.
Loss & Training¶
- Use LSTM architecture to capture sequential features of actions and observations.
- Although model parameters are fixed, the agents' behaviors are dynamic—they make decisions based on the complete game history (including partners' actions).
- Train 4 human proxies in total (2 for the two-player setting + 2 for the three-player setting), ensuring diversity via different hyperparameters and random seeds.
- The evaluation API hides agent weights to prevent overfitting; pre-registration is required to submit experiments.
Overfitting-Proof Design of Evaluation Protocol¶
- Human proxies are not released publicly and are only interacted with via the API.
- Each evaluation is limited to 1,000 games.
- Experiments must be pre-registered to obtain API access.
- Evaluation results are published on the leaderboard.
Key Experimental Results¶
Open-Source Dataset Statistics¶
| Setting | Metric | Min | Max | Mean | Median | Std |
|---|---|---|---|---|---|---|
| 2P (1858 games) | Score | 13 | 25 | 23.37 | 24 | 1.86 |
| 2P (1858 games) | Game Length | 52 | 76 | 65.45 | 66 | 3.35 |
| 3P (1221 games) | Score | 14 | 25 | 23.25 | 24 | 1.91 |
| 3P (1221 games) | Game Length | 45 | 67 | 57.86 | 58 | 3.38 |
Human Proxy Self-Play Performance (5,000 Games Evaluation)¶
| Metric | HP1 (2P) | HP2 (2P) | HP3 (3P) | HP4 (3P) |
|---|---|---|---|---|
| Avg Self-Play Score | 22.55±0.03 | 22.97±0.03 | 20.88±0.03 | 21.21±0.03 |
| Gain Relative to BC | +3.0 | +4.0 | +15.7 | +13.9 |
| Perfect Score % | 23.86% | 29.66% | 2.76% | 3.88% |
| BC Perfect Score % | 16.12% | 19.88% | 1.34% | 1.80% |
| Zero Score % | 0.10% | 0.04% | 0.34% | 0.20% |
| BC Zero Score % | 11.42% | 17.70% | 75.82% | 66.02% |
The improvement in the three-player (3P) setting is particularly significant: while the BC policy's zero-score rate was as high as 66-76%, HDR-IPPO reduced it to less than 0.5%.
AH2AC2 Leaderboard Main Results¶
| Players | Method | Mean | Median | Cross-Entropy |
|---|---|---|---|---|
| 2P | OBL (L4) | 21.04 | 22 | 1.33 |
| 2P | BR-BC | 19.41 | 20 | 10.82 |
| 2P | FCP | 14.01 | 16 | 3.52 |
| 2P | OP | 13.91 | 19 | 7.81 |
| 2P | HDR-IPPO | 12.76 | 15 | 0.96 |
| 2P | IPPO | 10.16 | 14 | 12.60 |
| 2P | DeepSeek-R1 (H-Group) | 9.91 | 0 | - |
| 2P | DeepSeek-R1 | 5.43 | 0 | - |
| 2P | BC | 2.12 | 0 | 0.86 |
| 3P | DeepSeek-R1 (H-Group) | 14.62 | 18 | - |
| 3P | HDR-IPPO | 14.03 | 16 | 0.80 |
| 3P | OP | 12.87 | 18 | 6.40 |
| 3P | BR-BC | 11.89 | 12 | 29.89 |
| 3P | FCP | 11.55 | 6 | 5.97 |
| 3P | IPPO | 6.34 | 0 | 8.60 |
| 3P | BC | 3.31 | 0 | 0.70 |
Ablation Study: Human Proxy Behavior Analysis¶
| Setting | Source | IPP (Information/Play) | Communicativeness |
|---|---|---|---|
| 2P | Human Dataset | 0.44 | 0.47 |
| 2P | HP1 | 0.43 | 0.45 |
| 2P | HP2 | 0.44 | 0.48 |
| 3P | Human Dataset | 0.42 | 0.49 |
| 3P | HP3 | 0.44 | 0.47 |
| 3P | HP4 | 0.44 | 0.46 |
The human proxy agents align almost identically with real human data across both behavioral metrics—IPP and Communicativeness—validating their human likeness.
Action Prediction Performance¶
| Metric | HP1 (2P) | HP2 (2P) | HP3 (3P) | HP4 (3P) |
|---|---|---|---|---|
| Accuracy | 0.63 | 0.63 | 0.43 | 0.44 |
| Accuracy Drop vs. BC | -0.03 | -0.08 | -0.08 | -0.07 |
| Cross-Entropy Loss | 0.53 | 0.54 | 0.63 | 0.60 |
| Top-10% Accuracy | 0.82 | 0.82 | 0.71 | 0.73 |
| Top-20% Accuracy | 0.95 | 0.95 | 0.87 | 0.88 |
While maintaining high self-play performance, the human proxy agents' action prediction accuracy drops by only 3-8 percentage points compared to pure BC.
Highlights & Insights¶
- The power of BC + Regularised RL: A strategy relying solely on behavioral cloning suffers from a 76% zero-score rate in the three-player setting. Adding KL-regularized IPPO reduces this rate to 0.3%, proving that regularized RL is an effective means to remedy the generalization flaws of BC.
- The unexpected advantage of OBL: OBL, which uses no human data, surprisingly achieves the highest score (21.04) in the two-player setting, revealing a critical research gap—existing methods cannot effectively leverage small amounts of human data to enhance coordination capability.
- Preliminary power of LLMs: DeepSeek-R1 outperforms all other baselines in the three-player setting, demonstrating the intrinsic cooperative potential of LLMs. However, its poor performance in the two-player setting and median score of 0 indicate highly unstable behavior.
- Exquisitely designed evaluation protocol: The triple-mechanism of API + pre-registration + limited game count effectively prevents overfitting, reflecting a rigorous approach to experimental science.
- Urgent need for data efficiency: Open-source datasets are intentionally limited (only 3,079 games vs. 147,000 games for training), explicitly pointing to data-efficient learning as a critical research direction.
Limitations & Future Work¶
- Limited to 2/3 player settings: Hanabi supports 2-5 players, but due to data availability constraints, the challenge only covers two-player and three-player scenarios.
- Lack of direct human evaluation: Although the human proxy agents align closely with humans on behavioral metrics, they have not been validated through direct play against real human players.
- Single source of data: All data originates from the H-Group convention player community on hanab.live, which may not represent the broader diversity of human playstyles.
- Insufficient LLM evaluation: DeepSeek-R1 was only evaluated over 100 games (compared to 1,000 games for other methods), casting doubt on its statistical significance.
- Lack of theoretical analysis: The effectiveness of HDR-IPPO is primarily empirically validated, lacking theoretical explanations for why KL regularization preserves human compatibility.
- Exclusion of game variants: Hanabi variants such as rainbow cards were not tested, making it impossible to assess the generalization capability of the methods.
- Poor performance of FCP in complex partially observable environments, indicating that population-based methods might not be suitable for such scenarios.
Related Work & Insights¶
- Hanabi benchmark: Bard et al. (2019) introduced the Hanabi Challenge, and SPARTA approached a perfect score (24.61/25) in self-play, but ad-hoc coordination remains an open problem.
- Regularised RL paradigm: Bakhtin et al. (2022) used human data to regularize RL in Diplomacy; Cornelisse & Vinitsky (2024) extended this approach to autonomous driving; this paper applies it to Hanabi.
- Zero-Shot Coordination (ZSC): OBL (Hu et al., 2021) and OP (Hu et al., 2020) explore coordination methods that bypass human data.
- FCP: Strouse et al. (2021) proposed population-based methods, and this paper presents the first evaluation of their performance in Hanabi.
- LLMs as game-playing agents: The evaluation of DeepSeek-R1 opens up new avenues for examining Theory of Mind capabilities of LLMs in cooperative games.
Rating¶
- Novelty: ⭐⭐⭐⭐ (4/5) — The first standardized and open-source evaluation framework for Hanabi Human-AI coordination, featuring a rigorously designed evaluation protocol.
- Experimental Thoroughness: ⭐⭐⭐⭐ (4/5) — Rich baselines (including LLMs) and multidimensional validation of human-likeness, though the LLM evaluation is limited in scale.
- Writing Quality: ⭐⭐⭐⭐ (4/5) — Clearly structured, with well-defined problems and motivations.
- Value: ⭐⭐⭐⭐⭐ (5/5) — Provides a much-needed standardized benchmark for Human-AI coordination research. The open-source data and API-based evaluation represent important contributions to the community.