MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=GCd5v3ehmr
Code: Open source (Project Page + Code, see OpenReview)
Area: Multi-Agent / LLM Reinforcement Learning
Keywords: Multi-agent reasoning, Self-play, GRPO, Credit assignment, Advantage estimation, Strategic games

TL;DR¶

MARSHAL utilizes a GRPO modification specifically designed for "multi-turn + multi-agent" scenarios (turn-level advantage estimation with summation before normalization + role-based group normalization). By training Qwen3-4B through self-play in cooperative and competitive strategic games, the model acquires reasoning capabilities that zero-shot transfer to multi-agent systems like MAD/AutoGen, consistently improving performance on math and QA benchmarks.

Background & Motivation¶

Background: Reinforcement Learning (GRPO/PPO) has significantly enhanced single-agent LLM reasoning (e.g., DeepSeek-R1). however, real-world negotiation, gaming, and collaborative development are multi-agent systems (MAS) involving long-term interactions. Extending RL to multi-turn, multi-agent scenarios remains largely unexplored.
Limitations of Prior Work: Directly applying GRPO to multi-agent self-play faces two major issues. First is long-range credit assignment—a game consists of multiple turns with sparse final outcomes; assigning the overall result to every token (naive GRPO) fails to distinguish which specific action was effective. Second is advantage variance from role heterogeneity—different roles (first-mover/second-mover, different positions in cooperation) have asymmetric information and reward scales, which introduces variance and instability when normalized together.
Key Challenge: Self-play naturally generates multi-turn, multi-role trajectories, whereas existing single-turn RL advantage estimation assumes "one response = one turn, one role," creating a structural mismatch.
Goal: Design an end-to-end RL framework to enable LLMs to acquire generalizable multi-agent reasoning capabilities through self-play in diverse strategic games, which can transfer to real-world MAS.
Core Idea: Self-play + two GRPO modifications for multi-turn multi-agent scenarios—modeling strategic games as turn-level MDPs, using "summation then normalization" turn-level advantage estimation for fine-grained credit assignment, and independently normalizing advantages by player role sub-groups.

Method¶

Overall Architecture¶

MARSHAL treats a full strategic game as an episode (turn-level MDP): the high-level state $s_k$ is the board/hand status at the start of turn $k$, and the high-level action $a_k$ is the complete LLM output (including reasoning and the move). This action is a sequence generated token-by-token by the underlying autoregressive policy. The goal is to maximize the total episode reward $R=\sum_{k=1}^{K} r_k$. Based on GRPO, the framework uses self-play to let the same model play all roles to generate trajectories, then applies two modifications to convert rewards into accurate token-level advantages for GRPO updates.

flowchart LR
    A[Self-play with same model<br/>Coop/Comp Strategic Games] --> B[Multi-turn trajectories per role<br/>turn-level rewards r_k]
    B --> C[Turn-Level Advantage Estimation<br/>Sum R_k then Normalize]
    C --> D[Agent-Specific Normalization<br/>Grouped by player role]
    D --> E[Token-level advantage → GRPO Update]
    E --> A

Key Designs¶

1. Naive Extension of GRPO to Multi-turn Self-play: Defining the baseline. In self-play, all players are controlled by the same model, producing one multi-turn trajectory per role per game. Treating all trajectories $\{(s^i_k,a^i_k)_{k=1}^{K_i}\}_{i=1}^{G}$ within a game environment as a set of responses and using the total reward $R_i$ as the terminal reward allows GRPO to be extended with a summation over turns. The advantage becomes $A^i_{k,t}=\frac{R_i-\mathrm{mean}(r)}{\mathrm{std}(r)}$. However, this assigns the same advantage to all tokens in a multi-turn trajectory, leading to the failure of long-range credit assignment.

2. Turn-level Advantage Estimation: Inverting "Normalize then Sum" to "Sum then Normalize". Original process-supervised GRPO normalizes each turn's reward across the batch $\tilde r^i_k=(r^i_k-\mathrm{mean}(r))/\mathrm{std}(r)$, then sums them $A^i_k=\sum_{\hat k=k}^{K}\tilde r^i_{\hat k}$. The issue is that intermediate reward distributions vary significantly across turns. MARSHAL reverses the sequence: first calculate the Monte Carlo cumulative return from turn $k$ as $R^i_k=\sum_{\hat k=k}^{K} r^i_{\hat k}$, then normalize these returns $A^i_{k,t}=R^i_k-\mathrm{mean}(R)$. This is equivalent to GAE with $\gamma=1,\lambda=1$, where the value function $V(s_k)$ is approximated by a simple yet effective baseline—the empirical mean of batch returns $\mathbb{E}[R]$.

3. Agent-specific Advantage Normalization: Calculating baselines per role group. Expected returns often depend heavily on player roles (first vs. second player, different cooperative roles). Normalizing across different roles pulls all players toward a shared baseline, which is statistically unsound and masks role-specific signals. MARSHAL partitions batch trajectories into sub-groups $G_p$ based on player role $p$, applying turn-level estimation independently within each sub-group: $$A^{p,i}_{k,t}=R^{p,i}_k-\mathrm{mean}(R^p),\quad R^p \text{ is the set of cumulative returns for sub-group } G_p$$ This ensures the advantage of each action is calculated relative to that role's average outcome.

4. Minimal Reward Design + Curriculized Game Selection: Relying on outcome signals. The primary signal is the intrinsic game result (Tic-Tac-Toe Win/Loss/Draw ±1, Kuhn Poker chips won/lost, Mini Hanabi +1 per card played). Rewards are scaled to a maximum of 4 across games. Two auxiliary rewards stabilize training: format reward (+0.05 for legal format, -10 and termination for illegal) and length penalty $r_{\text{length}}(l)=\alpha\cdot\max(0,1-\frac{l-l_{\min}}{l_{\max}-l_{\min}})$ ($l_{\min}=11,l_{\max}=2048,\alpha=0.5$). Games are categorized for curriculum: Perfect Information Competitive (Tic-Tac-Toe → Connect Four), Imperfect Information Competitive (Kuhn Poker → Leduc Hold'em), and Imperfect Information Cooperative (Mini Hanabi → Simple Hanabi).

Key Experimental Results¶

Main Results (Downstream Reasoning in MAS, Average)¶

Setting	Model	Average
Single Agent	Qwen3-4B	60.74
Single Agent	SPIRAL	63.75
Single Agent	MARSHAL Generalist	62.79
MAD (Competitive)	Qwen3-4B	72.45
MAD (Competitive)	SPIRAL	73.41
MAD (Competitive)	MARSHAL Generalist	75.96 (+3.51)
AutoGen (Cooperative)	Qwen3-4B	79.14
AutoGen (Cooperative)	SPIRAL	80.05
AutoGen (Cooperative)	MARSHAL Generalist	82.15

Representative gains: The Generalist improved GPQA-Diamond by 7.57% under the MAD framework and AIME by 10.00% under AutoGen.

Ablation Study (Tic-Tac-Toe Expert, Normalized Returns for Train/Held-out games)¶

Model	Tic-Tac-Toe	Kuhn Poker	Mini Hanabi	Connect Four	Leduc Hold'em	Simple Hanabi
MARSHAL	75.30/32.10	74.15/3.42	50.48	30.65/14.85	58.36/27.65	29.75
w/o Turn-Level	74.60/24.15	80.26/28.35	34.80	26.75/12.30	48.34/41.34	19.05
w/o Agent-Specific	82.70/31.20	70.89/11.24	44.10	25.40/10.50	51.04/49.88	21.72
w/ fixed opponent	88.00/41.95	63.15/28.84	34.93	20.35/5.65	47.38/35.55	12.22

Key Findings¶

OOD Generalization: The Tic-Tac-Toe expert generalizes to the more complex Connect Four and even improves on OOD Mini Hanabi, suggesting it learned foundational "turn-based planning" skills.
Generalist Performance: Showed the strongest overall performance across all games, with 28.7% improvement in Leduc Hold'em and 22.9% in Simple Hanabi.
Component Necessity: Removing turn-level estimation or agent-specific normalization significantly degraded performance on held-out games (especially cooperative Hanabi).
Self-play vs. Fixed Opponent: Training against fixed experts leads to overfitting to static environments; fixed opponent variants dropped to zero performance on most held-out games.
Failure Mode Attribution: In GPQA-Diamond + MAD, MARSHAL reduced "Inter-Agent Misalignment" by 11.5%, primarily through fewer "task deviations" and "ignoring inputs from other agents."

Highlights & Insights¶

Structural Analysis of Strategy Games: The paper clarifies the difference where single-turn math is "one response = one turn," while strategic games are turn-level MDPs. This precisely identifies the mismatch in GRPO.
The "Sum then Normalize" Inversion: A seemingly small sequence adjustment that aligns with GAE ($\gamma=1,\lambda=1$) using batch means as value functions, providing theoretical consistency with zero extra engineering overhead.
Cross-domain Generalization: Abstract multi-agent skills like "role understanding" and "intent recognition" (Theory of Mind) acquired in games transfer zero-shot to real-world debate/cooperative scenarios in MAS.

Limitations & Future Work¶

Evaluation limited to Qwen3-4B and six simplified two-player games; scalability to larger models or more complex multiplayer games is unknown.
Downstream transfer tested only on MAD/AutoGen frameworks with math and QA benchmarks; generalization to open tool-use or long-term collaborative tasks remains to be seen.
Auxiliary rewards and reward scaling still require manual tuning; uniform scaling across games might need re-calibration for more heterogeneous task mixtures.

Single-Agent RL Reasoning: Follows DeepSeek-R1 and Kimi k1.5 in using RL to scale reasoning, applying format and length rewards but in a multi-agent context.
Self-play: Directly compares with SPIRAL (competitive-only self-play); MARSHAL covers both cooperation and competition while emphasizing cross-domain generalization.
GRPO / GAE: Methodological roots in GRPO, with theoretical grounding in the equivalence between turn-level estimation and GAE.
Insight: Games serve as a training ground for generalizable multi-agent reasoning—externalizing abstract skills into quantifiable game objectives before transferring back to real MAS.

Rating¶

Novelty: ⭐⭐⭐⭐ Clear identification of GRPO mismatches in multi-turn multi-agent settings; "Sum-then-Normalize + Role-based Grouping" is a simple and effective modification.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive coverage of OOD generalization, MAS frameworks, math/QA benchmarks, ablations, and failure mode analysis.
Writing Quality: ⭐⭐⭐⭐ Smooth logic from motivation to experiments; well-supported by both formulas and qualitative trajectory analysis.
Value: ⭐⭐⭐⭐ Provides a reproducible end-to-end paradigm for training generalizable multi-agent reasoning LLMs.