Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration¶

Conference: ICML 2025
arXiv: 2505.05262
Code: github.com/ddaedalus/smpe
Area: Reinforcement Learning
Keywords: Multi-Agent Reinforcement Learning, Partial Observability, State Modelling, Adversarial Exploration, Intrinsic Reward

TL;DR¶

Proposed the SMPE² algorithm, which learns meaningful state belief representations through variational inference and integrates adversarial intrinsic exploration. It significantly enhances coordination in partially observable cooperative multi-agent environments, outperforming SOTA on three benchmarks: MPE, LBF, and RWARE.

Background & Motivation¶

In cooperative multi-agent deep reinforcement learning (MARL), agents need to learn collaborative policies under distributed partially observable environments. Real-world applications are widespread, such as multi-robot coordination, wireless network optimization, autonomous driving, and air traffic control.

Key Challenge: Under the CTDE (Centralized Training with Decentralized Execution) framework, agents can share information during training but must rely solely on their local observations during execution, without explicit communication channels.

Limitations of Prior Work:

Disconnection between Agent Modelling (AM) representations and policies: Belief representations learned by standard AM are not optimized for maximizing the value function, resulting in suboptimality (e.g., LIAM, SoMM).

Unfiltered redundant state information: Directly using compressed embeddings of global states can actually harm performance, as joint states contain redundant features that are uninformative for individual agents.

Failure of using AM for improving exploration: Existing AM methods cannot enhance initial random exploration policies, making them particularly inefficient in sparse-reward environments.

Practical limitations: Many methods assume a single controller, require prior knowledge, or need to access other agents' actions during execution.

Ours focuses on two core questions: - Q1: Can agents infer meaningful state representations solely from their own observations to enhance coordination? - Q2: Can these representations be leveraged to efficiently explore the state space to discover superior policies?

Method¶

Overall Architecture¶

SMPE² (State Modelling for Policy Enhancement through Exploration) is built upon MAA2C and consists of two core modules:

1. Self-supervised state modelling: Each agent \(i\) infers a belief embedding \(z^i\) from its own observation \(o^i\) using a variational encoder-decoder (ED) to reconstruct the (filtered) observations of other agents. An AM filter \(w^i\) is introduced to automatically filter out uninformative features.

2. Adversarial count-based intrinsic exploration: Intrinsic rewards are calculated using the SimHash of \(z^i\) to encourage agents to discover novel states. This establishes an adversarial framework—new observations discovered by agent \(i\) simultaneously present an "adversarial target" to the ED of other agents, thereby enhancing mutual state inference capabilities.

Architectural Components (see Figure 1): - Actor (blue): Partially observable, with policy \(\pi_\psi(a_t^i \mid h_t^i, z_t^i)\) that explicitly incorporates the belief embedding into the policy network. - Critic (red): Both critics possess global observability. - \(V_\xi(s)\): Standard MAA2C critic. - \(V_k(\hat{s})\): Critic trained on the state purified by the AM filter \(\hat{s} = o^i \oplus (w^i \cdot o^{-i})\).

Key Designs¶

1. Variational Encoder-Decoder (ED) and AM Filter¶

The ED of each agent \(i\) includes: - Encoder \(q_{\omega^i}(z^i \mid o^i)\): Infers a Gaussian latent variable \(z^i\) from its own observation. - Decoder \(p_{\phi^i}(w^i \cdot o^{-i} \mid z^i)\): Reconstructs the filtered observations of other agents.

The design of the AM Filter is crucial. \(w^i_j = \sigma(\phi^i_w(o^j))\) is a learnable sigmoid MLP with output values in \([0,1]\) per dimension, controlling the information contribution of each observation feature of other agents \(j\) to agent \(i\).

Definition of Uninformative Features (which must simultaneously satisfy): - Irrelevant to agent \(i\)'s future reward maximization. - Infeasible to infer via \(z^i\) due to partial observability and non-stationarity.

Why is the ED conditioned only on \(o^i\) (rather than history \(h^i\))? Because \(z^i\) is utilized for intrinsic exploration, it should reward "novel observations" rather than "novel trajectories." Conditioning on \(h^i\) would over-compress a novel high-value observation in an otherwise thoroughly explored trajectory, making it fail to receive sufficient intrinsic motivation. The policy network itself is already conditioned on \(h_t^i\) and implicitly leverages beliefs \(z_t^i\) across all timesteps.

2. Adversarial Count-Based Exploration¶

A SimHash-based count method is adopted, but the hash domain is \(z^i \in \mathcal{Z}\) instead of the raw observations:

\[\hat{r}^i = \frac{1}{n(SH(z^i))}\]

where \(n(SH(z^i))\) is the visitation count of the hash value of \(z^i\). SimHash is chosen because it maps neighboring \(z^i\) to neighboring hash values at a low computational cost.

Adversarial Aspect: Agent \(i\) is motivated to discover novel \(o^i\) (leading to novel \(z^i\)). These new observations simultaneously serve as unseen reconstruction targets for the ED of other agents—effectively acting as "adversarial samples" for other agents to help them learn stronger state inference capabilities.

Stability Guarantee: A periodic hard update (every \(N_{ED}=2000\) steps) is applied to the ED parameters \((\omega^i, \phi^i)\), preventing severe fluctuations of \(z^i\) across continuous training epochs and ensuring the intrinsic reward stably decays to a minimum plateau.

3. Dual-Critic Design¶

\(V_\xi(s)\): Standard critic, providing advantage estimation for the actor.
\(V_k(\hat{s})\): Critic trained on the filtered state \(\hat{s}\), allowing \(w^i\) to learn a filtering policy customized for policy optimization.

Ablation studies show that a single critic leads to high variance and divergence.

Loss & Training¶

Total Loss for State Representation Learning:

\[L_\text{encodings} = L_\text{critic}^w + \lambda_\text{rec} \cdot L_\text{rec} + \lambda_\text{norm} \cdot L_\text{norm} + \lambda_\text{KL} \cdot L_\text{KL}\]

Descriptions of each loss term:

Loss Term	Formula	Function
\(L_\text{rec}\)	\(\\|\tilde{w}^i \cdot o^{-i} - w^i \cdot \hat{o}^{-i}\\|^2\)	Self-supervised reconstruction loss (with target network stabilization)
\(L_\text{norm}\)	\(-\\|w^i\\|_2^2\)	Prevents the AM filter from degenerating to zero
\(L_\text{KL}\)	\(\text{KL}(q_{\omega^i}(z^i\\|o^i) \\| \mathcal{N}(0,I))\)	Variationalizes belief representations and promotes belief consistency among agents
\(L_\text{critic}^w\)	\([r_t^i + V_{k'}^\pi(\hat{s}_{t+1}) - V_k^\pi(\hat{s}_t)]^2\)	Trains the AM filter for policy optimization

Actor Loss:

\[L_\text{actor}(\psi^i) = -\beta_H \cdot H(\pi_{\psi^i}(a_t^i \mid h_t^i, z_t^i)) - \log \pi_{\psi^i}(a_t^i \mid h_t^i, z_t^i) \cdot (r_t^i + V_{\xi'}^\pi(s_{t+1}) - V_\xi^\pi(s_t))\]

Total Loss: \(L_\text{SMPE} = L_\text{actor} + L_\text{critic} + L_\text{encodings}\)

Key Hyperparameter Settings (using LBF as an example): \(\lambda_\text{rec}=1\), \(\lambda_\text{norm}=0.1\), \(\lambda_\text{KL}=1\), intrinsic reward coefficient \(=0.1\). The ED uses a 3-layer MLP (non-RNN), and the latent variable dimension \(=32\).

Key Experimental Results¶

Main Results¶

Evaluated on three major benchmarks and compared against methods such as MAA2C, COMA, MAPPO, ATM, EOI, EMC, and MASER.

Benchmark/Task	Metric	SMPE²	Strongest Baseline	Performance
MPE Spread-3/4/5/8	Average Episode Reward	Best	MAA2C/ATM	Superiority becomes more pronounced as active agent count increases
LBF 2s-9x9-3p-2f	Average Episode Reward	Fully solved	Other methods are 0	Solves open challenge
LBF 4s-11x11-3p-2f	Average Episode Reward	Fully solved	Other methods are 0	Solves open challenge
LBF 2s-12x12-2p-2f	Average Episode Reward	Best	EOI second best	Faster convergence
LBF 7s-20x20-5p-3f	Average Episode Reward	Consistently leading	MAA2C	Sustained advantage in large grids
RWARE tiny-2ag-hard	Average Episode Reward	Best	EOI	Significant lead
RWARE tiny-4ag-hard	Average Episode Reward	Best	EOI	Significant lead
RWARE small-4ag-hard	Average Episode Reward	Best	MAA2C	EOI completely fails here

Runtime efficiency comparison (LBF 2s-12x12-2p-2f):

Method	Runtime	Relative to SMPE²
MAA2C	37min	0.5x
SMPE²	1h13min	1x
ATM	2h32min	2x
EOI	17h11min	17x
MASER	25h8min	25x
EMC	29h34min	30x

Ablation Study¶

Configuration	Key Performance	Note
SMPE² (Full)	Best	All components cooperate synergistically
no_intr (No intrinsic reward)	Complete failure	State modeling alone is insufficient to discover high-value states under sparse rewards
no_filters (No AM filters)	Significant drop	Redundant information interferes with belief learning
no_kl (No KL regularization)	Obvious drop	Inconsistent beliefs, lower exploration efficiency
no_L2_norm (No norm regularization)	Decayed	\(w^i\) may degenerate to zero
obs_rew (Standard SimHash on raw obs)	Inferior to SMPE²	Hashing \(z^i\) outperforms hashing raw observations
no_critic_w (No second critic)	Decayed	\(w^i\) cannot learn targeting policy optimization
SMPE_PPO (MAPPO backbone)	Outperforms MAPPO	Flexible framework, transferable to other backbones

Key Findings¶

Exploration and coordination complement each other: Neither state modeling alone (no_intr) nor exploration alone (standard methods) is sufficient; the success of SMPE² lies in the tight coupling of both.
t-SNE Visualization: When \(L_\text{KL}\) is enabled, the beliefs of three agents overlap over a wide area (57.5% consistency). When disabled, they are completely separated (99.3%), demonstrating that KL regularization promotes belief consistency.
Interpretability of the AM Filter: In LBF tasks, the filter automatically learns to mask invisible and irrelevant features (such as the position of food out of view) while retaining information relevant to coordination.
Failure Reasons of EMC and MASER: Heavily reliant on the initial random policy, generating misleading intrinsic rewards or subgoals based on low-value data.

Highlights & Insights¶

Unification of State Modelling and Policy Optimization: Proved that \(V_{SM}^* = V^*\) (Proposition 2.1). The state modeling objectives do not constrain the optimal policy space, theoretically guaranteeing the optimality of the framework.
Implicit Emergence of Adversarial Exploration: Without explicitly designed adversarial mechanisms, encouraging the discovery of novel \(z^i\) naturally forms a mutually beneficial "adversaria" between agents—your new discovery serves as my training fuel.
Elegant Design of Self-Supervised Filtering: The AM filter appears simultaneously in the target and prediction of the loss function, forming a self-consistent self-supervised learning loop.
Extremely High Computational Efficiency: 25x faster than MASER and 30x faster than EMC, with an additional overhead only about 2x that of MAA2C.
Solves Multiple Recognized Open Challenges: Tasks 2s-9x9-3p-2f and 4s-11x11-3p-2f in LBF, which were flagged as open challenges by prior works, are successfully resolved.

Limitations & Future Work¶

Architectural Extension: Currently, the ED uses an MLP. The authors suggest that integrating Transformers into the architecture could further enhance state modeling capabilities.
Scalability: The reconstruction target grows linearly with the number of agents. Although the AM filter mitigates parts of this issue, further verification is still needed in larger-scale scenarios.
Adaptability to Stochastic Environments: Current experiments are based on deterministic dynamics. The paper does not evaluate performance under noisy observations or complex dynamics.
Benchmark Coverage: Evaluation is only performed on MPE, LBF, and RWARE, lacking more complex benchmarks such as SMAC (StarCraft Multi-Agent Challenge).
Communication Setup: Operates under a strict zero-communication assumption without comparing against hybrid methods that incorporate lightweight communication.

LIAM (Papoudakis et al., 2021): Reconstructs other agents' information via ED, but its representation is disconnected from the policy. Ours resolves this through joint optimization.
SIDE (Xu et al., 2022): Infers states but does not utilize \(z^i\) during execution. SMPE² explicitly leverages belief embeddings to enhance policies.
EMC (Zheng et al., 2021): Curiosity-based exploration reliant on QMIX, which performs very poorly on LBF.
MASER (Jeon et al., 2022): Subgoal generation depends on the quality of the initial policy, yielding misleading subgoals in sparse-reward environments.
EOI (Jiang & Lu, 2021): Encourages exploration via diversity, which is inapplicable to homogeneous agent scenarios (LBF).
MAVEN (Mahajan et al., 2019): Explores joint action spaces but lacks encouragement for discovering novel states.

Insights: The concept of utilizing belief representations to simultaneously serve policy enhancement and exploration-driven mechanisms can be extended to other scenarios requiring "inference of unobserved information," such as single-agent tasks in POMDPs, decentralized communication networks, etc.

Rating¶

Novelty: ⭐⭐⭐⭐ — The combination of the state modeling framework and adversarial exploration is highly creative, though individual components (VAE, SimHash, AM) themselves are not entirely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Extremely thorough, featuring three major benchmarks, comprehensive ablation studies, t-SNE visualizations, interpretability analysis of the AM filter, and runtime performance comparisons.
Writing Quality: ⭐⭐⭐⭐ — Logically clear with a good balance of theory and intuition, though heavy mathematical notation slightly hinders the readability of certain derivations.
Value: ⭐⭐⭐⭐⭐ — Successfully resolves several recognized open challenges. The proposed method is highly efficient and scalable, with open-sourced code, holding high value for the MARL community.