Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning¶

Conference: NeurIPS 2025 arXiv: 2412.00661 Code: emiletimothy/Mean-Field-Subsample-Q-Learning Area: Multi-Agent Reinforcement Learning Keywords: MARL, mean-field, subsampling, Q-learning, cooperative

TL;DR¶

This paper proposes the SUBSAMPLE-MFQ algorithm, which randomly samples $k$ agents from $n$ to perform mean-field Q-learning, reducing the sample complexity of multi-agent reinforcement learning from $\text{poly}(n)$ to $\text{poly}(k)$. The resulting optimality gap is only $\tilde{O}(1/\sqrt{k})$ (independent of $n$), achieving exponential speedup over standard mean-field MARL when $k = O(\log n)$.

Background & Motivation¶

Background: MARL suffers from the curse of dimensionality — the joint state-action space for $n$ agents scales as $(|S||A|)^n$, requiring exponential computation for standard Q-learning. Mean-field MARL approximates agent interactions via empirical distributions, reducing complexity to $O(n^{|S||A|})$ (polynomial), yet this remains intractable for large $n$.

Limitations of Prior Work: The polynomial complexity of mean-field methods is still prohibitive when $n$ is very large (e.g., swarm coordination with thousands of robots). Sparse-network methods fail under dense interaction structures.

Key Challenge: How can subpolynomial (even logarithmic) computational complexity in the number of agents $n$ be achieved while preserving near-optimal performance guarantees?

Key Insight: Exploit the structured reward decomposition in cooperative settings (global + sum of local rewards), and reduce the $n$-agent problem to a $k$-agent problem via subsampling combined with a randomized execution policy.

Core Idea: A policy is learned offline on $k$ sampled agents; at execution time, a different random $k$-subset is selected at each step. The performance gap of this randomized mixed policy depends only on $k$, not on $n$.

Method¶

Overall Architecture¶

System Model: 1 global agent + $n$ homogeneous local agents, with structured reward $r(s,a) = r_g(s_g, a_g) + \frac{1}{n}\sum_i r_l(s_i, s_g, a_i)$
Offline Planning: Randomly sample $k$ local agents to form a subsystem; apply value iteration on this subsystem to learn $\hat{Q}_k^{\text{est}}$ and $\hat{\pi}_k^{\text{est}}$
Online Execution (Decentralized): At each step, the global agent randomly samples $k$ local agents to execute the policy; each local agent randomly samples $k-1$ peers to execute the policy

Key Designs¶

Subsampled Mean-Field Q-Learning
- Function: Learn the Q-function on the $k$-subsystem, with complexity polynomial in $k$ rather than $n$
- Mechanism: Adaptive switching between two regimes — standard Q-learning when $|Z_l|^k < |Z_l| \cdot k^{|Z_l|}$, mean-field Q-learning otherwise
- Design Motivation: Standard and mean-field Q-learning are each more efficient in different ranges of $k$; adaptive selection ensures worst-case optimal complexity
Randomized Execution Policy
- Function: Extend the $k$-subsystem policy to an $n$-agent policy
- Mechanism: $\pi_k^{\text{est}}$ is defined as a uniform mixture over all $\binom{n}{k}$ possible $k$-subsets, with one subset sampled randomly at each step
- Key Property: Randomization ensures permutation invariance of the policy, enabling performance analysis via mean-field approximation
MDP Subsampling Theorem (Core Theoretical Contribution)
- Function: Prove that the performance loss from deploying a $k$-subsystem policy on the $n$-agent system is $\tilde{O}(1/\sqrt{k})$
- Mechanism: Extends the Performance Difference Lemma from single-agent to multi-agent settings, and applies a sampling-without-replacement variant of the DKW inequality to control empirical distribution error
- Key Bound: $V^{\pi^*}(s) - V^{\pi_k^{\text{est}}}(s) \leq \tilde{O}(\sqrt{(n-k+1)/(nk)})$

Loss & Training¶

Bellman operator iteration: $\hat{Q}_{k,m}^{t+1} \leftarrow \tilde{\mathcal{T}}_{k,m} \hat{Q}_{k,m}^t$
Bellman expectations approximated using $m$ samples
Sample count $m^*$ chosen such that Bellman noise $\varepsilon_{k,m^*} \leq \tilde{O}(1/\sqrt{k})$

Key Experimental Results¶

Main Results¶

Setting	Varying $k$	Result
Gaussian Squeeze ($n$=100–800)	$k$ from 5 to $n$	Performance gap decreases monotonically; approaches 0 as $k \to n$
Restricted Exploration	Similar trend	Significantly faster than mean-field Q-learning

Complexity Comparison¶

Method	Sample Complexity	Optimality Gap
Standard Q-learning	$O(	Z_l
Mean-field MFQ	$O(n^{	Z_l
SUBSAMPLE-MFQ ($k$=$O(\log n)$)	$O((\log n)^{	Z_l

Key Findings¶

At $k = O(\log n)$, this is the first centralized MARL algorithm achieving polylogarithmic sample complexity in $n$, yielding an exponential speedup over mean-field methods
The optimality gap $\tilde{O}(1/\sqrt{k})$ is entirely independent of $n$

Highlights & Insights¶

Elegant Core Idea: The combination of subsampling and randomized execution is both natural and powerful, effectively "compressing" a high-dimensional problem into a lower-dimensional one via statistical sampling
TV Distance over KL: Choosing total variation distance as the analytical tool is crucial — the DKW inequality yields an $O(1/\sqrt{k})$ bound under TV distance, whereas KL divergence does not admit a comparable result
Practical Significance of the $k$-$n$ Tradeoff: Algorithm designers can freely choose $k$ according to their computational budget, enabling precise control over the efficiency–optimality tradeoff

Limitations & Future Work¶

Homogeneity Assumption: All local agents are required to share the same state/action spaces and transition kernels, though this can be partially relaxed via type encoding
Cooperative Settings Only: The framework does not address competitive or mixed-incentive scenarios
Primarily Theoretical: Experiments are limited in scale (up to 800 agents) and lack comparison with deep RL baselines
Exponential Dependence on $|Z_l|$: When the local state-action space is large, $k^{|Z_l|}$ can still be substantial even when $k = O(\log n)$

vs. Mean-Field MFQ (Yang et al. 2018): MFQ has complexity $\text{poly}(n)$; this paper achieves $\text{poly}(\log n)$ — an exponential speedup
vs. Sparse Network Methods (Qu et al. 2020): Sparse methods rely on graph structure and fail under dense interactions; this paper applies to fully connected settings
vs. Graphon Methods (Cui & Koeppl 2022): Graphon approaches require cut-norm convergence assumptions; this paper does not

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The subsampling idea is concise and powerful; first to achieve polylog complexity in MARL
Experimental Thoroughness: ⭐⭐⭐ Primarily theoretical; experiments only verify basic trends
Writing Quality: ⭐⭐⭐⭐ Theorems and algorithms are clearly stated
Value: ⭐⭐⭐⭐ Significant theoretical contribution to the scalability of MARL

Additional Notes¶

The theoretical analysis framework and technical tools developed in this paper offer insights for adjacent research areas
The core contribution lies in a deep theoretical understanding that provides a foundation for subsequent practical optimization
The paper is technically and methodologically complementary to other NeurIPS 2025 papers published concurrently
The paper's exposition of problem motivation and technical approach is exemplary
Readers are encouraged to consult the appendix for more complete experimental details and proofs

Setting	Varying \(k\)	Result
Gaussian Squeeze (\(n\)=100–800)	\(k\) from 5 to \(n\)	Performance gap decreases monotonically; approaches 0 as \(k \to n\)
Restricted Exploration	Similar trend	Significantly faster than mean-field Q-learning