Skip to content

Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning

Conference: NeurIPS 2025 arXiv: 2412.00661 Code: emiletimothy/Mean-Field-Subsample-Q-Learning Area: Multi-Agent Reinforcement Learning Keywords: MARL, mean-field, subsampling, Q-learning, cooperative

TL;DR

This paper proposes the SUBSAMPLE-MFQ algorithm, which randomly samples \(k\) agents from \(n\) to perform mean-field Q-learning, reducing the sample complexity of multi-agent reinforcement learning from \(\text{poly}(n)\) to \(\text{poly}(k)\). The resulting optimality gap is only \(\tilde{O}(1/\sqrt{k})\) (independent of \(n\)), achieving exponential speedup over standard mean-field MARL when \(k = O(\log n)\).

Background & Motivation

Background: MARL suffers from the curse of dimensionality — the joint state-action space for \(n\) agents scales as \((|S||A|)^n\), requiring exponential computation for standard Q-learning. Mean-field MARL approximates agent interactions via empirical distributions, reducing complexity to \(O(n^{|S||A|})\) (polynomial), yet this remains intractable for large \(n\).

Limitations of Prior Work: The polynomial complexity of mean-field methods is still prohibitive when \(n\) is very large (e.g., swarm coordination with thousands of robots). Sparse-network methods fail under dense interaction structures.

Key Challenge: How can subpolynomial (even logarithmic) computational complexity in the number of agents \(n\) be achieved while preserving near-optimal performance guarantees?

Key Insight: Exploit the structured reward decomposition in cooperative settings (global + sum of local rewards), and reduce the \(n\)-agent problem to a \(k\)-agent problem via subsampling combined with a randomized execution policy.

Core Idea: A policy is learned offline on \(k\) sampled agents; at execution time, a different random \(k\)-subset is selected at each step. The performance gap of this randomized mixed policy depends only on \(k\), not on \(n\).

Method

Overall Architecture

  • System Model: 1 global agent + \(n\) homogeneous local agents, with structured reward \(r(s,a) = r_g(s_g, a_g) + \frac{1}{n}\sum_i r_l(s_i, s_g, a_i)\)
  • Offline Planning: Randomly sample \(k\) local agents to form a subsystem; apply value iteration on this subsystem to learn \(\hat{Q}_k^{\text{est}}\) and \(\hat{\pi}_k^{\text{est}}\)
  • Online Execution (Decentralized): At each step, the global agent randomly samples \(k\) local agents to execute the policy; each local agent randomly samples \(k-1\) peers to execute the policy

Key Designs

  1. Subsampled Mean-Field Q-Learning

    • Function: Learn the Q-function on the \(k\)-subsystem, with complexity polynomial in \(k\) rather than \(n\)
    • Mechanism: Adaptive switching between two regimes — standard Q-learning when \(|Z_l|^k < |Z_l| \cdot k^{|Z_l|}\), mean-field Q-learning otherwise
    • Design Motivation: Standard and mean-field Q-learning are each more efficient in different ranges of \(k\); adaptive selection ensures worst-case optimal complexity
  2. Randomized Execution Policy

    • Function: Extend the \(k\)-subsystem policy to an \(n\)-agent policy
    • Mechanism: \(\pi_k^{\text{est}}\) is defined as a uniform mixture over all \(\binom{n}{k}\) possible \(k\)-subsets, with one subset sampled randomly at each step
    • Key Property: Randomization ensures permutation invariance of the policy, enabling performance analysis via mean-field approximation
  3. MDP Subsampling Theorem (Core Theoretical Contribution)

    • Function: Prove that the performance loss from deploying a \(k\)-subsystem policy on the \(n\)-agent system is \(\tilde{O}(1/\sqrt{k})\)
    • Mechanism: Extends the Performance Difference Lemma from single-agent to multi-agent settings, and applies a sampling-without-replacement variant of the DKW inequality to control empirical distribution error
    • Key Bound: \(V^{\pi^*}(s) - V^{\pi_k^{\text{est}}}(s) \leq \tilde{O}(\sqrt{(n-k+1)/(nk)})\)

Loss & Training

  • Bellman operator iteration: \(\hat{Q}_{k,m}^{t+1} \leftarrow \tilde{\mathcal{T}}_{k,m} \hat{Q}_{k,m}^t\)
  • Bellman expectations approximated using \(m\) samples
  • Sample count \(m^*\) chosen such that Bellman noise \(\varepsilon_{k,m^*} \leq \tilde{O}(1/\sqrt{k})\)

Key Experimental Results

Main Results

Setting Varying \(k\) Result
Gaussian Squeeze (\(n\)=100–800) \(k\) from 5 to \(n\) Performance gap decreases monotonically; approaches 0 as \(k \to n\)
Restricted Exploration Similar trend Significantly faster than mean-field Q-learning

Complexity Comparison

Method Sample Complexity Optimality Gap
Standard Q-learning $O( Z_l
Mean-field MFQ $O(n^{ Z_l
SUBSAMPLE-MFQ (\(k\)=\(O(\log n)\)) $O((\log n)^{ Z_l

Key Findings

  • At \(k = O(\log n)\), this is the first centralized MARL algorithm achieving polylogarithmic sample complexity in \(n\), yielding an exponential speedup over mean-field methods
  • The optimality gap \(\tilde{O}(1/\sqrt{k})\) is entirely independent of \(n\)

Highlights & Insights

  • Elegant Core Idea: The combination of subsampling and randomized execution is both natural and powerful, effectively "compressing" a high-dimensional problem into a lower-dimensional one via statistical sampling
  • TV Distance over KL: Choosing total variation distance as the analytical tool is crucial — the DKW inequality yields an \(O(1/\sqrt{k})\) bound under TV distance, whereas KL divergence does not admit a comparable result
  • Practical Significance of the \(k\)-\(n\) Tradeoff: Algorithm designers can freely choose \(k\) according to their computational budget, enabling precise control over the efficiency–optimality tradeoff

Limitations & Future Work

  • Homogeneity Assumption: All local agents are required to share the same state/action spaces and transition kernels, though this can be partially relaxed via type encoding
  • Cooperative Settings Only: The framework does not address competitive or mixed-incentive scenarios
  • Primarily Theoretical: Experiments are limited in scale (up to 800 agents) and lack comparison with deep RL baselines
  • Exponential Dependence on \(|Z_l|\): When the local state-action space is large, \(k^{|Z_l|}\) can still be substantial even when \(k = O(\log n)\)
  • vs. Mean-Field MFQ (Yang et al. 2018): MFQ has complexity \(\text{poly}(n)\); this paper achieves \(\text{poly}(\log n)\) — an exponential speedup
  • vs. Sparse Network Methods (Qu et al. 2020): Sparse methods rely on graph structure and fail under dense interactions; this paper applies to fully connected settings
  • vs. Graphon Methods (Cui & Koeppl 2022): Graphon approaches require cut-norm convergence assumptions; this paper does not

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The subsampling idea is concise and powerful; first to achieve polylog complexity in MARL
  • Experimental Thoroughness: ⭐⭐⭐ Primarily theoretical; experiments only verify basic trends
  • Writing Quality: ⭐⭐⭐⭐ Theorems and algorithms are clearly stated
  • Value: ⭐⭐⭐⭐ Significant theoretical contribution to the scalability of MARL

Additional Notes

  • The theoretical analysis framework and technical tools developed in this paper offer insights for adjacent research areas
  • The core contribution lies in a deep theoretical understanding that provides a foundation for subsequent practical optimization
  • The paper is technically and methodologically complementary to other NeurIPS 2025 papers published concurrently
  • The paper's exposition of problem motivation and technical approach is exemplary
  • Readers are encouraged to consult the appendix for more complete experimental details and proofs

Further Reading

  • This line of research is closely connected to several active topics in the AI community
  • The rigor of the theoretical results provides a solid mathematical foundation for subsequent empirical studies
  • The methodology is generalizable to broader problem settings
  • Follow-up work from the same research group is worth monitoring
  • For readers new to theoretical MARL, the proof sketch section provides an excellent technical roadmap
  • From a methodological standpoint, the paper demonstrates how careful mathematical modeling can reduce complex problems to tractable analytical frameworks