HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning¶

Conference: AAAI2026 arXiv: 2511.12123 Area: Reinforcement Learning Keywords: multi-agent RL, cooperative MARL, joint policy optimization, hierarchical framework, trust region

TL;DR¶

This paper proposes HCPO, an algorithm that enhances the expressiveness and exploration efficiency of multi-agent joint policies by introducing a conductor mechanism, constructing a Gaussian mixture model-like joint policy framework, and providing monotonic improvement guarantees for two-level policy updates.

Background & Motivation¶

Efficient exploration is critical for joint policy optimization in cooperative MARL. Existing CTDE paradigms (e.g., MAPPO, QMIX) suffer from two core problems:

Limited joint policy expressiveness: Most methods assume the joint policy factorizes as a product of independent per-agent policies \(\boldsymbol{\pi}(\boldsymbol{a}|s) = \prod_i \pi^i(a^i|s)\), restricting the expressive capacity of the policy space.
Uncoordinated independent exploration: Agents explore independently, making it difficult to coordinate the discovery of high-value joint policies.
Limitations of existing hierarchical methods: MAVEN relies on the monotonicity assumption of QMIX; COPA requires communication at execution time; skill discovery methods depend on variational inference.

Method¶

Conductor-Based Joint Policy Framework¶

Inspired by a coach directing players in a soccer match, a centralized conductor is introduced to provide a shared instruction \(M\) to the entire team:

\[\boldsymbol{\pi}_{\text{mar}}(\boldsymbol{a}|s) \triangleq \mathbb{E}_{M \sim w(\cdot|s)} \boldsymbol{\pi}(\boldsymbol{a}|s, M)\]

The conductor policy \(w(\cdot|s)\) selects one of \(K\) discrete instructions based on the global state.
Given instruction \(M\), the joint policy decomposes into a product of conditionally independent policies: \(\boldsymbol{\pi}(\boldsymbol{a}|s,M) = \prod_{i=1}^N \pi^i(a^i|s,M)\).
The overall structure forms a mixture policy analogous to a Gaussian mixture model, substantially enhancing expressiveness.

Advantage Function Decomposition¶

The joint advantage function is decomposed into a conductor level and an agent level:

\[A_{\boldsymbol{\pi}_{\text{mar}}}(s, \boldsymbol{a}) = A_{\boldsymbol{\pi}_{\text{mar}}}(M|s) + A_{\boldsymbol{\pi}_{\text{mar}}}(\boldsymbol{a}|s, M)\]

Instruction advantage \(A(M|s)\): evaluates the relative quality of instruction \(M\) over alternatives.
Joint action advantage \(A(\boldsymbol{a}|s,M)\): evaluates the quality of the joint action given the instruction.

Two-Level Policy Update¶

Conductor policy update: Maximizes the instruction advantage subject to a KL divergence constraint:

\[w_{k+1} = \arg\max_{\bar{w}} \left[\mathbb{E}_{s,M\sim\bar{w}} A(M|s) - C \cdot D_{\text{KL}}^{\max}(w_k, \bar{w})\right]\]

Sequential agent policy update: For each instruction \(M^j\), agents are updated one by one in a randomly permuted order, leveraging conditional advantage decomposition (Lemma 2) to decompose the joint advantage into a sum of per-agent marginal advantages.

Decentralized Execution¶

A centralized conductor is used during training; each agent is equipped with a local conductor \(w^i(\cdot|o^i)\).
The centralized conductor's policy is distilled into local conductors via a cross-entropy loss.
At execution time, each agent relies solely on local observations and its local conductor, requiring no communication.

Theoretical Guarantee¶

The paper proves \(J(\boldsymbol{\pi}_{\text{mar},k+1}) \geq J(\boldsymbol{\pi}_{\text{mar},k})\), i.e., the joint policy performance improves monotonically, without relying on the monotonicity assumption of QMIX.

Key Experimental Results¶

SMAC (StarCraft II)¶

Evaluated on 5 maps (5 seeds), HCPO achieves a 90% win rate first on all maps with the lowest standard deviation.

MA-MuJoCo¶

HalfCheetah-v2-2×3: Final return is 23.42% higher than the second-best algorithm HAA2C.
t-SNE visualizations show that HCPO covers a broader state space in early training, confirming superior exploration.
Walker2d-v2-6×1: Entropy analysis and average nearest-neighbor distance further validate HCPO's exploration advantage.

MPE (Multi-Agent Particle Environment)¶

Policy improvement is fastest during early training (0–2M steps), indicating high cooperative efficiency.
HCPO demonstrates greater stability compared to HATRPO and A2PO.

Ablation Study¶

Removing the conductor leads to lower win rates and slower convergence.
The number of instructions \(K\) requires balancing performance against computational cost.
A random conductor (uniform instruction output) yields significantly degraded performance, validating the effectiveness of the learned instruction distribution.
The median return of the local conductor approaches that of the centralized conductor.

Highlights & Insights¶

Mixture policy representation: Modeling the joint policy as a mixture distribution breaks the expressiveness bottleneck of factorized independent policies.
Strict monotonic improvement guarantee: Theoretical guarantees are established without relying on the QMIX monotonicity assumption.
Decentralized execution: Policy distillation eliminates the need for communication at execution time.
Unified framework: Trust region methods are organically integrated with sequential updates and the hierarchical mechanism.

Limitations & Future Work¶

Applicable only to on-policy algorithms, limiting sample efficiency; the authors plan to incorporate off-policy methods in future work.
The instruction space is discrete (\(K\) instructions); continuous instruction spaces remain unexplored.
Conductor distillation introduces additional training overhead.

Rating¶

Novelty: ⭐⭐⭐⭐ — The conductor-based mixture policy framework is novel in MARL, with complete theoretical derivations.
Experimental Thoroughness: ⭐⭐⭐⭐ — Full coverage across three major benchmarks (SMAC/MA-MuJoCo/MPE) with detailed ablations.
Writing Quality: ⭐⭐⭐⭐ — Clear structure and rigorous theoretical derivations, though notation is dense.
Value: ⭐⭐⭐⭐ — Offers new perspectives on policy expressiveness and coordinated exploration in MARL.