HCPO: Hierarchical Conductor-Based Policy Optimization in Multi-Agent Reinforcement Learning¶
Conference: AAAI2026 arXiv: 2511.12123 Area: Reinforcement Learning Keywords: multi-agent RL, cooperative MARL, joint policy optimization, hierarchical framework, trust region
TL;DR¶
This paper proposes HCPO, an algorithm that enhances the expressiveness and exploration efficiency of multi-agent joint policies by introducing a conductor mechanism, constructing a Gaussian mixture model-like joint policy framework, and providing monotonic improvement guarantees for two-level policy updates.
Background & Motivation¶
Efficient exploration is critical for joint policy optimization in cooperative MARL. Existing CTDE paradigms (e.g., MAPPO, QMIX) suffer from two core problems:
- Limited joint policy expressiveness: Most methods assume the joint policy factorizes as a product of independent per-agent policies \(\boldsymbol{\pi}(\boldsymbol{a}|s) = \prod_i \pi^i(a^i|s)\), restricting the expressive capacity of the policy space.
- Uncoordinated independent exploration: Agents explore independently, making it difficult to coordinate the discovery of high-value joint policies.
- Limitations of existing hierarchical methods: MAVEN relies on the monotonicity assumption of QMIX; COPA requires communication at execution time; skill discovery methods depend on variational inference.
Method¶
Conductor-Based Joint Policy Framework¶
Inspired by a coach directing players in a soccer match, a centralized conductor is introduced to provide a shared instruction \(M\) to the entire team:
- The conductor policy \(w(\cdot|s)\) selects one of \(K\) discrete instructions based on the global state.
- Given instruction \(M\), the joint policy decomposes into a product of conditionally independent policies: \(\boldsymbol{\pi}(\boldsymbol{a}|s,M) = \prod_{i=1}^N \pi^i(a^i|s,M)\).
- The overall structure forms a mixture policy analogous to a Gaussian mixture model, substantially enhancing expressiveness.
Advantage Function Decomposition¶
The joint advantage function is decomposed into a conductor level and an agent level:
- Instruction advantage \(A(M|s)\): evaluates the relative quality of instruction \(M\) over alternatives.
- Joint action advantage \(A(\boldsymbol{a}|s,M)\): evaluates the quality of the joint action given the instruction.
Two-Level Policy Update¶
- Conductor policy update: Maximizes the instruction advantage subject to a KL divergence constraint:
- Sequential agent policy update: For each instruction \(M^j\), agents are updated one by one in a randomly permuted order, leveraging conditional advantage decomposition (Lemma 2) to decompose the joint advantage into a sum of per-agent marginal advantages.
Decentralized Execution¶
- A centralized conductor is used during training; each agent is equipped with a local conductor \(w^i(\cdot|o^i)\).
- The centralized conductor's policy is distilled into local conductors via a cross-entropy loss.
- At execution time, each agent relies solely on local observations and its local conductor, requiring no communication.
Theoretical Guarantee¶
The paper proves \(J(\boldsymbol{\pi}_{\text{mar},k+1}) \geq J(\boldsymbol{\pi}_{\text{mar},k})\), i.e., the joint policy performance improves monotonically, without relying on the monotonicity assumption of QMIX.
Key Experimental Results¶
SMAC (StarCraft II)¶
Evaluated on 5 maps (5 seeds), HCPO achieves a 90% win rate first on all maps with the lowest standard deviation.
MA-MuJoCo¶
- HalfCheetah-v2-2×3: Final return is 23.42% higher than the second-best algorithm HAA2C.
- t-SNE visualizations show that HCPO covers a broader state space in early training, confirming superior exploration.
- Walker2d-v2-6×1: Entropy analysis and average nearest-neighbor distance further validate HCPO's exploration advantage.
MPE (Multi-Agent Particle Environment)¶
- Policy improvement is fastest during early training (0–2M steps), indicating high cooperative efficiency.
- HCPO demonstrates greater stability compared to HATRPO and A2PO.
Ablation Study¶
- Removing the conductor leads to lower win rates and slower convergence.
- The number of instructions \(K\) requires balancing performance against computational cost.
- A random conductor (uniform instruction output) yields significantly degraded performance, validating the effectiveness of the learned instruction distribution.
- The median return of the local conductor approaches that of the centralized conductor.
Highlights & Insights¶
- Mixture policy representation: Modeling the joint policy as a mixture distribution breaks the expressiveness bottleneck of factorized independent policies.
- Strict monotonic improvement guarantee: Theoretical guarantees are established without relying on the QMIX monotonicity assumption.
- Decentralized execution: Policy distillation eliminates the need for communication at execution time.
- Unified framework: Trust region methods are organically integrated with sequential updates and the hierarchical mechanism.
Limitations & Future Work¶
- Applicable only to on-policy algorithms, limiting sample efficiency; the authors plan to incorporate off-policy methods in future work.
- The instruction space is discrete (\(K\) instructions); continuous instruction spaces remain unexplored.
- Conductor distillation introduces additional training overhead.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The conductor-based mixture policy framework is novel in MARL, with complete theoretical derivations.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Full coverage across three major benchmarks (SMAC/MA-MuJoCo/MPE) with detailed ablations.
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and rigorous theoretical derivations, though notation is dense.
- Value: ⭐⭐⭐⭐ — Offers new perspectives on policy expressiveness and coordinated exploration in MARL.