SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous Language Model Assemblies¶
Conference: NeurIPS 2025
arXiv: 2601.22623
Code: https://github.com/ZHUWEI-hub/SYMPHONY
Area: LLM/NLP
Keywords: multi-agent planning, MCTS, heterogeneous models, LLM collaboration, tree search
TL;DR¶
This paper proposes SYMPHONY, an MCTS-based multi-agent planning framework that leverages diversity-driven search over a heterogeneous LLM pool, UCB-based adaptive scheduling, entropy-modulated confidence scoring, and pool-level memory sharing to substantially improve planning diversity and efficiency.
Background & Motivation¶
Combining LLMs with MCTS for complex task planning has attracted significant attention (RAP, LATS, MASTER, etc.), yet existing methods almost exclusively adopt a single-model paradigm: repeatedly querying the same LLM to generate search branches. The fundamental problem is that multiple samples from a single LLM tend to be highly similar, reflecting the same dominant reasoning pattern, which leads to: - Search trees populated with redundant, near-identical branches - Limited exploration capacity and susceptibility to local optima - Additional sampling and token overhead to cover the solution space
Key Challenge: MCTS requires diverse rollouts for effective exploration, but the intra-model variability of a single LLM is insufficient to support such diversity.
Key Insight: The paper replaces the single model with a pool of heterogeneous LLMs (differing in pretraining origin and reasoning style), enhancing branch diversity from the perspective of model-level diversity. Adaptive scheduling and collaborative memory are introduced to ensure efficient coordination.
Method¶
Overall Architecture¶
SYMPHONY introduces four key components into the standard MCTS framework: (1) a heterogeneous agent pool for diverse branch generation; (2) a UCB-based scheduling policy for dynamic agent selection; (3) Entropy-Modulated Confidence Scoring (EMCS) to calibrate node value estimation; and (4) pool-level memory sharing to enable cross-agent reflective learning.
Key Designs¶
-
Heterogeneous Agent Pool:
- Maintains \(\mathcal{M}^{(k)} = \{M_1^{(k)}, \cdots, M_n^{(k)}\}\), a pool of heterogeneous LLMs
- SYMPHONY-S (consumer-grade hardware): Qwen2.5-7B + Mistral-7B + Llama-3.1-8B
- SYMPHONY-L (API-level): GPT-4 + Qwen-Max + DeepSeek-V3
- Unified input/output interface supporting modular replacement
- Theoretically proven: sampling from the ensemble with non-zero probability yields strictly lower expected error than deterministically selecting any single agent
-
UCB Adaptive Scheduling:
- Models agent selection as a multi-armed bandit problem
- \(\text{UCB}(M_i^{(k)}) = \bar{Q}(M_i^{(k)}) + \alpha \cdot \sqrt{\frac{\ln N_{total}}{N(M_i^{(k)})+1}}\)
- Balances exploitation of high-performing agents and exploration of underutilized ones
- Applied across all three MCTS phases: expansion, evaluation, and reflection
-
Pool-Level Memory Sharing:
- Failed trajectories trigger UCB-selected agents to generate natural-language reflections \(\mathcal{R}_i^k\)
- Reflections are broadcast to all agents and injected into prompts as shared memory blocks
- A fixed-size buffer with FIFO eviction manages memory capacity
- Achieves behavioral adaptation through prompt-level memory without any parameter updates
-
Entropy-Modulated Node Evaluation (EMCS):
- Agents evaluate nodes to produce a value \(Z(s_t) \in [0,1]\) and confidence \(C(s_t) \in (0,1)\)
- Uncertain predictions are penalized via Bernoulli entropy: \(R(s_t) = Z(s_t) \cdot (1 - E(s_t))\)
- Where \(E(s_t) = -C(s_t)\ln C(s_t) - (1-C(s_t))\ln(1-C(s_t))\)
- Penalty is maximized at \(C=0.5\) (maximum uncertainty), preserving high-confidence evaluations
Loss & Training¶
- No training is required; SYMPHONY is a purely inference-time collaboration framework.
- Key hyperparameters include the UCB exploration coefficient \(\alpha\), MCTS expansion width \(n\), and search budget \(K\).
Key Experimental Results¶
Main Results¶
| Task | Metric | SYMPHONY-S | SYMPHONY-L | LATS | MASTER |
|---|---|---|---|---|---|
| HotpotQA | EM | 0.59 | 0.79 | 0.71 | 0.76 |
| WebShop | Score/SR | 0.82/0.56 | 0.88/0.72 | 0.76/0.38 | 0.80/– |
| MBPP (Python) | Pass@1 | 0.927 | 0.965 | 0.811 | 0.910 |
| MBPP (Rust) | Pass@1 | 0.946 | 0.974 | – | – |
Ablation Study¶
| Configuration | HotpotQA (EM) | WebShop (SR) | MBPP (Pass@1) |
|---|---|---|---|
| SYMPHONY-S (full) | 0.59 | 0.56 | 0.927 |
| w/o Agent Scheduling | 0.51 | 0.48 | 0.906 |
| w/o Memory Sharing | 0.45 | 0.46 | 0.871 |
| w/o EMCS | 0.51 | 0.49 | 0.892 |
Key Findings¶
- Diversity drives performance: The three-model combination achieves over 80% 4-Unique branch diversity on MBPP, compared to under 20% for a single model, yielding performance gains exceeding 30%.
- Search efficiency: SYMPHONY-L requires only 9.47 node expansions on HotpotQA, versus 66.65 for LATS@50—a 7× improvement in efficiency.
- Cost optimization: GPT-4 accounts for only 40% of calls in SYMPHONY-L, yet outperforms the GPT-4-only baseline.
- Consumer-grade feasibility: SYMPHONY-S (purely open-source 7B/8B models) surpasses single-model GPT-4 baselines on multiple tasks.
Highlights & Insights¶
- Heterogeneous > Homogeneous: Even a combination of different open-source models at the same scale outperforms repeated sampling from a single strong model—model-level diversity is an underexploited resource.
- Decentralized design for memory sharing: Unlike multi-agent systems requiring explicit communication protocols, SYMPHONY achieves lightweight knowledge propagation through natural-language reflections.
- Elegance of EMCS: Calibrating LLM self-assessment scores via Bernoulli entropy from information theory is both principled and effective.
Limitations & Future Work¶
- The selection of the heterogeneous model pool lacks systematic guidance and currently relies on empirical choice.
- The search budget \(K\) and expansion width \(n\) require manual tuning.
- The FIFO memory eviction strategy is relatively simple and may discard important reflections.
- Robustness under adversarial or high-noise environments has not been thoroughly validated.
Related Work & Insights¶
- vs. MASTER (Gan et al.): MASTER constructs multiple agents from the same LLM and modifies the UCT formula, whereas SYMPHONY fundamentally increases diversity through a heterogeneous model pool.
- vs. LATS (Zhou et al.): LATS performs single-model search; SYMPHONY combines multi-model search with memory sharing to achieve superior results under a smaller budget.
- vs. AgentCoder (Huang et al.): AgentCoder employs fixed role assignments (programmer/tester); SYMPHONY's adaptive scheduling offers greater flexibility.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of a heterogeneous agent pool and UCB scheduling is novel in the MCTS-LLM setting.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three diverse task types, two configurations, comprehensive ablation and efficiency analyses.
- Writing Quality: ⭐⭐⭐⭐ Framework is clearly presented, though the notation is dense.
- Value: ⭐⭐⭐⭐⭐ Provides a principled approach to combining multiple LLMs for planning, with strong accessibility on consumer-grade hardware.