Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies¶

Conference: ICLR 2026 arXiv: 2502.02533 Code: To be confirmed Area: Agent Keywords: multi-agent system, prompt optimization, topology search, automated MAS design, workflow optimization

TL;DR¶

This paper provides a systematic analysis of the respective contributions of prompt design and topology design in multi-agent systems (MAS), finding that prompt optimization is the single most critical factor—a single agent with optimized prompts can outperform complex multi-agent topologies. The paper proposes Mass, a three-stage framework (block-level prompt → topology → workflow-level prompt) that achieves state-of-the-art performance across 8 benchmarks.

Background & Motivation¶

Background: Multi-agent systems (MAS) organize multiple LLM agents through topologies such as Debate, Reflect, and Aggregate. Recent work has explored automated MAS design methods, including ADAS and AFlow.

Limitations of Prior Work: It remains unclear whether MAS performance gains stem from multi-agent topologies or from better prompts. Many complex topologies actually degrade performance, yet the underlying reasons are not well understood.

Key Challenge: The benefit of increasing agent count and topology complexity is uncertain—sometimes helpful, sometimes harmful.

Goal: ① Quantify the relative contributions of prompts vs. topologies; ② Design a unified automated framework that jointly optimizes both.

Key Insight: Controlled variable analysis—first optimizing prompts in isolation, then layering topology search on top.

Core Idea: Prompt optimization >> topology selection; however, joint optimization of both > either alone.

Method¶

Overall Architecture¶

Mass performs three-stage alternating optimization: ① Block-level prompt optimization (independently optimizing the instruction and exemplar for each agent module) → ② Workflow topology optimization (pruning the search space based on the incremental influence of each module) → ③ Workflow-level prompt optimization (globally and jointly optimizing prompts over the selected optimal topology).

Key Designs¶

Block-level Prompt Optimization (Warm-up Stage):
- Independently performs joint instruction + exemplar optimization for each agent module
- Iteratively refines prompts using validation-set feedback
- Serves as "pre-training" for topology search, ensuring prompt quality for each module
- Design Motivation: Experiments show that prompt-optimized single agents already outperform complex topologies such as SC, Reflect, and Debate
Workflow Topology Optimization:
- Computes the incremental influence \(I_{a_i}\) of each module
- Prunes the search space via softmax-based probability sampling
- Evaluates candidate topologies using prompts optimized in Stage 1
- Finding: Not all topologies contribute positively—e.g., on HotpotQA, only debate yields a ~3% gain
Workflow-level Prompt Optimization:
- After the optimal topology is determined, jointly optimizes prompts for all agents globally
- Accounts for inter-agent interaction effects (i.e., how information flow within the topology influences prompt design)
- Performs fine-grained adjustment to adapt prompts to the final topology

Key Experimental Results¶

Main Results (8 Benchmarks)¶

Method	MATH	HotpotQA	MMLU	Average
SC (Self-Consistency)	Baseline	Baseline	Baseline	Baseline
Reflect	Slightly higher	Slightly lower	Slightly higher	Mixed
Debate	Slightly higher	+3%	Slightly lower	Mixed
ADAS	High	High	High	Strong baseline
Mass	Highest	Highest	Highest	SOTA

Ablation Study¶

Comparison	Conclusion
Single agent + prompt optimization vs. multi-agent (no prompt optimization)	Single agent is superior
Mass (3-stage) vs. prompt-only vs. topology-only	Mass is significantly best
Transfer from Gemini 1.5 Pro → Claude 3.5 Sonnet	Findings transfer across models

Key Findings¶

A prompt-optimized single agent already outperforms SC/Reflect/Debate—challenging the intuition that "more agents is always better"
Mass achieves SOTA on all 8 benchmarks, significantly outperforming automated baselines such as ADAS and AFlow
Not all topologies have a positive impact—in approximately 50% of cases, additional topology components actually hurt performance
Findings transfer across model families (Gemini → Claude → Mistral)

Highlights & Insights¶

"Prompt > Topology" is the central finding—offering an important corrective for the MAS community: optimize prompts before pursuing complex topologies
The three-stage alternating optimization design is well-motivated—warm-up first, then search, then fine-tune, avoiding cold-start issues
Incremental influence pruning effectively reduces the topology search space
The approach is generalizable to any MAS framework—Mass's methodology is not tied to specific topology types

Limitations & Future Work¶

The search space relies on predefined building blocks (Aggregate/Reflect/Debate), limiting the discovery of arbitrary structures
The optimization process requires validation-set feedback, and computational cost grows with the number of agents
Topology construction rules follow a fixed order, constraining the discovery of unconventional topologies
Dynamic topology selection at inference time (e.g., selecting topology based on input difficulty) is not considered

vs. ADAS: ADAS searches over agent architectures; Mass jointly optimizes prompts and topology
vs. AFlow: AFlow searches workflows via code generation; Mass optimizes via validation-set feedback
vs. DSPy: DSPy optimizes prompt pipelines; Mass simultaneously optimizes topology
Insight: The performance bottleneck in MAS may lie not in topology complexity but in the prompt quality of individual agents

Supplementary Discussion¶

Why Do Complex Topologies Often Fail?¶

Multi-agent collaboration introduces additional "communication overhead"—each agent's output may contain noise or irrelevant information, which accumulates and ultimately disrupts the final decision. Debate topology is effective on HotpotQA (where multi-perspective discussion aids fact verification) but degrades performance on mathematical tasks (where reasoning is deterministic and debate introduces unwanted variance). This suggests that topology selection should be task-dependent rather than following a "one topology fits all" approach.

Necessity of Three-Stage Optimization¶

Experiments demonstrate that prompt-only optimization or topology-only search are both inferior to the full three-stage joint optimization.

The key reason is the interaction effect between prompts and topology: the optimal prompt may differ across topologies, and the optimal topology likewise depends on prompt quality. This establishes MAS design as a joint optimization problem that cannot be decomposed into independent subproblems.

Rating¶

Novelty: ⭐⭐⭐⭐ The "Prompt > Topology" finding is valuable; the three-stage design is well-reasoned
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 benchmarks, cross-model validation, comprehensive ablations
Writing Quality: ⭐⭐⭐⭐ Controlled variable analysis is clearly presented
Value: ⭐⭐⭐⭐ Offers direct practical guidance for MAS design