BAMAS: Structuring Budget-Aware Multi-Agent Systems¶

Conference: AAAI 2026 arXiv: 2511.21572 Code: https://github.com/chunfenri/BAMAS Area: Reinforcement Learning Keywords: Budget-Aware, Multi-Agent Collaboration, Integer Linear Programming, Topology Selection, Reinforcement Learning

TL;DR¶

This paper proposes the BAMAS framework, which employs Integer Linear Programming (ILP) to select the optimal LLM combination under budget constraints, and uses a reinforcement learning policy to choose the best collaboration topology (Linear/Star/Feedback/Planner-Driven). BAMAS achieves accuracy comparable to state-of-the-art multi-agent systems on GSM8K, MBPP, and MATH, while reducing costs by up to 86%.

Background & Motivation¶

Background: LLM-based multi-agent systems (AutoGen, MetaGPT, ChatDev) leverage multi-agent collaboration to handle complex tasks, but primarily focus on maximizing performance with little regard for cost control. A single task may require dozens of LLM calls, and costs grow unpredictably with collaboration topology and reasoning depth.

Limitations of Prior Work: (1) Existing frameworks treat cost as an afterthought, lacking proactive budget management; (2) users cannot specify a budget ceiling to constrain system behavior; (3) different topologies suit different tasks and budget levels, yet existing systems use fixed topologies without adaptive adjustment.

Key Challenge: There exists a fundamental trade-off between performance and cost — using stronger LLMs and more complex collaboration topologies improves performance but dramatically increases cost. Finding the optimal LLM allocation and collaboration strategy under a given budget remains an open challenge.

Goal: Given a task, a pool of available LLMs, and a budget ceiling, how can one automatically construct a multi-agent system with optimal performance?

Key Insight: The problem is decomposed into two optimizable sub-problems — LLM selection (combinatorial optimization → ILP) and topology selection (policy learning → RL), each solved with the most appropriate optimization method.

Core Idea: Use ILP for budget-constrained LLM selection, and RL for task- and budget-adaptive topology selection, achieving tunable cost–performance trade-offs.

Method¶

Overall Architecture¶

BAMAS operates in three stages: (1) Budget-Constrained LLM Configuration — ILP selects the optimal LLM subset \(\mathcal{P}\) from the LLM pool within budget \(B\); (2) Collaboration Topology Selection — an RL policy \(\pi_\theta\) chooses the best topology \(t\) based on task description and budget; (3) Agent Instantiation — LLMs from \(\mathcal{P}\) are assigned roles (executor/reviewer/planner) according to topology \(t\) and the task is executed.

Key Designs¶

Budget-Constrained LLM Configuration (ILP):
- Function: Select the performance-optimal LLM combination within budget \(B\).
- Mechanism: LLMs are ranked into tiers by performance (\(\mathcal{A}_1\) strongest to \(\mathcal{A}_L\) weakest), using the LMSys leaderboard as a performance proxy. Recursive weights \(W_i = 1 + \sum_{j=i+1}^{L}(W_j \cdot \lfloor B/c_j \rfloor)\) are constructed to ensure that the weight of any higher-tier LLM always exceeds any combination of lower-tier LLMs. The ILP objective is \(\max \sum W_i \cdot x_{ij}\), subject to total cost \(\leq B\) and at least 2 LLMs selected.
- Design Motivation: Empirical evidence shows that a single strong model often outperforms an ensemble of weaker models, motivating a "performance-first" selection strategy. ILP guarantees an exact global optimum, avoiding the local optima inherent in greedy approaches.
Collaboration Topology Selection (RL):
- Function: Select the most suitable collaboration mode from 4 topologies based on task characteristics and budget level.
- Mechanism: The policy network \(\pi_\theta(t|T,B)\) takes a task embedding (MiniLM, 384-dimensional) and a budget scalar as input, outputting a probability distribution over 4 topologies. It is trained offline via REINFORCE with a composite reward \(R_{\text{final}} = w_{\text{perf}} \cdot R_{\text{perf}} + w_{\text{cost}} \cdot R_{\text{cost}}\), where success rewards, over-budget penalties, and budget-saving bonuses jointly guide policy learning.
- Design Motivation: Different tasks require different collaboration modes (mathematical reasoning favors Feedback iteration; code generation favors Linear pipelines), making fixed topologies inadequate. RL is preferred over rules because the optimal topology also depends on the budget level — under tight budgets, the policy favors simpler topologies to avoid overspending.
Library of 4 Collaboration Topologies:
- Linear: Sequential reasoning where each agent continues from the previous agent's output. Suitable for multi-step reasoning.
- Star: Parallel hypothesis generation and evaluation via a divide-and-conquer strategy. Suitable for decomposable problems.
- Feedback: Generate-review loop for iterative refinement. Suitable for tasks requiring self-correction.
- Planner-Driven: A central planner dynamically coordinates agents. Most flexible but highest cost and greatest instability.

Loss & Training¶

The RL loss consists of policy gradient with entropy regularization: \(\mathcal{L}(\theta) = -\hat{\mathbb{E}}[\log \pi_\theta(t|T,B) \cdot R_{\text{final}}(\tau)] - \beta \cdot H(\pi_\theta)\). Offline training avoids the high cost of online data collection. The Adam optimizer is used with a batch size of 20,000 and training for 10 epochs.

Key Experimental Results¶

Main Results¶

Cost–accuracy comparison on GSM8K and MBPP:

Method	Setting	GSM8K Acc (%)	GSM8K Cost	MBPP Acc (%)	MBPP Cost
AutoGen	DeepSeek-V3	95.4	1425.3	80.8	2661.3
MetaGPT	DeepSeek-V3	93.5	3235.4	82.2	3735.1
ChatDev	DeepSeek-V3	95.0	2733.1	81.2	3635.1
BAMAS	Budget 1625	95.3	542.9	–	–
BAMAS	Budget 1250	94.9	447.0	82.6	529.2

MATH dataset:

Method	Acc (%)	Cost
AutoGen (GPT-4.1 nano)	77.6	797.2
BAMAS (Budget 2000)	81.2	646.0

Ablation Study¶

Configuration	GSM8K Acc (%)	GSM8K Cost	Note
Naive-CostAware L5+DeepSeek	95.3	1650.8	Greedy tier-5; highest accuracy but expensive
BAMAS Budget 1625	95.3	542.9	Same accuracy, 67% cost reduction
Naive-CostAware L1+GPT-nano	89.7	216.7	Cheapest but low accuracy
BAMAS Budget 500	87.9	222.4	Similar cost, adjustable

Key Findings¶

BAMAS achieves 82.6% on MBPP at a cost of only 529.2, a reduction of 86% compared to MetaGPT (3735.1) — the most significant cost reduction observed.
The learned policy exhibits meaningful patterns: math tasks favor the Feedback topology (69.8% on MATH), while code tasks favor the Linear topology.
Under low budgets, the policy biases toward simpler topologies (Linear/Star); under higher budgets, it selects more complex topologies (Feedback) — reflecting risk-aware behavior.
The Planner-Driven topology is never selected, indicating that RL has learned it is too costly and unstable to be worth the risk.
Over-budget rates are negligible: 0 violations on GSM8K and at most 5/500 (1%) on MBPP, demonstrating effective budget control.

Highlights & Insights¶

Decomposing the multi-agent system construction problem into "whom to select" (ILP) and "how to collaborate" (RL) is conceptually clean, with each sub-problem solved by the most appropriate optimization method.
The recursive weight design guarantees lexicographic optimality for ILP — the weight of any higher-tier LLM always exceeds any combination of lower-tier ones, a concise and effective modeling technique.
The topology selection preferences learned by the RL policy are highly interpretable (math → iterative feedback, code → linear pipeline, low budget → simple topology), rather than opaque black-box decisions.

Limitations & Future Work¶

Only 2 LLMs are used (DeepSeek-V3 and GPT-4.1 nano); this small pool makes it difficult to fully demonstrate ILP's advantages in large-scale LLM selection.
The 4 topologies are predefined and do not support automatic discovery of new collaboration modes. In practice, hybrid topologies (e.g., different topologies at different stages) may be needed.
Cost estimation uses a fixed token count (500 input tokens), but actual token consumption varies significantly, potentially leading to inaccurate budget estimates.
Evaluation is limited to code generation and mathematical reasoning; validation on more diverse tasks (e.g., creative writing, information retrieval) is lacking.

vs. AutoGen: AutoGen provides a flexible agent architecture but does not account for budgets. BAMAS can be viewed as a budget-aware wrapper around AutoGen, deciding which models and topologies to use before delegating execution to an AutoGen-like engine.
vs. FrugalGPT/TREACLE: These works optimize costs for single-model settings (routing/cascading). BAMAS extends cost optimization to multi-agent collaboration, requiring joint optimization of model selection and collaboration strategy.
vs. ADAS: ADAS automatically designs agent system architectures but ignores cost constraints. BAMAS complements ADAS by incorporating cost-dimension optimization.

Rating¶

Novelty: ⭐⭐⭐⭐ — First work to systematically introduce budget constraints into multi-agent system construction; the ILP + RL combinatorial optimization approach is novel.
Experimental Thoroughness: ⭐⭐⭐ — Covers three datasets but only 2 LLMs; larger-scale and more diverse scenario validation is lacking.
Writing Quality: ⭐⭐⭐⭐ — Research goals are clearly articulated; the RQ-driven evaluation structure is well-organized with rich figures and tables.
Value: ⭐⭐⭐⭐ — Cost-awareness is a critical requirement for real-world deployment of multi-agent systems; BAMAS offers a practical and viable solution.