Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems¶
Conference: ICLR 2026 arXiv: 2510.26585 Code: None Area: Social Computing Keywords: Multi-agent systems, Token efficiency, Runtime supervision, Adaptive filtering, Error correction
TL;DR¶
This paper proposes SupervisorAgent, a lightweight real-time adaptive supervision framework that actively intervenes at critical interaction nodes (error correction, guidance provision, observation purification) via an LLM-free adaptive filter, reducing token consumption of Smolagent on the GAIA benchmark by 29.68% without sacrificing success rate.
Background & Motivation¶
- Multi-agent systems (MAS) excel at complex tasks but face an efficiency–robustness paradox:
- Error propagation: A single hallucinated piece of information can corrupt the entire downstream reasoning chain.
- Inefficient behavior: Agents fall into repetitive action loops or select unnecessarily complex execution paths.
- Context bloat: Verbose tool returns (e.g., raw HTML) flood the context window.
- Existing methods primarily focus on post-hoc failure attribution, lacking real-time proactive intervention.
Method¶
Overall Architecture: Supervised Multi-Agent System (SMAS)¶
A meta-level control agent — SupervisorAgent — is added on top of the existing MAS to monitor three categories of high-risk interactions in real time:
- Agent–Agent interactions: Inter-agent communication and delegation, prone to propagating hallucinated information.
- Agent–Tool interactions: External tool calls that may return inaccurate, irrelevant, or outdated data.
- Agent–Memory interactions: Memory retrieval that may surface stale or defective historical information.
Adaptive Filter (When to Supervise)¶
A lightweight, LLM-free heuristic filter that triggers supervision only at critical nodes:
- Error occurrence \(c_{error}\): Explicit errors such as tool call failures or code execution errors.
- Inefficient behavior \(c_{inefficient}\): Repetitive action loops (e.g., repeated
page_down). - Excessive observation \(c_{excessive}\): Tool returns exceeding a length threshold (e.g., raw HTML).
Context Window¶
- \(N\): Name of the supervised agent
- \(Q_g, Q_l\): Global goal and local task
- \(T_l\): Local action trajectory
- \(S\): Summary of the most recent interaction
- The extended version \(\mathcal{W}_{ext}\) incorporates the global trajectory \(T_g\) for diagnosing system-level inefficiencies.
Multi-Level Intervention Action Space (How to Supervise)¶
| Action | Intensity | Trigger Condition | Description |
|---|---|---|---|
| approve | Lowest | \(c_{inefficient}\) | Permits valid repetitive behavior to continue |
| provide_guidance | Moderate | \(c_{error}, c_{inefficient}\) | Appends guidance hints to correct the reasoning path |
| correct_observation | High | \(c_{error}, c_{excessive}\) | Replaces or purifies the raw observation content |
| run_verification | Highest | \(c_{error}\) | Invokes a verification sub-agent for external fact-checking |
Core Design Principles¶
- Non-invasive: Does not modify the underlying agent architecture.
- Adaptive: Supervision is triggered only at high-risk nodes, not at every interaction.
- Memory-augmented: SupervisorAgent maintains a more comprehensive view of system state than any individual agent.
Key Experimental Results¶
Main Results on GAIA Benchmark¶
| Method | Avg. Accuracy | Avg. Tokens (K) | L2 Tokens (K) |
|---|---|---|---|
| CodeAgent | 40.00 | 120.40 | — |
| Smolagent (pass@1) | — | Baseline | Baseline |
| SMAS (pass@1) | On par | −29.68% | −35% |
Detailed Analysis on GAIA Level 2¶
| Metric | Smolagent | SMAS | Improvement |
|---|---|---|---|
| Token consumption | Baseline | −35% | Significant |
| Variance | Baseline | −63% | Substantially reduced |
| Number of steps (case) | Baseline | −43% | Significant |
Cross-Benchmark Validation¶
| Benchmark | Domain | Token Reduction | Accuracy Change |
|---|---|---|---|
| HumanEval | Code generation | −23.74% | Improved |
| MBPP | Code generation | Significant | On par / improved |
| AIME 2024 | Math reasoning | Reduced | On par |
| GSM8k-Hard | Math reasoning | Reduced | On par |
| DROP | QA | Reduced | On par |
Key Findings¶
- A 23.74% token reduction on HumanEval is achieved simultaneously with an improvement in accuracy.
- SupervisorAgent is effective across GPT-4.1, Gemini-2.5-pro, and the Qwen3 model series.
- The adaptive filter effectively controls supervision overhead, avoiding redundant checks on every interaction.
- Case studies show that a single successful supervisory intervention can reduce token consumption by more than 70%.
Highlights & Insights¶
- Real-time proactive intervention vs. post-hoc analysis: A paradigm shift from reactive to proactive supervision.
- Pareto improvement: Token consumption is reduced without sacrificing — and in some cases while improving — task success rate.
- LLM-free filter: A key innovation lies in using simple heuristics rather than an LLM to determine when to supervise.
- Orthogonal to existing methods: The framework can be stacked on top of any existing MAS framework.
- Substantially reduced variance: The resulting system exhibits more stable and reliable behavior.
Limitations & Future Work¶
- The adaptive filter relies on predefined heuristic rules and may fail to capture emerging categories of high-risk interactions.
- The LLM calls made by SupervisorAgent itself incur costs, necessitating a trade-off between supervision benefit and supervision overhead.
- Validation is primarily conducted on the Smolagent framework; adaptation to other MAS frameworks may require non-trivial adjustments.
- The framework has limited applicability to purely conversational tasks that do not involve tool use.
Related Work & Insights¶
- Failure attribution: Post-hoc analysis methods such as Aegis and AgenTracer.
- Efficiency optimization: AgentDropout (agent pruning), MetaAgent (design-time topology optimization).
- Context compression: Observation summarization and distillation.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The concept of runtime supervision is novel; the non-invasive design is practically valuable.
- Technical Depth: ⭐⭐⭐ — The method is relatively intuitive with moderate technical complexity.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 6 benchmarks × multiple backbone models × detailed case analysis.
- Practicality: ⭐⭐⭐⭐⭐ — Directly deployable; of significant value for reducing MAS operational costs.