👥 Multi-Agent¶

🧪 ICML2026 · 24 paper notes

📌 Same area in other venues: 📷 CVPR2026 (2) · 🔬 ICLR2026 (47) · 💬 ACL2026 (40) · 🤖 AAAI2026 (26) · 🧠 NeurIPS2025 (17) · 🧪 ICML2025 (7)

🔥 Top topics: Agents ×17 · LLM ×7 · Reasoning ×3

Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information: This paper proposes two algorithms for aggregating LLM responses by leveraging higher-order information—Optimal Weight (OW) based on first-order accuracy information and Inverse Surprising Popularity (ISP) based on second-order correlation information. These methods are provably superior to Majority Voting (MV) under label-free conditions and demonstrate consistent improvements on UltraFeedback, MMLU, and healthcare datasets.
CoOT: Learning to Coordinate In-Context with Coordination Transformers: This work reframes "cooperating with unknown partners" from a task-generalization problem to a partner-generalization in-context learning problem. By training a Decision Transformer to predict best-response actions over cross-episode interaction trajectories, the model adapts to unseen partners within a few episodes during test-time without updating parameters.
Does Persona Make LLMs K-pop Fans? A Pilot Study of LLM-Based Online Concert Audience Agents: The authors constructed a "virtual audience" system consisting of ten LLM agents posting real-time danmaku. By pairing pre-recorded K-pop performances with human-like fan chats, an N=11 within-subject pilot study revealed that assigning individual personas to each agent significantly enhances diversity and "naturalness" at the model output level. However, this does not translate into a stronger sense of social connection, engagement, or emotional resonance—as K-pop danmaku is essentially a "collective monologue" rather than interpersonal dialogue.
E-mem: Multi-Agent Based Episodic Context Reconstruction for LLM Agent Memory: E-mem replaces the traditional memory paradigm of "preprocessing compression into embeddings/graphs" with an episodic reconstruction paradigm of "preserving original context + on-site reasoning by small model assistants": the master agent only handles global planning, while multiple SLM assistants each guard a segment of uncompressed raw text, performing local reasoning to return evidence after activation via multi-pathway retrieval. This approach outperforms the SOTA F1 on LoCoMo by 7.75 points while cutting token consumption by 70%.
EduMirror: Modeling Educational Social Dynamics with Value-driven Multi-agent Simulation: EduMirror simulates educational social phenomena like "campus bullying" and "peer cooperation" in an LLM-driven multi-agent sandbox. It employs "value-driven agents" based on Maslow's hierarchy of needs and Social Value Orientation (SVO) to play students and teachers, coupled with a "dual-track measurement" protocol that quantifies both observable behaviors and latent psychological states. This allows for ethically safe "what-if" counterfactual experiments in a digital environment.
EngiAgent: Fully Connected Coordination of LLM Agents for Solving Open-ended Engineering Problems with Feasible Solutions: EngiAgent decomposes engineering problem solving into five specialist agents: Analyzer, Modeler, Verifier, Solver, and Evaluator. It utilizes a fully connected coordinator for dynamic feedback routing (replacing rigid pipelines). This approach improves the feasible solution rate on GPT-4o for engineering tasks from 5.66% (zero-shot) and 7.55% (MM-Agent) to 64.15%, representing an approximate 7x increase over previous SOTAs.
Sheaf-ADMM: Learning Multi-Agent Coordination via Sheaf-ADMM: Sheaf-ADMM formulates multi-agent coordination as an end-to-end differentiable ADMM unrolling: each agent observes a local patch, independently solves an ADMM subproblem (\(\bm x\)-update), negotiates consensus via "edge space projections" defined by a cellular sheaf (\(\bm z\)-update), and accumulates divergence using dual variables \(\bm u\). Agents successfully solve global tasks in maze pathfinding, MNIST, and Sudoku, where their inference paths exhibit analyzable primal/consensus/dual states—offering higher intervenability than standard MPNNs.
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks: The paper reformulates "automated multi-agent system design" as a reinforcement learning (RL) problem involving function calls that output an entire MAS structure in a single step. It introduces MASBench to clarify "when multi-agent systems are truly superior to single-agent systems" across five dimensions: Depth, Horizon, Breadth, Parallelism, and Robustness.
MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems: MASPO end-to-end jointly optimizes role prompts for multi-agent chains without relying on labels through multi-granularity joint evaluation (Local Validity + Lookahead Potential + Global Alignment) and misalignment-driven evolutionary beam search, achieving an average improvement of approximately 2.9 points across 6 tasks.
MASPOB: Multi-Agent Prompt Optimization via GNN Surrogate + LinUCB + Coordinate Ascent: MASPOB reformulates multi-agent system prompt optimization as budget-constrained black-box optimization. It utilizes a GAT surrogate model to capture prompt coupling under workflow topologies, LinUCB in the embedding space to compute epistemic uncertainty, and coordinate ascent to decompose joint search into sequential individual problems. This reduces search complexity from \(\mathcal{O}(\prod |\mathcal{P}_i|)\) to \(\mathcal{O}(\sum |\mathcal{P}_i|)\). Across 6 benchmarks (QA/Code/Math), it achieves an average score of 80.58, surpassing MIPRO (78.87), AFlow (78.52), and IO (68.56).
More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration: The authors developed a turn-based multi-agent environment where "helping others is zero-cost and cooperation is the obviously optimal solution," discovering that capability among 8 mainstream LLMs fails to predict cooperation levels (\(o3\) reached only 17% of the optimal, while the weaker \(o3-mini\) reached 50%). Using causal decomposition via "automated one-sided communication," they categorized failures into "unwillingness to cooperate" and "inability to execute," then applied three low-cost interventions—explicit protocols, micro-sharing incentives, and restricted visibility—as targeted remedies.
Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?: This paper models "multi-LLM agent debates" using Friedkin-Johnsen (FJ) opinion dynamics from sociology, proving that FJ parameters are input-dependent. This establishes that Multi-Agent Systems (MAS) implement a Mixture of Experts (MoE) with implicit routing. The authors theoretically characterize when MAS outperforms single agents or static ensembles and reveal through experiments that "who becomes an influencer" is primarily determined by confidence (especially relative confidence).
OMAC: A Holistic Optimization Framework for LLM-Based Multi-Agent Collaboration: This paper formalizes the optimization space of multi-agent systems (MAS) into five dimensions (two functional + three structural). It utilizes a dual-actor algorithm comprising a "Semantic Initializer" for generation and a "Contrastive Comparator" for iterative improvement to perform supervised optimization across each dimension. By iteratively and jointly optimizing multiple dimensions, it consistently outperforms baselines such as DyLAN, ADAS, and AFlow on HumanEval, MMLU, and MATH.
ProtocolBench: Which LLM MultiAgent Protocol to Choose?: ProtocolBench presents the first systematic comparison of four major LLM multi-agent communication protocols (A2A, ACP, ANP, Agora) across four axes: task success, end-to-end latency, message byte overhead, and failure robustness. The study reveals that protocol choice results in a 36.5% difference in completion time and a 3.48s difference in latency; it further proposes ProtocolRouter for dynamic scenario-based protocol selection, reducing Fail-Storm recovery time by 18.1%.
RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation: RADAR models the communication topology design of multi-LLM-agent systems as a "redundancy-aware" discrete graph diffusion process. By using effective size as a guiding signal to incrementally generate query-adaptive collaboration graphs, it achieves higher accuracy, lower token consumption, and stronger robustness across six benchmarks.
Representational Similarity and Model Behavior in Multi-Agent Interaction: This paper pairs 276 LLMs across 8 interactive games and identifies a robust pattern: pairs with higher internal representational similarity (quantified by CKA) exhibit better cooperation but lower collective novelty in their outputs—revealing a fundamental trade-off between cooperation and creativity driven by representational similarity.
Searching for Synergy in Shared Workspace Human-AI Collaboration: This paper identifies a counter-intuitive phenomenon in shared-workspace human-AI collaboration environments: adding (simulated) human collaborators with relevant expertise can actually degrade performance. The root cause is identified as "process loss" resulting from a lack of coordination structure. By borrowing two mechanisms from group psychology—Shared Group Memory and Simulated HITL Approval Gating—as scaffolds, the authors restore the average score of a three-agent team from 0.63 to 0.76.
Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation: BPD reconstructs the multi-round interactions of an LLM Multi-Agent System (MAS) into a "signed Directed Acyclic Graph (DAG)," scoring each message as \(\{-1, 0, 1\}\) for agreement, indifference, or disagreement. It then utilizes a PageRank-style single-pass reverse topological propagation to calculate the contribution score of each agent to the final answer. Outliers are identified as malicious agents and their outgoing edges are pruned—offering a training-free, single-query utility that is naturally robust to dynamic topologies.
Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows: This paper employs a linear MetaGPT pipeline (Product Manager → Architect → Project Manager → Engineer) and injects a malicious agent into the Engineer role to secretly embed bugs. It finds that larger models cause more severe destruction (Pass@1 drops by 53.7pp at 27B); however, simply appending a lightweight QA+Fixer terminal stage reduces the performance drop to 0.6pp. This indicates that the previously claimed "inherent fragility of linear topologies" is actually due to the absence of terminal error correction.
Systematic Failures in Collective Reasoning under Distributed Information in Multi-Agent LLMs: This paper adapts the Hidden Profile paradigm from social psychology into a multi-agent LLM evaluation, constructing the HiddenBench with 65 tasks. Systematically evaluating 15 frontier LLMs reveals a stark performance gap: while a single agent achieves 80.7% accuracy under Full Profile, a group of agents achieves only 30.1% under distributed information. The fundamental failure mode is the inability to proactively elicit information that remains unsaid by others, which can be significantly mitigated across model families by a lightweight structured communication protocol.
Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning: OG-MAR organizes raw World Values Survey (WVS) data into a "cultural ontology with structural relations + individual value personas." During inference, it retrieves ontology triples relevant to the target population alongside demographically similar real-world respondents to instantiate multiple "Value Persona Agents." A judge agent then synthesizes a final answer following an "evidence-first, ontology-consistent" protocol, improving cultural alignment and providing explainable reasoning trajectories across six regional social survey benchmarks.
Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems: The paper situates four "tutoring agents" with non-overlapping responsibilities (Scaffolding/Correction/Encouragement/Metacognition) within the same tutoring turn, allowing them to propose, peer-review, and revise responses before using four distinct voting protocols (Plurality / Borda / Cumulative / Approval) to converge disagreements into a final response. Rather than simply proving "voting makes tutoring better," the study uses tutoring as an experimental testbed for partially aligned but locally conflicting goals, systematically characterizing how different voting rules induce divergent coordination behaviors.
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems: This paper systematically investigates hybrid multi-agent systems consisting of a cloud-based GPT-4o supervisor and on-device Qwen3 executors. It finds that PEVR and EVA have respective advantages in UI assistance and deep search. More cloud intervention is not necessarily better, while context resetting and summarization significantly improve costs and KV-cache pressure for long-duration on-device tasks.
Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence: HetMedAgent organizes generalist LLMs, modality-specific models, and clinicians into a heterogeneous multi-agent system. Through conflict-aware evidence fusion and uncertainty routing, it demonstrates that specialist models and human supervision remain irreplaceable components of medical AI in cardiovascular and chest X-ray clinical decision-making tasks.