ICLR2026 Multi-Agent AI paper notes paper summaries Agents LLM Reasoning Reinforcement Learning Adversarial Robustness

👥 Multi-Agent¶

🔬 ICLR2026 · 47 paper notes

📌 Same area in other venues: 📷 CVPR2026 (2) · 💬 ACL2026 (40) · 🧪 ICML2026 (24) · 🤖 AAAI2026 (26) · 🧠 NeurIPS2025 (17) · 🧪 ICML2025 (7)

🔥 Top topics: Agents ×41 · LLM ×9 · Reasoning ×8 · Reinforcement Learning ×3 · Adversarial Robustness ×3

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning: The HILA framework is proposed to enable multi-agent LLMs to learn a set of "metacognitive policies"—judging when to solve problems independently and when to defer to human experts. By using Dual-Loop Policy Optimization, it decouples the optimization of "when to ask" (inner-loop reinforcement learning) from "how to gain capability from assistance" (outer-loop continual learning), consistently outperforming existing autonomous multi-agent systems on benchmarks such as mathematical reasoning.
Aegis: Automated Error Generation and Attribution for Multi-Agent Systems: Aegis uses an LLM manipulator to "actively inject" successful multi-agent trajectories into labeled failure trajectories, automatically generating 9,533 data entries labeled with "erroneous agent + error mode." This transforms the expensive manual labeling bottleneck into a scalable engineering problem and supports training error attribution models via SFT, RL, and contrastive learning.
AgentPO: Enhancing Multi-Agent Collaboration via Reinforcement Learning: Instead of searching for multi-agent topologies, AgentPO freezes a powerful Actor within a fixed topology and uses Reinforcement Learning (GRPO) to train a lightweight Collaborator to learn "how to assist teammates." With only 500 training samples and 7.8% of the inference overhead of EvoAgent, it consistently outperforms strong baselines like Role Assignment and EvoAgent across multiple mathematical reasoning benchmarks.
AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework: The framework organizes "tasking-solving-scoring" agents into an adversarial loop and uses a non-LLM Bayesian update rule to evolve code, test cases, and prompts simultaneously. It enables 32B open-source models to outperform 235B models on scientific code generation benchmarks, shifting system reliability from "betting on a strong LLM" to "reducing uncertainty via Bayesian convergence."
Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems: This paper utilizes an open-ended bias benchmark, Discrim-Eval-Open, based on forced three-choice questions to model Multi-Agent Systems (MAS) as directed acyclic graphs. By using the Gini coefficient to track the "amplification rate" of bias across layers, it systematically demonstrates a counter-intuitive conclusion: while it is often assumed that multi-agent collaboration "dilutes" bias, various role specializations, complex topologies, and deepened iterations actually amplify minor random preferences in individual models into systemic discrimination against groups. Even neutral external information can trigger intense polarization.
ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning: ATLAS formalizes "real-world travel planning with search" as a dynamic Constraint Satisfaction Problem (CSP). It utilizes five specialized LLM agents (Search, Constraint Manager, Planner, Checker, and Search Advisor) to cooperatively complete constraints, iteratively correct errors, and guide search in the event of a deadlock. This approach increases the final pass rate of TravelPlanner from 23.3% to 44.4% and achieves an 84% pass rate in a real-world multi-round scenario with live web search for the first time.
Benefits and Limitations of Communication in Multi-Agent Reasoning: This paper establishes a theoretical framework based on Transformer expressivity for multi-agent reasoning systems that "chunk long contexts, process them via multiple LLM agents, and aggregate results." It proves tight bounds on how many agents and how much communication are needed to achieve specific parallel speedups across associative recall, state tracking, and k-hop reasoning. It identifies three depth–communication trade-off regimes and validates the theoretically predicted inflection points using Llama on synthetic benchmarks.
Breaking and Fixing Defenses Against Control Flow Hijacking in Multi-Agent Systems: This paper first demonstrates that existing "alignment-check" defenses (e.g., LlamaFirewall) can be bypassed by meticulously rewritten Control Flow Hijacking (CFH) attacks. It then proposes CONTROLVALVE—a coordination-layer defense inspired by program Control Flow Integrity (CFI). During the task planning phase, it generates an "allowed agent call graph + per-edge context rules." During execution, it performs "narrow decisions" on each agent transition to verify if it exists in the graph and satisfies the edge rules. This approach reduces the attack success rate to 0% across all evaluated attacks without degrading performance on baseline tasks.
BRIDGE: Bi-level Reinforcement Learning for Dynamic Group Structure in Coalition Formation Games: This paper models the "optimally partitioning a group of agents into several coalitions" (the NP-complete Coalition Structure Generation problem) as a compact, RL-solvable MDP. By employing bi-level RL (where the upper level learns to merge coalitions and the lower level learns optimal individual policies), models trained on only 3 agents can generalize to 100 agents, outperforming traditional heuristics in both inference speed and performance in mixed-motive Markov games.
Cache-to-Cache: Direct Semantic Communication Between Large Language Models: Instead of collaborating through natural language "conversations," multiple large language models use a lightweight neural network to directly project and fuse the KV-Cache of a Sharer model into a Receiver model. This bypasses token-by-token text generation, preserving deep semantics that text might lose, while reducing average latency by 2.5× and improving accuracy by approximately 3–5% compared to pure text-based collaboration.
CellAgent: LLM-Driven Multi-Agent Framework for Natural Language-Based Single-Cell Analysis: CellAgent utilizes a three-tier agent architecture (Planner-Executor-Evaluator) combined with an expert toolbox (sc-Omni) and a self-reflective optimization mechanism. It enables researchers to perform end-to-end single-cell RNA sequencing and spatial transcriptomics analysis using only natural language, achieving quality comparable to or exceeding that of human experts across multiple downstream tasks.
CoAct-1: Computer-using Multi-agent System with Coding Actions: CoAct-1 treats "writing and executing code" as a first-class action alongside GUI clicking. It utilizes an Orchestrator to dynamically assign subtasks to a Programmer (proficient in Python/Bash) or a GUI Operator (capable of screen interaction). This approach pushes the success rate to 60.8% on OSWorld (52.5% on WindowsAgentArena) while reducing the average number of steps to 10.15.
CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards: CoMAS enables multiple LLM agents to propose solutions, critique, and score each other within a forum-like discussion environment. These discussion dynamics are converted into intrinsic reward signals via LLM-as-a-judge, which are then used to update individual policies through RL, achieving decentralized and scalable collaborative self-evolution without relying on external verifiers or reward models.
Completing Missing Annotation: Multi-Agent Debate for Accurate and Scalable Relevance Assessment: Ours proposes DREAM—a multi-agent multi-round debate framework based on opposing stance initialization for IR relevance annotation: automated labeling upon agreement, and escalation to humans (assisted by debate history) upon disagreement. It achieves 95.2% balanced accuracy with only 3.5% human intervention. Using this, the BRIDGE benchmark was constructed, identifying 29,824 missing relevance annotations (428% of the original), correcting retrieval system ranking biases and the retrieval-generation performance mismatch in RAG.
Context Learning for Multi-Agent Discussion: M2CL learns a "context generator" for each LLM in Multi-Agent Discussion (MAD), allowing round-wise instruction contexts to be automatically organized and refined based on discussion progress. This approach prevents early convergence on "majority noise" while gradually aligning multiple LLMs toward the correct consensus, outperforming existing methods by 20%–50% across 9 benchmarks.
DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems: DoVer upgrades the "log attribution" debugging paradigm for LLM multi-agent systems from "guessing the faulty agent/step" to "targeted intervention and replay verification" (Do-then-Verify). By segmenting failure trajectories into multiple trials, proposing hypotheses for each, and rewriting orchestrator instructions or plans for in-situ replay with milestone scoring, it successfully flips 18–28% of failure cases in GAIA / AssistantBench and achieves a 49% flip rate on GSMPlus.
Emergent Coordination in Multi-Agent Language Models: This paper proposes a quantifiable framework based on Partial Information Decomposition (PID) and Time-Delayed Mutual Information (TDMI), proving that multi-LLM agent systems can leap from loose aggregations to true collectives with high-order coordination structures under appropriate prompting (Persona + ToM). It further reveals that the "Synergy \(\times\) Redundancy" interaction is the critical mechanism for performance improvement.
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization: This paper proposes the EduVisBench benchmark (1,154 STEM problems + five-dimensional pedagogical scoring rubric) to systematically reveal the weakness of foundation models in "reasoning correctly but failing to draw effective pedagogical diagrams." It designs the EduVisAgent multi-agent framework featuring five collaborating experts to decompose abstract reasoning into human-cognition-aligned interactive web pages, achieving a 40.2% improvement over the strongest baseline.
From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning: ChemMAS reframes the task from "what conditions to recommend" to "why these conditions are selected" through evidence-driven reasoning. It employs a four-stage pipeline: "General Chemist Mechanism Analysis → Multi-channel Candidate Recall → Tournament Elimination → Multi-agent Debate & Voting." This ensures each decision is accompanied by falsifiable and auditable chemical evidence. The system's Top-1 similarity is 20–35% higher than specialized models and 10–15% higher than general LLMs.
GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems: This paper reformulates "global state inference" under multi-agent partial observability as a conditional diffusion denoising process. By introducing a latent variable \(z\) as a "mode selector," it explicitly models the one-to-many ambiguity where a single local observation corresponds to multiple plausible global states. This approach avoids the mode collapse inherent in discriminative methods, enabling agents to reconstruct high-fidelity global states from local information for decision-making.
Goal-Aware Identification and Rectification of Misinformation in Multi-Agent Systems: This paper proposes the red-teaming dataset MisInfoTask and the training-free defense framework ARGUS. Through a two-stage process of "adaptive localization of key propagation channels on the graph + goal-aware multi-round persuasive rectification," it specifically defends against "misinformation" injection—content that is semantically harmless but factually incorrect—within LLM multi-agent systems.
Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration: GoA models multi-LLM collaboration as a dynamic directed graph—first selecting a small set of the most relevant agents as nodes using model cards, then constructing edges based on mutual scores for bidirectional message passing, and finally aggregating via graph pooling. Using only 3 agents, it outperforms baselines like Mixture-of-Agents that utilize a full set of 6 agents.
GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs: GraphPlanner upgrades multi-model LLM routing from "selecting a single model" to "generating a multi-agent workflow." It utilizes a heterogeneous graph memory network, GARNet, to simultaneously encode the current workflow and historical interactions, while jointly optimizing task performance and computational overhead using PPO. It achieves up to a 9.3% accuracy improvement across 14 tasks while reducing GPU training overhead from \(186 \text{ GiB}\) to \(1 \text{ GiB}\).
HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatre: The HAMLET multi-agent framework is proposed to decouple AI theatre creation and online performance into two stages: offline planning and online performance. Through a narrative blueprint, a Perception-and-Decision (PAD) module, and a hierarchical control system, it achieves an AI theatre experience characterized by proactive agency, physical environment interaction, and improvisational freedom.
KVComm: Enabling Efficient LLM Communication through Selective KV Sharing: The KVComm framework is proposed to enable efficient communication between LLMs through selective sharing of KV pairs. It identifies an "information concentration bias" in hidden states that makes them unsuitable for cross-model transmission and designs a layer selection strategy based on attention importance paired with a Gaussian prior. Transmitting only 30% of layers outperforms most baselines.
Learning Efficient and Interpretable Multi-Agent Communication: GLC unifies "discrete autoencoder compression + LLM offline semantic anchoring + inter-agent contrastive alignment" into an Information Bottleneck framework. This allows learned multi-agent communication protocols to achieve extreme bandwidth efficiency, strong task performance, and human readability simultaneously, breaking the "trilemma" of communication efficiency, utility, and interpretability.
Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization: SUMMQ structures "summarization" and "quizzing" as a pair of adversarial multi-agent tasks: the summarizer is responsible for full-text coverage, while the quizzer interrogates whether the summary omits information or exhibits distortion. An additional "examinee" agent validates if the summary can answer the quiz, utilizing multi-round feedback to refine the summary for improved completeness and factual consistency in long documents.
LH-Deception: Simulating and Understanding LLM Deceptive Behaviors in Long-Horizon Interactions: This paper proposes LH-Deception, the first simulation framework for LLM deceptive behaviors in long-horizon interactions. It utilizes a three-role multi-agent architecture (Performer-Supervisor-Auditor) combined with a probabilistic event system driven by social science theories. Systematic quantification across 11 frontier models reveals deception frequency, severity, and type distribution, as well as the erosion effect on trust relationships, uncovering the emergence of "deception chains" that static single-round evaluations fail to capture.
MAC-AMP: A Closed-Loop Multi-Agent Collaboration System for Multi-Objective Antimicrobial Peptide Design: Ours proposes MAC-AMP, the first closed-loop multi-agent collaboration system that reformulates antimicrobial peptide (AMP) design as a coordinated multi-agent optimization problem, achieving multi-objective optimization through AI-simulated peer review and adaptive reward design.
MAD-Logic: Multi-Agent Debate Enhances Symbolic Translation and Reasoning: The proposed method enables multiple agents to translate a single logic problem into three symbolic languages (LP/FOL/SAT), followed by a multi-round debate between the "Solver group" and the "Natural Language group" with majority voting. Use of sparse communication based on confidence and information gain prunes redundant interactions, achieving superior reasoning strength and robustness in logical QA while reducing token expenditure.
MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs: MARSHAL utilizes a GRPO modification specifically designed for "multi-turn + multi-agent" scenarios (turn-level advantage estimation with summation before normalization + role-based group normalization). By training Qwen3-4B through self-play in cooperative and competitive strategic games, the model acquires reasoning capabilities that zero-shot transfer to multi-agent systems like MAD/AutoGen, consistently improving performance on math and QA benchmarks.
MARTI: A Framework for Multi-Agent LLM Systems Reinforced Training and Inference: MARTI unifies "multi-agent reasoning" and "distributed RL training" into an open-source framework. By utilizing centralized environment interaction and reward allocation, it dispatches each agent's trajectories and rewards back to their respective policy trainers. This enables multiple LLM agents to be trained together via RL during collaboration, achieving a higher mathematical reasoning upper bound than a single agent under the same inference budget.
MAS²: Self-Generative, Self-Configuring, Self-Rectifying Multi-Agent Systems: MAS² enables a "Meta-Multi-Agent System" (Generator–Implementer–Rectifier triad) to architect, configure, and dynamically correct another Multi-Agent System (MAS) for specific tasks at runtime. By utilizing Collaborative Tree Optimization (CTO) for offline RL to specialize these three meta-agents, MAS² achieves up to a 19.6% improvement over SOTA MAS across 8 benchmarks while maintaining a superior cost–performance Pareto frontier.
Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning: This purely theoretical work investigates the difficulty of learning a multi-agent policy from offline experts in a Nash equilibrium that is robust against unilateral deviations. It first proves that even with exact occupancy measure matching, a low Nash gap policy cannot be guaranteed (impossibility results + PPAD-hard lower bounds). It then introduces a new assumption, "dominant strategies / best-response \(\delta\)-continuity," to provide a computable Nash gap upper bound of \(O(n\epsilon_{BC}/(1-\gamma)^2)\) that converges to zero with imitation error.
MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning: MMedAgent-RL is proposed to optimize a multi-agent system simulating clinical consultation workflows (Triage → Specialist → Attending) via RL. The core innovation is Curriculum-guided Entropy-aware RL (C-MARL), which enables the Attending Physician agent to adopt distinct exploration-exploitation strategies when facing correct, conflicting, or incorrect specialist opinions, achieving SOTA performance across five in-domain and out-of-domain medical VQA benchmarks.
Multi-agent Coordination via Flow Matching: Ours proposes MAC-Flow, which first learns a centralized joint behavior distribution using Flow Matching, then distills it into decentralized single-step policies via Individual-Global-Max (IGM) decomposition. Combined with Q-value maximization for behavior regularization training, it achieves approximately 14.5x inference acceleration compared to diffusion-based methods across 34 datasets in 12 environments while maintaining comparable coordination performance.
Multi-Agent Debate with Memory Masking (MAD-M²): This paper points out that Multi-Agent Debate (MAD) can be misled by "false memories" remaining from previous rounds. It theoretically proves that MAD performance is constrained by memory quality and proposes MAD-M², which applies an "evaluate-mask" filter to the previous round's memory before each debate round, ensuring agents reason only based on reliable memories.
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies: The Multi-Agent System Search (MASS) framework is proposed, which automatically discovers high-performance multi-agent system designs through a three-stage interleaved strategy for optimizing prompts and topologies (local prompt optimization → topology search → global prompt optimization).
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images: PixelCraft introduces a multi-agent system comprising a "dispatcher + planner + reasoner + dual critic + visual tool agents." By treating a fine-tuned pixel-level localization model as "eyes" and traditional CV operators as "hands," combined with a backtrackable branch-based image memory, it significantly improves the reasoning accuracy of MLLMs like GPT-4o and Claude on structured images such as charts and geometry (+5.6 to 9.5 points on CharXiv).
Stochastic Self-Organization in Multi-Agent Systems: Proposes the SelfOrg framework, which dynamically constructs Directed Acyclic Graphs (DAGs) for communication based on the semantic similarity of Agent responses and Shapley value contribution estimation. It achieves self-organized collaboration in multi-agent systems, showing particularly significant advantages in weak model scenarios.
Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems: Proposes SupervisorAgent, a lightweight real-time adaptive supervision framework. It uses an LLM-free adaptive filter to proactively intervene (error correction, guidance, observation purification) at critical interaction nodes. It reduces the token consumption of Smolagent by 29.68% on the GAIA benchmark without compromising success rates.
Strategic Planning and Rationalizing on Trees Make LLMs Better Debaters: This paper proposes TreeDebater, which utilizes a "Rehearsal Tree" to pre-simulate opponent moves and a "Debate Flow Tree" to track the status of the debate. Combined with simulated audience feedback and a speech duration controller, it enables LLMs to allocate precious speaking time to the most impactful actions in strictly timed competitive debates. In human evaluations, it achieved a +15.6% gain in per-stage persuasiveness and a +10% win rate in overall opinion shifts compared to the previous SOTA multi-agent debate system.
Stronger-MAS: Multi-Agent Reinforcement Learning for Collaborative LLMs: To address the gap in applying on-policy RL to Multi-Agent Systems (MAS), this paper proposes AT-GRPO—a group-relative RL algorithm grouped by "agent + turn" (featuring tree-based sampling and hybrid global/local rewards) alongside a system supporting concurrent multi-policy on-policy training. It achieves consistent improvements across game, planning, code, and math tasks, specifically increasing the success rate of long-horizon planning tasks from 14–47% in single-agent RL to 96.0–99.5%.
UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking: Identifies and formalizes the "Unindexed Information Seeking" (UIS) problem—targeting dynamic web pages, embedded files, and interactive content that search engines cannot directly index. Proposes the first UIS benchmark, UIS-QA (110 questions), and a multi-agent framework UIS-Digger. A ~30B parameter model trained with SFT+RFT achieves 27.27% accuracy, outperforming systems integrated with O3/GPT-4.1.
Unlocking the Power of Multi-Agent LLM for Reasoning: From Lazy Agents to Deliberation: This paper identifies the "lazy agent" phenomenon in multi-agent LLM reasoning frameworks (ReMA)—where one agent performs nearly all reasoning while the other merely repeats. It theoretically identifies the root cause as the \(1/T\) normalization term in the multi-turn GRPO loss, which biases towards fewer turns. The authors propose Dr. MAMR: removing this normalization + Shapley-style causal influence measurement + verifiable rewards for <restart>, elevating multi-agent systems from underperforming single-agent GRPO to comprehensive superiority (7B average 51.97→58.43).
When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems: This paper conducts the first systematic study of the Mandela Effect (collective false memory) in LLM-based multi-agent systems. It proposes the ManBench benchmark (4,838 questions, 5 interaction protocols) and finds that all 13 evaluated LLMs are susceptible to this effect. The authors further propose prompt-level and model-level mitigation strategies, reducing collective false memories by an average of 74.40%.
WideSearch: Benchmarking Agentic Broad Info-Seeking: WideSearch introduces the first benchmark specifically designed to evaluate "wide-scale info-seeking." Given a query and a table schema, the agent must populate the entire table. The benchmark includes 200 Chinese and English human-annotated tasks with five-stage quality control. Results show that the success rate of over 10 mainstream search agents remains near 0%, with the best performing at only 7%, while human cross-validation approaches 100%. This highlights a critical deficiency in current agents regarding "large-scale, zero-tolerance" information collection.