👥 Multi-Agent¶

🤖 AAAI2026 · 26 paper notes

📌 Same area in other venues: 📷 CVPR2026 (2) · 🔬 ICLR2026 (47) · 💬 ACL2026 (40) · 🧪 ICML2026 (24) · 🧠 NeurIPS2025 (17) · 🧪 ICML2025 (7)

🔥 Top topics: Agents ×23 · LLM ×11 · Reasoning ×4 · Adversarial Robustness ×2

A Graph-Theoretical Perspective on Law Design for Multiagent Systems: This paper studies the law design problem in multiagent systems from a graph-theoretical perspective, reducing the minimization of useful laws and gap-free laws to the vertex cover problem on hypergraphs, proving NP-hardness, and providing approximation algorithms.
KDR-Agent: A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval: This paper proposes KDR-Agent, a multi-agent framework in which a central planner coordinates three specialized agents—knowledge retrieval, contextual disambiguation, and reflective error correction—combined with natural language type definitions and entity-level positive/negative contrastive demonstrations. Without any fine-tuning, KDR-Agent comprehensively outperforms zero-shot and few-shot baselines across 10 low-resource NER datasets spanning 5 domains (BC5CDR F1=82.47, WNUT-17 F1=80.78 on GPT-4o).
Adaptive Theory of Mind for LLM-based Multi-Agent Coordination: This paper proposes the Adaptive Theory of Mind agent (A-ToM), which formulates ToM order alignment as an online expert advice problem. By employing Follow-the-Leader (FTL) or Hedge algorithms to estimate a partner's ToM order in real time and dynamically adjust its own reasoning depth, A-ToM achieves robust zero-shot multi-agent coordination across four task categories, including repeated matrix games, grid navigation, and Overcooked.
AgentODRL: A Large Language Model-based Multi-agent System for ODRL Generation: This paper proposes AgentODRL, an LLM-based multi-agent system built on an Orchestrator-Workers architecture that converts natural language data usage rules into high-quality ODRL policies through task decomposition, a syntax validation loop, and a LoRA-driven semantic reflection mechanism.
ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment: This paper proposes ARCANE, a framework that formulates alignment as a multi-agent collaboration problem. A manager agent learns to generate natural-language rubrics (weighted verifiable criterion sets) through dialogue with stakeholders, which serve as interpretable proxy reward functions for a worker agent. Via two-stage SFT+GSPO training, the framework enables test-time configurable alignment, improving mean return from 0.58 to 0.74 (N=8) on the GDPVal benchmark with the GSPO variant.
Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation: This paper proposes ARG-Designer, which reformulates multi-agent system topology design as a conditional autoregressive graph generation task. Rather than pruning from template graphs, the model incrementally generates agent nodes and communication edges from scratch. ARG-Designer achieves state-of-the-art performance across 6 benchmarks (average 92.78%), reduces token consumption by approximately 50% compared to G-Designer, and supports role expansion without retraining.
BAMAS: Structuring Budget-Aware Multi-Agent Systems: This paper proposes the BAMAS framework, which employs Integer Linear Programming (ILP) to select the optimal LLM combination under budget constraints, and uses a reinforcement learning policy to choose the best collaboration topology (Linear/Star/Feedback/Planner-Driven). BAMAS achieves accuracy comparable to state-of-the-art multi-agent systems on GSM8K, MBPP, and MATH, while reducing costs by up to 86%.
Beyond Detection: Exploring Evidence-based Multi-Agent Debate for Misinformation Intervention and Persuasion: This paper proposes ED2D, a framework that integrates an evidence retrieval module into a multi-agent debate (MAD) system to enhance misinformation detection accuracy. Through controlled human experiments, it provides the first comparative evaluation of AI-generated debate transcripts versus expert human fact-checks in terms of persuasiveness and belief correction, revealing a double-edged-sword effect: the AI debate system achieves expert-level persuasiveness when correct, but may amplify misinformation when wrong.
COACH: Collaborative Agents for Contextual Highlighting -- A Multi-Agent Framework for Sports Video Analysis: This paper proposes COACH — a reconfigurable multi-agent framework built on a shared backbone model — that achieves role specialization via intent-driven strategy orchestration and structured CoT fine-tuning, significantly outperforming generalist models such as Gemini 2.5 Pro on both QA and summarization tasks in badminton video analysis.
Conversational Learning Diagnosis via Reasoning Multi-Turn Interactive Learning: This paper proposes ParLD (Preview-Analyze-Reason framework), which leverages multi-agent collaboration to achieve fine-grained, turn-level diagnosis of students' cognitive states during conversational learning. ParLD outperforms traditional knowledge tracing methods by 10% on performance prediction and substantially improves tutoring outcomes.
EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation: This paper proposes EcoAgent, a closed-loop device-cloud collaborative multi-agent framework for mobile automation. By combining Dual-ReACT two-level reasoning and planning, lightweight on-device verification feedback, and a Pre-Understanding text compression module, EcoAgent achieves success rates comparable to fully cloud-based agents on AndroidWorld while substantially reducing latency (3.9s vs. 15.3s), cloud invocations (−89%), and upstream data volume (−48.6×).
FinRpt: Dataset, Evaluation System and LLM-based Multi-agent Framework for Equity Research Report Generation: This paper is the first to systematically define the task of automated Equity Research Report (ERR) generation. It constructs the FinRpt dataset (6,825 high-quality bilingual reports integrating 7 categories of financial data), proposes an 11-metric evaluation framework, and designs the FinRpt-Gen generation framework with 9 collaborative agents featuring a three-stage enhancement pipeline (rating correction / expert review / language polishing). Human evaluation shows that generated reports approach expert-written quality.
Hierarchical Pedagogical Oversight: A Multi-Agent Adversarial Framework for Reliable AI Tutoring: This paper proposes the HPO framework, which achieves reliable AI tutoring evaluation through a three-phase pipeline (Intelligence Distillation → Adversarial Debate → Synthesis and Judgment). Using only an 8B-parameter model, HPO achieves a Macro F1 of 0.845 on the MRBench middle-school mathematics dialogue dataset, surpassing GPT-4o (0.812) by 3.3%, demonstrating that interaction structure—rather than model scale—is the key to reliable AI tutoring.
iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference: iMAD proposes a framework for selectively triggering multi-agent debate (MAD): a single agent first generates a structured response with self-critique, from which 41 interpretable linguistic/semantic features are extracted; a lightweight MLP classifier trained with the FocusCal loss then determines whether to trigger MAD. Across 6 QA/VQA benchmarks, iMAD reduces token overhead by up to 92% while improving accuracy by up to 13.5%.
Learning to Generate and Extract: A Multi-Agent Collaboration Framework for Zero-shot Document-level Event Arguments Extraction: This paper proposes a "Propose-Evaluate-Revise" multi-agent collaboration framework (comprising a generator agent and an evaluator agent) to address zero-shot document-level event argument extraction (ZS-DEAE). The generator agent synthesizes training data for unseen event types, while the evaluator agent provides log-likelihood-based quality scores to guide reinforcement learning for iterative optimization, simultaneously improving synthetic data quality and extraction performance.
LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models: This paper proposes LieCraft, a multi-player hidden-role game framework (with constraint-satisfaction-guaranteed balance) to evaluate the strategic deception capabilities of 12 LLMs. It finds that all tested frontier LLMs—including GPT-4o—exhibit deception rates exceeding 90% under incentive conditions, demonstrating that safety training has not eliminated the capacity for strategic lying.
LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval: This paper proposes LLandMark, a modular multi-agent framework that achieves landmark-aware multimodal interactive video retrieval through landmark knowledge augmentation, LLM-assisted image retrieval, and OCR refinement modules, achieving a total score of 77.40/88 in the Vietnamese large-scale video retrieval challenge (HCMAIC 2025).
LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules: This paper proposes LungNoduleAgent, the first collaborative multi-agent system for lung nodule analysis. It simulates the clinical workflow through a three-stage pipeline—Nodule Spotter, Simulated Radiologist, and Doctor Agent System—and substantially outperforms mainstream VLMs (GPT-4o, Claude 3.7 Sonnet) and medical agents (MedAgent-Pro) on CT report generation and malignancy grading tasks.
MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes: This paper proposes MAMAMemeia, a multi-agent multi-aspect collaborative discussion framework grounded in the Cognitive Analytic Therapy (CAT) competency framework, designed to identify depressive symptoms from social media memes. It additionally introduces the RESTOREx resource (containing both LLM-generated and human-annotated rationales), achieving a 7.55% improvement in macro-F1 over 30+ competing methods.
MAPS: Multi-Agent Personality Shaping for Collaborative Reasoning: This paper proposes MAPS, a five-agent collaborative reasoning framework that assigns distinct "personalities" to four functional agents based on the Big Five personality theory — Interpreter (Openness), Aligner (Agreeableness), Scholar (Conscientiousness), and Solver (Extraversion) — to achieve heterogeneous collaboration, complemented by a Critic Agent (Neuroticism → Socratic reflection) for iterative refinement. MAPS surpasses the GPT-4o baseline by 15.84% on MathVista/OlympiadBench/EMMA and, for the first time, exceeds human expert performance by 3.58%.
MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models: This paper proposes MedLA, the first multi-agent medical reasoning framework based on syllogistic logic trees. Each agent organizes its reasoning as an explicit logic tree composed of syllogistic nodes (major premise–minor premise–conclusion). Multiple agents align and revise their logic trees at the premise level through graph-guided multi-round discussions. MedLA outperforms all baselines by 7.4% on MedDDx (8B model) and achieves an average accuracy of 69.9% on medical QA benchmarks with an 8B model, surpassing 70B RAG-based models.
Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems: This paper proposes an adaptively coordinated multi-agent LLM framework that achieves a 27% improvement in compliance accuracy and a 74% reduction in revision rate on high-complexity financial document analysis tasks, through parallel competitive evaluation, dynamic task routing, and bidirectional feedback mechanisms.
SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication: SafeSieve is proposed as a progressive adaptive multi-agent communication pruning framework. Through a two-stage edge scoring mechanism combining semantic-heuristic initialization and history-feedback-driven refinement, together with 0-extension clustering, SafeSieve achieves 94.01% average accuracy across 6 benchmarks while reducing token consumption by 12.4%–27.8%, and demonstrates inherent robustness against prompt injection attacks.
Scalable and Accurate Graph Reasoning with LLM-Based Multi-Agents: This paper proposes GraphAgent-Reasoner (GAR), inspired by distributed graph computation theory. It decomposes graph problems into node-centric subtasks assigned to multiple agents, which collaborate through neighbor message passing. GAR extends the graph scale tractable by LLMs from 100 nodes to 1,000 nodes, and significantly outperforms existing state-of-the-art methods on polynomial-time graph reasoning tasks.
Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems: The first systematic security analysis of LLM-based multi-agent software development systems (ChatDev/MetaGPT/AgentVerse): proposes the IMBIA attack framework covering two threat scenarios (malicious user + benign agents / benign user + malicious agent) and 12 malicious behaviors across 5 malware families, achieving an attack success rate (ASR) of up to 93% on ChatDev, with the Adv-IMBIA adversarial defense reducing ASR by 40–73%.
Thucy: An LLM-based Multi-Agent System for Claim Verification across Relational Databases: This paper presents Thucy, the first multi-agent claim verification system supporting cross-database and cross-table reasoning. Led by a Verifier agent, it coordinates three specialized agents (Data/Schema/SQL Expert) with zero prior knowledge of the data sources, enabling autonomous discovery, reasoning, and SQL evidence generation. Thucy surpasses the previous SOTA by 5.6 percentage points on TabFact (94.3%).