ACL2026 Multi-Agent AI paper notes paper summaries Agents LLM Reasoning Question Answering Adversarial Robustness

👥 Multi-Agent¶

💬 ACL2026 · 40 paper notes

📌 Same area in other venues: 📷 CVPR2026 (2) · 🔬 ICLR2026 (47) · 🧪 ICML2026 (24) · 🤖 AAAI2026 (26) · 🧠 NeurIPS2025 (17) · 🧪 ICML2025 (7)

🔥 Top topics: Agents ×35 · LLM ×13 · Reasoning ×5 · Question Answering ×3 · Adversarial Robustness ×2

A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation: This paper proposes MAFIG, a framework that leverages multi-agent collaboration, feature-level evaluators, and iterative revision to generate multiple-choice reading comprehension questions. Compared to single-turn prompting, it significantly improves the satisfaction rate of constraints such as vocabulary, passage length, sentence length, reasoning complexity, factuality, and option neutrality, while providing a more stable monotonic increase in difficulty.
AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models: AgenticEval redefines LLM safety evaluation as a "continuous, self-evolving red-teaming process": the Specialist decomposes unstructured regulatory text into an atomic rule knowledge base; the Generator creates multimodal and multi-format Question Groups centered around each rule; the Evaluator + Analyst continuously transform failures from the current round into more aggressive attack strategies for the next. After three iterations, the compliance rate of GPT-5 under the EU AI Act plummeted from 72.50% to 36.36%, revealing that static benchmarks significantly overestimate the safety levels of large models.
ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination: This paper proposes the ATLAS multi-agent financial trading framework and the Adaptive-OPRO prompt optimization method. By utilizing specialized analyst agents to prepare heterogeneous market information and dynamically optimizing the instruction prompts of the central trading agent based on delayed noisy feedback, the system significantly outperforms baselines across diverse volatile market environments.
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage: AutoReproduce proposes a multi-agent framework that utilizes a "Paper Lineage" algorithm to mine implicit domain knowledge from referenced literature. This enables end-to-end automatic reproduction of paper experiments, achieving a code execution rate of 94.87% and a performance gap of only 19.72% on the self-constructed ReproduceBench.
BookAgent: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration: BookAgent is a safety-aware multi-agent framework that utilizes a three-stage closed-loop architecture consisting of a Value-Aligned Storyboard (VAS) + Iterative Cross-modal Refinement (ICR) + Temporal Cognitive Calibration (TCC) to generate high-quality, character-consistent, and safety-compliant picture book stories end-to-end from user drafts.
CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems: This paper proposes CIA (Communication Inference Attack), which, under a strict black-box setting where only the final output is observable, induces multi-agent systems to expose intermediate agent reasoning through adversarial queries. By combining global bias disentanglement with LLM weak supervision to model semantic correlations, it successfully reconstructs the MAS communication topology, achieving an average AUC of 0.87 and a peak of 0.99.
Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games: Ours proposes a collaborative multi-agent framework for the automated generation of high-quality Murder Mystery game scripts and training data. Through a two-stage training strategy (CoT fine-tuning + GRPO reinforcement learning with ScoreAgent reward shaping), the multi-hop reasoning capability of VLMs under imperfect information is enhanced. This significantly improves VLM narrative reasoning, fact extraction, and deception resistance on WhodunitBench.
Conjunctive Prompt Attacks in Multi-Agent LLM Systems: This paper investigates conjunctive prompt attacks in multi-agent LLM systems: trigger keys embedded in user queries and hidden templates in compromised remote agents appear harmless individually, but activate harmful behavior when routing brings them to the same agent. Existing defenses (PromptGuard, Llama-Guard, etc.) cannot reliably prevent these attacks.
ConSensus: Multi-Agent Collaboration for Multimodal Sensing: ConSensus is a training-free multi-agent sensor fusion framework that assigns specialized agents to independently interpret different sensing modalities. By utilizing semantic fusion, statistical consensus, and hybrid arbitration, it achieves an average 7.1% accuracy improvement over single-agent methods across five multimodal sensing benchmarks, while reducing fusion token costs to approximately 1/12.7 of multi-round debate methods.
Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection: This paper proposes the RADAR framework, which detects half-truths based on omitted context through role-anchored (Politician vs. Scientist) multi-agent debate. Combined with a dual-threshold adaptive early stopping mechanism, it consistently outperforms single-agent and traditional multi-agent baselines under noisy retrieval conditions.
Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation: This paper systematically reveals the "diversity collapse" phenomenon in multi-agent LLM systems by evaluating over 10,000 research proposals across three levels: model intelligence, agent cognition, and system dynamics. It demonstrates that stronger models, authority-driven role assignments, and dense communication topologies suppress semantic diversity, identifying the root cause as the interaction structure rather than a lack of model capability.
Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search: This paper proposes DITS, using "training data influence scores" instead of traditional Q-values as the guiding signal for MCTS tree search and preference data selection. It derives an influence score estimation formula for non-differentiable metrics that can be calculated via forward inference, enabling MAS to achieve a 2.5–2.7% average improvement over Optima-iSFT-DPO across 7 datasets and 3 multi-agent tasks.
EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery: This paper proposes EvoSci, which models the generation of scientific ideas as a multi-agent collaboration and bio-inspired evolutionary cycle. By constructing a problem space, executing research in teams, providing reviewer feedback, and performing entity-level crossover/variation/selection, it generates research ideas with higher novelty and overall quality across 10 scientific topics.
EvoSpark: Endogenous Interactive Agent Societies for Unified Long-Horizon Narrative Evolution: EvoSpark proposes a multi-agent framework for long-horizon narrative evolution. It addresses social memory stacking and narrative-spatial misalignment through three designs: hierarchical recursive memory (RSB for social cognitive metabolism), Generative Mise-en-scène (GMS for character-location-plot alignment), and the Emergent Character Grounding Protocol (ECGP to transform LLM hallucinations into persistent characters).
Explicit Trait Inference for Multi-Agent Coordination: This paper proposes the Explicit Trait Inference (ETI) method, which enables LLM agents to reason about and track the behavioral characteristics of partners based on the psychological dimensions of warmth and competence. This approach reduces payoff losses by 45-77% in economic games and improves task performance by 3-29% on MultiAgentBench.
MAGEO: From Experience to Skill — Multi-Agent Generative Engine Optimization via Reusable Strategy Learning: This paper reformulates Generative Engine Optimization (GEO) from instance-wise heuristic optimization into a strategy learning problem. It proposes the MAGEO multi-agent framework—where the execution layer consists of collaboration between preference/planning/editing/evaluator agents, and the learning layer distills validated editing patterns into reusable engine-specific strategy skills. By introducing the Twin Branch causal evaluation protocol and DSV-CF dual-axis metrics, MAGEO significantly outperforms heuristic baselines across three mainstream engines.
From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation: This paper constructs JurisCQAD—a large-scale dataset containing 43,000+ real Chinese legal consultations—and proposes the JurisMA multi-agent framework. By utilizing legal element graphs for structured task decomposition and dynamic multi-agent collaboration (Manager Agent + Format Check + Law Search), it significantly outperforms general and legal-specific LLMs on LawBench.
HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents: HACHIMI formalizes "student persona generation" as the TAD-PG (Theory-Aligned and Distribution-Controllable) task. By employing a "propose-validate-revise" multi-agent framework integrated with neuro-symbolic validators and stratified sampling, it produces 1 million synthetic student personas for grades 1–12. Group-level evaluations on CEPS / PISA 2022 reveal a distinct "fidelity gradient"—high alignment for constructs like mathematics and curiosity, but weak alignment for well-being and family dynamics.
Latent Agents: A Post-Training Procedure for Internalized Multi-Agent Debate: The Internalized Multi-Agent Debate (IMAD) framework is proposed, utilizing a two-stage post-training pipeline (SFT + GRPO) to "internalize" multi-agent debate into a single LLM. This approach reduces token consumption by up to 93% and demonstrates through activation steering that the internalized model retains separable and controllable "agent subspaces" within its latent space.
LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey: This paper provides the first systematic review of "LLM-based Human-Agent Collaboration and Interaction Systems (LLM-HAS)"—reintegrating humans into the agent loop. It establishes a unified taxonomy across five dimensions (Environment/Profiling, Human Feedback, Interaction Type, Orchestration, and Communication) and introduces a Human Agency Scale (A1–A5) to quantify the necessary depth of human involvement in tasks.
MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing: MASFactory models LLM Multi-Agent Systems (MAS) as Node/Edge computational graphs and introduces the "Vibe Graphing" three-stage pipeline (Role Assignment → Structure Design → Semantic Completion) to compile natural language intent into executable MAS workflows. It provides Context/Message Adapters, ComposedGraph templates for reuse, and VS Code visualization. On 7 benchmarks, it replicates 5 representative MAS with comparable or superior performance; end-to-end Vibe Graphing reduces ChatDev's 1,511 lines of code to 45 lines, with API costs an order of magnitude lower than Vibe Coding.
MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering: The authors propose MATA, a multi-agent framework for TableQA that uses a scheduler to prioritize reasoning paths (CoT/PoT/text2SQL), a confidence checker to filter answers, and a judge agent for arbitration. This model-agnostic, efficient framework achieves an average EM improvement of 40.1% across 10 LLMs.
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data: MALMAS is proposed as a memory-augmented LLM multi-agent system for automated feature generation on tabular data. Through a workforce of six specialized Agents exploring different feature space dimensions and a three-level memory mechanism (Procedural/Feedback/Conceptual) for cross-iteration optimization, it outperforms existing baselines across 16 classification and 7 regression datasets.
Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling: This paper compares self-consistency, self-refinement, multi-agent debate, and Mixture-of-Agents under a unified computational budget. It finds that multi-agent reasoning, particularly MoA, is more efficient on the Pareto front, improving MMLU-Pro accuracy from 64.3% to 71.4% at approximately 20x CoT budget.
ODUTQA-MDC: A Task for Open-Domain Underspecified Tabular QA with Multi-turn Dialogue-based Clarification: This paper proposes the ODUTQA-MDC task and benchmark, systematically investigating the detection and multi-turn dialogue clarification of user query ambiguity in open-domain scenarios for the first time. It constructs a large-scale dataset containing 25,105 QA pairs and designs the MAIC-TQA multi-agent framework to perform end-to-end "detection-clarification-reasoning" tabular QA.
OxyGent: Making Multi-Agent Systems Modular, Observable, and Evolvable via Oxy Abstraction: OxyGent encapsulates agents, tools, LLMs, and reasoning processes into pluggable Oxy atomic components. By utilizing permission-driven dynamic planning and the OxyBank data feedback mechanism, it simplifies the construction, monitoring, and continuous evolution of industrial-grade multi-agent systems.
PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf: PaperMentor codifies the writing experience of senior researchers into an "Expert Skill Library" and employs 12 agents with distinct divisions of labor to review LaTeX papers in parallel. It provides actionable revision suggestions via Overleaf's native inline annotations without ghostwriting for the user. In user studies, 90.6% of comments were judged as "actionable," with both validity and actionability significantly exceeding the GPT-5.2 baseline without the skill library.
PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation: PosterForest utilizes a Poster Tree, which simultaneously encodes the hierarchical semantics of a paper and the spatial layout of a poster, as an intermediate representation. It employs recursive collaborative optimization between Content, Layout, and Feedback agents to generate scientific posters in a training-free manner. In human evaluations, it achieved a 59.2% overall preference, significantly outperforming P2P and Paper2Poster.
Preference Estimation via Opponent Modeling in Multi-Agent Negotiation: Proposes a preference estimation method that integrates LLM-extracted natural language preference signals into a Bayesian opponent modeling framework. By combining qualitative cues with quantitative bidding information through a linguistic likelihood function in multi-party multi-issue negotiations, the Full Agreement Rate (FAR) is improved from 37% to 62%.
PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows: PROTEA is an offline debugging platform for multi-agent LLM workflows. It localizes the cause of degraded final answers to specific nodes through node-level evaluation, backward-generated intermediate expectations, and editable prompt revisions, enabling a closed-loop verification of modification effects.
RoadMapper: A Multi-Agent System for Roadmap Generation of Solving Complex Research Problems: This paper proposes the RoadMap research roadmap generation benchmark and the RoadMapper multi-agent system, which forms a closed loop consisting of knowledge retrieval, logic/granularity critiques, revision, and a DPO evaluator. On complex Chinese and English research problems, it improves performance by an average of 7-9 points compared to direct prompting and significantly reduces the time cost for experts to design roadmaps.
Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration: The authors propose ExtAgents, a multi-agent framework that addresses the performance bottleneck where existing multi-agent methods fail to scale when external knowledge exceeds the context window. By implementing global knowledge synchronization (information exchange among all Seeking Agents) and cumulative reasoning (gradual injection of filtered knowledge into the Reasoning Agent), the framework significantly improves performance in multi-hop QA and long summary generation tasks.
Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems: TraceElephant advocates that failure attribution in multi-agent systems should be evaluated under full execution traces visible to developers. It provides 220 failed traces with annotations for responsible agents and critical failure steps, demonstrating that full observability improves step-level attribution from 16% (output-only) to over 28%-30%.
SILO-BENCH: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems: This paper proposes SILO-BENCH, a role-agnostic benchmark for evaluating distributed coordination in multi-agent LLM systems. Comprising 30 algorithmic tasks across three communication complexity levels and 1620 experiments over 54 configurations, it reveals a critical "communication-reasoning gap": while agents can spontaneously form rational communication topologies and actively exchange information, they systematically fail to integrate distributed states into correct answers.
Social Dynamics as Critical Vulnerabilities that Undermine Objective Decision-Making in LLM Collectives: This paper proves that representative agents in LLM multi-agent systems are not only limited by their own reasoning capabilities but are also significantly influenced by "social dynamics"—such as the number of peers, peer capabilities, argument length, and rhetorical style—leading to incorrect decisions on tasks with objective answers.
To Trust or Not to Trust: Attention-Based Trust Management for LLM Multi-Agent Systems: This paper proposes the first comprehensive definition of "trustworthiness" for LLM Multi-Agent Systems (LLM-MAS) based on six orthogonal dimensions of Grice's Cooperative Principle. It discovers that LLM attention patterns can distinguish different types of trustworthiness violations, leading to the design of A-Trust, a lightweight evaluation method and an end-to-end Trust Management System (TMS) that improves malicious message detection rates to 77-90% under various attacks.
Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs: To be added after further reading
Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Collaboration: Proposes SpreadsheetAgent, a two-stage multi-agent framework that achieves robust real-world spreadsheet understanding by performing progressive region reading and cross-verification using code execution, vision, and LaTeX formats, without exceeding LLM context limits.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems: Ours proposes the ErrorProbe framework, which achieves self-improving semantic fault attribution in multi-agent systems through MAST taxonomy-driven structured decomposition, symptom-driven backward tracing, and a verified memory mechanism, significantly outperforming baselines in step-level error localization.
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning: This paper points out that LLMs in multi-agent debates change their stances based on "who said it" rather than "what was said," and quantifies and mitigates this identity-driven bias through response anonymization and the Identity Bias Coefficient (IBC).