ACL2025 LLM Agent AI paper notes paper summaries Agents LLM Multimodal/VLM Reasoning Few-/Zero-Shot Learning

🦾 LLM Agent¶

💬 ACL2025 · 55 paper notes

📌 Same area in other venues: 📷 CVPR2026 (42) · 🔬 ICLR2026 (162) · 💬 ACL2026 (82) · 🧪 ICML2026 (59) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (39)

🔥 Top topics: Agents ×29 · LLM ×22 · Multimodal/VLM ×4 · Reasoning ×3 · Few-/Zero-Shot Learning ×3

Agentic Knowledgeable Self-Awareness: This paper proposes KnowSelf, a data-driven approach that labels special tokens on the agent's self-exploration trajectories to identify different thinking situations (fast thinking, slow thinking, knowledgeable thinking). Through a two-stage training process (SFT + RPO), the agent model learns to autonomously judge when to invoke external knowledge, achieving optimal planning performance with minimal knowledge consumption cost.
Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools: Agentic Reasoning proposes a framework that integrates three agent tools—Web search, code execution, and knowledge-graph-based memory (Mind-Map)—into the LLM reasoning process. It improves the accuracy of DeepSeek-R1 on Humanity's Last Exam from 9.4% to 23.8% (+14.4%) and GPQA from 71.5% to 81.2%, approaching the performance level of OpenAI Deep Research.
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems: This paper proposes the Agentic Reward Modeling paradigm and its implementation, RewardAgent, which integrates traditional human preference-based reward models with verifiable correctness signals from factuality and instruction-following verification. It significantly enhances the reliability of reward models through a three-module architecture consisting of a Router, Verification Agents, and a Judger.
Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks: This paper is the first to systematically study adversarial attacks in realistic multi-agent LLM systems featuring bandwidth constraints, latency, and security mechanisms. It proposes attack methods based on Minimum-Cost Maximum-Flow (MCMF) topological optimization and Permutation-Invariant Evasion Loss (PIEL), achieving up to a 7-fold increase in success rate compared to traditional attacks across multiple LLM architectures.
An Empirical Study on LLM-based Agents for Automated Bug Fixing: This paper systematically analyzes the top six LLM-based bug-fixing systems on SWE-bench Verified, revealing the capabilities and future directions of current agent systems across three dimensions: overall fixing effectiveness, fault localization accuracy, and the utility of bug reproduction.
AndroidGen: Building an Android Language Agent under Data Scarcity: This paper proposes the AndroidGen framework, which enhances LLM capabilities for Android operations under conditions of high-quality training data scarcity using four modules: Experience Search (ExpSearch), Reflection Planning (ReflectPlan), Automatic Checking (AutoCheck), and Step-level Critic (StepCritic). It successfully trains open-source mobile agents without manual annotation by automatically generating trajectory data.
Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning: A fully automated thematic analysis (TA) pipeline based on multi-agent LLMs is proposed. Through division of labor among specialized roles and optional RLHF fine-tuning, the system achieves end-to-end theme extraction from clinical narratives, eliminating the need for manual coding and full-text review.
Bel Esprit: Multi-Agent Framework for Building AI Model Pipelines: Proposes Bel Esprit, a multi-agent conversational framework. Through a four-step collaboration of Mentalist (requirement clarification) \(\rightarrow\) Builder (pipeline construction) \(\rightarrow\) Inspector (validation) \(\rightarrow\) Matchmaker (model mapping), it automatically transforms vague natural language requirements from users into multi-model AI pipeline graphs, achieving 25.2% EM and 37.0 GED (with GPT-4o Builder) on 441 pipeline test cases.
Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents: This work systematically evaluates the zero-shot in-context decision-making capabilities of LLMs in Dueling Bandits (preference-feedback reinforcement learning). It reveals that GPT-4 Turbo excels in weak regret but displays a gap in sstrong regret. Consequently, the LEAD (LLM with Enhanced Algorithmic Dueling) framework is proposed, which achieves both theoretical guarantees and robustness by adaptively and fine-grainedly integrating classical DB algorithms with LLM agents.
BookWorld: From Novels to Interactive Agent Societies for Story Creation: BookWorld is the first multi-agent social simulation system based on novels. It constructs interactive virtual worlds by extracting character data and worldview specifications from source books, allowing novel characters to act and interact autonomously to generate creative stories, outperforming previous story generation methods in 75.36% of pairwise comparisons.
Browsing Like Human: A Multimodal Web Agent with Experiential Fast-and-Slow Thinking: This paper proposes the WebExperT framework, which simulates the human cognitive pattern of "fast and slow thinking" and continuously improves decision-making through an experiential learning mechanism that reflects on failures. It achieves outstanding performance under both supervised and unsupervised settings on the Mind2Web benchmark.
Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model: This paper proposes CoALM (Conversational Agentic Language Model). By constructing a multi-task training dataset, CoALM-IT, that integrates multi-turn ReAct reasoning and complex API calls, the authors train a unified model that excels in both traditional task-oriented dialogue (TOD) and language agent (LA) tool use, outperforming specialized models such as GPT-4o on three benchmarks: MultiWOZ, BFCL V3, and API-Bank.
Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions: This paper presents the first systematic study on the vulnerability of multimodal GUI agents to environmental distractions (e.g., pop-up ads, recommended content). In natural, non-adversarial scenarios, even the most advanced MLLMs (including GPT-4o) exhibit a 20-40% probability of being distracted by irrelevant environmental content, leading them to execute actions that deviate from the user's objectives.
Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration: DPT-Agent is proposed, which is the first method to systematically integrate Dual Process Theory into a language agent framework. It employs a Finite State Machine (FSM) + code-as-policy as the fast, intuitive System 1, and an LLM with Theory of Mind (ToM) + asynchronous reflection as the slow, deliberative System 2. This achieves autonomous, real-time simultaneous human-AI collaboration for the first time (in a challenging version of Overcooked).
EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions: The EMULATE multi-agent fact-checking framework is proposed, which simulates the complete human action chain for verifying claims (search → ranking → content evaluation → evidence sufficiency judgment → classification) through 7 specialized LLM agents, outperforming existing methods in both Macro-F1 and Weighted-F1 on three fact-checking benchmarks.
Enhancing LLM Agent Safety via Causal Influence Prompting: This paper proposes CIP (Causal Influence Prompting), which utilizes Causal Influence Diagrams (CIDs) to structurally represent decision-making causal relationships for LLM agents. By employing a three-step pipeline—CID initialization, CID-guided interaction, and iterative CID updating—CIP effectively enhances agent safety in code execution and mobile device control tasks.
Explorer: Scaling Exploration-Driven Web Trajectory Synthesis for Multimodal Web Agents: Proposes Explorer, a scalable multi-agent pipeline that synthesizes large-scale multimodal web trajectory datasets (94K successful trajectories, 49K+ URLs, 720K screenshots) through autonomous web exploration and step-by-step refinement. The trained Explorer-7B matches or exceeds GPT-4 performance on benchmarks like Mind2Web-Live and MiniWob++.
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models: This paper proposes FACT-AUDIT, an adaptive dynamic fact-checking evaluation framework based on importance sampling and multi-agent collaboration. By dynamically generating test data, iteratively probing model weaknesses, and simultaneously evaluating both verdict predictions and justification quality, it comprehensively audits the boundaries of LLMs' fact-checking capabilities.
GeAR: Graph-enhanced Agent for Retrieval-augmented Generation: GeAR enhances the multi-hop discovery capabilities of traditional retrievers through a graph expansion mechanism (SyncGE), and combines it with a Gist Memory agent framework to achieve multi-step retrieval reasoning. It outperforms existing SOTA by more than 10% on multi-hop QA datasets like MuSiQue, while consuming fewer tokens and iterations.
GUI Agents: A Survey: This paper presents a comprehensive survey of Graphical User Interface (GUI) agents based on large foundation models. It proposes a unified analytical framework covering four core capabilities—perception, reasoning, planning, and action—systematically reviews GUI Agent benchmarks, evaluation metrics, architectural designs, and training methods, and discusses key challenges and future directions.
GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent: This paper proposes GUI-explorer, a training-free GUI agent that collects function-aware interaction trajectories through autonomous exploration and unsupervisedly mines transition-aware knowledge from state-transition triplets, achieving task success rates of 53.7% on SPA-Bench and 47.4% on AndroidWorld.
GUICourse: From General Vision Language Model to Versatile GUI Agent: This paper introduces GUICourse, a suite of datasets (GUIEnv/GUIAct/GUIChat) designed to train versatile GUI agents from general Vision-Language Models (VLMs). Through a two-stage training pipeline, it first enhances OCR and grounding capabilities, and then injects GUI-specific knowledge, enabling a small model with only 3.1B parameters to achieve effective performance on web and smartphone GUI navigation tasks.
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents: This paper proposes GuideBench, a benchmark designed to systematically evaluate the capability of LLMs in following domain-oriented guidelines. It covers 1,272 instances across 7 task categories and evaluates 18 LLMs along three dimensions: rule compliance, robustness to rule updates, and alignment with human preferences. The results indicate significant room for improvement for current models when following complex domain rules.
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement: Introduces Gödel Agent, a self-referential agent framework inspired by the Gödel Machine, which can read and modify its own code (including modifying its own modification logic) at runtime via Python monkey patching to achieve recursive self-improvement. It outperforms hand-crafted and meta-learning-optimized agents on DROP, MGSM, MMLU, and GPQA.
iAgent: LLM Agent as a Shield between User and Recommender Systems: This paper proposes a user-agent-platform three-tier paradigm that inserts an LLM Agent as a protective layer between the user and the recommender system. It achieves personalized recommendation through instruction parsing, knowledge acquisition, reranking, and dynamic user profiling, yielding an average performance gain of 16.6% across four datasets while effectively mitigating the echo chamber effect and unfairness issues for inactive users.
LegalAgentBench: Evaluating LLM Agents in Legal Domain: Proposes LegalAgentBench, a comprehensive evaluation benchmark for LLM Agents in the Chinese legal domain. It consists of 17 real-world corpora, 37 tools, and 300 tasks covering multi-hop reasoning and writing, achieving fine-grained evaluation through keyword matching and process rate.
Enhancing Interpretable Image Classification Through LLM Agents and Conditional Concept Bottleneck Models: This paper proposes Conditional Concept Bottleneck Models (CoCoBMs) and an LLM-driven Concept Agent framework. By introducing a class-conditional concept scoring mechanism and dynamically refining the concept bank based on environmental feedback, the framework enhances classification accuracy by 6% while improving interpretability by approximately 30% across six datasets.
LLM Agents Making Agent Tools: This paper proposes ToolMaker, an autonomous agent framework that transforms GitHub code repositories into LLM-compatible tools. Given a repository URL and a task description, it automatically installs dependencies, generates invocation code, and debugs through a closed-loop self-correction mechanism. It successfully implements 80% of the tasks on a new benchmark spanning 15 complex tasks across various domains, significantly outperforming existing software engineering agents.
LocAgent: Graph-Guided LLM Agents for Code Localization: LocAgent parses a codebase into a directed heterogeneous graph (encompassing four relationships: contain/import/invoke/inherit) and designs unified tools (SearchEntity/TraverseGraph/RetrieveEntity) to guide the LLM Agent in multi-hop reasoning. This achieves high-accuracy code localization, reaching a 92.7% accuracy rate at the file level while reducing costs by 86% through fine-tuning open-source models.
MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration: This paper proposes MAM, a modular multi-agent framework that decomposes the medical diagnosis process into five roles: General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director. Through role-specialized collaboration, MAM achieves multi-modal (text/image/audio/video) medical diagnosis, outperforming baseline models by 18% to 365% across multiple public datasets.
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger: Proposed MeCo (Meta-Cognition Trigger), which extracts "meta-cognitive signals"—the model's self-assessment of its own capabilities—from within the LLM utilizing representation engineering. This adaptively determines whether to call external tools without the need for fine-tuning and with minimal computational overhead, significantly improving the accuracy of tool-use decision-making across multiple backbone models and benchmarks.
MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis: This paper proposes the MEDDxAgent framework, which coordinates three modules—a History Taking Simulator, a Knowledge Retrieval Agent, and a Diagnosis Strategy Agent—via a central orchestrator, DDxDriver, to perform iterative differential diagnosis (DDx). It achieves over 10% accuracy improvement in interactive diagnostic scenarios while providing comprehensive reasoning explainability.
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling: Proposes METAL, a VLM-based multi-agent framework that decomposes the chart-to-code generation task into the iterative collaboration of four specialized agents (generation, visual critique, code critique, and revision), achieving a 5.2% F1 improvement over the prior SOTA on the ChartMIMIC benchmark and demonstrating test-time scaling behavior.
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation: This paper proposes MetaSynth, a meta-prompting-driven multi-agent collaborative framework that generates highly diverse synthetic data. Using only 25M tokens of synthetic data (without mixing real data), it successfully adapts Mistral-7B to financial and biomedical domains, achieving performance gains of \(4.08\%\) and \(13.75\%\) respectively, without compromising general capabilities.
MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection: This paper proposes the MIND framework, which achieves zero-shot harmful meme detection through three stages: similar sample retrieval, bidirectional insight derivation, and multi-agent debate. Without any labeled data, MIND outperforms existing zero-shot methods on three datasets and demonstrates strong generalization across different model architectures and scales.
A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems: A dual-agent iterative collaborative framework is constructed, comprising a Dialect Agent (for dialect translation + review) and a Privacy Policy Agent (for domain-specific answering). By injecting dialectological linguistic knowledge via prompt engineering, this framework simultaneously improves the overall accuracy of privacy policy QA and cross-dialect fairness without any retraining.
MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents: This paper proposes the MultiAgentBench benchmark and the MARBLE framework to systematically evaluate the performance of LLM multi-agent systems in collaborative and competitive scenarios. Covering 6 interactive environments (Research, Minecraft, Database, Coding, Bargaining, and Werewolf), the study introduces milestone-based KPI metrics and coordination scores. The evaluation reveals that GPT-4o-mini achieves the highest overall task score, graph-structured coordination protocols perform best in research scenarios, and cognitive planning improves milestone completion rates by 3%.
Multiple LLM Agents Debate for Equitable Cultural Alignment: Proposes the Multi-Agent Debate framework, where two LLM agents debate cultural scenarios adjudicated by a judge LLM. This significantly improves cultural adaptation accuracy and equity across cultural groups on the NormAd-eti benchmark, enabling 7-9B small models to achieve performance levels comparable to 27B models.
NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization: Proposes NexusSum, a three-stage multi-agent LLM framework (dialogue-to-description \(\to\) hierarchical summarization \(\to\) iterative compression) that generates summaries for long narrative texts like books, movies, and TV series without fine-tuning, achieving up to a 30% BERTScore improvement on BookSum.
OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents: This paper proposes OS-Kairos, which automatically annotates step-by-step confidence scores via a collaborative probing framework and fine-tunes them into a base model. This enables the GUI Agent to predict confidence at each step, autonomously deciding to execute the action or request human intervention. In complex scenarios, the task success rate (TSR) is improved from 14.29% (OS-Atlas-Pro-7B) to 88.20%, along with an absolute improvement of 24–87% on the AITZ and Meta-GUI benchmarks.
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use: A comprehensive survey of Operating System Agents (OS Agents) based on Multimodal Large Language Models (MLLMs), systematically analyzing their fundamental concepts (environment/observation/action space), core capabilities (understanding/planning/grounding), construction methodology (foundation models + agent frameworks), and evaluation systems. It covers a categorical comparison of 30+ foundation models and 20+ Agent frameworks.
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis: This work proposes OS-Genesis, an interaction-driven GUI Agent trajectory synthesis pipeline. By allowing the agent to explore and interact with the environment first, followed by deriving tasks in reverse (Reverse Task Synthesis), and combined with a Trajectory Reward Model (TRM) for quality filtering, it generates high-quality, diverse training trajectories, nearly doubling the performance on AndroidWorld.
PaSa: An LLM Agent for Comprehensive Academic Paper Search: PaSa is an LLM-based academic paper search agent that achieves comprehensive and accurate literature retrieval by autonomously invoking search tools, reading papers, and navigating citation networks. Trained with RL, it significantly outperforms Google Scholar and GPT-4o in real-world scenarios.
Play2Prompt: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play: Proposes Play2Prompt, which enables LLMs to autonomously "play" with tools (exploring input-output behaviors) to generate tool-use examples and optimize tool documentation in a zero-shot manner, significantly enhancing the tool-calling capabilities of LLM agents without requiring any annotated data.
R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory: R2D2 proposes a Web Agent framework that integrates the Remember (experience replay buffer + A* search navigation) and Reflect (error reflection + reflective memory storage) paradigms. It transforms web navigation from an Unknown MDP to a Known MDP, reducing navigation errors by 50% and tripling the task completion rate on WebArena, outperforming the SOTA by 17%.
REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?: This paper proposes REPRO-Bench, a benchmark containing 112 social science paper instances, designed to evaluate the capability of AI agents in automatically assessing the reproducibility of papers. The best existing agent achieves an accuracy of only 21.4% (lower than the random guess baseline of 25%). REPRO-Agent, developed by the authors, improves the accuracy to 36.6% (a 71% relative improvement).
REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science?: This work introduces REPRO-Bench, which comprises 112 reproducibility assessment tasks for social science papers. The study reveals that existing AI agents (with the highest accuracy at only 21.4%) are far from capable of automating this process. Consequently, REPRO-Agent is developed to improve the accuracy to 36.6%.
Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation: A collaborative three-agent framework, Select-Read-Write, is proposed. By employing graph-aware reading order decision-making and a shared working memory mechanism, it achieves automatic Related Work generation based on the full text of papers (rather than just abstracts). Consistent improvements are demonstrated across three base models (Llama3-8B, Claude-3-Haiku, and GPT-4o), with the Citation Graph strategy achieving the best performance.
Self-Taught Agentic Long-Context Understanding: The AgenticLU framework is proposed, which enables LLMs to autonomously generate clarification questions and retrieve relevant context via a Chain-of-Clarifications (CoC) workflow. By distilling search paths from tree search into the model using a two-stage SFT+DPO fine-tuning, an 8B model significantly outperforms baselines on 128K long-context QA tasks.
"sudo rm -rf agentic_security" | SUDO: Screen-based Universal Detox2tox Offense: This work proposes the SUDO attack framework. It disguises malicious requests as harmless instructions using a three-stage Detox2tox pipeline, restores the attack payload during execution, and systematically breaches the safety defenses of computer-use agents like Claude CUA and MANUS using dynamic iterative optimization based on checklist feedback, achieving an attack success rate of up to 41.33%.
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement: SynWorld proposes enabling agents to explore and refine action knowledge (tool descriptions and workflows) through Monte Carlo Tree Search (MCTS) in synthesized virtual scenarios. This allows agents to autonomously adapt to tool usage in new environments, achieving approximately a 9% improvement over the ReAct baseline on ToolBench.
Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning: This paper proposes the Table-Critic multi-agent framework. Through the collaborative criticism and iterative refinement of four specialized agents—Judge, Critic, Refiner, and Curator—coupled with a self-evolving template tree to accumulate criticism knowledge, it achieves 73.7% and 91.7% accuracy on WikiTQ and TabFact, respectively, significantly outperforming existing methods.
The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs: Proposes a comprehensive evaluation framework to quantify the "behavior gap" between LLM agents and human experts in task-oriented dialogues. It systematically diagnoses behavioral discrepancies across three dimensions: dialog acts, tool usage, and knowledge utilization. It reveals that the behavior gap is highly correlated with task complexity (\(r=0.963\)), and closing this gap via behavior injection improves performance by an average of 24.3%.
Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models: This paper proposes the Theorem-of-Thought (ToTh) framework, which models abductive, deductive, and inductive reasoning using three parallel agents. It constructs reasoning trajectories as Formal Reasoning Graphs and employs NLI-calibrated Bayesian belief propagation to select the most coherent reasoning chain, consistently outperforming CoT, Self-Consistency, and CoT-Decoding on symbolic and numerical reasoning.
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use: This paper proposes ToolHop, a benchmark dataset containing 995 multi-hop queries and 3,912 locally executable tools. By adopting a "query-driven" data construction paradigm (generating tools based on queries), the benchmark ensures genuine dependency relationships among tools and verifiable answers. Evaluation of 14 LLMs reveals that the strongest model, GPT-4o, achieves an accuracy of only 49%, highlighting significant strategy differences across different model families in tool use.