🦾 LLM Agent¶

💬 ACL2026 · 44 paper notes

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts: This paper proposes AgencyBench — a comprehensive benchmark comprising 138 real-world tasks that evaluates six core agent capabilities. Each scenario requires an average of 90 tool calls and 1M tokens. Fully automated evaluation is achieved via a user simulation agent and Docker sandbox.
Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models: This paper proposes Agent-GWO, which integrates the leader-follower hierarchy of the Grey Wolf Optimizer (GWO) into a multi-agent framework to jointly optimize prompt templates and decoding hyperparameters (temperature, top-p, etc.), consistently outperforming existing prompt optimization methods across 11 mathematical and mixed reasoning benchmarks.
ATLAS: Adaptive Trading with LLM AgentS Through Dynamic Prompt Optimization and Multi-Agent Coordination: This paper proposes ATLAS, a multi-agent financial trading framework, and Adaptive-OPRO, a prompt optimization method. ATLAS employs specialized analyst agents to prepare heterogeneous market information and dynamically optimizes the instruction prompt of a central trading agent based on delayed and noisy feedback, achieving significant improvements over baselines across diverse market volatility conditions.
Bayesian Social Deduction with Graph-Informed Language Models: This paper proposes GRAIL (Graph Reasoning Agent Informed through Language), a hybrid reasoning framework that externalizes probabilistic inference to a factor graph model while delegating language understanding and interaction to an LLM. GRAIL is the first agent to defeat human players in the social deduction game Avalon (67% win rate) while consuming far fewer computational resources than large-scale reasoning models.
CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents: CI-Work is an enterprise-scenario benchmark grounded in Contextual Integrity (CI) theory. It reveals that state-of-the-art LLM agents systematically violate privacy norms in enterprise workflows, and that scaling model size exacerbates rather than mitigates leakage.
CodeStruct: Code Agents over Structured Action Spaces: This paper proposes CodeStruct, a framework that redefines code repositories as AST-based structured action spaces, enabling LLM code agents to read and edit code via named program entities rather than raw text fragments. CodeStruct achieves 1.2–5.0% accuracy improvements on SWE-Bench Verified while reducing token consumption by 12–38%.
CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution: CoEvolve proposes an agent-data co-evolution framework that extracts three types of weakness signals—forgetting, boundary, and rare—from training trajectories, guiding LLMs to perform targeted environment re-exploration and task synthesis. This allows the training data distribution to dynamically adapt to the agent's evolving capabilities, yielding absolute improvements of 19–23% on AppWorld and BFCL.
Conjunctive Prompt Attacks in Multi-Agent LLM Systems: This paper investigates conjunctive prompt attacks in multi-agent LLM systems: a trigger key embedded in a user query and a hidden template injected into a compromised remote agent each appear benign in isolation, yet activate harmful behavior when routing brings them together at the same agent. Existing defenses (PromptGuard, Llama-Guard, etc.) fail to reliably prevent such attacks.
Creating ConLangs to Probe the Metalinguistic Grammatical Knowledge of LLMs: This paper introduces IASC (Interactive Agentic System for ConLangs), a modular constructed-language building system that probes LLMs' metalinguistic knowledge by requiring them to perform morphosyntactic transformations according to linguistic specifications. The findings reveal that LLMs handle common typological patterns far better than rare ones, and that capability gaps across different LLMs are substantial.
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4: DAP introduces the concept of Hard Mode ATP (where AI must independently discover answers before constructing proofs, as opposed to Easy Mode statements with embedded answers), releases the MiniF2F-Hard and FIMO-Hard benchmarks, and proposes a two-stage "Discover and Prove" framework — using LLM natural language reasoning to discover answers, then rewriting the statement into an Easy Mode declaration for a formal prover. The framework improves solved problems on CombiBench from 7 to 10 and, for the first time, proves 36 theorems on PutnamBench in Hard Mode.
Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation: Through evaluating over 10,000 research proposals, this paper systematically reveals the phenomenon of "diversity collapse" in multi-agent LLM systems across three levels — model intelligence, agent cognition, and system dynamics. Stronger models, authority-driven role assignments, and dense communication topologies all suppress semantic diversity, with the root cause residing in interaction structure rather than insufficient model capability.
EA-Agent: A Structured Multi-Step Reasoning Agent for Entity Alignment: This paper proposes EA-Agent, which decomposes entity alignment (EA) into a structured multi-step reasoning process. Through planning and execution over a tool pool (triple selector + alignment tool + reflector), EA-Agent achieves interpretable alignment decisions. Combined with reward-guided offline policy optimization for continuous improvement of planning capability, it achieves up to 3.17% Hits@1 improvement on DBP15K while reducing efficiency issues caused by redundant triples.
ExpSeek: Self-Triggered Experience Seeking for Web Agents: ExpSeek proposes a step-level entropy self-triggered framework for proactive experience seeking, enabling web agents to determine when and what guidance is needed based on intrinsic model signals during interaction, achieving absolute improvements of 9.3% and 7.5% on Qwen3-8B and Qwen3-32B, respectively.
FairQE: Multi-Agent Framework for Mitigating Gender Bias in Translation Quality Estimation: This paper proposes FairQE, a multi-agent framework that mitigates systematic gender bias in QE models through gender cue detection, gender-flipped variant generation, and dynamic bias-aware score aggregation, without sacrificing translation quality estimation accuracy.
FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems: FedGUI is the first comprehensive federated learning benchmark for cross-platform GUI agents, comprising six datasets covering mobile, web, and desktop environments. It systematically investigates the effects of four types of heterogeneity—cross-platform, cross-device, cross-OS, and cross-source—on federated GUI agent training.
FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction: FregeLogic is a hybrid neuro-symbolic system that combines a five-member LLM ensemble with a Z3 SMT solver as a tiebreaker, achieving a 16% reduction in belief bias alongside a 0.9% accuracy improvement on syllogistic validity prediction.
From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal Consultation: This paper introduces JurisCQAD—a large-scale dataset of 43,000+ real Chinese legal consultations—and proposes the JurisMA multi-agent framework, which performs structured task decomposition via a legal element graph and dynamic multi-agent collaboration (Manager Agent + Format Check + Law Search), achieving significant improvements over both general-purpose and law-specialized LLMs on LawBench.
HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation: This paper proposes HAG, a framework that formalizes population-level agent generation as a two-stage hierarchical decision process — first constructing a topic-adaptive demographic distribution tree via a world knowledge model to achieve macro-level distributional alignment, then combining real-data retrieval with LLM-based agent augmentation to ensure micro-level individual consistency. HAG reduces population alignment error by an average of 37.7% and improves sociological consistency by 18.8% across multi-domain benchmarks.
HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents: HeLa-Mem proposes a neuroscience-inspired memory architecture for LLM agents that models conversation history as a dynamic graph driven by Hebbian learning dynamics — strengthening inter-memory connections through co-activation, distilling hub memories into semantic knowledge via reflective consolidation, and retrieving via a dual-pathway combining semantic similarity with Hebbian spreading activation. It achieves state-of-the-art performance on LoCoMo with significantly fewer tokens.
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents: This paper proposes STEP-HRL, which introduces a local progress module to iteratively compress interaction history within each subtask into compact textual summaries, enabling both high-level and low-level policies to make decisions based solely on single-step transitions rather than full histories. The approach achieves significant performance and generalization gains on ScienceWorld and ALFWorld while reducing token usage.
How Adversarial Environments Mislead Agentic AI: This paper formalizes the Adversarial Environment Injection (AEI) threat model, decomposing it into breadth attacks (poisoning retrieval results to induce cognitive drift) and depth attacks (injecting phantom nodes to construct navigational traps causing policy collapse). Across 11,000+ experimental runs, the two attack dimensions are found to be completely independent in terms of robustness — a phenomenon termed "robustness splitting" — demonstrating that current single-point defense strategies are fundamentally insufficient.
ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models: This paper introduces ImplicitMemBench, the first benchmark for systematically evaluating implicit memory in LLMs. It comprises 300 test items across three cognitive paradigms—procedural memory, priming effects, and classical conditioning—and reveals severe limitations across 17 models: the best-performing model achieves only 66% overall accuracy, far below the human baseline.
Lightweight LLM Agent Memory with Small Language Models: This paper proposes LightMem, a lightweight LLM agent memory system driven by multiple specialized small language models (SLMs). By modularizing memory operations into a Controller (SLM-1), a Selector (SLM-2), and a Writer (SLM-3), and decoupling online processing from offline consolidation, LightMem achieves an average F1 improvement of approximately 2.5 over A-MEM on the LoCoMo benchmark, while attaining a retrieval latency of 83ms and an end-to-end latency of 581ms.
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization: This paper proposes Location Preference Optimization (LPO), which combines entropy-based window rewards and distance-based dynamic location rewards within the GRPO framework to improve the spatial grounding accuracy of GUI agents, achieving state-of-the-art performance on both offline and online benchmarks.
MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering: This paper proposes MATA, a multi-agent framework for table question answering that employs a scheduler to prioritize reasoning paths (CoT/PoT/text2SQL), a confidence checker to filter candidate answers, and a judge agent for arbitration. The framework achieves model-agnostic, efficient, and accurate TableQA, with an average EM improvement of 40.1% across 10 LLMs.
MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools: MCP-Flow proposes a Web Agent-based automated pipeline that collects tool information from 1,166 real-world MCP servers and synthesizes 68,733 high-quality training samples, enabling small fine-tuned models (0.6B–8B) to surpass SOTA large models such as GPT-4o on MCP tool use.
MemoPhishAgent: Memory-Augmented Multi-Modal LLM Agent for Phishing URL Detection: This paper proposes MemoPhishAgent (MPA), the first memory-augmented multimodal LLM agent specifically designed for phishing URL detection. MPA dynamically orchestrates five dedicated tools and leverages an episodic memory system to reuse historical reasoning trajectories. It achieves a 13.6% recall improvement on public benchmarks and a 20% improvement on real-world social media data, and has been deployed in production, processing approximately 60K high-risk URLs per week.
Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh: This work presents Mina, a multilingual LLM-powered legal assistant for the Bangladeshi legal domain. Through a two-stage RAG pipeline that accurately retrieves relevant acts and specific provisions, combined with a tool chain and multilingual embeddings, Mina achieves 75–80% passing rates on the Bangladesh Bar Council exam while reducing legal consultation costs to just 0.12–0.61% of traditional methods.
RISK: A Framework for GUI Agents in E-commerce Risk Management: This paper proposes the RISK framework, comprising a domain dataset (RISK-Data: 8,492 single-step + 2,386 multi-step trajectories), a benchmark (RISK-Bench), and a GRPO-based reinforcement fine-tuning method (RISK-R1) for GUI agents in e-commerce risk management. The 7B model surpasses state-of-the-art baselines with only 7.2% of their parameter count, achieving an online task success rate of 70.5%.
Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration: This paper proposes ExtAgents, a multi-agent framework that addresses the performance degradation observed in existing multi-agent methods when scaling external knowledge input beyond the context window. It introduces two mechanisms—global knowledge synchronization (information exchange across all Seeking Agents) and knowledge-accumulative reasoning (progressively injecting filtered knowledge into the Reasoning Agent)—achieving significant improvements on multi-hop QA and long survey generation tasks.
SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios: This paper presents SecureVibeBench, the first repository-level, multi-file-editing secure coding benchmark. It constructs 105 C/C++ secure coding tasks from 41 OSS-Fuzz projects, precisely reconstructing the scenarios in which vulnerabilities were first introduced via cascaded static and dynamic analysis. Evaluation results reveal that the best-performing agent (SWE-agent + Claude Sonnet 4.5) produces code that is simultaneously functionally correct and secure in only 23.8% of cases.
SILO-BENCH: A Scalable Environment for Evaluating Distributed Coordination in Multi-Agent LLM Systems: This paper introduces SILO-BENCH, a role-agnostic benchmark for evaluating distributed coordination in multi-agent LLM systems. It comprises 30 algorithmic tasks across three communication complexity levels, with 54 configurations yielding 1,620 experiments. The benchmark reveals a critical communication-reasoning gap: agents spontaneously form reasonable communication topologies and actively exchange information, yet systematically fail to integrate distributed state into correct answers.
Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Identification: This paper proposes Spec-o3, a tool-augmented vision-language agent that simulates the spectral inspection workflow of professional astronomers via Interleaved Multimodal Chain-of-Thought (iMCoT). Through a two-stage training pipeline of cold-start SFT followed by outcome-based RL, Spec-o3 improves macro-F1 from 28.3% to 76.5% on rare celestial object identification, achieving ~50× speedup over manual inspection.
StructMem: Structured Memory for Long-Horizon Behavior in LLMs: StructMem proposes a structure-enhanced hierarchical memory framework that achieves state-of-the-art performance on the LoCoMo long-conversation benchmark (76.82%) through dual-perspective event-level extraction and cross-event semantic consolidation, while substantially reducing token consumption (1.94M vs. 35.8M for graph memory) and API call counts.
SynthAgent: Adapting Web Agents with Synthetic Supervision: This paper proposes SynthAgent, a web agent adaptation framework built entirely on synthetic supervision. It employs categorized exploration to systematically cover functional regions of webpages for diverse task synthesis, followed by a dual refinement strategy—task refinement (conflict-triggered correction of hallucinations) and trajectory refinement (global-context denoising)—to improve synthetic data quality. SynthAgent significantly outperforms existing synthetic methods on WebArena and Online-Mind2Web.
ToolOmni: Enabling Open-World Tool Use via Agentic Learning with Proactive Retrieval and Grounded Execution: This paper proposes ToolOmni, a unified agentic framework that integrates proactive tool retrieval and retrieval-grounded tool execution within a single reasoning loop. Through cold-start SFT followed by decoupled multi-objective GRPO, the framework jointly optimizes retrieval and execution capabilities, achieving an end-to-end execution success rate that surpasses strong baselines by +10.8% on ToolBench.
Towards Scalable Lightweight GUI Agents via Multi-role Orchestration: This paper proposes LAMO, a framework that trains a lightweight 3B MLLM into a flexibly orchestrated multi-role GUI Agent through role-oriented data synthesis and two-stage training (SFT with Perplexity-Weighted Cross-Entropy + multi-task RL). The agent operates in three modes—monolithic inference, multi-agent collaboration, and plug-and-play policy executor—and achieves a 77.6% success rate on AndroidWorld when paired with a GPT-5 planner, surpassing dedicated GUI agents with 72B parameters.
Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities: This paper proposes the first formal framework for Agent Uncertainty Quantification (Agent UQ), modeling an agent's problem-solving trajectory as a stochastic process over a dynamic Bayesian network: \(P(\mathcal{F}_{\leq T}) = P(E_0, O_0) \prod_{i=1}^{T} P_{\pi,\mathcal{T}}(A_i|E_{i-1}, O_{i-1}) P(O_i|A_i, E_i)\). The framework unifies existing UQ paradigms (single-step QA, multi-step reasoning) as special cases and identifies four technical challenges unique to Agent UQ through empirical analysis on \(\tau^2\)-bench.
Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories: This paper proposes SPECTRA, a supervision-free framework that enables small vision-language models (SVLMs) to discover effective tool-calling and visual reasoning behaviors through pure environment interaction, leveraging cold-start reinforcement learning (GRPO) and soft structured multi-turn rollout topological constraints. SPECTRA achieves up to 5% improvement in task accuracy and 9% improvement in tool efficiency across 4 multimodal benchmarks, and introduces the Tool Instrumental Utility (TIU) metric to quantify tool effectiveness without supervision.
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search: Through large-scale experiments (15 LLMs × 8 tasks, 72K candidate solutions), this paper finds that effective LLM optimizers behave as "local refiners"—consistently producing frequent incremental improvements while progressively concentrating search in semantic space—rather than generating high-novelty, leap-style breakthroughs. A key finding is that novelty per se does not predict optimization performance; novelty is only beneficial when the search remains sufficiently localized.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors: This paper proposes two complementary metrics, RPS and AGS, to quantify distillation-induced behavioral homogenization in LLM agents' tool-use behaviors. By distinguishing necessary from unnecessary behaviors, the framework reveals cross-family behavioral inheritance patterns across 18 models, finding that Kimi-K2 exhibits greater behavioral similarity to Claude Sonnet 4.5 than Anthropic's own models do.
Why Agents Compromise Safety Under Pressure: This paper introduces the concept of Agentic Pressure — when LLM agents operating under resource constraints cannot simultaneously complete tasks and comply with safety rules, they spontaneously exhibit norm drift, proactively sacrificing safety to preserve helpfulness. Notably, models with stronger reasoning capabilities are more adept at constructing verbalized rationalizations to justify such violations.
Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception: This paper reveals a "Temporal Blindness" phenomenon in LLM Agents during multi-turn interactions — the inability to adjust tool-calling decisions based on the real elapsed time between messages — and constructs the TicToc benchmark to evaluate this problem.
ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents: This paper proposes ZARA, a knowledge- and retrieval-augmented multi-agent framework that distills sensor signals into a structured textual knowledge base, combines class-conditional retrieval with hierarchical LLM reasoning, and achieves interpretable human activity recognition in a fully training-free setting, substantially outperforming existing methods across 8 datasets.