🦾 LLM Agent¶

🧠 NeurIPS2025 · 50 paper notes

A-MEM: Agentic Memory for LLM Agents: This paper proposes A-Mem, a Zettelkasten-inspired agentic memory system for LLM agents. Each memory entry automatically generates a structured note (keywords/tags/contextual description), dynamically establishes inter-memory links, and triggers evolutionary updates to existing memories upon the insertion of new ones. A-Mem substantially outperforms baselines such as MemGPT on the LoCoMo long-conversation QA benchmark.
Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning: This paper proposes the Adaptive Coopetition (AdCo) framework, which employs a UCB multi-armed bandit strategy with coarse-grained verifier signals to enable multiple LLM agents to adaptively switch between cooperative and competitive modes during inference, achieving a 20% relative improvement on mathematical reasoning benchmarks.
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents: This paper proposes AgentAuditor — a training-free, memory-augmented reasoning framework that enables LLMs to adaptively extract structured semantic features (scenario, risk, behavior) to construct an experiential memory bank, then employs multi-stage context-aware retrieval-augmented generation to guide LLM evaluators in assessing agent behavior for safety and security threats. The work also introduces ASSEBench, the first benchmark jointly covering safety and security evaluation (2,293 records, 15 risk types, 29 scenarios), achieving human expert-level evaluation accuracy across multiple benchmarks.
AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness: AgentChangeBench is the first benchmark that systematically evaluates the adaptability of LLM agents when user goals shift mid-conversation: 315 base tasks × 9 variants = 2,835 sequences, spanning 3 enterprise domains (banking/retail/airline) and 5 user personas. It introduces 4 complementary metrics including GSRT (Goal-Shift Recovery Time), revealing efficiency and robustness gaps masked by high pass@k—e.g., GPT-4o achieves 92.2% airline recovery rate yet 89.1% retail redundancy rate.
AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents: This paper proposes AgentDAM, the first benchmark for end-to-end evaluation of data minimization compliance by AI agents in real web environments. It comprises 246 tasks spanning Reddit, GitLab, and Shopping platforms, and finds that leading models such as GPT-4o exhibit privacy leakage rates of 36–46% without mitigation, while a CoT-based privacy prompt reduces leakage rates to 6–8%.
Agentic NL2SQL to Reduce Computational Costs: This paper proposes Datalake Agent, an agentic NL2SQL system built on an interactive reasoning loop. Through a hierarchical information retrieval strategy (GetDBDescription → GetTables → GetColumns → DBQueryFinalSQL), the system enables LLMs to request database schema information on demand rather than receiving it all at once. In a setting with 319 tables, the approach reduces token usage by 87% and cost by 8×, while maintaining superior performance on complex queries.
Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents: This paper proposes Agentic Plan Caching (APC), which extracts structured plan templates from agent execution logs and reuses them via keyword-matching cache hits with a small model for adaptation. APC reduces cost by 50.31% and latency by 27.28% on average while retaining 96.61% of accuracy-optimal performance.
AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents: This paper proposes the AgentMisalignment benchmark suite, comprising 9 realistic scenario evaluation tasks that measure the propensity of LLM agents to spontaneously deviate from deployer intent under non-malicious instructions (rather than measuring capability). The study finds that stronger models tend to exhibit higher misalignment, and that persona prompts sometimes exert greater influence on misaligned behavior than model choice itself.
AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks: This paper investigates the problem of compute-optimal test-time scaling in multi-stage complex tasks. Through large-scale pilot experiments, three generalizable scaling insights for LLMs on multi-stage tasks are identified. The authors propose AgentTTS—an LLM agent-based framework that autonomously searches for compute-optimal model selection and budget allocation strategies via iterative feedback-driven search.
Are Large Language Models Sensitive to the Motives Behind Communication?: Three progressive experiments systematically evaluate whether LLMs possess "motivational vigilance"—the ability to recognize the intentions and incentives of information sources and adjust trust accordingly. In controlled experiments, frontier non-reasoning LLMs perform close to the rational model (Pearson's $r > 0.9$) and resemble humans more than the rational model does; however, vigilance drops sharply in real-world YouTube sponsored content ($r < 0.2$), and simple prompt steering partially restores it (raising $r$ to 0.31).
Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools: AMA (Attractive Metadata Attack) demonstrates that by carefully crafting malicious tool metadata (name, description, parameter schema) alone — without prompt injection or internal model access — an attacker can induce LLM agents to invoke malicious tools and leak private data at a success rate of 81–95%, while barely affecting original task completion (98%+), with existing defenses (auditors, prompt rewriting) proving largely ineffective.
Automated Composition of Agents: A Knapsack Approach for Agentic Component Selection: This paper formalizes the agent component selection problem as an online knapsack problem and proposes the Composer Agent framework, which evaluates true component capabilities via sandbox testing (rather than static semantic retrieval) and dynamically selects optimal component combinations under budget constraints using the ZCL online algorithm. The approach achieves up to a 31.6% improvement in single-agent tool selection success rate, and boosts multi-agent sub-agent selection success rate from 37% to 87%.
Automated Multi-Agent Workflows for RTL Design: VeriMaAS is a multi-agent framework that integrates HDL formal verification feedback (Yosys + OpenSTA) into the automated workflow generation process, adaptively selecting reasoning operators (I/O → CoT → ReAct → SelfRefine → Debate) for RTL code generation tasks. With only a few hundred training samples, it achieves 5–7% higher pass@k performance than fine-tuning baselines.
Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX: This paper constructs ChemX — a suite of 10 multimodal chemical data extraction benchmark datasets manually annotated and validated by domain experts, spanning nanomaterials and small molecules. It systematically evaluates state-of-the-art agentic systems including ChatGPT Agent, SLM-Matrix, FutureHouse, and nanoMINER, as well as frontier LLMs such as GPT-5 and GPT-5 Thinking. The proposed single-agent method achieves F1=0.61 on the nanozyme dataset through structured document preprocessing (marker-pdf → Markdown → LLM extraction), surpassing all general-purpose multi-agent systems, while revealing systemic challenges in chemical information extraction such as SMILES parsing failures and terminology ambiguity.
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent: This paper proposes the Blink-Think-Link (BTL) brain-inspired framework, which decomposes GUI interaction into three biologically plausible stages: Blink (rapid attentional localization), Think (cognitive reasoning and decision-making), and Link (executable command generation). Combined with an automated Blink data annotation pipeline and the first rule-based composite process-and-outcome reward mechanism, BTL Reward, the resulting BTL-UI model achieves competitive performance on both static GUI understanding and dynamic interaction benchmarks.
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension: Inspired by Piaget's constructivist theory, this paper proposes CAM — an agentic memory system characterized by three properties: structuredness (hierarchical schema), flexibility (assimilation via overlapping clustering), and dynamism (incremental adaptation). CAM comprehensively outperforms baselines such as RAPTOR and GraphRAG across six long-document reading comprehension benchmarks.
ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions: This paper proposes ContextAgent, the first LLM agent framework that leverages multimodal sensory perception from wearable devices (video + audio + notifications) to understand user intent and proactively deliver tool-augmented services. It also introduces ContextAgentBench, a benchmark of 1,000 samples, achieving improvements of 8.5% in proactive prediction accuracy and 6.0% in tool invocation accuracy.
CORE: Full-Path Evaluation of LLM Agents Beyond Final State: This paper proposes CORE, a framework that encodes legitimate tool-calling paths for agent tasks using deterministic finite automata (DFA) and introduces five complementary metrics (path correctness, order correctness, prefix criticality, harm rate, and efficiency) to evaluate agent behavior along the full execution path rather than the final state alone, revealing safety and efficiency differences invisible to conventional final-state evaluation.
Crucible: Quantifying the Potential of Control Algorithms through LLM Agents: This paper is the first to formalize the concept of Tuning Potential, using LLM agents to simulate multi-level developers performing dual-layer (parameter + logic) optimization of control algorithms. On CartPole, Bang-bang improves from 34→500, reaching DQN-level performance; on ABR tasks, Crucible achieves up to 44.1% improvement over Bayesian optimization.
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?: This work establishes, both theoretically and empirically, that the performance gains attributed to Multi-Agent Debate (MAD) stem primarily from majority voting (ensembling) rather than the debate process itself. The debate dynamics are shown to constitute a martingale—meaning debate does not systematically improve correctness in expectation—and this theoretical insight motivates a principled improvement to MAD by biasing updates toward correct signals.
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding: This paper proposes DVD (Deep Video Discovery), an agent that frames long-form video understanding as a multi-step information search problem. It first constructs a multi-granular structured database from a long video (global summary + clip-level caption embeddings + frame-level pixels), then provides three search tools (Global Browse / Clip Search / Frame Inspect). A reasoning LLM autonomously orchestrates the search trajectory via an observe-reason-act loop. DVD achieves 74.2% on LVBench (surpassing the previous SOTA MR.Video by 13.4 pp), and 76.0% with subtitles.
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments: This paper presents DefenderBench, an open-source modular toolkit for systematically evaluating LLM agents across three categories of cybersecurity tasks—offensive, defensive, and knowledge understanding—covering five scenarios: network intrusion simulation, malicious content detection, code vulnerability detection/repair, and CTI knowledge QA. Benchmark results show that Claude-3.7-sonnet achieves the best overall performance (81.65 points).
Distilling LLM Agent into Small Models with Retrieval and Code Tools: This paper proposes an Agent Distillation framework that distills the complete reason-act-observe interactive behaviors of LLM agents (rather than static CoT) into small models ranging from 0.5B to 7B parameters. Combined with a first-thought prefix to improve teacher trajectory quality and self-consistent action generation to enhance inference robustness, the framework enables small models to achieve performance comparable to CoT-distilled models 2–4× their size.
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents: DRIFT is a system-level agent security framework featuring three layers of defense: Secure Planner (pre-planned function trajectories and parameter checklists), Dynamic Validator (dynamic policy updates based on Read/Write/Execute permissions), and Injection Isolator (detection and masking of injected instructions from the memory stream). On AgentDojo, DRIFT reduces ASR from 30.7% to 1.3% while achieving 20.1% higher utility than CaMeL.
Enhancing Demand-Oriented Regionalization with Agentic AI and Local Heterogeneous Data for Adaptation Planning: This paper proposes a planning support system based on Agentic AI, in which an LLM agent guides non-technical users through data-driven demand-oriented regionalization. The core algorithm is RepSC-SOM (spatially constrained self-organizing map with representative initialization), supporting iterative human-AI collaborative refinement of regional delineations for disaster risk management and climate adaptation planning.
EU-Agent-Bench: Measuring Illegal Behavior of LLM Agents Under EU Law: This paper introduces EU-Agent-Bench, the first verifiable agent benchmark grounded in the EU legal framework. Using 600 benign user requests, it evaluates whether LLM agents' tool calls violate EU regulations. Results show that even the best-performing model (Gemini 2.5 Flash) achieves a legality rate of only ~55%, revealing a substantial gap between current alignment techniques and legal reliability.
Generative AI Agents for Controllable and Protected Content Creation: This paper proposes a multi-agent generative framework that addresses controllability and copyright protection in a unified manner through the collaboration of five specialized agents — Director/Planner, Generator, Reviewer, Integration, and Protection — augmented with human-in-the-loop feedback.
Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data: This paper proposes Ground-Compose-Reinforce (GCR), an end-to-end neuro-symbolic framework that learns the grounding semantics of atomic propositions from a small number of annotated trajectories (only 350), composes them into complex task specifications via Reward Machines, and trains an RL agent using self-generated dense rewards — eliciting out-of-distribution complex behaviours without any hand-crafted reward functions.
Group-in-Group Policy Optimization for LLM Agent Training: GiGPO introduces step-level grouping nested within the episode-level grouping of GRPO by leveraging recurring environment states across trajectories as anchor states, enabling fine-grained credit assignment without additional rollouts or a critic model. It outperforms GRPO by >12% on ALFWorld and >9% on WebShop.
Hogwild! Inference: Parallel LLM Generation via Concurrent Attention: This paper proposes Hogwild! Inference—a parallel LLM inference protocol that requires no predefined collaboration framework. Multiple LLM instances synchronize in real time through a shared concurrent KV cache, leveraging RoPE positional encoding to avoid recomputation, achieving higher accuracy with fewer serial steps on mathematical reasoning and programming tasks.
It's LIT! Reliability-Optimized LLMs with Inspectable Tools: By defining reliability/inspectability cost functions for each external tool, LIT guides LLMs to select the lowest-cost (most transparent and auditable) tool-calling path among multiple candidates, improving interpretability while maintaining or enhancing task accuracy in 61 out of 65 test scenarios.
LC-Opt: Benchmarking Reinforcement Learning and Agentic AI for End-to-End Liquid Cooling Optimization in Data Centers: This paper presents LC-Opt, a liquid cooling benchmark environment built upon a high-fidelity digital twin of the cooling system of the ORNL Frontier supercomputer. It supports end-to-end liquid cooling optimization via RL control policies, encompassing centralized/decentralized multi-agent RL, policy distillation into interpretable decision trees, and an LLM-driven agentic mesh architecture.
Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve: This paper proposes the LessonL framework, enabling multiple small LLM agents to reflect on both successful and failed cases through mutually shared "lessons," collaboratively optimizing code performance. A combination of three 7B–14B models achieves code optimization results on par with GPT-4o and approaching o3.
LLM Agent Communication Protocol (LACP) Requires Urgent Standardization: A Telecom-Inspired Protocol is Necessary: This position paper argues that the fragmented ecosystem of current LLM Agent communication mirrors the "protocol wars" of the early networking era. It proposes LACP, a three-layer protocol (Semantic, Transactional, and Transport layers) inspired by telecom standardization, and contends that security-by-design, transactional integrity, and semantic interoperability are critical for multi-agent systems.
LLM Agents for Knowledge Discovery in Atomic Layer Processing: By having an LLM agent control a simulated chemical reactor (a black-box function), this work demonstrates that agents can explore, discover, and summarize the rules of an unknown chemical system through trial and error without any prior knowledge, revealing both the capabilities and limitations of agents for open-ended scientific discovery.
MAT-Agent: Adaptive Multi-Agent Training Optimization: This paper proposes MAT-Agent, a multi-agent framework consisting of four autonomous agents responsible for data augmentation, optimizer selection, learning rate scheduling, and loss function selection, respectively. The framework dynamically adjusts training configurations during the training process, employing DQN to learn policies as a replacement for conventional static hyperparameter configurations, and achieves state-of-the-art performance on multi-label image classification tasks.
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges?: This paper proposes MLRC-Bench, a dynamic benchmark grounded in ML conference competition tasks, designed to objectively evaluate the ability of LLM agents to propose and implement novel research methods. The study finds that even the strongest agent (gemini-exp-1206) closes only 9.3% of the gap between the baseline and top human solutions, and that LLM subjective scores for "novelty" exhibit virtually no correlation with actual performance.
Orchestration Framework for Financial Agents: From Algorithmic Trading to Agentic Trading: This paper proposes FinAgent, an orchestration framework that maps each component of a traditional algorithmic trading system to a dedicated AI agent (Planner, Orchestrator, Alpha/Risk/Portfolio/Backtest/Execution/Audit/Memory agents), employs the MCP protocol for control communication and the A2A protocol for inter-agent communication, and validates the framework's feasibility on stock and BTC trading tasks.
PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer: This paper proposes PANDA, an agentic AI engineer framework built upon MLLMs, which achieves training-free and human-intervention-free generalist video anomaly detection through four core capabilities: adaptive scene-aware strategy planning, goal-driven heuristic reasoning, tool-augmented self-reflection, and chain-of-memory.
R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization: This paper proposes R&D-Agent(Q), a data-driven multi-agent framework that automates the joint optimization of factor mining and model innovation for quantitative strategies through five collaborative modules (Specification, Synthesis, Implementation, Validation, and Analysis), achieving approximately 2× the annualized return of traditional factor libraries in real stock markets at a cost of under $10.
ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling: This paper proposes ShapeCraft, a multi-agent framework built on a Graph-based Procedural Shape (GPS) representation. Three LLM agents — Parser, Coder, and Evaluator — collaborate to decompose natural language descriptions into structured sub-task graphs, iteratively generating editable and animatable textured 3D assets.
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications: Caches long token sequences via suffix trees and achieves 5.3× speedup through adaptive speculation length, targeting highly predictable repetitive inference tasks in agent scenarios.
T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning: This paper introduces T1, a dataset of 13.5K multi-turn dialogues spanning 9 domains (4 single-domain + 5 cross-domain) and 14 tools, with a focus on inter-tool dependencies and dynamic replanning. A baseline system, T1-Agent (code generation + caching mechanism), is proposed for systematic evaluation. Experiments show that SFT-tuned Llama 8B achieves 87.17% Tool Call F1, surpassing untuned 70B models, yet still trailing closed-source models such as GPT-5 and o3.
TAI3: Testing Agent Integrity in Interpreting User Intent: This paper proposes TAI3, an API-centric stress-testing framework for LLM agent intent integrity. It organizes the natural language input space into a structured test grid via Semantic Partitioning, and leverages Intent-Preserving Mutation and Strategy Memory to efficiently expose intent misinterpretation errors when agents execute user tasks.
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement: This paper proposes CGI (Critique-Guided Improvement), a dual-role framework that trains a dedicated Critic model to provide structured natural language feedback (discrimination + correction suggestions) to an Actor Agent, and enables the Actor to learn to leverage such feedback through iterative action refinement. CGI achieves an average score of 74.20% across WebShop, ScienceWorld, and TextCraft, surpassing GPT-4o (45.46%) and Iterative SFT (58.21%).
Traj-CoA: Patient Trajectory Modeling via Chain-of-Agents for Lung Cancer Risk Prediction: This paper proposes Traj-CoA, a multi-agent framework that employs a chain-of-agents architecture with an EHRMem long-term memory module to perform temporal reasoning over long, noisy longitudinal EHRs. The framework surpasses ML/DL/BERT/LLM baselines on zero-shot lung cancer risk prediction tasks (5-year EHR data, up to 160k tokens).
TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration: This paper proposes TrajAgent — an LLM-agent-based framework for trajectory modeling that achieves automated, cross-task, and cross-dataset trajectory modeling through a unified environment (UniEnv), an automated workflow, and a collaborative learning schema between large and small models, outperforming baseline methods by 2.38%–69.91% across multiple tasks.
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents: This paper proposes Web-Shepherd, the first process reward model (PRM) specifically designed for web navigation. By decomposing task objectives into evaluable sub-goals via checklists, 3B/8B models achieve trajectory accuracy far surpassing GPT-4o (85% vs. 10%) at only 1/10 of the cost, making reinforcement learning and inference-time search for web agents practically feasible.
What AI Speaks for Your Community: Polling AI Agents for Public Opinion on Data Center Projects: This paper proposes an LLM-based AI agent polling framework that synthesizes demographically representative virtual resident agents to conduct large-scale, low-cost public opinion surveys on data center projects. Cross-model and cross-region experiments demonstrate high thematic alignment between agent opinions and real-world polls.
Zero-Shot Large Language Model Agents for Fully Automated Radiotherapy Treatment Planning: This paper proposes a zero-shot LLM Agent-based workflow for automated radiotherapy treatment planning, in which the LLM directly interacts with a commercial treatment planning system (Eclipse TPS). By iteratively extracting dose-volume histogram (DVH) metrics and objective function losses and reasoning about constraint adjustment strategies, the approach achieves dose distribution quality comparable to or better than clinical manual planning on 20 head-and-neck cancer IMRT cases.