🦾 LLM Agent¶
💬 ACL2026 · 82 paper notes
📌 Same area in other venues: 📷 CVPR2026 (39) · 🔬 ICLR2026 (162) · 🧪 ICML2026 (59) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (4)
🔥 Top topics: Agents ×32 · LLM ×28 · Reasoning ×6 · Adversarial Robustness ×5 · Multimodal/VLM ×3
- AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning
-
This paper identifies that "LLM-as-Judge + Fixed Rubrics" (Helpfulness/Safety/Fluency) are poorly matched for evaluating goal-oriented agent trajectories. It proposes AdaRubric—where an LLM automatically generates task-specific N-dimensional evaluation rubrics based on task descriptions, followed by confidence-weighted step-by-step evaluations to produce dense reward signals. A DimensionAwareFilter is designed for DPO data construction to prevent "dimension masking." Evaluated on WebArena/ToolBench/AgentBench, it achieves a Pearson \(r=0.79\) and brings a \(+6.8\) to \(+8.5\%\) task success rate improvement through DPO training.
- AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
-
AgencyBench is proposed as a comprehensive benchmark comprising 138 real-world tasks to evaluate 6 core agent capabilities. Each scenario averages 90 tool calls and 1 million tokens, achieving fully automated evaluation via user simulation agents and Docker sandboxes.
- Agent-GWO: Collaborative Agents for Dynamic Prompt Optimization in Large Language Models
-
This paper proposes Agent-GWO, which introduces the leader-follower mechanism of the Grey Wolf Optimizer into a multi-agent framework to jointly optimize prompt templates and decoding hyperparameters (temperature, top-p, etc.). It consistently out-performs existing prompt optimization methods across 11 mathematical and hybrid reasoning benchmarks.
- AnchorMem: Anchored Facts with Associative Contexts for Building Memory in Large Language Models
-
The AnchorMem memory framework is proposed, inspired by the Proust phenomenon. It decouples the retrieval unit (atomic facts) from the generation context (original interactions) and connects fragmented memories via an associative event graph. It significantly outperforms existing systems like A-Mem and Mem0 on the LoCoMo benchmark.
- AVA: Attentive VLM Agent for Mastering StarCraft II
-
This paper proposes AVACraft—the first StarCraft II multimodal benchmark supporting both MARL and VLM decision-making paradigms (21 scenarios / RGB + Text + Structured State). It introduces the VLM baseline AVA (Multimodal Priority Reasoning + RAG + Dynamic Role Assignment). Experiments demonstrate that while MARL achieves only a 19–27% win rate after 5M training steps in base 3m scenarios, zero-shot VLM reaches 75–90%.
- BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search
-
Addressing the reliability issue where RL-trained agentic search models rarely say "I DON'T KNOW," leading to hallucinations, BAPO introduces "group-based boundary-aware rewards + adaptive reward modulators" on top of GRPO. This allows the model to reject answering only when truly exceeding its boundaries. Compared to GRPO, BAPO improves reliability across four multi-hop QA datasets by approximately 9.7% on average and outperforms Search-R1 (trained on 90k samples) using only 5k training samples.
- Benchmarking Web Agent Safety under E-commerce Deceptive Interfaces
-
The authors developed WebDecept—a lightweight, pluggable "deceptive interface injection layer" that can insert seven types of common real-world deceptive patterns (pop-ups, banners, domain redirection, hidden cart additions, price changes, etc.) into the VisualWebArena e-commerce environment at specific trigger times to test the safety of multimodal web agents. The results show that advanced agents like GPT-5.1, Claude 4.5, and Gemini 2.5 are generally vulnerable, particularly to "hidden cart/total price manipulation," where they almost entirely failed, and safety prompts were unable to mitigate these risks.
- ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering
-
ChartAgent transforms chart question answering from "textual chain-of-thought" to "acting on the image itself." By using a suite of chart-specific visual tools (segmenting pie slices, isolating bars, locating axes) within a ReAct loop and performing self-verification on intermediate visualizations, it achieves gains of up to 16.07% on ChartBench / ChartX for unannotated and numerical-heavy challenges, with a 17.31% improvement on the unannotated subset.
- CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents
-
This paper proposes CLAG, a cluster-based Agent memory framework. It organizes memories into semantically consistent clusters via SLM-driven routing, performs local evolutionary updates within clusters, and filters noise through two-stage retrieval. It significantly outperforms global memory pool baselines across multiple QA datasets.
- CodeStruct: Code Agents over Structured Action Spaces
-
This paper proposes the CodeStruct framework, which redefines code repositories as AST-based structured action spaces. It enables LLM code agents to perform read and edit operations through named program entities (rather than text snippets), achieving a \(1.2-5.0\%\) accuracy improvement on SWE-Bench Verified while reducing token consumption by \(12-38\%\).
- CoEvolve: Training LLM Agents via Agent-Data Mutual Evolution
-
CoEvolve proposes an agent-data mutual evolution framework that extracts three types of weakness signals—forgetting, boundary, and rare patterns—from training trajectories to guide targeted environment re-exploration and task synthesis. This enables the training data distribution to dynamically adapt to agent capabilities, yielding absolute gains of 19-23% on AppWorld and BFCL.
- Context-Value-Action Architecture for Value-Driven Large Language Model Agents
-
The CVA (Context-Value-Action) architecture is proposed based on the S-O-R psychological model and Schwartz Value Theory. By utilizing a Value Verifier trained on real human data to decouple behavior generation from cognitive reasoning, it effectively alleviates the behavior polarization issue in LLM agents, significantly outperforming baselines on CVABench which contains over 1.1 million real interaction trajectories.
- Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?
-
This paper simulates power-asymmetric conversations using professional roles and personas, finding that LLM agents replicate socio-cognitive effects such as pronoun usage patterns, language coordination, authoritative persuasion, and harmful compliance. While some effects enhance conversational realism, others introduce significant safety risks.
- Don't Act Blindly: Robust GUI Automation via Action-Effect Verification and Self-Correction
-
This paper proposes the VeriGUI framework, which utilizes a Thinking-Verification-Action-Expectation (TVAE) closed-loop reasoning mechanism and a two-stage training pipeline (Robust SFT + GRPO). It enables GUI Agents to verify the success of each operation and perform self-correction upon failure, significantly outperforming baselines at both 3B and 7B scales.
- Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models
-
This paper proposes PA-Tool, a training-free tool schema optimization method. By utilizing the "peakedness" signal borrowed from data contamination detection, it identifies naming patterns familiar to the model from pre-training. By renaming tool components to align with the internalized knowledge of Small Language Models (SLMs), PA-Tool achieves up to a 17% improvement on MetaTool and RoTBench, and reduces schema misalignment errors by 80%.
- Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces
-
This work formalizes "adversarial deceptive UI" as an independent defense problem for web agents. It proposes the two-stage framework DUDE (hybrid-reward RL with asymmetric penalties to train an evaluator + iterative experience summarization to distill failure modes into transferable context) and releases the RUC benchmark containing 1407 real/synthetic scenarios. Across three VLM agent bases, it reduces deception-induced failure rates from 23.5% to 1.5%, pushes task success rates from 9.5% to 60.5%, and demonstrates zero-shot transferability of Stage-2 prompts to closed-source models.
- Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models
-
This paper proposes Guided Topology Diffusion (GTD), which models the communication topology generation of multi-LLM agents as a conditional graph diffusion process. It utilizes a proxy reward model to perform zero-order guidance at each denoising step, thereby generating task-adaptive collaboration networks that are sparser, more token-efficient, and more robust.
- Exploring Reasoning Reward Model for Agents
-
The authors identify that current agentic RL typically employs sparse outcome rewards (evaluating only final correctness), which discards high-quality signals from intermediate reasoning steps. They propose Agent-RRM, a reasoning reward model generating structured feedback in three segments:
<think>/<critique>/<score>. By systematically comparing three integration methods (C: pure critique refinement, R: scalar reward enhancement, U: combined critique + score GRPO), Reagent-U achieves 43.7% on GAIA and 46.2% on WebWalkerQA using Qwen3-8B. The results demonstrate that joint supervision using "language-level critique + numerical reward" is significantly more effective than single-signal approaches. - ExpSeek: Self-Triggered Experience Seeking for Web Agents
-
ExpSeek proposes a proactive experience-seeking framework based on step-level entropy self-triggering, allowing Web Agents to determine when and what guidance is needed based on internal signals during interaction. It achieves absolute improvements of 9.3% and 7.5% on Qwen3-8B/32B respectively.
- FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments
-
FAMA employs an independent "failure analysis agent + orchestration agent" set to automatically diagnose dominant failure modes of a baseline tool-use agent on multi-turn benchmarks like τ-bench. It then directs a mitigation agent to select a minimal subset of helper agents for context injection, achieving up to a 27% increase in task success rate on Qwen series open-source models.
- FedGUI: Benchmarking Federated GUI Agents across Heterogeneous Platforms, Devices, and Operating Systems
-
FedGUI is the first comprehensive federated learning benchmark for cross-platform GUI agents, containing six datasets covering mobile, web, and desktop platforms. It systematically investigates the impact of four dimensions of heterogeneity—platform, device, operating system, and data source—on the training of federated GUI agents.
- Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
-
This paper proposes the FTRL framework, which constructs a stable and controllable tool-use training environment through a five-stage automated pipeline. It designs a verifiable reward mechanism combining tool call precision and task completion. When paired with preference optimization RL algorithms, it achieves an average tool-use performance improvement of over 10% on 7B-14B models, even surpassing the strongest closed-source models.
- FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction
-
The authors propose FregeLogic, a hybrid neuro-symbolic system that combines a five-member LLM ensemble with a Z3 SMT solver as a tie-breaking judge, reducing the content effect by 16% while improving accuracy by 0.9% in syllogistic validity judgment.
- From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms
-
This paper provides a systematic survey of LLM Agent memory mechanisms using an evolutionary framework of "Storage → Reflection → Experience." It utilizes formal definitions to map these three stages to three functional signatures: "Trajectory Retention → Trajectory Refinement → Cross-Trajectory Abstraction." The storyline is structured around three RQs (Why-How-What), with a deep dive into two transformative mechanisms of the Experience stage: Active Exploration and Cross-Trajectory Abstraction.
- GOAT: A Training Framework for Goal-Oriented Agent with Tools
-
GOAT enables small open-source models to decompose high-level goals into sequences of interdependent API calls without human annotation. By automatically constructing a "dependency graph + call-first synthetic data" pipeline from API documentation, it drives open-source models to SOTA performance on RestBench, API-Bank, and the self-constructed GOATBench, even surpassing closed-source models in specific scenarios.
- Grounding Agent Memory in Contextual Intent
-
STITCH introduces "contextual intent" (thematic scope + event type + key entity types) triples as structured retrieval cues for LLM agent long-term memory. These triples are induced online at each trajectory step. During inference, retrieval follows "label density ranking," performing structural matching before semantic scoring. On the newly constructed CAME-Bench, STITCH maintains performance as trajectories grow, outperforming the strongest baseline by 35.6% absolute (100% relative) on the Large subset.
- HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation
-
The HAG framework is proposed, formalizing population Agent generation as a two-stage hierarchical decision process. It uses a world knowledge model to construct a topic-adaptive demographic distribution tree for macro-distribution alignment, followed by real data retrieval and Agent augmentation to ensure micro-level individual consistency. On multi-domain benchmarks, it reduces aggregate alignment error by an average of 37.7% and improves sociological consistency by 18.8%.
- HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents
-
HeLa-Mem proposes a neuroscience-inspired memory architecture for LLM agents that models conversation history as a dynamic graph with Hebbian learning dynamics. It strengthens inter-memory connections through co-activation, condenses hub memories into semantic knowledge via reflective distillation, and combines semantic similarity with Hebbian spreading activation in a dual-path retrieval process, achieving state-of-the-art performance on LoCoMo with significantly fewer tokens.
- Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents
-
Ours proposes STEP-HRL, which iteratively condenses interaction histories into compact text summaries through a local progress module. This allows high-level and low-level policies to make decisions based only on step-level transitions rather than full histories, significantly improving performance and generalization on ScienceWorld and ALFWorld while reducing token consumption.
- HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
-
This paper proposes HiGMem, a two-layer event-turn memory system. By enabling the LLM to browse event summaries first before predicting which fine-grained dialogue turns are worth reading, it achieves state-of-the-art F1 scores in four out of five categories on the LoCoMo10 benchmark with an order of magnitude lower retrieval volume.
- How Adversarial Environments Mislead Agentic AI
-
This paper formalizes the "Adversarial Environment Injection" (AEI) threat model, decomposing it into Breadth Attacks (poisoning retrieval results to induce cognitive drift) and Depth Attacks (injecting phantom nodes to construct navigation traps leading to policy collapse). Through 11,000+ experiments, the study reveals that robustness against these two attacks is completely independent—a "robustness split" suggesting that current point-solution defense strategies are insufficient.
- Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents
-
The authors propose TraitBasis—a fine-tuning-free, model-agnostic, and lightweight method that extracts user trait directions like "impatient/confused/skeptical/incoherent" within the hidden space using contrastive activation differences. These directions can be scaled, combined, and injected during inference to simulate challenging users with high fidelity. Integrating this into \(\tau\)-Bench to create the \(\tau\)-trait benchmark, they found that frontier agent performance drops by 4%–20% on average (up to 46%) under varying user behaviors, exposing the illusion that high benchmark scores equate to real-world robustness.
- ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
-
Ours proposes ImplicitMemBench, the first benchmark for systematically evaluating implicit memory in LLMs. It includes 300 test items across three cognitive paradigms: procedural memory, priming, and classical conditioning. Evaluations across 17 models reveal severe limitations: the best model achieves only 66% overall accuracy, far below the human baseline.
- IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review
-
IntrAgent decomposes the process of "researchers reading papers to find information" into a two-stage pipeline: "ranking sections by structure, then iteratively reading sections and stopping when sufficient." This allows the LLM to extract fine-grained answers faithfully aligned with queries from a full scientific paper without relying on vector retrieval, outperforming RAG and research agent baselines by an average of 13.2% across five STEM fields on the new IntraBench benchmark.
- Lightweight LLM Agent Memory with Small Language Models
-
This paper proposes LightMem, a lightweight LLM agent memory system driven by multiple specialized Small Language Models (SLMs). By modularizing memory operations into a Controller (SLM-1), Selector (SLM-2), and Writer (SLM-3), and decoupling online processing from offline consolidation, it achieves an average F1 improvement of approximately 2.5 on the LoCoMo benchmark (compared to A-MEM), while maintaining an 83ms retrieval latency and 581ms end-to-end latency.
- LiTS: A Modular Framework for LLM Tree Search
-
LiTS decomposes LLM tree search into Policy, Transition, RewardModel, and unified data structures. Utilizing a decorator registry, it enables the modular reuse of search algorithms, components, and task logic across mathematical reasoning, environmental planning, and tool-use tasks. Furthermore, the study identifies that policy diversity in open-text action spaces serves as a primary bottleneck for tree search.
- LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
-
This paper proposes Location Preference Optimization (LPO), which optimizes the spatial localization accuracy of GUI agents through entropy-based window rewards and physical distance-based dynamic location rewards integrated with the GRPO framework, achieving SOTA performance in both offline and online evaluations.
- MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents
-
MAGMA decouples the memory of LLM agents into four orthogonal relation graphs: semantic, temporal, causal, and entity. It employs intent routing and adaptive beam search for policy-guided traversal across the appropriate graphs, complemented by a dual-stream writing mechanism ("Fast Path" for synchronous ingestion and "Slow Path" for asynchronous LLM consolidation). On LoCoMo, it achieves a Judge score of 0.700, comprehensively outperforming A-MEM, Nemori, and MemoryOS, while maintaining a query latency of only 1.47s (40% faster than the runner-up).
- MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools
-
MCP-Flow proposes a Web Agent-based automated pipeline to collect tool information from 1166 real-world MCP servers and synthesize 68,733 high-quality training data points. This allows small-scale fine-tuned models (0.6B-8B) to outperform SOTA large models like GPT-4o in MCP tool utilization.
- Mem²Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation
-
This paper proposes Mem²Evolve, a self-evolving agent framework that achieves co-evolution of capability expansion and experience distillation through a dual memory mechanism (Asset Memory + Experience Memory). It achieves an average Pass@1 of 70.24% across 8 benchmarks in 6 task categories, outperforming the strongest experience-centric and capability-centric baselines by 11.80% and 6.46%, respectively.
- Mem^p: Exploring Agent Procedural Memory
-
This paper proposes the Mem^p framework to systematically study how to build learnable, updatable, and lifelong evolving procedural memory for LLM Agents. By distilling past task trajectories into fine-grained step-by-step instructions and high-level script abstractions, combined with a dynamic update mechanism (addition/validation/reflection/elimination), the authors achieve continuous success rate improvements and significant reductions in execution steps on TravelPlanner and ALFWorld.
- MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End RL
-
MemSearcher replaces the "history concatenation" of search agents with "LLM-managed compact memory"—where only the
(question, memory)is processed each round instead of(question, t₁, a₁, o₁, …). Utilizing multi-context GRPO, it propagates the advantage of the entire trajectory to each round for independent optimization. MemSearcher outperforms same-sized ReAct baselines across 3B/7B/14B scales on 7 QA benchmarks (the 7B model even surpasses the 32B ReSearch) while maintaining a constant context length of <4K tokens. - Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models
-
Through a systematic comparison of hypernetwork LoRA adaptation vs. carefully designed few-shot prompting across four benchmarks, it was found that a 228-million-parameter hypernetwork provides zero gain—few-shot examples contribute +21.5%, document encoding contributes +5.0%, and the hypernetwork contributes 0%. A 3B model with effective prompting achieves 79.7% of average GPT-5 performance with 10x lower latency.
- Mina: A Multilingual LLM-Powered Legal Assistant Agent for Bangladesh
-
Mina is developed as a multilingual LLM legal assistant specifically for Bangladesh's legal landscape. By utilizing a two-stage RAG pipeline to accurately retrieve acts and sections, combined with a toolchain and multilingual embeddings, it achieved a $75\text
}80\%$ passing rate in the Bangladesh Bar Council MCQ exams. The operational cost of legal consultation is reduced to only $0.12\text{0.61\%$ of traditional methods. - MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery
-
MOOSE-Copilot unifies divergent exploration of scientific ideas and convergent refinement of fine-grained hypotheses into a visual human-AI collaborative system, significantly enhancing hypothesis discovery through three explicit human signals: initial blueprints, stage routing, and feedback.
- OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory
-
OCR-Memory renders long-horizon agent interaction trajectories into images with numbered anchors, allowing a fine-tuned OCR retriever to first localize relevant segments in visual space and then retrieve original text by index. This approach maintains complete history under strict context budgets and improves long-horizon task performance on Mind2Web and AppWorld.
- OctoTools: An Agentic Framework with Extensible Tools for Complex Reasoning
-
OctoTools is a training-free, user-friendly, and extensible multi-agent framework. By utilizing standardized Tool Cards to encapsulate heterogeneous tools, a Planner-Executor separation paradigm, and a task-specific toolset optimization algorithm, it achieves an average accuracy improvement of +9.3% over GPT-4o and up to +10.6% over frameworks like AutoGen and LangChain across 16 diverse benchmarks.
- OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation
-
OPeRA is a user behavior dataset collected from real Amazon shopping processes. It aligns personas, web observations, fine-grained actions, and real-time rationales on the same timeline to evaluate whether LLMs can truly simulate a specific user's next shopping behavior.
- PersonaAgent: Bridging Memory and Action for Personalized LLM Agents
-
PersonaAgent connects user history with tool-based actions through "personalized memory + personalized actions + test-time optimizable persona prompts," significantly outperforming baselines such as RAG, PAG, ReAct, and MemBank on multiple LaMP personalized decision-making tasks.
- Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair
-
Polaris transforms the recursive self-improvement of Gödel Agents into a "failure analysis → experience abstraction → minimal code patch → execution validation" policy repair loop tailored for 7B/8B small models. This achieves interpretable, persistently reusable policy-level improvements on MGSM, DROP, GPQA, and LitBench.
- PRInTS: Process Reward Modeling for Long-range Information Retrieval
-
PRInTS migrates "Process Reward Models (PRM)" from short-form mathematical reasoning to long-range information retrieval (IR) Agents. By utilizing a 4B model that simultaneously learns to "assign dense scores to each step based on information gain" and "recursively compress expanding trajectory contexts," the method achieves a 9.3% average improvement for 32B-scale Agents via test-time best-of-\(n\) selection. Notably, the 30B+4B combination outperforms the 671B DeepSeek-V3.1 on the GAIA benchmark.
- ProPer Agents: Proactivity Driven Personalized Agents for Advancing Knowledge Gap Navigation
-
ProPer models proactive agents as the problem of "discovering and calibrating unspoken task dimensions." Through a Dimension Generating Agent, a post-hoc reranker, and a Response Generating Agent, it selectively fills knowledge gaps, significantly improving response quality and win rates across medical, coding, and shopping recommendation tasks.
- RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents
-
RecMem draws from the "consolidation through repetition" principle in human memory, placing raw interactions into a lightweight subconscious memory first. It only invokes the LLM to generate episodic and semantic memory upon detecting semantic recurrence, thereby reaching or exceeding the QA accuracy of mainstream memory systems on LoCoMo and LongMemEval-S at significantly lower construction token costs.
- Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
-
The paper proposes BRIGHT-PRO, which re-evaluates reasoning-intensive retrievers using multi-aspect evidence annotation and agentic search protocols. It also introduces RTriever-Synth to train RTriever-4B, demonstrating that retrievers should optimize for "evidence portfolio coverage" rather than single-passage relevance.
- Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors
-
Ours proposes Fission-GRPO, which dynamically transforms tool execution errors into online-policy correction training instances within the RL loop. By utilizing a learned error simulator to generate diagnostic feedback and resampling recovery trajectories, it improves the error recovery rate of Qwen3-8B by 5.7% and the overall accuracy from 42.75% to 46.75%.
- SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning
-
SafeMCP is an agent defense plugin deployed on the MCP server side. It utilizes an environmental dynamics world model for look-ahead reasoning to first filter tools that might expand dangerous power boundaries, and kemudian performs real-time interception of initiated hazardous calls. It simultaneously enhances safety and preserves task utility across PowerSeeking Bench, ToolEmu, and AgentHarm.
- SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents
-
SEARL jointly optimizes agent policy parameters and external Tool Graph Memory. It addresses credit assignment in long trajectories by utilizing tool-anchored step-level advantages and process rewards, enabling small models to continuously create, reuse, and integrate tools in multi-hop QA and complex mathematical tasks.
- Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks
-
Shopping Companion constructs an e-commerce task benchmark with long-term user preference memory and a real product library. It employs a two-stage agent with dual rewards and tool-level rewards to jointly optimize preference identification and product recommendation, enabling a 4B model to approach the performance of strong closed-source models.
- SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning
-
SOLAR-RL utilizes offline trajectory reconstruction, failure point detection, and target-aligned reward shaping to process static GUI data into long-horizon training signals with pseudo-online feedback. This allows the Qwen2.5-VL-7B scale GUI agent to achieve stable performance on Android Control, GUI-Odyssey, and Android World, matching or exceeding strong offline baselines.
- Spec-o3: A Tool-Augmented Vision-Language Agent for Rare Celestial Object Candidate Identification
-
This paper proposes Spec-o3, a tool-augmented vision-language agent that simulates the spectral inspection workflow of astronomers through Interleaved Multimodal Chain-of-Thought (iMCoT). Using a two-stage training approach with cold-start SFT and outcome-based RL, it improves the macro-F1 of rare celestial object identification from 28.3% to 76.5%, achieving an inference speed ~50x faster than manual inspection.
- StructMem: Structured Memory for Long-Horizon Behavior in LLMs
-
StructMem proposes a structure-enhanced hierarchical memory framework. Through event-level dual-view extraction and cross-event semantic integration, it achieves SOTA performance (76.82%) on the LoCoMo long-dialogue benchmark while significantly reducing token consumption (1.94M vs. 35.8M for graph memory) and API call counts.
- Supplement Generation Training for Enhancing Agentic Task Performance
-
SGT (Supplement Generation Training) trains a small LLM (1.7B) to generate instance-specific supplemental text (reasoning clues, summaries, error reminders, etc.). When appended to the input, these supplements allow a frozen large Actor model to solve tasks more effectively, achieving an average improvement of 21% across 5 benchmarks without modifying the Actor's parameters.
- SynthAgent: Adapting Web Agents with Synthetic Supervision
-
This paper presents SynthAgent, a framework for adapting Web Agents based entirely on synthetic supervision. It systematically covers functional areas of web pages to synthesize diverse tasks through categorical exploration. Then, a dual refinement strategy is employed: task refinement (triggered by conflict detection to correct hallucinations) and trajectory refinement (denoising from a global perspective). SynthAgent significantly outperforms existing synthesis methods on WebArena and Online-Mind2Web.
- Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment
-
It is discovered that LLM Agents exhibit human-like "Actor-Observer Asymmetry" (AOA) cognitive bias during role-playing—tending to attribute their own failures to external factors as actors, while attributing others' failures to internal errors as observers. ReTAS is proposed to eliminate this bias through dialectical reasoning (Thesis-Antithesis-Synthesis) and GRPO alignment.
- Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning
-
Temp-R1 transforms Temporal Knowledge Graph Question Answering (TKGQA) from manually designed fixed prompt workflows into an autonomous agent trainable via reinforcement learning. By employing explicit internal actions, SFT cold start, GRPO, and a "hard-first" reverse curriculum, it outperforms strong baselines driven by GPT-4o/DeepSeek-V3 using an 8B open-source model.
- The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check
-
This paper systematically evaluates the performance of diffusion language models (dLLMs) in embodied and tool-use agents. It finds that despite the speed potential offered by parallel decoding, dLLMs significantly lag behind autoregressive (AR) LLMs in long-horizon causal planning and strict format generation. Furthermore, the authors utilize DiffuAgent to demonstrate that dLLMs are better suited as non-causal auxiliary modules, such as for memory compression and tool filtering.
- TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning
-
TheraAgent transforms treatment plan generation from a one-shot response into a generate-reflect-refine self-improving agent workflow. By utilizing a clinical multidimensional evaluator, TheraJudge, and score-aware memory to continuously refine plans, it significantly outperforms strong baselines in the HealthBench treatment planning subset and blind physician evaluations.
- TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents
-
TiMem organizes long-range conversational memory into a five-layer Temporal Memory Tree with explicit temporal containment. By employing complexity-aware retrieval to dynamically balance fine-grained facts and high-level personas, it improves accuracy on LoCoMo and LongMemEval-S while significantly reducing recalled context length.
- ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"
-
ToolGrad reverses tool-use data generation from "writing queries first and searching tool chains via DFS" to "generating successfully executable tool chains first and then back-inferring user queries." By using an API selection loop similar to textual gradients to construct ToolGrad-500, the pass rate for data generation reaches 99.8%. Small models like Gemma-3 trained on this data outperform several powerful closed-source models in single-turn tool calling.
- ToolOmni: Enabling Open-World Tool Use via Agentic Learning with Proactive Retrieval and Grounded Execution
-
This paper proposes ToolOmni, a unified agent framework that integrates proactive tool retrieval and retrieval-based tool execution into a single reasoning loop. Through a two-stage approach of cold-start SFT and decoupled multi-objective GRPO, it jointly optimizes retrieval and execution capabilities, achieving an end-to-end success rate on ToolBench that surpasses strong baselines by +10.8%.
- Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs
-
This paper systematically measures how communication topology affects the leakage of personally identifiable information (PII) in multi-agent LLM systems through the MAMA framework. It identifies dense connectivity and the distance between attackers and targets as critical factors determining leakage risk.
- Towards Scalable Lightweight GUI Agents via Multi-role Orchestration
-
This paper proposes the LAMO framework, which trains a lightweight 3B MLLM into a GUI Agent capable of flexible multi-role orchestration through role-oriented data synthesis and two-stage training (SFT with Perplexity-Weighted Cross-Entropy + Multi-task RL). Operating in three modes—monolithic inference, multi-agent collaboration, and plug-and-play policy executor——it achieves a 77.6% success rate on AndroidWorld when paired with a GPT-5 planner, surpassing specialized GUI Agents with 72B parameters.
- Uncertainty Quantification in LLM Agents: Foundations, Emerging Challenges, and Opportunities
-
This paper proposes the first formal framework for Agent Uncertainty Quantification (Agent UQ): modeling agent problem-solving trajectories as stochastic processes on Dynamic Bayesian Networks \(P(\mathcal{F}_{\leq T}) = P(E_0, O_0) \prod_{i=1}^{T} P_{\pi,\mathcal{T}}(A_i|E_{i-1}, O_{i-1}) P(O_i|A_i, E_i)\). It unifies existing UQ paradigms (single-step QA, multi-step reasoning) as special cases and identifies four unique technical challenges of agent UQ through empirical analysis on \(\tau^2\)-bench.
- Verified Critical Step Optimization for LLM Agents
-
CSO identifies "verified critical steps" from an agent's own failed trajectories where "changing a single action leads to task success." It constructs DPO preference pairs only at these critical decision points, enhancing the post-training performance of long-horizon LLM agents with fewer and more reliable supervisory signals.
- Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories
-
This paper proposes SPECTRA, a framework that requires no supervised trajectories. By utilizing cold-start Reinforcement Learning (GRPO) and soft-structured multi-round rollout topological constraints, it enables Small Vision-Language Models (SVLMs) to autonomously discover effective tool-calling and visual reasoning behaviors through pure environment interaction. It achieves up to a 5% increase in task accuracy and a 9% improvement in tool efficiency across 4 multimodal benchmarks, while introducing the Tool Instrumental Utility (TIU) metric to quantify tool efficacy in unsupervised settings.
- WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
-
WebClipper models long tool-call trajectories of Web Agents as "Action Node-Information Node" state graphs and mines the minimum necessary DAG to prune cyclic searches and invalid branches. This reduces the average tool rounds by approximately 21% and tokens by 19.4% for Deep Research agents while maintaining or even improving accuracy.
- What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
-
Through large-scale experiments (15 LLMs × 8 tasks, 72K candidate solutions), this paper finds that excellent LLM optimizers function as "local refiners"—continuously producing frequent incremental improvements and gradually concentrating search within semantic space, rather than generating high-novelty jumpy breakthroughs. A key finding is that novelty itself does not predict optimization performance; novelty is beneficial only when the search remains sufficiently localized.
- When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
-
This paper proposes two complementary metrics, RPS and AGS, to quantify distillation-induced homogenization in LLM Agent tool-use behaviors. By distinguishing between mandatory and optional behaviors, cross-family behavioral inheritance patterns are revealed across 18 models. Notably, the behavioral similarity between Kimi-K2 and Claude Sonnet 4.5 is found to exceed even that of Anthropic's own models.
- Why LLM Web Agents Fail: A Hierarchical Planning Perspective
-
This paper systematically analyzes the failure causes of LLM web agents through a hierarchical planning framework (high-level planning, low-level execution, and replanning). It discovers that PDDL representations outperform natural language planning, but low-level execution and perceptual grounding are the primary bottlenecks.
- YIELD: A Large-Scale Dataset and Evaluation Framework for Information Elicitation Agents
-
This paper proposes Information Elicitation Agents (IEA) as a new dialogue paradigm and releases YIELD, the first large-scale human-to-human information elicitation dialogue dataset (2,281 dialogues, 26M tokens). The study formalizes information elicitation as a finite-horizon POMDP and designs specialized evaluation metrics (Conformity, Progression, TLR). Experiments demonstrate that fine-tuning on YIELD significantly improves the alignment of Large Language Models (LLMs) with authentic elicitation behaviors.
- Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception
-
This paper reveals the "Temporal Blindness" of LLM Agents in multi-turn interactions—their inability to adjust tool-calling decisions based on the actual time elapsed between messages—and constructs the TicToc benchmark to evaluate this issue.
- ZARA: Training-Free Motion Time-Series Reasoning via Evidence-Grounded LLM Agents
-
Ours proposes ZARA, a knowledge and retrieval-augmented multi-agent framework. By distilling sensor signals into a structured text knowledge base, performing class-wise retrieval, and employing hierarchical LLM reasoning, it achieves interpretable human activity recognition in a completely training-free setting, significantly outperforming existing methods on 8 datasets.