🦾 LLM Agent¶

🔬 ICLR2026 · 47 paper notes

A Benchmark for Deep Information Synthesis (DeepSynth): This paper proposes DeepSynth, a benchmark comprising 120 real-world information synthesis tasks spanning 7 domains and 67 countries (averaging 5.5 hours of expert annotation per task). The benchmark requires agents to collect information from multiple web sources and perform structured reasoning. The strongest current agent (o3-deep-research) achieves only 8.97 F1 / 17.5% LLM-Judge, exposing a critical gap in LLM agents' information synthesis capabilities.
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models: This paper proposes ACE (Agentic Context Engineering), a framework that treats context as a continuously evolving playbook. Through a Generator–Reflector–Curator role decomposition and incremental delta updates, ACE accumulates and refines strategies over time, addressing brevity bias and context collapse in existing prompt optimization methods. ACE achieves an average improvement of 10.6% on agent benchmarks and 8.6% on financial tasks, while reducing adaptation latency by 86.9%.
AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents: This paper proposes AgentSynth, a pipeline that leverages information asymmetry (forward stepwise generation is easy; backward holistic solving is hard) to chain simple subtasks into complex long-horizon computer-use tasks. It automatically generates 6,000+ diverse tasks and trajectories at $0.60 per trajectory, with SOTA agents achieving only 4% success rate at the highest difficulty level.
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents: This paper exposes a structural vulnerability in chat templates used by LLM agents: by embedding forged role labels (e.g., <system>, <user>) in tool-returned data, attackers can hijack the model's role hierarchy perception and disguise malicious instructions as high-priority directives, raising ASR from 5–15% to 32–52%.
CoMind: Towards Community-Driven Agents for Machine Learning Engineering: This paper proposes MLE-Live — the first real-time evaluation framework simulating a Kaggle research community — and CoMind, a multi-agent ML engineering system that systematically leverages collective community knowledge. CoMind achieves a 36% medal rate across 75 historical Kaggle competitions and outperforms an average of 79.2% of human participants on 4 active competitions (reaching 92.6% in an updated version).
Efficient Agent Training for Computer Use: PC Agent-E uses only 312 human-annotated Windows operation trajectories. Via the proposed Trajectory Boost method, Claude 3.7 Sonnet synthesizes diverse alternative action decisions at each timestep. The resulting fine-tuned Qwen2.5-VL-72B achieves a 141% relative improvement on WindowsAgentArena-V2, even surpassing the teacher model Claude 3.7 Sonnet by 10%.
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization: This paper proposes EMPO2, an RL framework that combines an external memory module with hybrid on-policy/off-policy updates. By leveraging memory-guided exploration and knowledge distillation to internalize exploration gains into model parameters, EMPO2 achieves improvements of 128.6% and 11.3% over GRPO on ScienceWorld and WebShop, respectively.
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development: This paper introduces FeatureBench—a benchmark for feature-level software development targeting code agents, comprising 200 tasks across 24 open-source repositories, with each task requiring an average of 790 lines of code spanning 15.7 files. Even Claude Opus 4.5 (74.4% on SWE-bench) resolves only 11.0% of tasks, revealing a substantial capability gap in realistic feature development scenarios.
FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents: FingerTip 20K collects 21,437 interaction records from 95 users during real-world daily smartphone usage—including user profiles, timestamps, locations, and historical intents—and introduces two new evaluation tracks: proactive task suggestion (predicting user intent) and personalized task execution (adapting to action preferences). The strongest model, Qwen-QVQ-Max, achieves only 12.8% success on proactive suggestion (vs. 30.3% for humans), while UI-TARS reaches only 38.5% on execution.
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments: This paper introduces the Gaia2 benchmark for evaluating LLM agents in dynamic and asynchronous environments. It incorporates realistic scenarios including time constraints, noisy events, ambiguity resolution, and multi-agent collaboration. A write-action verifier with verifiable rewards enables direct use for RLVR training. Evaluation results show that the strongest model, GPT-5 (high), achieves only 42% pass@1.
HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatre: This paper proposes HAMLET, a multi-agent framework that decouples AI theatrical creation and live performance into an offline planning phase and an online performance phase. Through a narrative blueprint, a Perceive And Decide (PAD) module, and a hierarchical control system, HAMLET enables an AI theatre experience characterized by proactivity, physical environment interaction, and improvisational freedom.
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents: This paper proposes EMPG, a framework that dynamically modulates policy gradient magnitudes using step-level entropy (uncertainty) to address the credit assignment problem under sparse rewards in long-horizon LLM agent tasks. EMPG achieves significant improvements over GRPO and DAPO on three benchmarks: WebShop, ALFWorld, and Deep Search.
InfiAgent: Self-Evolving Pyramid Agent Framework for Infinite Scenarios: This paper proposes InfiAgent, a DAG-based pyramidal multi-agent framework that achieves automated hierarchical task decomposition, dual-audit quality assurance, intelligent routing, and self-evolution through an agent-as-a-tool mechanism, outperforming ADAS by an average of 9.9% across multiple reasoning benchmarks.
Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals: This paper finds that while modern LLM agents are robust to direct adversarial pressure (goal drift = 0), they can "inherit" goal drift behavior from the context produced by weaker models. More counterintuitively, instruction hierarchy compliance (system vs. user prompt priority) shows no correlation with drift resistance — Gemini fails to follow system prompts yet exhibits non-trivial drift resistance, while Qwen3 follows system prompts but remains susceptible to contextual contagion.
Judge Reliability Harness: Stress Testing the Reliability of LLM Judges: This paper proposes Judge Reliability Harness (JRH), an open-source framework that systematically evaluates the reliability of LLM judges through synthetic tests including label flip, format invariance, semantic paraphrase, verbosity bias, and stochastic stability. The framework stress-tests four state-of-the-art judges across four benchmarks (FORTRESS, HarmBench, Persuade, AgentHarm), finding that no single judge is reliable across all scenarios.
LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News: This paper proposes LiveNewsBench, a periodically updated benchmark that automatically generates QA pairs from fresh news events to evaluate LLM agentic web search capabilities, effectively isolating model internal memory from genuine search ability.
LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News: This paper proposes LiveNewsBench, an automatically generated and periodically refreshed benchmark derived from recent news articles. It evaluates LLMs' agentic web search capabilities through multi-hop, factual question answering, effectively decoupling models' parametric knowledge from their retrieval ability. Model performance ranges from 11% to 90%, demonstrating strong discriminative power.
M2-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining: This paper proposes M2-Miner, the first MCTS-based automated data mining framework for mobile GUI agents. Through the collaboration of three agents—InferAgent, OrchestraAgent, and JudgeAgent—combined with an intent recycling strategy and progressive model-in-the-loop training, M2-Miner generates SOTA-quality data at 1/18 the cost of human annotation.
M²-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining: This paper proposes M²-Miner, the first MCTS-based automatic data mining framework for mobile GUI agents. Through the collaboration of three agents—InferAgent, OrchestraAgent, and JudgeAgent—it achieves a 64× improvement in mining efficiency, enriches intent diversity via an intent recycling strategy, and trains a GUI agent that achieves state-of-the-art performance on multiple benchmarks.
MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains: This paper proposes MC-Search, the first benchmark for agentic multimodal RAG, comprising 3,333 high-quality samples (averaging 3.7 hops) across 5 reasoning topology types. The benchmark employs HAVE verification to ensure the necessity of each reasoning step, and introduces the Search-Align process-supervised fine-tuning framework, which substantially improves retrieval planning in open-source models (Qwen2.5-VL-7B F1 +13.7).
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development: This paper introduces FeatureBench, a benchmark for evaluating code agents on feature-level software development. Through a test-driven automated pipeline, it extracts verifiable feature implementation tasks from open-source repositories. The strongest model, Claude Opus 4.5, resolves only 11.0% of tasks, revealing a substantial gap between current agents and the demands of complex feature development.
Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies: This paper provides a systematic analysis of the respective contributions of prompt design and topology design in multi-agent systems (MAS), finding that prompt optimization is the single most critical factor—a single agent with optimized prompts can outperform complex multi-agent topologies. The paper proposes Mass, a three-stage framework (block-level prompt → topology → workflow-level prompt) that achieves state-of-the-art performance across 8 benchmarks.
NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents: This paper proposes NewtonBench, a benchmark for LLM-based scientific law discovery comprising 324 tasks across 12 physical domains. Novel tasks resistant to memorization are generated via "counterfactual law shifts," requiring agents to discover hidden physical equations through interactive experimentation. GPT-5 achieves the best performance (75.9% symbolic accuracy) but degrades sharply on complex systems (40.3%), and code tools surprisingly hurt stronger models.
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety: This paper proposes OpenAgentSafety, a comprehensive AI agent safety evaluation framework comprising 350+ executable tasks, a real-world toolset (browser, terminal, file system, and messaging platforms), and multi-turn multi-user interaction scenarios. The framework reveals that even state-of-the-art LLMs exhibit unsafe behaviors in 49%–73% of safety-sensitive tasks.
PerfGuard: A Performance-Aware Agent for Visual Content Generation: This paper proposes PerfGuard, a performance-aware agent framework for visual content generation. It replaces textual tool descriptions with a multi-dimensional performance scoring matrix to model tool capability boundaries, and incorporates adaptive preference updating and capability-aligned planning optimization, substantially improving tool selection accuracy (error rate reduced from 77.8% to 14.2%) and visual generation quality.
PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement: This paper proposes PhyScensis, an LLM agent framework augmented with a physics engine that generates high-complexity, physically accurate 3D scenes via a spatial and physical predicate-driven solver. It significantly outperforms prior methods in visual quality, semantic correctness, and physical accuracy, and is successfully applied to training robotic manipulation policies.
PerfGuard: A Performance-Aware Agent for Visual Content Generation: This paper proposes PerfGuard, a performance-aware agent framework for visual content generation. It replaces textual descriptions with a multi-dimensional scoring matrix to model tool performance boundaries (PASM), employs Adaptive Preference Updating (APU) to dynamically calibrate deviations between theoretical rankings and actual execution outcomes, and introduces Capability-Aligned Planning Optimization (CAPO) to guide the Planner in generating subtasks aligned with tool capabilities. PerfGuard comprehensively outperforms SOTA methods such as GenArtist and T2I-Copilot on image generation and editing tasks.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents: This paper proposes T³ (Truncating Belief-Trapped Trajectories), which leverages POMDP theory to analyze the "belief trap" phenomenon in multi-turn active reasoning of LLM agents. By detecting belief deviation and truncating uninformative trajectory suffixes, T³ corrects credit assignment errors during RL training, achieving performance gains of up to 30 points across 5 challenging tasks while reducing token consumption by 34%.
REMem: Reasoning with Episodic Memory in Language Agents: This paper proposes REMem, an episodic memory framework for language agents that employs a hybrid memory graph (temporally-aware gist nodes combined with factual triple nodes) and tool-augmented agentic reasoning, achieving improvements of 3.4% and 13.4% over the state of the art on episodic recall and episodic reasoning tasks, respectively.
SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents: This paper proposes SimuHome, a time-accelerated smart home simulator based on the Matter protocol along with a 600-episode benchmark. It is the first benchmark to simulate the continuous effects of device operations on environmental variables and to evaluate workflow scheduling capabilities. Results reveal that workflow scheduling remains the most challenging frontier for current LLM agents, including GPT-5.1.
Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents: This paper proposes the HPL framework to address the granularity mismatch in preference learning for long-horizon LLM agents. Through three-level DPO (trajectory-level + step-level + action-group-level) and two-dimensional curriculum learning (subtask complexity × sample difficulty), HPL significantly outperforms baselines such as ETO and IPR on ALFWorld/WebShop/InterCode-SQL (average 59.44 vs. 55.43/55.49).
SR-Scientist: Scientific Equation Discovery With Agentic AI: This paper proposes the SR-Scientist framework, which elevates LLMs from simple equation proposers to autonomous AI scientists. By leveraging a code interpreter tool for data analysis and equation evaluation, the framework autonomously discovers scientific equations through long-horizon interactions, with reinforcement learning further enhancing its capabilities.
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents: This paper introduces ST-WebAgentBench, the first benchmark specifically designed to evaluate the safety and trustworthiness of web agents. Through a policy hierarchy framework and the Completion under Policy (CuP) metric, it reveals that current SOTA agents exhibit serious policy violations in enterprise settings.
The Controllability Trap: A Governance Framework for Military AI Agents: This paper proposes the Agentic Military AI Governance Framework (AMAGF), which transforms human control over military AI agents from a binary judgment into a continuous, quantified monitoring system centered on the Control Quality Score (CQS), encompassing three pillars: prevention, detection, and correction.
The Controllability Trap: A Governance Framework for Military AI Agents: This paper proposes the Agentic Military AI Governance Framework (AMAGF), a governance framework for military AI agents built around a measurable Control Quality Score (CQS), addressing six categories of agentic governance failures through three pillars: prevention, detection, and correction.
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution: This paper introduces Toolathlon, a language agent benchmark covering 32 software applications, 604 tools, and 108 tasks, emphasizing realistic and diverse environment states alongside long-horizon multi-step interactions (averaging ~20 tool calls per task). The strongest evaluated model, Claude-4.5-Sonnet, achieves only 38.6% task success rate.
ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning: This paper proposes ToolTree, an MCTS-based tool planning framework for LLM agents that achieves look-ahead tool selection within a fixed computational budget through a dual-phase pre/post-execution evaluation mechanism and bidirectional pruning, yielding an average improvement of approximately 10% across 4 benchmarks.
ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models: ToolWeaver is proposed to represent each tool as a hierarchical discrete code sequence (rather than a single token) via collaboration-aware vector quantization, achieving logarithmic vocabulary scaling (47,000+ tools requiring only ~512 new tokens). It comprehensively outperforms the ToolGen baseline on ToolBench while reducing language model perplexity degradation from 16.5× to 4×.
Toward a Dynamic Stackelberg Game-Theoretic Framework for Agentic AI Defense Against LLM Jailbreaking: This paper models the LLM jailbreak attack-defense interaction as a dynamic Stackelberg extensive-form game, explores the prompt space via Rapidly-exploring Random Trees (RRT), and proposes a "Purple Agent" defense architecture—embodying the "Think Red to Act Blue" philosophy—that anticipates attack trajectories through internal adversarial simulation and proactively neutralizes them.
Towards Scalable Oversight via Partitioned Human Supervision: This paper proposes a scalable oversight framework based on partitioned human supervision. When tasks exceed the competence of any single expert, domain experts provide complementary labels (i.e., excluding incorrect options) to construct an unbiased accuracy estimator, enabling evaluation and training of AI systems without requiring complete annotations.
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning: This paper proposes VideoMind, a role-based video-language agent framework that achieves temporally grounded video reasoning through the collaboration of four roles—Planner, Grounder, Verifier, and Answerer. The core innovation is the Chain-of-LoRA mechanism, which enables seamless role switching on a unified backbone model by swapping role-specific LoRA adapters. A 2B-parameter VideoMind surpasses GPT-4o and Gemini-1.5-Pro on temporal grounding benchmarks.
VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Understanding: VideoMind proposes a video-language agent based on a Chain-of-LoRA mechanism, enabling efficient temporal-grounded video reasoning through the collaborative operation of four roles—Planner, Grounder, Verifier, and Answerer—on a unified LMM backbone. The 2B model surpasses GPT-4o and Gemini-1.5-Pro.
Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning for Web Agents: Inspired by Bloom's educational taxonomy, this paper proposes the Web-CogKnowledge Framework, which decomposes Web Agent capabilities into a progressive three-tier knowledge hierarchy—Factual→Conceptual→Procedural—and trains Web-CogReasoner using a Knowledge-driven CoT (KCoT) reasoning framework. The resulting model achieves 84.4% on Web-CogBench, surpassing Claude Sonnet 4 (76.8%) and Gemini 2.5 Pro (80.4%).
Web-CogReasoner: Towards Knowledge-Induced Cognitive Reasoning in Web Agents: Web-CogReasoner draws on Bloom's Taxonomy in educational psychology to decompose Web Agent capabilities into a three-tier hierarchy of factual, conceptual, and procedural knowledge, constructing a structured knowledge-driven Chain-of-Thought reasoning framework that substantially outperforms existing methods on web navigation tasks.
WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents: WebArbiter proposes a reasoning-first, principle-guided Process Reward Model (WebPRM) that formulates reward modeling as a text generation task. Through a two-stage training pipeline of reasoning distillation followed by reinforcement learning, a 7B model achieves performance surpassing GPT-5 by 9.1 percentage points on WebPRMBench.
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents: This paper introduces the concept of "Misevolution" for the first time, systematically revealing that self-evolving LLM agents—when autonomously improving along four pathways (model evolution, memory evolution, tool evolution, and workflow evolution)—can exhibit emergent risks including safety alignment degradation, deployment-time reward hijacking, introduction and reuse of unsafe tools, and bypassing of safety checks. Even state-of-the-art models such as Gemini-2.5-Pro are not immune to these risks.
ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense: This paper introduces the first benchmark for evaluating LLM agents on discovering and patching novel zero-day vulnerabilities. By transplanting real CVEs into different codebases, the authors construct 22 novel high-severity vulnerability tasks and evaluate agent capability across 5 information-visibility levels. The strongest model achieves only a 14.4% pass rate at the zero-day level, indicating that autonomous vulnerability discovery remains a significant challenge.