🦾 LLM Agent¶

🧪 ICML2025 · 11 paper notes

📌 Same area in other venues: 📷 CVPR2026 (42) · 🔬 ICLR2026 (162) · 💬 ACL2026 (82) · 🧪 ICML2026 (59) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (39)

🔥 Top topics: LLM ×4 · Reasoning ×2 · Agents ×2

AdvAgent: Controllable Blackbox Red-teaming on Web Agents: This paper proposes AdvAgent, a reinforcement learning (DPO)-based blackbox red-teaming framework. It trains an adversarial prompter model to automatically generate invisible HTML adversarial prompts. When injected into web pages, these prompts mislead GPT-4V-driven Web Agents into executing attacker-specified target actions (e.g., changing buying Microsoft stock to buying NVIDIA stock). AdvAgent achieves a 97.5% attack success rate across 440 tasks and maintains over 88.8% effectiveness against existing defense methods.
AGACCI: Affiliated Grading Agents for Criteria-Centric Interface in Educational Coding Contexts: AGACCI proposes a multi-agent evaluation framework consisting of 9 specialized agents. It decomposes the evaluation task of educational programming assignments into roles such as rubric parsing, code execution validation, visual evaluation, and explanatory reasoning assessment. Through collaboration, it achieves more accurate, consistent, and interpretable rubric-aligned feedback than single-model baselines.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction: Proposes Aguvis, the first fully pure vision-based cross-platform autonomous GUI Agent framework. By unifying visual observation space, standardizing action spaces, and utilizing an inner monologue mechanism, Aguvis achieves SOTA results on offline and online benchmarks without relying on closed-source models.
Evaluating Retrieval-Augmented Generation Agents for Autonomous Scientific Discovery in Astrophysics: This paper constructs CosmoPaperQA (105 expert QA pairs), a RAG evaluation benchmark in the cosmology domain, to systematically evaluate nine RAG agent configurations (covering commercial APIs, hybrid architectures, and academic tools). It finds that the OpenAI RAG solution leads with a 91.4% accuracy rate and calibrates an LLM-as-a-Judge system that can substitute for manual human review.
From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?: This paper introduces AR-Bench, a benchmark specifically designed to evaluate the active reasoning capabilities of LLMs. It features three task families: detective cases, situation puzzles, and guessing numbers. Experiments reveal that state-of-the-art models such as GPT-4o perform far worse than humans in scenarios where they must actively ask questions to retrieve missing information, exposing a massive gap between passive and active reasoning.
GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning: GuardAgent is the first "Agent-safeguarding-Agent" framework that dynamically converts safety rules into executable guardrail code to verify if the actions of a target Agent violate safety policies. It achieves guardrail accuracies of over 98% and 83% on new benchmarks for medical access control and web safety control, respectively.
Improving LLM Agent Planning with In-Context Learning via Atomic Fact Augmentation and Lookahead Search: Proposes LWM-Planner, which extracts "atomic facts" from interaction trajectories to enhance LLM world model simulation and combines this with recursive lookahead search to improve agent planning purely in-context. It significantly outperforms ReAct and Reflexion on tasks like ALFWorld.
KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search: KBQA-o1 is proposed, which combines a ReAct Agent with Monte Carlo Tree Search (MCTS) to perform knowledge base question answering through heuristic search driven by policy and reward models. Under low-resource settings, it improves the GrailQA F1 from 48.5% (GPT-3.5-turbo SOTA) to 78.5% using Llama-3.1-8B.
Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery: This paper proposes cmbagent, a multi-agent system composed of approximately 30 LLM agents. It adopts a Planning & Control strategy to orchestrate fully autonomous scientific research workflows. Individual agents are responsible for specialized tasks such as literature retrieval, code generation, result interpretation, and output review, with the capability of executing code locally. The system successfully completes a PhD-level cosmology task (measuring cosmological parameters using supernova data) and outperforms state-of-the-art LLMs on two benchmark datasets.
Towards LLM Agents for Earth Observation: This paper proposes UnivEARTH—a Earth Observation benchmark featuring 140 yes/no questions, covering 13 topics and 17 satellite sensors. Evaluation reveals that the best LLM Agent (generating code to use Google Earth Engine) achieves an accuracy of only 33%, primarily limited by the fact that 58% of the generated code fails to execute.
xChemAgents: Agentic AI for Explainable Quantum Chemistry: xChemAgents proposes a Selector-Validator dual-agent collaborative framework that injects physics-aware reasoning into multimodal molecular property prediction: the Selector Agent adaptively selects a sparsely weighted subset of descriptors with natural language explanations, while the Validator Agent iteratively verifies them through dimensional consistency and scaling law checks, achieving up to a 22% reduction in MAE on the QM9 benchmark.