Skip to content

🦾 LLM Agent

🧪 ICML2026 · 16 paper notes

📌 Same area in other venues: 💬 ACL2026 (45) · 📷 CVPR2026 (14) · 🔬 ICLR2026 (39) · 🤖 AAAI2026 (43) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (4)

🔥 Top topics: Agents ×5 · LLM ×3

A Minimal Agent for Automated Theorem Proving

This paper introduces AxProverBase—a minimal Lean 4 theorem-proving agent that, using only three components ("compiler feedback + self-managed notebook + lightweight tool search"), achieves or surpasses specialized systems like Hilbert/Seed-Prover on cutting-edge, untuned LLMs (Claude Opus), while reducing costs by 100x.

Adaptive Querying with AI Persona Priors

The authors encapsulate the "distribution of LLM responses under persona conditions" as a finite mixture Bayesian prior, enabling efficient prediction of other responses for a user after only a few questions by performing closed-form posterior updates over persona, outperforming classic CAT/IRT baselines.

Agent-Omit: Adaptive Context Omission for Efficient LLM Agents

By using Monte-Carlo rollout to quantify "which rounds of thought/observation can be omitted," and then training an 8B agent with cold-start SFT and dual-sampling omit-aware GRPO, the model adaptively skips redundant reasoning and observations. On five benchmarks, token usage drops significantly while accuracy matches seven leading models.

AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

The authors propose a new task, AWR, which aims to reconstruct an equivalent white-box workflow from a black-box agent system. They use MCTS to search the agent primitive sequence space, combined with a Red-Black pruning method based on dynamic score coloring to balance depth and breadth, achieving interpretable white-box reconstruction in five real-world domains.

BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics

BioAgent Bench provides an end-to-end evaluation suite for "running bioinformatics pipelines with LLM agents"—10 real bioinformatics tasks × 10 frontier/open-weight models × 3 agent harnesses, combined with LLM judge scoring and three types of perturbation tests (corrupted/decoy/prompt-bloat). The study finds that frontier models can complete over 90% of pipelines, but robustness remains a concern.

DiscoverLLM: From Executing Intents to Discovering Them

DiscoverLLM formalizes the scenario where "users themselves are unclear about what they want" as a progressive discovery process over a hierarchical intent tree. It uses a rewardable hierarchical user simulator to train models that actively diverge and explore when user intent is unclear, and converge to execution when intent is clear. On creative writing, technical writing, and SVG tasks, it outperforms baselines like CollabLLM by +10% in satisfaction and reduces dialogue length by 40%.

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

EvolveR establishes a closed-loop lifecycle for LLM agents: "online interaction → offline self-distillation into a principle library → GRPO policy evolution." Instead of discarding past trajectories, the agent abstracts its own successes and failures into a retrievable set of "policy principles," then uses RL to learn how to leverage its own principles to solve new problems. On seven multi-hop QA benchmarks, it significantly outperforms RL agent baselines such as Search-R1.

ExCyTIn-Bench: Evaluating LLM Agents on Cyber Threat Investigation

This paper introduces the first benchmark for evaluating LLM Agents in end-to-end "cyber threat investigation": ExCyTIn-Bench. From 57 real Azure tenant security log tables, it automatically generates 7,542 SQL QA tasks with evidence chains using an alert-entity bipartite graph, and provides a MySQL environment for agents to answer by querying logs and multi-hop evidence tracing. The current best model, Claude-Opus-4.5, achieves only a 0.606 reward.

Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

To address the persistent issues of "modality dominance" and "spurious modality coupling" in centralized multimodal fusion, GCL reframes multimodal learning as a protocolized collaboration among four agents in two stages: In the first stage, Routing/Auditing agents determine, on a per-sample basis, which cross-modal communications are permitted based on marginal predictive gain; in the second stage, Public-Factor/Aggregation agents decouple shared semantics from private specialization before aggregation. This approach achieves SOTA on MOSI/MOSEI/MIntRec.

Internalizing Agency from Reflective Experience

This paper proposes the LEAFE framework, enabling LLM agents to generate "failure→rollback→correction→success" experience data by reflecting on failed trajectories, and then distilling feedback-grounded recovery ability via SFT. On long-horizon tasks such as CodeContests, WebShop, and ALFWorld, Pass@128 is improved by up to 14%, significantly outperforming outcome-driven RL methods like GRPO.

Position: Agentic AI Orchestration Should Be Bayes-Consistent

This position paper advocates: stop trying to make LLMs themselves "Bayesian" (that path is both theoretically and practically infeasible), and instead move Bayesian structure to the orchestration control layer of agentic AI—let the controller maintain a belief over low-dimensional, task-level latent variables, update it via Bayes’ rule on "message observations" returned by agents/tools, and use expected utility or value-of-information for routing, stopping, escalation, and budget allocation.

Position: Assistive Agents Need Accessibility Alignment

This is a position paper. Through a systematic review of 778 blind assistance task instances from 417 papers, the authors argue that "accessibility alignment" should be considered a primary alignment objective for agents, on par with helpful/harmless/honest, and propose a design pipeline covering four dimensions: goal, interaction, risk, and lifecycle.

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

PragLocker employs a two-stage strategy of "code-symbol initialization + noise injection under black-box target model feedback" to encode the agent system prompt into an obfuscated text that only works on the target LLM and fails on any other LLM. Thus, even if the prompt is stolen from the deployment side, attackers cannot reuse it on their own LLMs.

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

ReSeek augments RL-trained search agents with a JUDGE action and uses BGE-reranker to compute an "ideal judgment" as a process reward, enabling the agent to softly "mask" irrelevant information and re-query after each retrieval. It also introduces FictionalHot, a contamination-resistant benchmark based on fictional entities. On Qwen2.5-7B, the average EM reaches 0.377, +3.1 higher than ZeroSearch.

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Video2GUI employs a four-stage pipeline—coarse metadata filtering → fine-grained video quality filtering → Gemini-3-Pro for task/action extraction → high-resolution three-frame precise spatial grounding—to distill 500 million YouTube video metadata entries into WildGUI (12.7M trajectories, 124.5M screenshots, 1500+ applications), boosting Qwen2.5-VL/Mimo-VL by 5–20% on multiple GUI grounding and agent benchmarks.

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets (CAIA)

CAIA establishes the first "adversarial high-stakes" agent benchmark using 17 cutting-edge large models on 178 time-anchored real-world cryptocurrency tasks. Key findings: without tools, all models achieve only 12–28% accuracy (near random guessing); with tools, even the strongest GPT-5 reaches only 67.4% vs. human junior analysts at 80%. More critically, 55.5% of model tool calls prefer "unreliable web search" over authoritative on-chain data, causing Pass@k metrics to systematically mask the dangerous "trial-and-error luck" behavior.