🦾 LLM Agent¶

🧪 ICML2026 · 59 paper notes

📌 Same area in other venues: 📷 CVPR2026 (42) · 🔬 ICLR2026 (162) · 💬 ACL2026 (82) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (4)

🔥 Top topics: LLM ×22 · Agents ×12 · Reasoning ×5 · Reinforcement Learning ×4

A Minimal Agent for Automated Theorem Proving: This paper proposes AxProverBase—a minimalist Lean 4 theorem-proving agent. By relying on only three components—"compiler feedback + self-managed notebook + lightweight tool search"—it achieves or exceeds the performance of specialized systems like Hilbert/Seed-Prover using non-fine-tuned frontier LLMs (Claude Opus), while reducing costs by 100x.
A Systematic Study of Behavioral Cloning for Scientific Data Annotation: This paper establishes a controlled framework consisting of 9 procedurally synthetic annotation tasks and virtual annotators to systematically study whether "behavioral cloning" (allowing a VLM to directly mimic full human operation trajectories—clicking, navigating, and undoing within an annotation interface) can replace "direct label prediction." Through four dimensions—training dynamics, scaling laws, transfer capabilities, and linear probes—it reveals findings such as the hierarchical emergence of skills, the phenomenon where models make fewer mistakes than training data but still perform error correction, the necessity of multi-task pre-training for transferability, and task-shared internal representations of "errors."
ACON: Optimizing Context Compression for Long-horizon LLM Agents: Acon utilizes failure trajectory contrast to optimize natural language compression guidelines, simultaneously compressing agent history and observation contexts. It reduces peak tokens by 26% to 54% on AppWorld, OfficeBench, and multi-objective QA while maintaining or improving success rates in long-horizon tasks.
AdaMEM: Test-Time Adaptive Memory for Language Agents: AdaMEM decouples agent memory into two layers: "offline long-term trajectory memory" and "online synthesized short-term strategy memory." This allows agents to dynamically refresh guidance strategies based on current states during long-horizon tasks. Coupled with Step-MFT—a fine-tuning technique that preserves only strategies that "actually change actions"—it achieves relative gains of 13–17% over static memory baselines on ALFWorld, WebShop, and HotpotQA.
Agent-Omit: Adaptive Context Omission for Efficient LLM Agents: By quantifying which turn-level thoughts and observations are omittable via Monte-Carlo rollouts, an 8B agent is trained using cold-start SFT and dual-sampling omit-aware GRPO. This agent adaptively skips redundant thoughts and observations, significantly reducing token usage across five benchmarks while maintaining accuracy comparable to seven state-of-the-art frontier models.
Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling: This paper transforms the web Computer-Use Agent from a step-by-step screenshot-LLM call-execution loop into a system similar to a JIT compiler: compiling natural language tasks into verifiable, cacheable, and parallel-schedulable code plans. This allows JIT-Planner to be 10.4× faster than Browser-Use with 28pp higher accuracy, and JIT-Scheduler to be 2.4× faster than OpenAI CUA with 9pp higher accuracy.
Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents: The paper reframes "RL for black-box LLM Agents" as "sampling from the posterior of an optimal policy." By employing Sequential Monte Carlo (SMC) with a lightweight value function to guide frozen black-box models during test time, the authors achieve RL-style optimization without accessing any parameters. This approach outperforms prompting baselines on three AgentGym environments and surpasses GRPO (which requires full parameter access) by scaling test-time computation.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction: The authors define the task of "inferring an equivalent white-box workflow from a black-box agent system" as AWR. They utilize MCTS to search within the sequence space of agent primitives, combined with dynamic Red-Black pruning based on scoring to balance search depth and width, achieving interpretable white-box reconstruction across five real-world domains.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems: This paper models the issue of "stating details without sufficient evidence" in agentic systems as a claim-level over-commitment problem. It proposes calibrated CSS: a calibrated selection for each atomic claim among precise expression, coarse-grained backoff, and omission. In LongFact full-scale experiments, it improves OAU from 0.8460 (without post-processing) to 0.9130 while retaining a specificity of 0.9381.
AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions: The authors propose the AutoRPA framework, which automatically distills interaction trajectories of ReAct-style GUI Agents into reusable RPA functions via a Translator-Builder pipeline. By combining iterative optimization with a hybrid repair strategy, the method maintains or exceeds original Agent success rates while reducing token consumption by 82%~96%.
Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning: This paper identifies an overlooked "retention-forgetting dilemma" in training-free Verbal Reinforcement Learning (which distills experience into rules in the context without updating parameters) within non-stationary environments. It proposes a transition from experience accumulation to governance via a "Rule/Evidence/Skill" three-layer architecture and a critic-proposer-curator loop. This approach flips the performance of accumulated experience from "dropping below the zero-shot baseline" to achieving "+5.3pp directional accuracy and doubling the Sharpe ratio."
CoDA-Bench: Can Code Agents Handle Data-Intensive Tasks?: CoDA-Bench is the first benchmark to jointly evaluate "code intelligence" and "data intelligence" within a data-intensive Linux sandbox. Agents are deployed into a Kaggle-based environment containing an average of 980 files, where they must autonomously discover correct data from semantically similar distractors before writing code to compute answers. Results show that even the most powerful Mini-SWE-Agent (GPT-5.5) achieves only 61.1% execution accuracy, exposing a severe lack of autonomous data discovery capabilities in current code agents.
CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement: The paper proposes CollabBench, a benchmark and training framework for LLM agents to collaborate with "diverse personality teammates" in cooperative games. It simulates diverse players driven by the Big Five personality traits, employs a unified agentic rollout with a dual-layer mixed "efficiency/affective" reward for reinforcement learning, and provides an evaluation protocol covering both efficiency and affective metrics. After training, Qwen2.5-7B achieves improvements of approximately 19.5% and 24.4% in efficiency and affective dimensions, respectively.
Constitutional Black-Box Monitoring for Scheming in LLM Agents: This paper proposes an end-to-end "Constitutional Black-Box Monitoring" framework. It utilizes two synthetic data pipelines (STRIDE and Gloom) to generate 2,000 synthetic trajectories for optimizing prompt classifiers. The framework detects scheming behaviors in LLM agents by observing only externally visible tool calls and outputs (without Chain-of-Thought). The study finds that simple prompt grid search saturates performance, while more aggressive optimization leads to overfitting.
EvoClaw: Evaluating AI Agents on Continuous Software Evolution: EvoClaw proposes a "milestone-level" software evolution evaluation paradigm. Utilizing the DeepCommit pipeline, it reconstructs noisy commit histories from open-source repositories into executable and verifiable milestone dependency Directed Acyclic Graphs (DAGs). This allows agents to complete a sequence of dependent development tasks on a single persistent codebase. The study reveals that while 12 frontier models can achieve scores \(>80\%\) on independent tasks, their performance drops to a maximum of \(38\%\) in continuous evolution scenarios, exposing fundamental deficiencies in long-term maintenance and the suppression of error propagation.
EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle: EvolveR provides LLM agents with a closed-loop lifecycle: "Online interaction \(\rightarrow\) Offline self-distillation into principle libraries \(\rightarrow\) GRPO policy evolution." Instead of discarding past trajectories, the agent abstracts successes and failures into a retrievable "principle library" and uses RL to learn how to utilize its own principles to solve new tasks. It significantly outperforms RL agent baselines like Search-R1 across 7 multi-hop QA benchmarks.
ExCyTIn-Bench: Evaluating LLM Agents on Cyber Threat Investigation: This paper constructs ExCyTIn-Bench, the first benchmark evaluating LLM Agents for end-to-end "cyber threat investigation." Using 57 security log tables from a real Azure tenant, it automatically generates 7,542 SQL Q&A pairs with evidence chains via alert-entity bipartite graphs. It provides a MySQL environment for Agents to answer by querying logs and performing multi-hop evidence tracking. Currently, the strongest model, Claude-Opus-4.5, achieves a reward of only 0.606.
From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory: MemoPilot is a plug-and-play "memory co-pilot"—it keeps the player LLM frozen but trains a separate memory model. It treats "how to update memory after each interaction" as an end-to-end optimizable multi-turn decision problem using multi-turn GRPO. With turn-level rewards and turn-level normalized advantage estimation, a frozen player truly becomes "stronger through play" in repetitive games. It achieves the highest Elo on both Rock-Paper-Scissors and Limit Hold'em testbeds, outperforming all memory baselines and proprietary models including DeepSeek-V3.2.
HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation: HawkesLLM grafts a multivariate Hawkes point process onto the LLM agent text simulation loop: Hawkes is responsible for scheduling "when and which node generates" as well as "which historical node outputs to use as compressed memory." LLMs are only responsible for verbalizing the selected memory into the next event. This approach achieves late-stage semantic alignment that increases over time under compact prompt budgets on GDELT Artemis II news cascades.
Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models: This paper introduces Deep Data Research (DDR), an open-ended agentic task paradigm where LLMs are provided only with a structured database and a minimal toolset (SQL+Python), without specific questions or turn limits, requiring the model to autonomously explore, hypothesize, and decide when to stop. The authors construct DDR-Bench (MIMIC-IV / GLOBEM / 10-K, featuring 291 entities and 2058 checklist items) using verifiable fact checklists extracted from unstructured text to objectively evaluate "investigatory intelligence." Results show that even Claude 4.5 Sonnet achieves an average accuracy of only 47.7%.
Internalizing Agency from Reflective Experience: This paper proposes the LEAFE framework, which enables LLM agents to generate "failure \(\to\) rollback \(\to\) correction \(\to\) success" experience data through reflection on failed trajectories. It then distills feedback-grounded recovery capabilities via SFT. This approach improves Pass@128 by up to 14% on long-horizon tasks such as CodeContests, WebShop, and ALFWorld, significantly outperforming outcome-driven RL methods like GRPO.
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents: TRAP is a "task-redirecting persuasion" benchmark for Web Agents. It decomposes prompt injection into 630 task-injection combinations across five modular dimensions of "Interface × Persuasion." Evaluated on six real-world website clones across six frontier models, it reveals that an average of 25% of tasks are hijacked (GPT-5 at 13%, while DeepSeek-R1 reaches 43%). Notably, button-based injections are over three times more effective than hyperlinks, and lightweight context clipping can increase the success rate by nearly sixfold.
Learning Efficient Guardrails for Compliance: This paper constructs PolicyGuardBench, a 60k-scale dataset (5 domains, 733 standardized trajectories × 2195 atomic policies → 60,000 trajectory-policy pairs including cross-subdomain and prefix truncation settings). Based on Qwen3-4B-Instruct, the authors perform full-parameter SFT to create PolicyGuard-4B, a lightweight guardrail model. It achieves 90.14% accuracy and an 87.59% F1 score with a latency of 22.5 ms/sample, matching or exceeding 70B-class open-source models and Claude-Sonnet-4, while demonstrating strong cross-domain generalization (LODO OOD F1 \(\approx\) 0.91).
Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks: NSI "lifts" interaction traces of LLM agents into neuro-symbolic workflow graphs with explicit conditional branching and dynamic variable binding. This evolves skills from stateless scripts into state-aware logical programs, achieving success rates of 98.0 / 76.5 / 95.2 on ALFWorld / WebShop / TextCraft, significantly outperforming programmatic skill baselines like ASI and AWM.
LLM Agents Are the Antidote to Walled Gardens: This ICML 2026 position paper argues that LLM agents can "bypass" the closed API strategies of dominant platforms through automatic format conversion and human-like UI interaction, achieving "universal interoperability." This dissolves the "walled gardens" created by traditional network effects. However, the ML community must proactively establish agent-friendly interfaces, security mechanisms, and ecological infrastructure to manage the resulting security, legal, and new-layer lock-in risks.
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment: MacArena unifies ported OSWorld tasks, macOSWorld tasks, and 49 brand-new macOS native tasks (totaling 421 tasks across 50 applications) into a real macOS environment running on Apple Silicon's native virtualization framework. Equipped with per-task handwritten executable evaluation scripts, it reveals that current GUI agents generally perform worse on macOS than Linux, and model rankings reverse between "ported tasks" and "macOS native tasks"—revealing that high scores on existing benchmarks stem more from "having seen this task distribution" rather than true cross-platform GUI capabilities.
MCP-Persona: Evaluating LLM Agent Capabilities in Real-World Personal Applications via Environment Simulation: MCP-Persona is the first LLM agent benchmark targeting real-world personalized MCP tools (12 servers including Slack/Rednote/Instagram/Lark, etc.). It proposes three methods—Tool-Traverse, Context-Tree, and Persona-Gen—to automatically synthesize Python simulator code using LLMs, avoiding real-account issues. Testing 10+ SOTA agents reveals that even Claude-Sonnet-4.5 achieves only 38.66% Acc, proving that personalized tool use is a severely underestimated capability gap.
Measuring Agents in Production: This is the first systematic empirical study investigating "how LLM agents in production are actually built and evaluated." Through 20 in-depth case studies and 306 practitioner surveys (filtering 86 deployed/pilot systems) across 26 domains, the authors find that production agents generally follow a "simple and controllable" route (\(68\%\) execute \(\le 10\) steps before human intervention, \(70\%\) directly prompt off-the-shelf models without weight fine-tuning, and \(74\%\) rely primarily on human evaluation). Reliability is identified as the number one challenge, and practitioners primarily address it through system-level design rather than algorithmic or model-layer innovation.
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents: MRAgent replaces the static "retrieve-then-reason" memory paradigm with "reason-while-reconstructing." By utilizing a Cue–Tag–Content associative memory graph and an active reconstruction loop, the agent dynamically selects traversal directions and prunes irrelevant branches based on intermediate evidence. It achieves a maximum improvement of 23% over the strongest baseline on LoCoMo / LongMemEval, while significantly reducing token consumption and latency.
NaviAgent: Graph-Driven Bilevel Planning for Scalable Tool Orchestration: NaviAgent decomposes LLM tool calling into a two-level process: "high-level 4-choice decision + low-level path search on a graph." A Tool World Navigation Model (TWNM) trained with HGT explicitly models structural and behavioral dependencies between tools. On ToolBench, API-Bank, and 50 real-world RapidAPIs, it improves the Task Success Rate (TSR) by 4.3–18.2 points over the strongest baselines while significantly reducing the number of invocation steps.
On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM Agents: Addressing the failure mode where "Action Selection (AS)" and "Belief Tracking (BT)" hinder each other in multi-turn active reasoning for LLM agents—causing outcome-only RL to fall into Low Information Self-Locking (SeL)—this paper provides a coupled gradient analysis and formal definition of the "self-locking region" from a POMDP perspective. It proposes AReW: using directional critiques (obtainable from environments or readout layers) to perform additive reweighting on stepwise advantages, yielding up to a 60-point performance gain across 9 active reasoning tasks.
OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents: OTora proposes a novel attack paradigm called Reasoning-Level Denial-of-Service (R-DoS): without compromising task correctness, it employs a two-stage red teaming pipeline (first inducing the agent to actively access attacker-controlled external resources via insertion-aware optimization, then deploying "reasoning payloads" optimized through ICL genetic search within those resources) to force LLM agents into sustained multi-round over-reasoning. It achieves 10\(\times\) reasoning token inflation and order-of-magnitude latency attacks on WebShop, Email, and OS agents, while maintaining nearly unchanged task accuracy.
Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History: This paper proposes Persona2Web, the first open-web benchmark for personalized web agents. It utilizes "implicit user history + three levels of ambiguous queries + reasoning-aware scoring" to compel agents to infer user preferences from browsing records to disambiguate queries. Evaluations of five mainstream models, including GPT-4.1 and o3, reveal that the success rate for Level 2 queries is only 13% even when history is provided, highlighting a significant lack of true personalization in current web agents.
Position: Agentic AI Orchestration Should Be Bayes-Consistent: This position paper argues against trying to make LLMs themselves "Bayesian" (a path fraught with engineering and theoretical roadblocks). Instead, it proposes moving the Bayesian structure to the orchestration control layer of agentic AI. Here, the controller maintains beliefs over low-dimensional task-level latent variables, updates these beliefs following Bayes' rule based on "message observations" from agents/tools, and employs expected utility or Value of Information (VOI) for routing, stopping, escalation, and budget allocation.
Position: Assistive Agents Need Accessibility Alignment: This position paper presents a systematic review of 778 blind assistive task instances across 417 publications. It argues that "accessibility alignment" should be a first-class alignment objective for Agents, alongside helpfulness, harmlessness, and honesty. The authors propose a design pipeline covering four dimensions: goal, interaction, risk, and lifecycle.
Position: Modular Memory is the Key to Continual Learning Agents: This is a position paper from the Dagstuhl Continual Learning workshop arguing that relying solely on In-Weight Learning (IWL) leads to catastrophic forgetting, while relying solely on In-Context Learning (ICL) causes computational explosion and rigid foundations. The missing piece for "continual learning agents" is a modular memory that integrates the fast adaptation of ICL with the slow consolidation of IWL (Core Model + Working Memory + Long-term Memory, plus sleep-like offline consolidation).
Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach: The authors propose Iterative RMFT, which ranks decision trajectories rolled out by the LLM itself based on regret from low to high. The top \(k\) optimal trajectories are selected for iterative SFT. This approach allows LLMs to automatically emerge with no-regret behavior and a reasonable exploration-exploitation balance across three types of verbalized decision tasks—Multi-Armed Bandits (MAB), online learning, and non-stationary bandits—without relying on known optimal algorithms (e.g., UCB/FTRL) or manually designed CoT templates.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts: PragLocker employs a two-stage strategy consisting of "code-symbol initialization + noise injection under target model feedback" to encode an agent system prompt into obfuscated text. This text functions solely on the target LLM and fails when migrated to any other LLM, ensuring that attackers cannot reuse stolen prompts.
Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks: The authors formalize neural networks (specifically LLMs) as composite agents synthesized through the log-weighted pooling of multiple implicit sub-agents (each defined as a probability distribution over outcomes). Under the framework of epistemic utility \(W_i(o)=\log P_i(o)\), it is proven that "strict unanimity" is impossible under linear pooling or binary outcomes, but feasible when \(|\mathcal O|\ge 3\). Consequently, an alignment principle is derived: "explicitly manifesting Waluigi before suppression" is strictly superior to "only reinforcing Luigi."
Process Reward Agents for Steering Knowledge-Intensive Reasoning: Reconstructs the Process Reward Model (PRM) from "post-hoc scoring" into an online agent: it decides in real-time whether to retrieve evidence and provides rewards at each reasoning step. By using beam search to prune candidate trajectories from a frozen policy, Qwen3-4B achieves a 4B-scale SOTA of 81.9% on MedQA and demonstrates direct transferability to various unseen backbones from 0.5B to 8B (yielding up to a 25.7% gain).
Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents: Addressing the issue where GUI agents frequently fail to recover from "self-inflicted errors" in real-world deployments, the authors construct GUI-RobustEval (1,216 executable tests covering 11 types of policy-induced errors across 4 error depths) for fine-grained evaluation. Simultaneously, they propose RoTS—an online data synthesis framework based on trajectory trees. It uses Fragility-based UCB to actively expose new errors in success subtrees and leverages neighbor experience for long-horizon recovery rollbacks in failure subtrees. This synthesizes 800k reflection samples, enabling RoTS-32B to achieve an open-source SOTA of 47.4% SR / 33.8% All-Pass@4 on OSWorld.
ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing: ReflexGrad integrates TextGrad-style "local gradient refinement every 3 steps" as a fast process and Reflexion-style "causal re-planning triggered by consecutive low scores" as a slow process. Using a progress-gated routing rule to switch between them zero-shot within a single episode, it improves Qwen-3-8B's success rate on 134 ALFWorld tasks from 35.1% to 75.4% (+40.3pp), surpassing 1-shot LATS / ToT / Self-Refine under equivalent compute budgets.
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use: RHB constructs a suite of realistic tool-based multi-step tasks (independent and chained modes across four families: data pipeline, log forensics, performance optimization, and multi-file reconstruction) to quantify reward hacking in LLM agents. Across 13 frontier models, the study finds that RL post-training significantly increases exploit rates (DeepSeek-V3 0.6% vs. R1-Zero 13.9%), hacking rates rise with chain length, and exploits "relapse" on harder variants even for near-zero models, while lightweight environment hardening reduces exploit rates by 87.7% without compromising task success.
Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation: The authors construct Rule2DRC, a large-scale EDA benchmark containing 1,000 natural language design rules and 13,921 evaluation layouts. Performance is measured via execution-level scoring using the KLayout engine rather than code similarity. They propose SplitTester: a method that clusters \(N\) candidate DRC scripts based on execution consistency, iteratively generates new layouts to split the most "dangerous" cluster (defined by the product of score and cluster size), and finally utilizes a judge LLM to select the optimal script based on discriminative tests.
SafeHarbor: Defining Precise Decision Boundaries via Hierarchical Memory-Augmented Guardrail for LLM Agent Safety: SafeHarbor upgrades LLM Agent safety defense from "static coarse-grained classifiers" to a "dynamic hierarchical memory tree + dual-score gating." Through adversarial rule generation and entropy-based self-evolution, it enables GPT-4o to maintain a 93%+ refusal rate while increasing the success rate of benign tool calls to 63.6%, significantly alleviating the over-refusal problem.
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation: The Xiaomi team presents a systematic study on the "Data-Evaluation-Reasoning" trinity for VLM mobile GUI agents. They released the HyperTrack dataset (16k tasks / 674 Chinese apps) and the GUIEvalKit tool (supporting 30+ models). The study demonstrates that DAPO-style RL significantly outperforms SFT in OOD scenarios and utilizes semi-online evaluation (SOEval) to reveal a core trade-off: "explicit reasoning sacrifices PASS@1 stability but enhances PASS@n diversity."
Scaling Small Agents Through Strategy Auctions: The paper proposes sale (Strategy Auctions for Workload Efficiency): letting Qwen3 agents of varying sizes submit "strategy short plans" as bids for each task. Executors are selected based on a cost-minus-value metric, while historical auction memory allows lower-cost agents to continuously refine their bids. In deep search and coding tasks, this approach exceeds the pass@1 of the largest model while reducing dependence on the largest agent by 52% and total costs by 35%.
SE-GA: Memory-Augmented Self-Evolution for GUI Agents: SE-GA equips VLM-based GUI agents with a triple-tier memory (TTME: episodic + semantic + experiential) and a two-stage memory-augmented self-evolution training pipeline (MASE: SFT → improved GRPO). This approach pushes Qwen2.5-VL-7B to 89.0 on ScreenSpot, 75.8 on AndroidControl-High, and 39.0 on AndroidWorld, comprehensively outperforming same-scale baselines and matching the performance of 72B models.
Self-evolving LLM agents with in-distribution Optimization: Q-Evolve enables LLM agents to learn an "in-distribution critic" on a fixed hybrid offline dataset. It automatically assigns process rewards to each step using advantage estimation and updates via a behavior-proximal policy optimization. The entire process remains within the data distribution, achieving stable self-evolution on AlfWorld, WebShop, and ScienceWorld with significantly fewer environment interactions.
Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents: Skill-Pro explicitly extracts interactive experiences of LLM agents into a "activation + execution + termination" skill triplet. It uses semantic gradients to generate candidate skills and verifies them with a PPO-style trust region (PPO Gate) before inclusion. Ultimately, it achieves over 0.85 reuse rate and significant performance gains in ALFWorld/Mastermind with a minimal memory library of ~800 tokens.
Talk, Judge, Cooperate: Gossip-Driven Indirect Reciprocity in Self-Interested LLM Agents: This paper proposes ALIGN, which enables a group of fully self-interested, decentralized LLM agents to evaluate each other via public "gossip" messages with five levels of sentiment. This allows them to form reputations and punish defection, thereby stably establishing indirect reciprocity in donation games, investment games, and e-commerce markets without central oversight. The study finds that reasoning-based LLMs are more capable than chat-based LLMs at following game-theoretic incentives—cooperating only when it is strategically optimal to do so.
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction: This paper proposes Thought-Aligner—a lightweight (1.5B/7B) plug-and-play safety model that performs causal debiasing of intermediate thoughts within the LLM agent's think-act-observe loop. By intervening before actions are executed, it improves the behavioral safety rate of six mainstream LLMs from approximately 50% to approximately 90% on ToolEmu/Agent-SafetyBench, while simultaneously increasing helpfulness by about 5%.
Towards a Science of AI Agent Reliability: Drawing on established practices from safety-critical engineering (aviation, nuclear power, and automotive), this paper decomposes AI agent "reliability" into 12 accuracy-independent metrics across four dimensions: consistency, robustness, predictability, and safety. Systematic evaluation of 15 frontier models on GAIA and \(\tau\)-bench reveals an industry-wide trend: while accuracy has skyrocketed over the past 24 months, reliability remains largely stagnant.
Towards Diverse Scientific Hypothesis Search with Large Language Models: The study reframes "scientific hypothesis search with LLMs" as a sampling problem aimed at efficiently producing a diverse and high-quality set of hypotheses under a fixed verification budget. By borrowing Parallel Tempering (PT) from physics, the authors developed EvoDiverse, a dual-temperature population framework where a high-temperature pool explores and a low-temperature pool refines. Samples are exchanged via Metropolis-Hastings rules, simultaneously improving quality and diversity across molecular, equation, and algorithm discovery tasks.
Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation: For self-evolving LLM agents generating CUDA kernels, this paper proposes CUDAnalyst. By "freezing intermediate program states of a specific generation + selectively injecting/masking feedback," it performs generation-level intervention. Using Banzhaf values from coalitional game theory to deconstruct the marginal contributions and high-order interactions of debugger, analyzer, and profiler feedback, it derives four conclusions—such as "explicit plans are only useful when feedback is aligned" and "plans from strong models can be transferred to weak models of the same family." Based on these, the CuGEdit plugin was designed, outperforming torch.compile by 2.08×–10.32×.
Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization: ParetoPO explicitly formulates the alignment of tool-integrated agents as a multi-objective RL problem (accuracy vs. tool-use efficiency). It employs a two-stage training process—global exploration via hypervolume-guided dynamic scalarization followed by local refinement via Pareto dominance ranking for advantage calculation—achieving higher accuracy with fewer tool calls in mathematical reasoning and multi-hop QA.
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining: Video2GUI utilizes a four-stage pipeline—"Metadata Coarse Filtering → Video Quality Fine Filtering → Gemini-3-Pro Task/Action Extraction → High-Resolution Three-Frame Precise Spatial Grounding"—to refine 500 million YouTube video metadata entries into WildGUI (12.7M trajectories, 124.5M screenshots, 1500+ applications). This dataset improves Qwen2.5-VL/Mimo-VL performance by 5–20% across multiple GUI grounding and agent benchmarks.
Weasel: Achieving Out-of-Distribution Generalization for Web Agents via Importance-Diversity Data Selection: By combining goal-relevance and diversity in a trajectory step selection method, Weasel reduces training data to 20% of the original, achieving 9.7-12.5x training speedup and significantly improving Web Agent generalization on unseen domains.
Web Agents Should Use Typed Actions Instead of Click-Based Browsing: This position paper argues that building a reliable "agentic web" requires more than just scaling models; websites must expose common web operations as typed actions with signatures—specifically designed as web verbs. These verbs consist of structured functions with defined inputs/outputs and documented behaviors, regardless of whether the implementation is a server-side API or a client-side browser workflow. By synthesizing tasks into short, auditable programs with explicit control/data flow on this layer, agents become significantly more reliable, efficient, and verifiable than those relying on low-level "click + keyboard + DOM" primitives.