Skip to content

🦾 LLM Agent

🔬 ICLR2026 · 162 paper notes

📌 Same area in other venues: 📷 CVPR2026 (39) · 💬 ACL2026 (82) · 🧪 ICML2026 (59) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (4)

🔥 Top topics: LLM ×41 · Agents ×38 · Reasoning ×14 · Reinforcement Learning ×9 · Multimodal/VLM ×7

A\(^2\)FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

A2FM integrates three execution modes—instant, reasoning, and agentic—into a single backbone. It first learns the optimal routing path and then aligns mode-specific trajectories. By employing a cost-regularized reinforcement learning method (APO), the model avoids unnecessary computation for simple queries while maintaining accuracy for complex ones, reducing the cost per correct answer by approximately 45% on a 32B scale.

A Benchmark for Deep Information Synthesis (DeepSynth)

The DeepSynth benchmark is proposed, containing 120 real-world information synthesis tasks across 7 domains and 67 countries (averaging 5.5 hours of human annotation per task). It requires agents to collect information from multiple webpages and perform structured reasoning. Currently, the strongest agent (o3-deep-research) only achieves 8.97 F1 / 17.5% LLM-Judge, revealing severe deficiencies in LLM agents regarding deep information synthesis.

A Framework for Studying AI Agent Behavior: Evidence from Consumer Choice Experiments

The authors propose ABXLAB, a real-time "man-in-the-middle" framework that intercepts and rewrites webpage content to transform any shopping site into a controlled behavioral experiment. By systematically measuring choice biases in 17 mainstream LLM agents under cues like price, rating, display order, and psychological nudges, they find that agents are more manipulable than humans, with bias magnitudes reaching 3–10 times those of human baselines.

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

The authors propose ADP, a lightweight "agent data interlingua" that unifies 13 heterogeneous agent training sets into a consistent Trajectory/Action/Observation pattern. These are then distributed to various agent frameworks for SFT, achieving an average gain of ~20% over base models and reaching or approaching SOTA on coding, browsing, and tool-use tasks.

AgentFold: Long-Horizon Web Agents with Proactive Context Folding

AgentFold treats the context of a web agent as a proactively carved "cognitive workspace." During reasoning, the agent outputs an additional "fold command" at each step to perform fine-grained condensation or multi-step deep consolidation of the historical trajectory. This keeps the context at approximately 7k tokens even after 100 interaction rounds. A 30B model (with 3B activated) achieves 36.2% on BrowseComp, surpassing the 671B DeepSeek-V3.1 and OpenAI o4-mini.

AgentGym-RL: An Open-Source Framework to Train LLM Agents for Long-Horizon Decision Making via Multi-Turn RL

This paper introduces AgentGym-RL, an open-source decoupled multi-turn reinforcement learning framework capable of training LLM agents from scratch across five real-world scenarios: Web Navigation, Deep Search, Digital Games, Embodied Control, and Science Tasks. It proposes ScalingInter-RL—a phased training method that progressively increases the number of interaction turns from short-horizon to long-horizon—enabling a 7B model to match or outperform OpenAI o3 and Gemini-2.5-Pro across 27 tasks.

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

The ACE (Agentic Context Engineering) framework proposes treating context as a continuously evolving "playbook." By utilizing a division of labor among Generator-Reflector-Curator roles and incremental delta updates, it continuously accumulates and refines strategies. This addresses brevity bias and context collapse in existing prompt optimization methods, achieving an average improvement of 10.6% on agent tasks and 8.6% on financial tasks, while reducing adaptation latency by 86.9%.

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

AgenTracer employs "counterfactual replay + programmatic fault injection" to automatically annotate multi-agent failure trajectories, constructing the TracerTraj-2.5K dataset. It then trains a lightweight 8B "failure tracer" using multi-granularity reinforcement learning. On the Who&When benchmark, it localizes decisive errors to specific agents and steps, outperforming giant models like Gemini-1.5-Pro and Claude-3.5-Sonnet by up to 18.18% in agent-level accuracy. Furthermore, providing feedback to off-the-shelf systems like MetaGPT and MaAS leads to performance gains of 4.8~14.2%.

AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents

The AgentSynth pipeline leverages the principle of information asymmetry—where forward step-by-step generation of simple tasks is easy, while backward holistic solving is difficult—to chain simple subtasks into complex long-horizon computer-use tasks. It automatically generates 6000+ diverse tasks and trajectories at only $0.60 per trajectory, while SOTA agents achieve only a 4% success rate at the highest difficulty level.

AlphaAgentEvo: Evolution-Oriented Alpha Mining via Self-Evolving Agentic Reinforcement Learning

The quantitative "factor mining" process is redefined from a fragile "search-backtest-restart" cycle into a continuous evolution trajectory. By using a 4B LLM agent guided by hierarchical rewards in multi-turn tool calls, the system learns long-term planning and reflection, ultimately outperforming factor evolution methods driven by GPT-5-mini / DeepSeek-R1 with only 4B parameters.

An Information Theoretic Perspective on Agentic System Design

This paper abstracts the common paradigm in agentic systems—where a smaller model compresses context and a larger model performs reasoning on the compressed output—as a noisy channel. It utilizes a mutual information estimator, directly computable by inference engines, to measure compression quality. This approach provides a task-agnostic answer to the resource allocation problem: compute should be "front-loaded" toward the compressor rather than the predictor.

Aria: an Agent for Retrieval and Iterative Auto-Formalization via Dependency Graph

Aria transforms the translation of natural language mathematical propositions into Lean formal code into a retrieval + iterative synthesis agent. It employs a "Graph-of-Thought" (GoT) to decompose propositions top-down into a concept dependency graph. Concepts found in Mathlib are anchored, while missing ones are synthesized bottom-up into new definitions. A semantic checker, AriaScorer, validates results by pulling real definitions for every Lean term from Mathlib. On research-level conjecture datasets where prior methods scored 0%, Aria achieves 42.9%.

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

This paper proposes AutoFigure—the first Agent framework based on the "Reasoned Rendering" paradigm. By decoupling structural layout planning and aesthetic rendering into two stages, it automatically generates publication-quality scientific illustrations from long scientific texts. Supported by FigureBench (3,300 pairs), the first large-scale benchmark for systematic evaluation, 66.7% of the generated results were deemed suitable for camera-ready versions by the original authors.

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

This paper applies Sequential Bayesian Experimental Design (BED) to LLMs, enabling the model to select questions with the "maximum Expected Information Gain (EIG)" in each round. This transforms the LLM into a proactive and adaptive multi-turn information-gathering agent. In 20 Questions and movie preference inference tasks, the average success rate exceeds direct prompting by 37.4 percentage points.

Benchmarking LLM Tool-Use in the Wild

WildToolBench extracts three characteristics of "wild" dialogues—composite tasks, hidden intentions, and instruction switching—from real user logs to construct a multi-turn, multi-step tool-use benchmark comprising 1024 tasks across 256 scenarios. Evaluations of 57 mainstream LLMs reveal that no model exceeds 15% session accuracy, indicating that current agentic capabilities are significantly weaker than leaderboard inflated scores suggest.

C-Evolve: Consensus-based Evolution for Prompt Groups

C-Evolve shifts from "evolving a single optimal prompt" to "evolving a group of complementary prompts." By using a voting score—measuring a prompt's contribution to group consensus—as evolutionary fitness, the method enables multiple prompts to reach a consensus, thereby breaking the performance ceiling of individual prompts.

Can Language Models Discover Scaling Laws?

This paper proposes SLDAgent—an evolutionary agent that co-evolves "formula generators + parameter optimizers"—along with SLDBench, the first scaling law discovery benchmark. It demonstrates for the first time that LLM agents can automatically discover scaling laws whose extrapolation accuracy exceeds human-expert derivations across all 8 evaluated tasks.

ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents

This work reveals structural vulnerabilities in LLM Agent chat templates: by forging role labels (e.g., <system>, <user>) within tool-returned data, attackers can hijack the model's perception of role hierarchy, disguising malicious instructions as high-priority commands, which increases ASR from 5-15% to 32-52%.

ChinaTravel: An Open-Ended Travel Planning Benchmark with Compositional Constraint Validation for Language Agents

ChinaTravel utilizes a compositional Domain-Specific Language (DSL) to automatically translate "open-ended natural language travel requirements" into verifiable logical constraints and preference objectives. Combined with 1154 real-world Chinese user queries, it constructs the first truly open multi-day multi-POI travel planning benchmark that requires contextual grounding and generalization to unseen constraint combinations. Empirical results show that neuro-symbolic agents achieve a 10x higher constraint satisfaction rate than pure LLMs (\(37.0\%\) vs \(2.60\%\)), yet the task remains far from solved.

CoDA: Agentic Systems for Collaborative Data Visualization

CoDA remodels "natural language to data visualization" as a multi-agent collaboration problem. It uses 8 specialized LLM agents to complete understanding, planning, generation, and self-reflection in stages. By "reading only metadata rather than raw data," it bypasses token limits, and through a "quality-driven reflection loop," it iteratively refines charts. It improves overall scores by up to 41.5% over strong baselines on MatplotBench, Qwen, and DA-Code.

Code Driven Planning with Domain-Adaptive Selector

CoPiC enables LLMs to generate multiple "high-level planning programs" at once (rather than requesting plans from the LLM step-by-step). These programs interact with the environment in a closed loop to produce candidate plans. A small model, the "Domain-Adaptive Selector" fine-tuned via RL, is then used to select the plan best aligned with long-term rewards. This approach achieves an average success rate improvement of 19.14% and an average token cost reduction of 79.39% across ALFWorld, NetHack, and StarCraft II unit production environments.

Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration

Proposes Collaborative Gym (Co-Gym)—the first open framework supporting bidirectional communication and non-turn-taking collaboration between humans and LM agents in shared task environments, accompanied by an evaluation suite that assesses both collaboration outcomes and processes.

CoLLMLight: Cooperative Large Language Model Agents for Network-Wide Traffic Signal Control

CoLLMLight assigns an LLM agent to each intersection in a road network, enabling cooperation through a dual-module "Asynchronous Spatio-temporal Reasoning + Real-time Decision-making" architecture (rather than controlling intersections in isolation). By employing "Cost-aware Optimization" (Adaptive Reasoning-chain SFT + PPO), reasoning depth automatically scales with traffic complexity. In zero-shot evaluations across four real-world networks, it surpasses all traditional, RL, and single-agent LLM baselines while keeping decision latency within the duration of a yellow light.

CoMind: Towards Community-Driven Agents for Machine Learning Engineering

The authors propose MLE-Live—the first real-time evaluation framework simulating the Kaggle research community—and CoMind—a multi-agent ML engineering system capable of systematically leveraging collective community knowledge. CoMind achieved a 36% medal rate across 75 historical Kaggle competitions and surpassed an average of 79.2% of human participants (reaching 92.6% in updated versions) in 4 active competitions.

Cyber-Zero: Training Cybersecurity Agents without Runtime

Addressing the limitation of lacking executable runtime environments and the difficulty of collecting real agent trajectories in cybersecurity (CTF) tasks, this paper proposes Cyber-Zero—the first runtime-free trajectory synthesis framework. It leverages public CTF writeups and persona-driven dual-LLM simulation (one playing a contestant, the other a Bash terminal) to reverse-engineer and "replay" multi-turn interaction trajectories. Using these synthetic trajectories for SFT, open-source models achieve an absolute improvement of up to +13.1% across three CTF benchmarks, with the 32B model approaching Claude-3.5-Sonnet at a significantly lower cost.

Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind

This paper proposes RebuttalAgent, which treats academic rebuttal as "strategic gaming under asymmetric information" rather than simple technical debate. By modeling reviewers' psychological states using Theory of Mind (ToM), it generates evidence-based responses through a three-stage "ToM→Strategy→Response" (TSR) framework. Trained with SFT + self-rewarding RL, it achieves an average improvement of 18.3% over base models, outperforming closed-source models such as GPT-4.1 and o3.

Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents

DGM enables a programming agent to continuously rewrite its own codebase to improve its code-modification capabilities. It replaces the theoretically infeasible "formal proof" of the Gödel Machine with the empirical evidence of "benchmark performance." By maintaining an ever-growing archive of agents for open-ended exploration, it pushes SWE-bench from 20.0% to 50.0% and Polyglot from 14.2% to 30.7%.

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

The authors propose a multi-stage pretraining data filtering pipeline utilizing a "blocklist + ModernBERT classifier" to remove dual-use biothreat-related text from the corpus. By training a 6.9B model from scratch, they achieve "harmlessness" that persists even under adversarial fine-tuning of up to 10,000 steps and 300 million tokens—outperforming existing post-training defenses by over an order of magnitude without compromising general capabilities.

DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively

DeepScientist models "automated scientific discovery" as a goal-oriented Bayesian optimization problem. Using a continuously accumulating Findings Memory, it iteratively "hypothesizes—implements/verifies—analyzes/generalizes" over a month-level timescale. After consuming over 20,000 GPU hours, generating approximately 5,000 ideas, and verifying about 1,100 of them, it surpassed human 2025 SOTA on three frontier AI tasks by 183.7%, 1.9%, and 7.9% respectively, achieved through the autonomous redesign of core methodologies rather than simple combinations of existing techniques.

Do Large Language Models Know What They Are Capable Of?

The authors systematically measure the ability of LLMs to "predict their success before starting a task" through three experimental setups. The study reveals that all models are systematically overconfident, though most possess discriminative power better than random. Furthermore, this self-awareness does not improve consistently as models grow stronger—current LLM agents are fundamentally limited by their inadequate understanding of their own capabilities.

DreamPhase: Offline Imagination and Uncertainty-Guided Planning for Large-Language-Model Agents

DreamPhase enables a frozen policy LLM to move beyond trial-and-error in real environments. Instead, it utilizes a learned latent world model to "dream" internally—simulating \(M\) multi-step future trajectories. Each trajectory is scored based on "Value minus Uncertainty" and passed through a safety gate. The selected branch is distilled into a natural language reflection and injected back into the prompt. This reduces real API calls per turn on WebShop from ~40 (ARMAP-M) to under 10 (a 4× reduction) and decreases irreversible actions by approximately 5×, all without fine-tuning the LLM.

Dual-Scale World Memory for LLM Agents towards Hard-Exploration Problems

GLoW is proposed to equip LLM agents with a dual-scale textual world memory—combining a "global trajectory frontier" and "local multi-path advantage reflection." It achieves new SOTA among LLM-based methods on sparse-reward hard-exploration tasks in Jericho text games, approaching the performance of the strongest RL methods with \(100\times–800\times\) fewer environmental interactions.

Dyna-Mind: Learning to Simulate from Experience for Better AI Agents

Dyna-Mind teaches (V)LM agents to perform mental simulations of future states before acting by "compressing real environment search trees into reasoning trajectories" (RESIM). It then employs Dyna-GRPO, which feeds "ground-truth future states" back into online RL to reinforce simulation capabilities, significantly outperforming GRPO/RLOO and Dyna-Think on Sokoban, ALFWorld, and AndroidWorld.

Efficient Agent Training for Computer Use

PC Agent-E utilizes only 312 human-annotated Windows operation trajectories. Through the Trajectory Boost method, it prompts Claude 3.7 Sonnet to synthesize diverse alternative action decisions at each time step. The trained Qwen2.5-VL-72B achieves a 141% relative improvement on WindowsAgentArena-V2, even surpassing the teacher model Claude 3.7 Sonnet by 10%.

Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking

WebLeaper reformulates Information Seeking (IS) tasks as "tree-based reasoning," utilizes Wikipedia tables to synthesize "target-entity-dense" training tasks in bulk (with Basic, Union, and Reverse-Union variants), and employs ISR/ISE metrics to filter out low-coverage and inefficient trajectories. This enables 30B-scale open-source Web Agents to achieve open-source SOTA performance in both completeness and efficiency across five deep-search benchmarks.

Empowering LLM Tool Invocation with Tool-call Reward Model

Addressing the issues of coarse-grained reward signals and subsequent gradient conflicts in LLM tool invocation, this paper proposes the Tool-call Reward Model (TRM). TRM is a process reward model that independently scores each tool call. The authors further design turn-level credit assignment and advantage estimation strategies integrated with PPO/GRPO, achieving consistent performance improvements across search-based QA and code-centric mathematical tasks.

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Based on memory and cognitive sciences, the authors decompose the capabilities required by memory agents into four core dimensions: "Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting." They construct MemoryAgentBench, the first unified benchmark that simulates multi-turn interaction by chunking long texts and feeding them incrementally to agents. The study finds that no existing long-context models, RAG systems, or commercial memory agents can master all four capabilities simultaneously.

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

This paper proposes the J-TTL benchmark to measure an agent's ability to "learn while playing" on the same task, and introduces EvoTest—a fine-tuning-free, gradient-free framework. After each episode, an Evolver Agent reads the full trajectory text to evolutionarily optimize the agent's prompts, memories, hyperparameters, and tool usage, enabling continuous performance gains through repeated attempts.

EXP-Bench: Can AI Conduct AI Research Experiments?

EXP-Bench semi-automatically extracts 461 "complete AI research experiment" tasks from 51 NeurIPS/ICLR 2024 papers and their open-source code. It forces Agents to complete the full pipeline of "hypothesize \(\rightarrow\) design experiment \(\rightarrow\) write code \(\rightarrow\) execute \(\rightarrow\) draw conclusions." Results show that the current strongest Agent achieves a success rate of only 0.5% in completing fully executable experiments.

Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis

By adopting the "Zone of Proximal Development (ZPD)" theory from educational psychology, this work develops a data synthesis engine that precisely calibrates task difficulty to the model's capability boundary. It generates high-value agent data for continued pre-training and post-training, pushing a small 30B-A3B model to 28.6% on HLE, surpassing several closed-source deep-research agents.

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Ours proposes EMPO2, an RL framework combining an external memory module with hybrid on-policy/off-policy updates. By internalizing exploration gains into model parameters through memory-guided exploration and knowledge distillation, it achieves performance improvements of 128.6% and 11.3% over GRPO on ScienceWorld and WebShop, respectively.

FaSTA*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

FaSTA integrates LLM "fast planning" with A search "slow planning" into a learning neuro-symbolic agent: it utilizes inductive reasoning to mine reusable symbolic subroutines—acting as "high-level tools"—from historical successful toolpaths. Most subtasks are solved instantly by applying these subroutines, with expensive A search triggered only upon failure. Compared to CoSTA, it reduces costs by 49.3% in multi-turn image editing with only a 3.2% decrease in quality.

FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

FingerTip 20K collects 21,437 interaction records (including user personas, time, location, and historical intents) from 95 users in real-world daily mobile usage. It proposes two new tracks: Proactive Task Suggestion (predicting user intent) and Personalized Task Execution (adapting to action preferences). The strongest model, Qwen-QVQ-Max, achieves a proactive suggestion success rate of only 12.8% (vs. 30.3% for humans), while UI-TARS reaches an execution success rate of only 38.5%.

Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution

This work rewrites the traditional "serial chain-of-thought" of web agents into a "DAG-based parallel execution graph." By allowing independent subtasks to perform retrieval and reasoning simultaneously, it achieves SOTA results on BrowseComp, xbench, GAIA, and HLE while reducing execution steps by 35% and end-to-end time by approximately 65%.

FlowSearcher: Synthesizing Memory-Guided Agentic Workflows for Web Information Seeking

FlowSearcher reformulates web information seeking from "ReAct-style linear tool chains" to "memory-guided agentic workflow synthesis." By decomposing queries into sub-goals and synthesizing typed workflow DAGs for each, while injecting structured experience across node/graph/task levels into orchestration and execution, it matches or exceeds RL-trained agents on GAIA, BrowseComp, and GPQA without any supervised fine-tuning or RLHF.

From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents

MemGAS utilizes a pipeline consisting of "multi-granularity memory units + GMM association + entropy-driven granularity routing + PPR retrieval + LLM filtering" to upgrade the long-term memory of conversational agents from single-granularity segmentation to cross-granularity association and adaptive selection. It comprehensively outperforms SOTA in QA and retrieval across four long-term memory benchmarks.

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

FutureX constructs a live dynamic benchmark for "future prediction" tasks. Through a fully automated pipeline, it daily collects upcoming events from 195 high-quality websites, tasks 25 LLMs/agents to make predictions on the event start date, and automatically crawls real results for scoring once revealed. This fundamentally eliminates data contamination and reveals that even the strongest model, Grok-4, significantly lags behind human experts on high-volatility open-ended events.

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Ours proposes the Gaia2 benchmark to evaluate LLM Agent capabilities in dynamic and asynchronous environments. It introduces real-world scenarios such as time constraints, noisy events, ambiguity resolution, and multi-agent collaboration. Combined with write action verifiers providing verifiable rewards, the benchmark can be directly used for RLVR training. Evaluations show that even the strongest model, GPT-5 (high), achieves only a 42% pass@1 rate.

Go-Browse: Training Web Agents with Structured Exploration

The data collection for training web agents is modeled as a graph search on websites: an expanding URL frontier maintains "discovered but under-explored" pages. For each page, the system proposes tasks, checks feasibility, and collects trajectories. By "resetting to discovered pages" to reuse historical leads, 10K successful trajectories were collected on WebArena. Fine-tuning a 7B model achieved a 21.7% success rate, surpassing GPT-4o mini.

GPS: Graph-guided Proactive Information Seeking in Large Language Models

GPS explicitly models the implicit "if-then rules" in retrieved documents as a logically complete Directed Acyclic Graph (DAG), utilizes dynamic traversal and pruning for on-demand questioning, and optimizes a Reasoner LLM via Group Relative Policy Optimization (GRPO) with hybrid rewards to generate high-quality DAGs, enabling LLMs to ask accurate questions efficiently when faced with underspecified user queries.

Grounding Computer Use Agents on Human Demonstrations

The authors construct GROUNDCUA, the largest desktop GUI grounding dataset to date (87 applications, 56k screenshots, 3.56M human-annotated elements), using expert human demonstrations. By utilizing only one-tenth of the training data compared to prior methods, the GROUNDNEXT series models achieve SOTA performance across five grounding benchmarks. This demonstrates that "high-quality dense supervision" is a more effective driver for reliable desktop grounding than merely increasing data volume.

GTA1: GUI Test-time Scaling Agent

GTA1 addresses the issue of planning cascade failures through test-time scaling—sampling multiple action proposals per step and selecting the best via a multimodal judge. It further achieves precise localization using a pure RL grounding model (without CoT thinking) that directly predicts coordinates with binary "hit-or-miss" rewards. This two-stage agent reaches SOTA on both grounding and task execution benchmarks.

GTool: Graph Enhanced Tool Planning with Large Language Model

GTool constructs a request-specific tool graph representing "dependencies between tools," encodes it into a <graph token> using a GNN to feed into a frozen LLM, and designs a Missing Dependency Prediction task to counteract incomplete dependencies, allowing 7B small models to outperform SOTA tool planning performance by 29.6%.

GUI-Shift: Enhancing VLM-Based GUI Agents through Self-supervised Reinforcement Learning

This paper proposes K-step GUI Transition, a self-supervised inverse dynamics task that predicts the first action to trigger a state transition given only a pair of screenshots \((S_t, S_{t+k})\), thereby eliminating the need for natural language instruction labels. By combining the GUI-Shift reinforcement learning framework (based on GRPO) with data filtering, multiple VLMs achieved up to an 11.2% Gain on GUI automation tasks using only 2K samples, with zero-shot transferability to GUI grounding tasks.

Helmsman: Autonomous Synthesis of Federated Learning Systems via Collaborative LLM Agents

Helmsman utilizes a team of specialized LLM agents to automatically synthesize a runnable and simulation-verified Federated Learning (FL) codebase from high-level natural language requirements, such as "deploying a data-heterogeneous object detection system on 15 mobile devices."

How Dark Patterns Manipulate Web Agents

This paper introduces the DECEPTICON benchmark, demonstrating that common "dark patterns" (deceptive UI designs) can manipulate frontier Web Agents into malicious outcomes contrary to user intent in over 70% of tasks (compared to only 31% for humans). Furthermore, larger models and increased reasoning tokens actually increase susceptibility to deception, while existing defenses fail to provide stable protection.

Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

Aiming at the problem of "allowing coding agents to rewrite their own code to become stronger," this paper points out that existing methods using single-step benchmark scores as expansion guidance are unreliable (high-scoring parents do not necessarily produce high-quality descendants). It proposes Clade-Metaproductivity (CMP)—the aggregated performance of an entire lineage—as the metric for self-improvement potential. It proves that a true CMP oracle is sufficient to simulate an optimal Gödel Machine. The implemented Huxley-Gödel Machine (HGM) uses Thompson sampling based on CMP estimates to select nodes for expansion, surpassing DGM/SICA on SWE-bench Verified and Polyglot with fewer CPU hours, and achieving the level of human-engineered agents on SWE-bench Lite.

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

This paper proposes AGENTFLOW—a trainable agentic system where four modules (planner, executor, verifier, and generator) collaborate via a shared memory. Using the accompanying Flow-GRPO algorithm, the planner is optimized online within a "live flow" of multi-turn interactions. A 7B backbone achieves a 4–15 point gain across 10 benchmarks, outperforming GPT-4o (~200B).

InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

InfoMosaic-Bench is the first benchmark specifically designed to evaluate the capability of tool-augmented Agents in "cross-multi-source information retrieval." Using the InfoMosaic-Flow pipeline with an organizer–worker architecture, it synthesized 621 tasks that must be solved by simultaneously calling general web searches and domain-specific MCP tools. The results reveal that even the most powerful GPT-5 achieves only 38.2% accuracy, gains from domain tools are unstable, and 22.4% of failures stem from basic tool misuse.

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative AI Research

This paper introduces InnovatorBench—the first end-to-end benchmark (20 tasks) constructed from real papers and codebases, covering 6 categories of LLM research sub-problems such as data, loss, reward, and scaffolding. Accompanied by the ResearchGym environment, which supports distributed, asynchronous, and snapshot capabilities, the study evaluates frontier models like Claude-4, GPT-5, and GLM-4.5 using ReAct agents. The findings reveal that while these models can handle code-centric research tasks, they frequently fail in fragile algorithm design and long-horizon decision-making (due to impatience, poor resource management, and template-based reasoning).

IR-Agent: Expert-Inspired LLM Agents for Structure Elucidation from Infrared Spectra

The expert workflow for interpreting infrared (IR) spectra is decomposed into three specialized LLM agents: one identifies local functional groups via absorption tables, another retrieves similar spectra for global backbones, and the third integrates reasoning to rank candidate structures. It outperforms single-model and single-agent approaches on real experimental IR spectra and incorporates additional chemical information without retraining.

Just Do It!? Computer-Use Agents Exhibit Blind Goal-Directedness

This paper introduces the concept of "Blind Goal-Directedness" (BGD), characterizing the tendency of Computer-Use Agents (CUAs) to pursue goals regardless of feasibility, safety, reliability, and context. The authors construct the BLIND-ACT benchmark with 90 tasks (based on OSWorld and evaluated via an LLM judge), measuring an average BGD rate of 80.8% across 9 frontier models, highlighting a pervasive systemic risk overlooked by existing safety research.

KRAMABENCH: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes

KRAMABENCH constructs an end-to-end data science benchmark that bridges the gap from "dirty data lakes" to "insights" using 6 real-world domains, 24 data sources, 1700+ files, and 104 human-curated tasks. Accompanied by a three-tier evaluation system (End-to-End / Pipeline Design / Sub-task Implementation), the results show that the strongest system achieves only 55.83% accuracy under full data lake conditions, significantly trailing the 76.75% human baseline.

K²-Agent: Co-Evolving Know-What and Know-How for Hierarchical Mobile Device Control

Inspired by the human cognitive systems of "knowing what" (declarative) and "knowing how" (procedural), K²-Agent utilizes a high-level planner running an SRLR self-evolution loop to refine task knowledge and a low-level executor using curriculum C-GRPO to learn operational skills. These two components co-evolve in a closed loop. Using only raw screenshots and open-source 7B/72B backbones, it achieves a new SOTA of 76.1% success rate on AndroidWorld.

Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning

Clinical differential diagnosis is modeled as a two-agent cyclic system consisting of a "Hypothesis Agent + Decision Agent." A hybrid paradigm of supervised and reinforcement learning is employed to simultaneously train accurate hypothesis generation, confidence calibration, and efficient test selection. This enables the LLM to perform iterative reasoning and information gathering like a physician, approaching the correct diagnosis at the minimum testing cost.

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

This paper systematically analyzes three core failure modes—greediness, frequency bias, and the knowing-doing gap—that lead to suboptimal LLM performance in simple decision-making scenarios (Multi-armed Bandits, Contextual Bandits, Tic-Tac-Toe). It demonstrates that RL fine-tuning (RLFT) on self-generated CoT reasoning significantly increases exploration and bridges the knowing-doing gap.

LongHorizonUI: A Unified Framework for Robust Long-Horizon Task Automation of GUI Agent

LongHorizonUI employs a "Enhanced Perception + Three-layer Closed-loop Reflective Decision-making + Multi-level Compensatory Execution" toolkit to enhance the success rate of training-free MLLM GUI agents in long-horizon tasks exceeding 15 steps, supplemented by the release of the LongGUIBench benchmark (averaging 22 steps).

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

ReMemR1 embeds a revisitable memory retrieval mechanism into "memorize while reading" agents. Each step, the agent updates its current memory while generating a callback query to search its own history. Combined with a multi-level (trajectory-level + step-level) reward system to densify RL signals, it reduces error rates for long-context multi-hop reasoning by over 20% with negligible computational overhead (<0.2% time cost).

M²-Miner: Multi-Agent Enhanced MCTS for Mobile GUI Agent Data Mining

Ours proposes M²-Miner, the first MCTS-based automatic data mining framework for mobile GUI agents. By employing a three-agent collaboration (InferAgent/OrchestraAgent/JudgeAgent), it improves mining efficiency by 64x. Combined with an intent recycling strategy to enrich intent diversity, the trained GUI agent achieves SOTA performance on multiple benchmarks.

MATHMO: Automated Mathematical Modeling Through Adaptive Search

Mathematical modeling is formalized as a sequential decision-making problem under uncertainty. Using an LLM as a generation operator and surrogate evaluator, combined with a bi-level adaptive search that selects frameworks at the upper level and tunes models at the lower level, the system automatically produces a set of mathematical models forming a Pareto frontier across multiple (including subjective) objectives.

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

This paper proposes MC-Search, the first benchmark for agentic multimodal RAG, featuring 3,333 high-quality samples (averaging 3.7 hops) across 5 reasoning topologies. It ensures the necessity of each step through HAVE verification and introduces the Search-Align process-level supervised fine-tuning framework, significantly enhancing the retrieval planning capabilities of open-source models (Qwen2.5-VL-7B F1 increases by +13.7).

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

MCP-Bench connects agents to 28 real-world production-grade MCP services (totaling 250 tools across 11 domains such as finance, research, and travel). By utilizing automatically synthesized complex tasks characterized by "fuzzy instructions, multi-objectives, and cross-domain dependencies," combined with a dual-layer evaluation of "rule-based checking + LLM-as-a-Judge," it systematically exposes the genuine deficiencies of 20 mainstream LLMs in long-term planning and dependency reasoning.

MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents

MSB is the first end-to-end security evaluation benchmark for the Model Context Protocol (MCP), covering 12 categories of attacks across the "Task Planning → Tool Invocation → Response Handling" workflow. By testing 10 LLM agents with real executable malicious tools (rather than simulated outputs), the study finds that MCP-specific attacks are widely effective (peak ASR 75.83%), and more capable models are often more vulnerable.

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

MCPMark constructs 127 high-difficulty tasks across 5 types of realistic MCP environments (Notion / GitHub / Filesystem / PostgreSQL / Playwright), polished through expert-agent collaboration and accompanied by programmatic verification scripts. Emphasizing multi-step CRUD workflows, results show that even the strongest gpt-5-medium achieves only 52.56% pass@1 and 33.86% pass^4, significantly pushing the performance limits of current agents in realistic MCP usage.

MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow

MedAgent-Pro decomposes the modern clinical "evidence-based diagnosis" process into a two-layer agentic workflow: disease-level standardized planning and patient-level step-by-step evidence reasoning. It utilizes RAG to align with medical guidelines, employs vision/coding tools for quantitative analysis, and utilizes an evidence reflection mechanism to prune unreliable intermediate conclusions. This transforms VLMs from "empirical one-jump responders" into a diagnostic system that is "metric-driven, evidence-based, and traceable."

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

MEM1 uses end-to-end reinforcement learning to train LLM agents to embed "memory consolidation" into "reasoning" itself. By maintaining a single, continuously rewritten compact internal state per round and discarding old observations, it maintains near-constant context across arbitrarily long multi-turn tasks, achieving higher performance, lower memory usage, and faster inference.

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

The authors propose FeatureBench—a benchmark for evaluating code agents in feature-level software development. Using a test-driven automated pipeline to extract verifiable feature implementation tasks from open-source repositories, the results show that the strongest model, Claude Opus 4.5, solves only 11.0%, revealing a substantial performance gap in complex feature development.

MemGen: Weaving Generative Latent Memory for Self-Evolving Agents

MemGen enables LLM agents to determine when to recall in real-time during inference via a "memory trigger," followed by a "memory weaver" that generates machine-native latent token sequences injected into the inference stream. This intertwines memory and reasoning into a dynamic cycle, significantly outperforming parametric and retrieval-based memory while keeping the backbone frozen and without modifying a single parameter.

Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents

Memory-T1 models the problem of "which memory to retrieve" in multi-session dialogues as a time-aware evidence selection task. It employs a coarse-to-fine filtering process using temporal windows and relevance retrieval, followed by a policy model trained with GRPO to select evidence from candidate sessions and generate answers. This allows 3B/7B open-source models to achieve an overall score of approximately 67% on the Time-Dialog temporal reasoning benchmark.

Meta-RL Induces Exploration in Language Agents

The LaMer framework is proposed to introduce Meta-Reinforcement Learning (Meta-RL) into LLM agent training. By optimizing cross-episode rewards and employing reflection-based in-context policy adaptation, the framework enables language agents to actively explore environments, achieving absolute performance improvements of 11%, 14%, and 19% on Sokoban, MineSweeper, and Webshop, respectively.

Mix-ECom: Towards Mixed-Type E-Commerce Dialogues with Complex Domain Rules

This paper constructs the first customer service benchmark Mix-ECom, which features "four dialogue types mixed in a single conversation + 82 real-world e-commerce domain rules." It proposes a dynamic rule filtering module placed before ReAct/Plan-and-Solve to suppress hallucinations caused by complex rules, revealing that the current strongest multimodal LLM Agents still achieve a total score of only 62% on real e-commerce service tasks.

MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents

MMSearch-Plus introduces a multimodal browsing benchmark comprising 311 questions. By employing "spatial-temporal extrapolation," it mandates that agents extrapolate from fine-grained visual cues to facts outside the image. Accompanying this is a model-agnostic agent framework with Set-of-Mark (SoM) zoom-in retrieval, revealing that the end-to-end accuracy of current state-of-the-art MLLMs is only 36%.

MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning

To address the lack of CoaT (Chain of Action-Planning Thoughts) reasoning trajectories and the difficulty of step-level annotation for mobile GUI agents, MobileIPL uses MCTS-style iterative sampling to construct a CoaT-tree. It scores leaf nodes based on rule-based rewards and backpropagates values to intermediate thinking steps, constructing "Thinking-level" DPO pairs (T-DPO) to optimize the reasoning process. This approach outperforms large-scale continuously pre-trained models such as OS-ATLAS and UI-TARS on three mobile GUI benchmarks.

MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents

MobileRL introduces an online agentic RL framework for mobile GUI agents using "two-stage reasoning SFT warm-up + Adaptive GRPO (AdaGRPO)". By combining positive sample replay, failure curriculum filtering, and shortest path rewards, the framework stabilizes multi-step training under sparse rewards. It achieves state-of-the-art (SOTA) results, with a 9B model reaching a 80.2% success rate on AndroidWorld and 53.6% on AndroidLab.

Modeling Others' Minds as Code

The paper reformulates "predicting others' next actions" as a program synthesis problem—using LLMs to generate a set of Python "behavior scripts" that explain observed trajectories, followed by Sequential Monte Carlo for Bayesian inference to filter the most likely programs. This approach enables efficient, interpretable, and generalizable prediction of human and AI agent behaviors.

Natural Language PDDL (NL-PDDL): Open-world Goal-oriented Commonsense Regression Planning in Embodied AI

The symbolic predicates of classical PDDL are replaced with "typed natural language predicates," and first-order regression planning is driven by LLM entailment judgments. This preserves the correctness of symbolic planning while gaining the commonsense generalization of LLMs in partially observable open worlds where goals and action descriptions are misaligned.

Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning

By using a binary reward that only checks "format compliance + precise tool-calling match" for R1-style GRPO training, the authors trained Qwen2.5-7B/14B into tool-calling reasoning models that outperform GPT-4o without any distilled reasoning trajectories.

NetArena: Dynamic Benchmarks for AI Agents in Network Automation

NetArena utilizes a unified "state-action" abstraction and network simulator integration to transform network operation and maintenance tasks into a live benchmark that can infinitely generate dynamic queries and automatically verify correctness, safety, and latency in simulation, revealing that current AI agents achieve only 13–38% accuracy on realistic large-scale network tasks.

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

NewtonBench is proposed as a benchmark for scientific law discovery featuring 324 tasks across 12 physical domains. It generates novel, memorization-resistant tasks via "counterfactual law shifts," requiring agents to discover hidden physical equations through interactive experimental exploration. Results show GPT-5 performs best (75.9% symbolic accuracy) but degrades sharply in complex systems (40.3%), and code tools unexpectedly yield negative effects for strong models.

OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

Addressing the phenomenon where joint training of GUI and embodied data leads to mutual interference, this paper discovers that these two data types synergize in shallow layers but conflict in deep layers (analogous to the "cerebrum-cerebellum" division in the human brain). The authors propose Layer-heterogeneity MoE, which shares parameters in shallow layers to exploit synergy and separates them in deep layers to avoid conflict. By unifying the action spaces and collecting large-scale data, they develop OmniActor, a generalist agent that outperforms specialized SOTA models in both GUI and embodied tasks.

Open Data Synthesis for Deep Research

This paper proposes the InfoSeek data synthesis framework, formalizing the "deep research" task as a Hierarchical Constraint Satisfaction Problem (HCSP). Using a two-stage "Diffusion–Retrospection" approach, it automatically grows research trees from seed webpages and retroactively weaves them into QA pairs requiring multi-layer reasoning with unique, verifiable answers. By training an InfoSeeker agent of only 3B parameters with 50k+ synthesized QA pairs and 16.5k trajectories, the model outperforms numerous larger open-source and even some closed-source systems on benchmarks like Multi-hop QA and BrowseComp-Plus.

OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

This paper introduces OpenAgentSafety, a comprehensive safety evaluation framework for AI agents. It features over 350 executable tasks, a suite of real-world tools (browser, terminal, file system, messaging platforms), and multi-turn, multi-user interaction scenarios. The study reveals that even state-of-the-art LLMs exhibit unsafe behaviors in 49%-73% of safety-sensitive tasks.

OpenApps: Simulating Environment Variations to Measure UI Agent Reliability

This paper proposes OpenApps—a lightweight UI Agent simulation ecosystem (comprising six configurable apps including Calendar, Maps, and Shop) that is pure Python and runnable on a single CPU. By procedurally morphing the appearance and content of the same app into thousands of versions, it introduces the dimension of "reliability across app variants," which is ignored by existing fixed cloned environments. Across over 10,000 evaluations of seven mainstream multimodal agents, the study finds that agents appearing stable in fixed environments can experience success rate fluctuations of over 50% when app variants are changed (e.g., Kimi-VL dropping from 63% to 4%).

Opponent Shaping in LLM Agents

This paper presents the first investigation of "opponent shaping" between LLM Agents, introducing ShapeLLM—a model-free shaping algorithm trained via PPO. By compressing "history" and "context" into structured natural language prompts, ShapeLLM demonstrates that LLM Agents can actively manipulate an opponent's learning dynamics to lead them toward exploitable equilibria (maximizing individual gain in competitive games) or foster cooperation to enhance collective welfare.

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Orak encapsulates 12 real video games covering all 6 major genres into a unified benchmark using Model Context Protocol (MCP) plug-and-play interfaces. It enables systematic evaluation of LLM agentic modules (reflection/planning/tools) and provides a fine-tuning dataset of expert LLM gameplay trajectories to transform general LLMs into effective game agents.

OrchestrationBench: LLM-Driven Agentic Planning and Tool Use in Multi-Domain Scenarios

OrchestrationBench is proposed—a fully manually annotated English/Korean bilingual benchmark. It decomposes service-level orchestration (where a "Main LLM decomposes requests \(\rightarrow\) assigns to Sub-LLMs \(\rightarrow\) Sub-LLMs call tools") into two independent dimensions: Workflow Planning and Constraint-Aware Tool Execution. Findings indicate that while function calling capabilities are similar across major models, the gap in planning capability is significant.

OSWorld-MCP: Benchmarking MCP Tool Invocation in Computer-Use Agents

OSWorld-MCP injects 158 high-quality MCP tools into the OSWorld real-world computer environment, enabling multimodal agents to autonomously choose between "invoking a tool" and "interacting with the GUI" at each step. This marks the first time tool invocation, GUI operation, and mixed decision-making capabilities are evaluated within a unified framework. Results show that MCP tools generally improve success rates (e.g., OpenAI o3 improves from 8.3% to 17.6% within 15 steps), but the highest tool invocation rate among models is only 33.3%, indicating that existing agents are far from mastering tool usage.

PhyScensis: Physics-Augmented LLM Agents for Complex Physical Scene Arrangement

Ours proposes PhyScensis, an LLM agent framework integrated with a physics engine. By utilizing a solver driven by spatial and physical predicates, it generates high-complexity, physically accurate 3D scenes. It significantly outperforms prior methods in visual quality, semantic correctness, and physical precision, and has been successfully applied to robot manipulation policy training.

PolySkill: Learning Generalizable Skills through Polymorphic Abstraction for Continual Agents

PolySkill introduces the concept of "polymorphism" from software engineering into Web agent skill learning: using an abstract domain class to define "what to do" (e.g., search_product), while site-specific subclasses implement "how to do it." This approach enables cross-site skill reuse—improving skill reusability by 1.7x on known sites, increasing success rates by up to 13.9% on unseen sites, and reducing steps by over 20%, while successfully mitigating catastrophic forgetting in continual learning.

Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

EvoPresent connects four agents—Storyline, Scholar, Design, and Checker—into a "draft-feedback-revision" self-improvement pipeline, transforming papers into narrative-driven, aesthetically pleasing presentation videos with virtual explanations. Its core is PresAesth, an aesthetic model trained via multi-task reinforcement learning, which provides reliable scoring, defect diagnosis, and comparative feedback, enabling the system to iterate autonomously with minimal labeled data.

PRISM: Festina Lente Proactivity—Risk-Sensitive, Uncertainty-Aware Deliberation for Proactive Agents

PRISM models the decision of "whether a proactive agent should speak" as a cost-sensitive selective intervention problem. It first estimates two calibrated probabilities—"whether the user needs help" and "whether the user will accept"—and uses an adaptive threshold derived from false alarm/missed detection costs for gating. A single "slow reasoning" pass is triggered only near the decision boundary. By employing gate-aligned distillation to train student models, PRISM reduces the false alarm rate by 22.78% and improves F1 by 20.14% on PROACTIVEBENCH.

Programming with Pixels: Can Computer-Use Agents do Software Engineering?

The authors constructed PwP (Programming with Pixels), the first "computer-use" environment for software engineering where agents operate VSCode via keyboard/mouse by viewing the screen like humans, and the accompanying 15-task benchmark, PwP-Bench. Systematic evaluation reveals that general Computer-Use Agents (CUA) using pure vision achieve only 22.9% accuracy, significantly underperforming specialized SWE agents; however, providing them with just two text APIs (file editing + bash) jumps accuracy to 50.7%, nearing specialized agents. This indicates the bottleneck is not coding ability, but poor visual grounding and failure to utilize existing IDE tools.

ProRe: A Proactive Reward System for GUI Agents via Reasoner–Actor Collaboration

To address the difficulty of obtaining verifiable rewards for GUI agents, ProRe utilizes a general reasoner to schedule "state probing tasks," which are then executed by a domain-specific evaluator agent (actor) to proactively collect key interface states. The task success is determined via chain-of-claims reasoning, achieving a reward accuracy of 93.7% (the first GUI reward system to exceed 90%) and improving policy agent success rates by up to 22.4%.

Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification

This paper systematically investigates test-time compute scaling for deep search agents. Identifying the "hard to search, easy to verify" asymmetry, it proposes allocating compute from the search agent to a verifier agent to efficiently filter candidate answers. This approach upgrades open-source models like GLM-4.5, K2, Qwen3-2507, and Tongyi-DeepResearch to "Heavy" versions, achieving improvements of up to 20+ percentage points on benchmarks like BrowseComp, reaching performance levels comparable to OpenAI Deep Research and o3.

R-WoM: Retrieval-augmented World Model for Computer-use Agents

The authors systematically verify that "LLMs as World Models" work for short-term but fail for long-term horizons. They propose R-WoM, which uses external tutorial retrieval to "ground" the multi-step imagination and reward estimation of the world model, achieving up to a 23.4% improvement over the strongest baselines on OSWorld / WebArena, with increasing advantages for longer trajectories.

PerfGuard: A Performance-Aware Agent for Visual Content Generation

PerfGuard is proposed as a performance-aware Agent framework for visual content generation. It replaces text descriptions with a multi-dimensional scoring matrix for Performance-Aware Selection Modeling (PASM), utilizes Adaptive Preference Update (APU) to dynamically calibrate deviations between theoretical rankings and actual execution, and employs Capability-Aligned Planning Optimization (CAPO) to guide the Planner in generating subtasks matched with tool capabilities. It outperforms SOTA methods such as GenArtist and T2I-Copilot in image generation and editing tasks.

Real-Time Reasoning Agents in Evolving Environments

This paper introduces the new problem of "real-time reasoning"—where the environment continues to evolve while the agent is thinking—and constructs the Real-Time Reasoning Gym to measure it. Furthermore, it proposes AgileThinker, which runs a "planning thread" and a "reaction thread" in parallel. The reaction thread can read the ongoing intermediate thoughts of the planning thread, consistently outperforming single-paradigm agents as cognitive load and time pressure increase.

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

This paper proposes T³ (Truncating Belief-Trapped Trajectories), which analyzes the "belief trap" phenomenon in LLM agents during multi-turn active reasoning based on POMDP theory. By detecting belief deviation and truncating uninformative trajectory tails, it corrects credit assignment errors in RL training. T³ achieves performance gains of up to 30 points across five challenging tasks while saving 34% in token costs.

REMem: Reasoning with Episodic Memory in Language Agents

This paper proposes REMem, an episodic memory framework for language agents. By utilizing a hybrid memory graph (time-aware gist nodes + factual triplet nodes) and tool-augmented agentic reasoning, it outperforms SOTA methods by 3.4% and 13.4% on episodic recall and episodic reasoning tasks, respectively.

Repurposing Synthetic Data for Fine-grained Search Agent Supervision

"Gold entities" used during synthetic data generation are repurposed as process supervision signals to propose entity-aware E-GRPO. Partial rewards are assigned to "near-miss" samples (incorrect answers with partially correct reasoning) based on entity hit rates, consistently outperforming GRPO on multiple QA and deep retrieval benchmarks while learning strategies with fewer tool calls.

ReVeal: Self-Evolving Code Agents via Reliable Self-Verification

ReVeal organizes code generation into an alternating "generation-verification" multi-turn loop and explicitly optimizes self-verification capabilities using a turn-level reinforcement learning algorithm (TAPO). This allows a 32B model, trained for only 3 turns, to continuously self-correct for over 20 turns during inference, driving the Pass@1 on LiveCodeBench V6 from 34.8% up to 38.7%.

ROGA: Scaling Generalist Agents for Office Productivity Tasks via Tool Generation

Addressing the severe performance drop of existing "Automated Tool Generation" (ATG) agents in long-horizon, stateful office tasks, ROGA restructures the agent paradigm. It utilizes active world modeling to complete partially observable file contexts, persistent symbolic memory to maintain cross-step states, and dynamic capability evolution to make generated tools reusable. ROGA improves task success rates by up to 13.64% on benchmarks like OSWorld, WindowsAgentArena, and GAIA-Office, even outperforming specialized agents on spreadsheet tasks.

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

ScaleCUA utilizes a "dual-loop" data pipeline—combining automated agents and human experts—to collect and annotate a massive cross-platform GUI corpus (471K perception, 17.1M grounding, and 19K trajectories) spanning six operating systems. Based on this, it trains open-source computer use agents (CUAs) supporting three reasoning paradigms, achieving new SOTAs across multiple GUI benchmarks (WebArena-Lite-v2 +26.6, ScreenSpot-Pro +10.7).

Scaling Agent Learning via Experience Synthesis

DreamGym utilizes a "reasoning-based experience model" to synthesize agent-environment interactions (state transitions + rewards) within an abstract textual state space. Combined with an experience replay buffer and a reward-entropy-based curriculum task generator, it enables LLM agents to execute RL training with almost no real-world rollouts. It outperforms all baselines by 30%+ on the non-RL-ready WebArena and matches GRPO/PPO performance on RL-ready environments using purely synthetic data.

Scaling Agents via Continual Pre-training

This paper shifts the learning of agent capabilities to the continual pre-training stage by proposing Agentic Continual Pre-Training. Using two types of large-scale synthetic data, FAS and HAS, the authors train AgentFounder. This allows an open-source 30B-class deep research agent to achieve strong performance across 10 benchmarks, including BrowseComp, GAIA, and HLE.

Scaling Synthetic Task Generation for Agents via Exploration

AUTOPLAY automatically constructs large-scale training data for UI agents by having Multimodal Large Language Models (MLLMs) first proactively explore Android and Ubuntu UI environments, then generating executable tasks based on exploration trajectories and task guidelines. It significantly improves task success rates for mobile and desktop agents after SFT and RL.

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

ScienceBoard constructs an Ubuntu virtual machine environment integrated with real scientific software and 169 cross-disciplinary tasks. By utilizing state-level execution evaluation, it examines the capabilities of multimodal computer-using agents in realistic scientific workflows. The results show that the success rate of even the strongest models remains significantly lower than that of humans.

SciNav: A General Agent Framework for Scientific Coding Tasks

SciNav embeds "pairwise relative judgment" into Top-K Tree Search (TKCTS), enabling LLM agents to solve scientific coding tasks under realistic conditions where predefined evaluation metrics are absent and search budgets are limited. By choosing branches, pruning, and expanding based on "which of the two is better" rather than "absolute scores for each solution," it significantly outperforms baselines such as Self-Debug and OpenHands on ScienceAgentBench and DA-Code.

SCUBA: Salesforce Computer Use Benchmark

SCUBA is a benchmark for computer-use agents built on authentic Salesforce sandbox environments, containing \(300\) CRM tasks derived from real-world user interviews. It features resettable environments, fine-grained milestone evaluations, and human demonstrations. The study reveals a significant performance gap between open-source and closed-source models, as well as between browser-based and desktop-based agents (open-source success rates are \(<5\%\) in zero-shot settings, whereas closed-source models reach \(39\%\); with demonstrations, success rates reach \(50\%\) while reducing costs and improving speed).

Sculptor: Empowering LLMs with Cognitive Agency via Active Context Management

This paper proposes the Sculptor framework, which equips LLMs with a reversible set of "Active Context Management (ACM)" tools—slicing, folding/summarizing/restoring, and precise search. This allows models to actively remove irrelevant information and focus on key content like a sculptor. Combined with a GSPO reinforcement learning method designed for dynamic contexts, the average score of a 13B model on multiple long-context benchmarks was improved from 39.4 to 73.8.

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

The same LLM plays both "proposer" and "solver" roles in a self-play framework for deep search tasks: the proposer generates increasingly difficult search queries with verifiable answers, while the solver attempts to answer them. RAG-based reverse verification using the proposer's retrieved pages ensures the validity of the questions. This end-to-end process requires no human annotation and significantly improves search agent performance across seven QA benchmarks (averaging +26.4 points for Qwen2.5-7B-Base).

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

M3-Agent converts real-time visual and audio streams into entity-centric multimodal long-term memory, utilizing a reinforcement learning-trained control model for multi-turn retrieval and reasoning. It outperforms prompt-based closed-source agents and online long video understanding baselines on M3-Bench and VideoMME-long.

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

The authors propose the Collaborative Battleship task to evaluate the information-seeking capabilities of language models. They design three Bayesian inference strategies (Bayes-Q/M/D) to enhance LM questioning, acting, and decision-making, enabling a weak model (Llama-4-Scout) to achieve superhuman performance (82% win rate) at approximately 1% of the cost of GPT-5.

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

SimuHome is proposed as a time-accelerated smart home simulator and a 600-episode benchmark based on the Matter protocol. It is the first to simulate the continuous impact of device operations on environmental variables and evaluate workflow scheduling capabilities. Findings indicate that workflow scheduling remains the most significant challenge for current LLM agents (including GPT-5.1).

SMAN-Bench: A Cross-System Benchmark for Mobile Agents under Single- and Multi-path, Ambiguous, and Noisy Tasks

SMAN-Bench transforms a 3-million-page graph-structured mobile operation corpus (Mobile3M) into a mobile agent benchmark. By using slot templates to automatically label multiple trajectories with the same instruction, it supports "offline multi-path" evaluation (where one instruction can have several correct solutions). It additionally constructs two subsets—one with advertising noise and one with ambiguous instructions—to systematically identify significant weaknesses in current VLM agents when facing real-world messy environments or requiring proactive clarification.

Social Agents: Collective Intelligence Improves LLM Predictions

This paper proposes Social Agents, which use different demographic/psychological personas to conditionalize a single LLM into a group of independent evaluators in a "virtual society." Each agent scores and provides rationales for stimuli (ads/webpages/videos), and these scores are aggregated by mean to bring the "Wisdom of Crowds" into LLMs. On 11 behavior prediction tasks, it achieves improvements of up to 164% on low-level tasks and 24% on high-level tasks relative to single LLM baselines, with an average improvement of 21.5% across 9 models.

Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

The HPL framework is proposed to address the granularity mismatch in preference learning for long-horizon LLM Agents. By utilizing triple-level DPO (trajectory-level + step-level + action group-level) and dual-layer curriculum learning (sub-task complexity × sample difficulty), it significantly outperforms baselines such as ETO and IPR on ALFWorld/WebShop/InterCode-SQL (average 59.44 vs 55.43/55.49).

Spinning Straw into Gold: Relabeling LLM Agent Trajectories in Hindsight for Successful Demonstrations

This work relabels "failed/sub-optimal" trajectories generated by LLM agents using an auxiliary LLM to identify all goals actually achieved. Combined with "irrelevant action masking + demonstration re-weighting," these discarded trajectories are transformed into successful demonstrations for fine-tuning. This approach provides plug-and-play performance gains on ALFWorld, PlanCraft, and WebShop, exceeding baselines trained on full datasets while using only one-fourth of the ground-truth demonstrations.

SR-Scientist: Scientific Equation Discovery With Agentic AI

Ours proposes the SR-Scientist framework, elevating LLMs from simple equation proposers to autonomous AI scientists. By utilizing code interpreter tools for data analysis and equation evaluation, the agent autonomously discovers scientific equations through long-horizon interactions, with capabilities further enhanced via reinforcement learning.

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

This paper proposes ST-WebAgentBench, the first benchmark specifically designed to evaluate the safety and trustworthiness of Web Agents. Through a hierarchical policy framework and the Completion under Policy (CuP) metric, it reveals that current SOTA agents exhibit severe policy violations in enterprise scenarios.

STARK: Strategic Team of Agents for Refining Kernels

STARK reformulates GPU kernel optimization into an agent framework of "professional team collaboration + policy search on tree memory." By utilizing three specialized LLM agents (plan/code/debug), grounded instructions with anchors, role-customized dynamic context windows, and adaptive \(\epsilon\)-greedy search, it simulates the iterative tuning process of senior engineers. It achieves up to 16× runtime speedup compared to baseline agents on KernelBench.

TaskCraft: Automated Generation of Agentic Tasks

TaskCraft proposes the first workflow for fully automated generation of scalable, multi-tool, and verifiable agentic tasks. It starts by creating single-tool "atomic tasks," then incrementally increases difficulty through depth expansion (recursively finding supersets) and breadth expansion (merging subtasks), complemented by an efficient incremental verification process that only checks changes. This yields 41k tool-intensive tasks, and SFT/RL training on this data achieves SOTA performance across four agent benchmarks.

Terminal-Bench: Benchmarking Agents on Difficult, Real-World Tasks in the Command Line Interface

Terminal-Bench introduces an agent evaluation framework structured around a "terminal environment + Docker container + test verification + oracle solution" unit. It releases Terminal-Bench 2.0, a dataset of 89 hard tasks audited through hundreds of person-hours. Results demonstrate that even the most powerful frontier models/agents (GPT-5.2 + Codex CLI) achieve a solve rate of only ~63%, while small models settle around 15%. Based on these results, a failure mode taxonomy is provided to guide subsequent improvements.

Test-Time Adaptation for LLM Agents via Environment Interaction

Addressing generalization failures of LLM Agents in unfamiliar websites or toolsets, this paper decomposes failures into "syntactic mismatch" and "semantic mismatch." These are resolved via an online-learned lightweight adaptation vector (Syntactic Alignment, SA) and a persona-driven exploration to build a verbalized world model in-context (Dynamics Grounding, DG). This process requires no labeled trajectories or fine-tuning, increasing the success rate on the WebArena multi-site split from 2% to 23%.

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

This paper introduces Toolathlon, a language agent benchmark covering 32 software applications, 604 tools, and 108 tasks. It emphasizes realistic and diverse environment states and long-horizon multi-step interactions (averaging ~20 tool calls). The strongest model, Claude-4.5-Sonnet, achieves only a 38.6% success rate.

ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

ToolACE-MT replaces the "turn-by-turn autoregressive" paradigm of multi-agent simulation for multi-turn tool-calling data with a non-autoregressive pipeline: "building the skeleton first, iterative refinement, and finally offline verification." It generates agent dialogue data with higher coherence and diversity using fewer API calls. The 8B model trained via Ours improved its multi-turn accuracy on BFCL-v3 from 9.25% to 40.25%.

ToolTree: Efficient LLM Agent Tool Planning via Dual-Feedback Monte Carlo Tree Search and Bidirectional Pruning

ToolTree models multi-tool calling for LLM agents as a Monte Carlo Tree Search (MCTS), utilizing "pre-execution pre-evaluation + post-execution empirical evaluation" LLM scoring signals to simultaneously guide selection and pruning. Under fixed compute budgets, it enables agents to possess both foresight and the ability to backtrack based on real feedback, achieving approximately 10% higher accuracy than SOTA search paradigms across four tool planning benchmarks with peak efficiency.

ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models

ToolWeaver is proposed to represent each tool as a hierarchical discrete code sequence rather than a single token through collaboration-aware vector quantization. This achieves logarithmic vocabulary expansion (covering 47,000+ tools with only ~512 new tokens), comprehensively outperforming the ToolGen baseline on ToolBench while reducing language model perplexity degradation from 16.5x to 4x.

Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents

This paper proposes MoSciBench—the first benchmark for "multimodal, repository-level" data-driven scientific discovery. Starting from peer-reviewed papers, an 88-task cross-modal hypothesis testing dataset was constructed via a four-stage pipeline. Systematic evaluation reveals that even the strongest agent (o4-mini + ReAct) achieves only 48.9% accuracy, with over 60% of failures stemming from cross-modal alignment, while lightweight workflow scaffolding improves accuracy by an average of 5.7%.

Towards Scalable Oversight via Partitioned Human Supervision

A scalable oversight framework based on partitioned human supervision is proposed. When tasks exceed the capability of a single expert, complementary labels (excluding incorrect options) provided by domain experts are used to construct an unbiased accuracy estimator, enabling the evaluation and training of AI systems without requiring full ground-truth annotations.

Trade in Minutes! Rationality-driven Agentic System for Quantitative Financial Trading

TiMi (Trade in Minutes) is a "rationality-driven" multi-agent quantitative trading system. It utilizes three specialized LLMs (semantic analysis, programming, and mathematical reasoning) to offline refine trading strategies into standalone programmable trading bots. These bots are then deployed for minute-level live trading, completely decoupling "heavy reasoning" from "fast execution." It achieves stable returns, low latency, and superior risk control across 200+ stock indices and crypto pairs.

TRAJECT-Bench: A Trajectory-Aware Evaluation Benchmark for Agent Tool Calling

TRAJECT-Bench constructs 5,670 "parallel/serial" tool calling trajectories and "simple/hard" dual-difficulty queries using 1,228 executable real APIs. It refines evaluation from "whether the final answer is correct" to trajectory-level diagnostics focusing on "whether tools were selected correctly, parameters filled accurately, and sequence/dependencies met," thereby revealing specific failure modes of LLMs in tool calling (similarity confusion, parameter blind selection) and the scaling bottleneck from "short trajectories to medium-length trajectories."

Tree Search for LLM Agent Reinforcement Learning

The authors replace "independent chain sampling" in multi-turn agent RL with "agent-step tree search sampling." By sharing prefixes, approximately \(1.5\times\) trajectories are sampled under a fixed token/tool-call budget. The tree branching structure implicitly converts sparse outcome rewards into step-level process supervision (theoretically equivalent to step-level DPO), consistently outperforming chain-based GRPO across 11 QA datasets.

TusoAI: Agentic Optimization for Scientific Methods

TusoAI is an agent specifically designed for "scientific computing method development." Given a task description, data, and an evaluation function \(h(\cdot)\), it organizes domain knowledge into a knowledge tree and utilizes hierarchical planning with Bayesian updates combined with diagnostic fine-grained optimization to iteratively improve solutions within a candidate pool. TusoAI consistently outperforms expert methods, MLE agents, and general scientific agents across 11 scientific tasks. Furthermore, it improved SOTA methods on two genetics challenges and discovered new biological insights missed by previous methods.

Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data

This paper recasts workflows composed of multiple LLM calls and deterministic logic as "typed unnormalized probabilistic programs." By using lightweight PEFT adapters as learnable parameters and a TACSTaR (MC-EM) training algorithm—proven unbiased even when dropping the partition function gradient—the entire pipeline can be trained end-to-end via gradients. This approach significantly outperforms discrete prompt optimization baselines like DSPy on structured reasoning tasks such as FinQA and MGSM-SymPy.

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

This paper upgrades "natural language instructions" from passive inputs to active reasoning paths (Instruction-as-Reasoning). It uses a data pipeline to clean noisy annotations and expand each instruction into four perspectives: appearance, function, location, and intent. Subsequently, SFT is employed to teach the model to treat "rewriting instructions into a specific perspective" as explicit reasoning, followed by GRPO to enable the model to autonomously select or combine the most effective perspectives. The resulting UI-Ins-7B/32B achieves SOTA on five GUI grounding benchmarks (UI-I2E-Bench 87.3%, ScreenSpot-Pro 57.0%) and attains a 74.1% success rate on the AndroidWorld online agent.

Unlocking Long-Horizon Agentic Search with Large-Scale End-to-End RL

Without relying on distillation data from commercial large models or acting as an external plugin tool, the search agent ASearcher is trained on a single QwQ-32B using pure end-to-end RL. By "automatically synthesizing high-difficulty QA data + setting the tool-call limit per trajectory to 128 steps for long-horizon exploration," the model spontaneously develops expert-level search behaviors such as uncertainty analysis and conflict checking. Using only basic search tools, it achieves performance comparable to commercial Deep Research systems on GAIA/xBench/Frames.

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

This paper proposes VideoAgentTrek, which employs an inverse dynamics module, VIDEO2ACTION, to automatically recover operation trajectories with precise action parameters (1.52 million steps) from 39,000 unlabeled YouTube screen-recorded tutorials. Through a two-stage process of "continual pre-training + supervised fine-tuning," the OSWorld-Verified success rate of computer use agents is increased from 9.3% to 15.8% (a 70% relative improvement).

VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

VideoMind is proposed as a role-based video-language agent framework that achieves temporal-grounded video reasoning through the collaboration of four roles: Planner, Grounder, Verifier, and Answerer. The core innovation is the Chain-of-LoRA mechanism, which enables seamless role switching on a unified base model by toggling LoRA adapters. The 2B model variant outperforms GPT-4o and Gemini-1.5-Pro.

ViMo: A Generative Visual GUI World Model for App Agents

ViMo is the first "visual" GUI world model—given a current mobile screen screenshot and a user action, it directly generates the future GUI image after the action is executed. To solve the long-standing problem of blurry small text in pixel-level generation, it decouples the interface into "graphics" and "text" streams for separate generation. Using a symbolic placeholder representation called STR, a diffusion model handles the graphical layout while an LLM fills the placeholder boxes with text. The predicted results of the world model are then fed to an App agent for more accurate action selection.

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

VitaBench abstracts three major life service scenarios—food delivery, in-store consumption, and online travel—into a sophisticated "life service" simulation environment containing 66 tools and 400 tasks. It replaces domain policy documents with tool dependency graphs to force autonomous exploration by agents and employs a rubric sliding window evaluator for scoring. Results indicate that even the strongest models achieve only a 30% success rate on cross-scenario tasks.

WALT: Web Agents that Learn Tools

WALT reverse-engineers functions already designed into websites (search, filter, sort, post, CRUD) into a set of deterministically callable tools. This shifts web agents from "reasoning step-by-step how to click and fill" to "directly calling search(query)," achieving SOTA on VisualWebArena (52.9%) and WebArena (50.1%) with fewer steps and lower reliance on LLM inference.

WARC-Bench: Web Archive based Benchmark for GUI Subtask Executions

This paper introduces WARC-Bench, which uses Web Archive files to "freeze" real websites into sandbox-replayable interactive environments. It constructs an evaluation set of 438 tasks focusing on "medium-granularity subtasks" (e.g., date selection, slider dragging, scrolling containers to extract info) with automated scoring via programmatic verifiable rewards. Experiments show that even the strongest closed-source models achieve only 64.8% success, while an open-source 72B model trained by the authors via SFT + RLVR reaches 52.3%, surpassing most frontier models.

Web-CogReasoner: Towards Multimodal Knowledge-Induced Cognitive Reasoning for Web Agents

By drawing on Bloom's Taxonomy, this paper decouples Web Agent capabilities into two stages: "Knowledge Content Learning" and "Cognitive Processes." It constructs a three-layered Web-CogKnowledge system (Factual/Conceptual/Procedural), a supporting Web-CogDataset, and a Web-CogBench. Through three-stage curriculum learning and knowledge-induced CoT, Web-CogReasoner is trained. With only 7B parameters, it surpasses open-source agents of the same scale on multiple web navigation benchmarks and demonstrates strong generalization on unseen tasks due to structured knowledge.

WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents

WebArbiter proposes a reasoning-first, principle-guided process reward model (WebPRM) that formalizes reward modeling as a text generation task. Through a two-stage training process involving reasoning distillation and reinforcement learning, the 7B model outperforms GPT-5 by 9.1 percentage points on WebPRMBench.

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

WebFactory redefines "training GUI agents" as a problem of distilling internet knowledge compressed within LLMs into executable grounded actions. By using a fully automated closed-loop pipeline—high-fidelity offline website synthesis via LLMs → knowledge-driven verifiable task generation → trajectory collection via strong LLMs → RL training with decomposed rewards—a 3B agent trained on only 10 synthetic websites achieves performance levels comparable to agents trained on equivalent human-annotated data and generalizes to real-world websites such as Amazon, Airbnb, and Booking.

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

WebSailor-V2 utilizes a complete post-training pipeline consisting of "dense cyclic knowledge graph data synthesis + sim/real dual-environment RL." This pipeline trains a 30B (3B active) MoE web agent to achieve 35.3 on BrowseComp-EN and 30.6 on HLE, surpassing the 671B DeepSeek-V3.1 and bringing open-source deep research agents close to the level of closed-source systems.

WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

WebSeer trains a 14B search agent using a two-stage process: "rejection sampling to construct cold-start data with reflection trajectories" and "Self-Reflection Reinforcement Learning (SRRL) allowing multiple answer submissions per turn." This enables the model to actively extend tool chains and backtrack or rewrite queries when uncertain, achieving SOTA results of 72.3% and 90.0% on HotpotQA and SimpleQA, respectively.

WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research

WebWeaver utilizes a "Planner + Writer" dual-agent system to simulate the human research process: the Planner iteratively optimizes a cited outline during searching, while the Writer performs "evidence retrieval-writing-pruning" section by section. It achieves SOTA on DeepResearch Bench, DeepConsult, and DeepResearchGym, with a citation accuracy of up to 92%.

When AI Agents Collude Online: Financial Fraud Risks by Collaborative LLM Agents on Social Platforms

The authors developed MAFF-Bench, a multi-agent simulation benchmark capable of simulating the full life cycle of financial fraud on social platforms. It proves that LLM agents not only execute fraud instructions with almost no refusal but also, once allowed to collude privately, the collective fraud success rate far exceeds the sum of individual capabilities (\(R_{pop}\) jumps from 17% to 41%). The study systematically evaluates the effectiveness and "adaptation" risks of mitigation measures across three layers: content, agent, and society.

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

This paper systematically introduces and investigates the concept of "Misevolution"—the phenomenon where self-evolving LLM Agents deviate from intended directions during autonomous improvement. This leads to emerging risks such as safety alignment degradation and vulnerability introduction across four evolutionary paths: model, memory, tool, and workflow. Even top-tier LLMs like Gemini-2.5-Pro are susceptible to these risks.

Zephyrus: An Agentic Framework for Weather Science

This paper constructs the first agentic framework for weather science: using a unified Python tool environment (ZephyrusWorld), LLMs solve tasks by writing code to call weather data, forecasting models, and climate simulators. It includes two execution strategies (Direct / Reflective) and a benchmark (ZephyrusBench) containing 2,230 problems across 49 task categories. Results show an accuracy improvement of up to 44 percentage points over text-only baselines, though difficult tasks remain challenging.