Skip to content

🦾 LLM Agent

📷 CVPR2026 · 39 paper notes

📌 Same area in other venues: 🔬 ICLR2026 (162) · 💬 ACL2026 (82) · 🧪 ICML2026 (59) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (39) · 📹 ICCV2025 (4)

🔥 Top topics: Agents ×14 · LLM ×5 · Multimodal/VLM ×5 · Reasoning ×4 · Segmentation ×2

AdapAction: Adaptive Target Action Backdoor Attack against GUI Agents

For MLLM-driven GUI agents, this work replaces traditional "trigger \(\rightarrow\) fixed action" backdoors with "trigger \(\rightarrow\) context-adaptive malicious action." An adversarial teacher LLM generates structured malicious reasoning trajectories, which are distilled into the target agent via SFT. This enables the agent, when triggered, to autonomously select a malicious operation that appears perfectly reasonable given the current interface and instruction, pushing the attack success rate to 100% while bypassing multi-principle LLM defenses and maintaining normal task utility.

AeroAgent: A Vision-Physics-Decision Framework for Aerodynamic Vehicle Design

AeroAgent integrates "text/image-to-3D car generation → second-level drag and flow field prediction via the AeroFormer surrogate model → planner-driven propose-evaluate-refine closed-loop editing" into a unified framework. It utilizes high-fidelity CFD only for final top-K candidate verification, achieving an average drag reduction of 2–12% within 5 iterations while reducing high-fidelity CFD calls by 50–80%.

Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection

A multi-agent system driven by LLMs is used to "act" as both forgers and social network observers, simulating the complete life cycle of face forgery from creation to propagation. It synthesizes training data with text-image consistency annotations, leading to significant performance gains for deepfake detectors in cross-domain and cross-algorithm real-world scenarios (e.g., Celeb-DF AUC improved from the 70% range to 87.1%).

BAMI: Training-Free Bias Mitigation in GUI Grounding

This paper diagnoses GUI grounding errors using the MPD attribution method, identifying two main types of inductive biases: precision bias and ambiguity bias. It proposes BAMI, a training-free inference framework that eliminates precision bias through "coarse-to-fine focusing" and mitigates ambiguity bias via "candidate selection." BAMI improves the accuracy of TianXi-Action-7B on ScreenSpot-Pro from 51.9% to 57.8%.

CGL: Advancing Continual GUI Learning via Reinforcement Fine-Tuning

Aiming at the "learning new while forgetting old" problem of GUI agents under frequent app updates, this paper discovers that SFT learns quickly but overwrites old knowledge, while RL (GRPO) resists forgetting but learns slowly. Therefore, the CGL framework is proposed—using "error-aware routing + entropy-regulated weighting + conditional gradient surgery" to integrate SFT and GRPO, achieving the highest accuracy and near-zero forgetting on the self-built AndroidControl-CL benchmark.

DRAMA: Next-Gen Dynamic Orchestration for Resilient Multi-Agent Ecosystems in Flux

DRAMA uniformly abstracts agents and tasks in embodied multi-agent systems as "resource entities," utilizing an affinity matrix and a modified Hungarian algorithm for event-triggered dynamic scheduling. Complemented by a "Trust Chain" for decentralized fault takeover, the framework ensures uninterrupted task completion during agent dropout, addition, or recovery. In VirtualHome-Social, it achieves fewer average steps, lower conflict rates, and higher throughput compared to SOTA.

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Ego2Web is proposed as the first benchmark that combines egocentric video perception with web agent execution. Accompanied by a semi-automatic data construction pipeline and the Ego2WebJudge automatic evaluation framework, experiments reveal a significant gap for current top agents in transferring from real-world visual perception to online actions, with a maximum success rate of only 48.2%.

EpiAgent: An Agent-Centric System for Ancient Inscription Restoration

EpiAgent is the first Agent system for ancient inscription restoration. By utilizing an LLM central planner to coordinate multimodal analysis, specialized restoration tools, and iterative self-optimization, it outperforms existing methods in both textual authenticity and visual fidelity.

Experience Transfer for Multimodal LLM Agents in Minecraft Game

This paper proposes Echo—a "transfer-oriented" memory framework that explicitly decomposes reusable knowledge into five transfer dimensions: structure, attribute, process, function, and interaction. These are encapsulated into a unified Contextual State Descriptor (CSD). Using In-Context Analogical Learning (ICAL), the agent actively infers and verifies new tasks from the memory bank. In Minecraft "from-scratch" scenarios, this increases item unlocking speed by 1.3×–1.7× and leads to a "chain burst unlocking" phenomenon.

GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents

This work proposes GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI Agents. It covers 201 mainstream Chinese Apps across 4 device types, utilizing a "Foundation + Application" two-layer structure to conduct fine-grained diagnosis across five dimensions: perception, planning, reflection, execution, and evaluation. Experiments on 20 representative models reveal that current models still exhibit significant weaknesses in reflection and self-evaluation.

HATS: Hardness-Aware Trajectory Synthesis for GUI Agents

Ours proposes HATS, a hardness-aware trajectory synthesis framework. Through a closed-loop mechanism of hardness-driven exploration and alignment-guided refinement, it focuses on collecting and correcting training trajectories with semantically ambiguous actions, significantly enhancing the generalization capabilities of GUI Agents in complex real-world scenarios.

HAVEN: Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

HAVEN proposes a unified framework featuring audiovisual entity cohesion + hierarchical indexing + agentic search. By utilizing speaker identity as a cross-modal consistency signal, it constructs a four-level hierarchical database (Global-Scene-Clip-Entity), achieving SOTA with an overall accuracy of 84.1% on LVBench.

History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation

EVONAV equips LLM agents for Zero-shot Vision-and-Language Navigation in Continuous Environments (VLN-CE) with a "Review History + Predict Future" feedback loop. By using Future Chain-of-Thought (F-CoT) to predict future actions and landmarks for estimating navigation progress, and History Chain-of-Experience (H-CoE) to summarize completed trajectories and traversed scenes into an online retrievable experience bank, the two components evolve decision-making from "naive direct reasoning" to "continuous error correction with feedback." On R2R-CE, it outperforms Open-Nav (using the same LLM) by +20% SR, +21% OSR, and +17% SPL, while being more time and VRAM efficient.

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

To address the issues of web agents relying on manual pipelines or expensive expert trajectories and struggling to adapt to dynamic web pages, the authors propose SCALE. This method enables a single MLLM to play three adversarial roles—Selector, Predictor, and Judger—to automatically discover and expand its own cognitive boundaries via "prediction errors." Combined with SCALE-Hop graph exploration for global planning, it achieves average task success rate improvements of 231.8% for InternVL2.5-8B and 176.3% for Qwen2.5-VL-7B, while generating the SCALE-20k dataset.

Learning to Select Visual Tools from Experience

This paper proposes VisTA (VisualToolAgent), which trains an agent using reinforcement learning to autonomously select the most useful combinations from 23 heterogeneous visual tools based solely on "correctness" feedback. These tools are provided to a frozen VLM reasoner. VisTA significantly outperforms training-free and fine-tuned baselines on ChartQA, Geometry3K, MathVerse, and BlindTest. Furthermore, the learned selection strategy can be directly transferred to stronger reasoners (e.g., GPT-4o) without retraining.

MMBench-GUI: A Unified Hierarchical Evaluation Framework for Multi-Platform GUI Agents

MMBench-GUI organizes GUI agent evaluation into four progressive levels: "Content Understanding → Element Grounding → Single-app Automation → Cross-app Collaboration." It covers 8,000+ tasks across six platforms (Windows, macOS, Linux, iOS, Android, Web) and introduces the EQA metric to evaluate both success rate and action redundancy. The study systematically reveals six diagnostic findings, notably that precise visual grounding is the critical factor for success and that almost all agents exhibit significant step redundancy.

ModularAgent: A Task-Aware Modular Framework for Joint Optimization of Multimodal Large Language Models and World Models

ModularAgent enables bidirectional coupling between Multimodal Large Language Models (MLLMs) and World Models (WMs) in latent space. The forward path injects MLLM semantics into the WM to guide "imagination," while the backward path utilizes dense text-aligned rewards generated by the WM to refine the MLLM semantic space. By employing task-aware layer-wise dual-expert routing to mitigate multi-task interference, it outperforms baselines such as GenRL and FOUNDER in DeepMind Control Suite (DMC) locomotion multi-task learning and cross-environment transfer.

ORCA: Orchestrated Reasoning with Collaborative Agents for Document Visual Question Answering

ORCA transforms single-page Document Visual Question Answering (DocVQA) into a five-stage multi-agent pipeline. It uses a thinking agent to decompose questions into reasoning paths, routes them via content types to orchestrate nine specialized agents, and triggers pressure tests and adversarial debates only when expert answers conflict with the thinking agent. This approach surpasses single-model SOTA performance across three benchmarks while restricting heavy computation (debates) to only 8.3% of samples.

OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

Addressing the lack of reliable step-by-step error detectors for "screen-based" GUI agents, OS-Oracle introduces a data pipeline that automatically synthesizes four types of typical error actions from positive trajectories. This generates 310,000 critic samples used to train a 7B critic model through two-stage SFT and Consistency-Preserving GRPO (CP-GRPO). The work also provides OS-Critic Bench, the first human-annotated critic benchmark covering Mobile, Web, and Desktop platforms. The model achieves SOTA among open-source models and demonstratedly improves the success rate of the UI-TARS agent.

Paper2Figure: A Multi-Agent Collaborative System for Figure Generation Towards Academic Research Paper

Paper2Figure utilizes a dual multi-agent system comprising "Generator Agents + Refiner Agents." It first translates text descriptions of papers into a self-developed structured intermediate language, FigScript, used for rendering. A closed-loop Critic-Refine agent system then performs self-correction. Coupled with an interactive Web editor that returns control to the author, the system outperforms SVG/Mermaid code generation and text-to-image baselines on the self-built Paper2Figure Bench in accuracy, aesthetics, and completeness (+14.1% overall).

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

Addressing the limitations of current mobile agents as passive command executors, this paper proposes ProactiveMobile—a large-scale benchmark that formalizes "proactive intelligence" as "inferring potential user intent from 4D device context and generating executable function sequences" (3,660 multi-intent samples / 14 scenarios / 63 APIs). Equipped with objectively evaluable SR/FTR metrics, it demonstrates that proactivity is a learnable capability currently missing in MLLMs (fine-tuned Qwen2.5-VL-7B achieves a 20.82% success rate, surpassing o1's 17.02%).

RAAS: LLM Agentic System Architecture Search with GRPO

RAAS introduces the concept of "group relative evaluation" into agentic supernet architecture search: multiple candidate architectures compete on the same problem (CAO), with each architecture undergoing multiple independent trials to calculate a trimmed mean (MTAS). By using zero-centered relative advantage signals to update the generative distribution, it decouples "architecture quality" from "problem difficulty/execution randomness," consistently outperforming the strongest baseline MaAS (average +5.41) across six benchmarks including MATH, HumanEval, and GAIA.

REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

The REALM framework is proposed, leveraging the reasoning capabilities of MLLMs through a global-to-local spatial positioning strategy to perform open-world 3D reasoning segmentation on 3DGS. It handles implicit instructions without 3D post-training, achieving 92.88% mIoU on LERF (surpassing baselines by over 40 percentage points) while supporting editing tasks such as object removal, replacement, and style transfer.

ReFAct: Empowering Multimodal Web Agents with Visual and Context Focusing

ReFAct enables multimodal web search agents to actively manage cross-modal contexts: it employs Grounding tools to crop highly relevant image regions to counter "visual noise," uses Defocus/Refocus external memory operations to compress and retrieve long text on demand to counter "retrieval noise," and is fine-tuned via GRPO reinforcement learning on a custom GroundedVQA dataset designed for high-noise scenarios. ReFAct-7B significantly outperforms RL agents of the same scale on high-noise benchmarks.

Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

Refer-Agent decomposes Referring Video Object Segmentation (RVOS) into a step-by-step reasoning pipeline of "frame selection → intent analysis → object localization → mask generation." It further integrates a dual-stage Chain-of-Reflection (Existence Reflection + Consistency Reflection) composed of a Questioner-Responder pair to alternate between reasoning and reflection for self-correction. Without any training and using only a 9B open-source MLLM, it outperforms SFT methods and GPT-4o-based zero-shot methods across five RVOS benchmarks.

Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding

Addressing the pain points of "sparse and scattered key evidence" and "redundant context interference" in long-document QA, this paper proposes SLEUTH, a training-free multi-agent framework. It utilizes a coarse-to-fine pipeline of "Retrieval → Clue Mining + Visual Screening → Difficulty Assessment → Decision" to distill noisy top-K retrieved pages into concise, evidence-dense multimodal contexts, achieving SOTA performance across four long-document benchmarks in a model-agnostic manner.

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Addressing the challenges that "creative retouching is inherently subjective" and "rule-based rewards from single reference images are unreliable," this paper proposes RetouchIQ. The framework enables an MLLM agent to translate natural language instructions into executable Lightroom parameters. It utilizes a Generalist Reward Model (GRM) that "generates case-by-case evaluation metrics and then assigns scores," combined with Policy-Guided Reward Training (PGRT) for RL. Experimental results on the self-built RetouchEval and the MIT-Adobe5K dataset demonstrate superior semantic consistency and perceptual quality compared to MLLM and Diffusion baselines.

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

SAGE transforms long video reasoning from the "DIRECT" paradigm, which feeds thousands of frames in a single pass for a one-shot answer, into an "AGENT" paradigm that performs multi-round on-demand retrieval like humans. By utilizing an orchestrator VLM (SAGE-MM) capable of coordinating 6 tools, combined with low-cost synthetic data and multi-reward GRPO post-training, SAGE achieves up to a 6.1% improvement in open-ended QA on the SAGE-Bench and a 14.6% gain for long videos exceeding 10 minutes.

SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

SciEducator transforms the Deming Cycle (Plan–Do–Study–Act) from management science into a self-evolving multi-agent closed loop. By iteratively performing "planning–execution–review–improvement," the system understands scientific experiment videos and generates multi-modal educational handbooks for children. On the self-constructed SciVBench, it significantly outperforms closed-source MLLMs like GPT-4o and Gemini, as well as existing video agents.

Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

This paper redefines fine-grained visual understanding from "assigning a label" to "reasoning with evidence like an expert," proposing KFRA, a three-stage closed-loop Agent. It first retrieves candidate hypotheses, then grounds the retrieved textual knowledge to discriminative image regions, and finally enables the Large Multimodal Model (LMM) to reason and self-correct based on multimodal evidence. On the self-constructed FGExpertBench, KFRA achieves up to a 19% improvement over base models.

Simple Agents Outperform Experts in Biomedical Imaging Workflow Optimization

Addressing the "last mile" adaptation challenge of scientific tools, this paper utilizes a minimal "coding-execution" closed-loop agent. Using only a few dozen validation images, it automatically generates pre/post-processing code. Across three production-grade biomedical imaging pipelines (Polaris/Cellpose/MedSAM), it consistently outperforms expert-tuned code that originally took weeks or months to develop. The study systematically proves that complex components like tree search, function libraries, and AutoML are not universally beneficial.

Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding

Symphony mimics human cognition by decomposing long-video understanding into multiple specialized agents based on "capability dimensions" (Planning, Reflection, Grounding, Caption, and Visual Perception). It employs an Actor-Critic-style reflection-enhanced dynamic collaboration mechanism to iteratively correct reasoning and introduces a grounding agent that "expands queries first, then scores with VLM" for complex problems. It achieves SOTA on LVBench, LongVideoBench, Video-MME, and MLVU, outperforming the previous best on LVBench by 5.0%.

Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding

VideoHV-Agent is proposed to refactor long video question answering into a "hypothesis-verification" process: the Thinker rewrites answer options into testable hypotheses, the Judge extracts discriminative clues, the Verifier localizes evidence within the video for validation, and the Answerer synthesizes evidence to provide the final result. It achieves SOTA on EgoSchema, NextQA, and IntentQA while maintaining higher inference efficiency than existing agent-based methods.

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

This paper presents the first systematic study of Discrete Diffusion Vision-Language Models (DVLM) for GUI Grounding. By adapting LLaDA-V for single-step action prediction and proposing a mixed mask scheduling strategy (linear + deterministic) to capture geometric hierarchical dependencies between bounding box coordinates, the authors demonstrate the feasibility of diffusion models as a foundation for GUI Agents across Web, Desktop, and Mobile interfaces.

Universal Guideline-Driven Image Clustering via a Hybrid LLM Agent

This paper proposes the first training-free hybrid LLM agent that unifies various image clustering scenarios (general / fine-grained / multi-view / long-tail) via "text guidelines." It first uses an MLLM to translate images into "concept-proxy captions" and then passes them to an instruction-aware embedding model, resulting in guideline-aligned embeddings fed directly into traditional clustering algorithms. When the number of clusters is unknown, an LLM traversal based on a Minimum Spanning Tree (MST) is used to selectively merge small clusters, reducing expensive LLM calls from \(O(M^2)\) to \(O(M\log M)\). This approach outperforms specialized training-based methods across four task categories.

ViLoMem: Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

ViLoMem equips Multimodal Large Language Models (MLLMs) with a "visual stream + logic stream" dual-channel semantic memory. This allows the agent, upon failing a task, to attribute, store, and retrieve perception errors and reasoning errors separately. By using a grow-and-refine incremental update strategy to avoid forgetting, it consistently improves pass@1 across six multimodal reasoning benchmarks and significantly reduces repetitive mistakes.

VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

VULCAN upgrades "3D object repositioning based on instructions" from a single-step edit to a multi-agent long-horizon task with a "Plan-Execute-Evaluate" loop. It replaces fragile raw script operations with MCP vision APIs and constraint solvers, utilizes three types of specialized agents to distribute global planning and local execution, and incorporates adaptive backtracking search to recover from deadlocks. On 25 complex scenes, it reduces collision and floating rates to 0, significantly outperforming all baselines.

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

WebChain is collected from real human operations on live websites, constructing the largest human-annotated web interaction trace dataset to date (31,725 traces, 318k steps, 428 domains). Its core feature is the "triple alignment" of visual screenshots, structural Accessibility Trees (AX Trees), and action coordinates. Based on this, a Dual Mid-Training recipe is proposed to decouple spatial grounding and long-range planning, achieving SOTA results on the self-built WebChainBench and multiple public GUI benchmarks.

WebGym: Scaling Training Environments for Long-Horizon Visual Web Agents with Realistic Tasks

WebGym aggregates 10 existing web benchmarks and programmatically expands them into nearly 300,000 realistic web tasks with rubric evaluations. Combined with an asynchronous rollout system that provides 4-5× acceleration, it uses vanilla REINFORCE to improve the open-source Qwen3-VL-8B from 26.2% to 42.9% on an OOD test set consisting entirely of unseen websites, outperforming GPT-4o (27.1%) and GPT-5-Thinking (29.8%).