🦾 LLM Agent¶

📷 CVPR2026 · 19 paper notes

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search: This paper proposes ARGOS, the first benchmark and framework that redefines multi-camera person search as an interactive reasoning problem. An agent conducts multi-turn dialogue with witnesses, invokes spatiotemporal tools, and eliminates candidates under information asymmetry. The benchmark comprises 2,691 tasks across 3 progressive tracks.
CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare: This paper proposes the CareFlow benchmark (1,050 long-horizon medical software workflow tasks, 8–24 steps, covering four systems: DICOM/3D Slicer/EMR/LIS) and the CarePilot framework (based on the Actor-Critic paradigm, integrating tool grounding and a dual memory mechanism), achieving approximately 15% higher task accuracy than GPT-5 on CareFlow.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration: EchoTrail-GUI is proposed as a framework that builds a high-quality action memory repository through critic-model-guided autonomous exploration, and dynamically retrieves relevant experiences to inject into prompts at inference time, improving GPT-4o's task success rate on AndroidWorld from 34.5% to 51.7%.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration: EchoTrail-GUI proposes a three-stage closed-loop framework: an exploration agent autonomously interacts with GUI environments to generate trajectories → a critic reward model filters and retains only high-quality trajectories to construct a memory store (EchoTrail-4K) → upon receiving a new task, the most relevant memories are injected via hybrid dense-sparse retrieval to guide inference. This transforms a stateless GUI agent into a memory-augmented system, achieving 51.7% SR (+17.2pp) with GPT-4o on AndroidWorld, and improving Qwen2.5-VL-72B SR from 23.9% to 37.5% on AndroidLab.
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos: This paper proposes Ego2Web, the first benchmark that bridges egocentric video perception with web agent execution, accompanied by a semi-automatic data construction pipeline and the Ego2WebJudge automatic evaluation framework. Experiments reveal that current state-of-the-art agents still exhibit a substantial gap in cross-modal transfer from real-world visual perception to online action, with the best model achieving only 48.2% success rate.
Gen-n-Val: Agentic Image Data Generation and Validation: This paper proposes Gen-n-Val, an agentic synthetic data generation and validation framework that leverages an LLM to optimize Layer Diffusion prompts for generating high-quality single-object transparent images, and employs a VLLM to filter low-quality samples. The framework reduces invalid synthetic data from 50% to 7%, achieving a 7.6% mAP improvement on LVIS rare-category instance segmentation.
GUI-CEval: A Hierarchical and Comprehensive Chinese Benchmark for Mobile GUI Agents: This paper proposes GUI-CEval, the first comprehensive benchmark for Chinese mobile GUI agents, covering 201 mainstream Chinese apps and 4 device types. It adopts a two-tier structure (foundation + application) to perform fine-grained diagnosis across five dimensions—perception, planning, reflection, execution, and evaluation. Experiments on 20 representative models reveal that current models exhibit significant deficiencies in reflection and self-evaluation.
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents: This paper proposes HATS (Hardness-Aware Trajectory Synthesis), a difficulty-aware trajectory synthesis framework that employs a closed-loop mechanism of hardness-driven exploration and alignment-guided refinement. By focusing on the collection and correction of training trajectories for semantically ambiguous actions, HATS substantially improves the generalization capability of GUI Agents in complex real-world scenarios.
HATS: Hardness-Aware Trajectory Synthesis for GUI Agents: This paper proposes HATS — a hardness-aware trajectory synthesis framework that identifies and handles semantically ambiguous GUI actions via two closed-loop modules: hardness-driven exploration and alignment-guided refinement, significantly improving the cross-environment generalization of GUI agents.
Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search: This paper proposes the HAVEN framework, which achieves 84.1% accuracy on LVBench through audiovisual entity cohesion and a four-level hierarchical video index (global–scene–clip–entity), coupled with an agentic search mechanism, attaining 80.1% on the reasoning category.
HAVEN: Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search: HAVEN proposes a unified framework combining audiovisual entity cohesion, hierarchical indexing, and agentic search. By leveraging speaker identity as a cross-modal coherence signal, it constructs a four-level hierarchical database (global → scene → clip → entity), achieving state-of-the-art 84.1% overall accuracy on LVBench.
Nerfify: A Multi-Agent Framework for Turning NeRF Papers into Code: Nerfify is proposed as a four-stage pipeline—CFG formalization with in-context learning, compositional citation recovery, GoT-based code synthesis, and visual feedback—that automatically converts NeRF papers into trainable Nerfstudio plugins, achieving 100% executability on a 30-paper benchmark (vs. 5% for general baselines) with visual quality within ±0.5 dB PSNR of expert implementations.
Nerfify: A Multi-Agent Framework for Turning NeRF Papers into Code: Nerfify is proposed, a domain-aware multi-agent framework that automatically converts NeRF papers into trainable Nerfstudio plugin code via context-free grammar (CFG) constraints, Graph-of-Thought (GoT) code synthesis, and compositional reference dependency recovery, achieving 100% executability with visual quality within ±0.5 dB PSNR of expert implementations.
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting: This paper proposes REALM, a framework that leverages an MLLM agent to perform reasoning segmentation on views rendered by 3D Gaussian Splatting (3DGS), and introduces a Global-Local Spatial Grounding strategy (GLSpaG) to aggregate multi-view MLLM reasoning results. REALM substantially outperforms existing methods on implicit-instruction 3D segmentation (mIoU 92.88% vs. baseline 44.82% on LERF) and supports downstream 3D editing.
REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting: REALM is proposed as a framework that leverages MLLM reasoning capabilities to perform open-world 3D reasoning segmentation on 3DGS via a global-to-local spatial grounding strategy, handling implicit instructions without 3D post-training. It achieves 92.88% mIoU on LERF, surpassing baseline methods by 40+ percentage points, and supports editing tasks including object removal, replacement, and style transfer.
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation: This paper proposes SceneAssistant—a VLM agentic framework driven purely by visual feedback—that designs 14 functionally complete Action APIs enabling Gemini-3.0-Flash to iteratively generate and refine open-vocabulary 3D scenes within a ReAct closed loop, requiring neither predefined spatial relation templates nor external layout solvers. On a human evaluation of 30 scenes, it achieves a Layout score of 7.600 (vs. SceneWeaver 5.800) and a Human Preference rate of 65%.
Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding: VideoHV-Agent reframes long video QA as a hypothesis-verification process: a Thinker rewrites answer options into testable hypotheses, a Judge extracts discriminative clues, a Verifier localizes evidence in the video, and an Answer agent synthesizes evidence into a final answer. The framework achieves state-of-the-art results on EgoSchema, NextQA, and IntentQA while outperforming existing agent methods in inference efficiency.
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding: This work presents the first systematic study of discrete vision-language diffusion models (DVLMs) for GUI grounding, adapting LLaDA-V for single-step action prediction and proposing a hybrid masking schedule (linear + deterministic) to capture geometric hierarchical dependencies among bounding box coordinates. The approach demonstrates the feasibility of diffusion models as a foundation for GUI agents across Web, Desktop, and Mobile interfaces.
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning: This paper proposes WorldMM, a video reasoning agent based on multimodal memory, which constructs three complementary memory types: episodic memory (multi-temporal-scale textual knowledge graphs), semantic memory (continuously updated relational knowledge graphs), and visual memory (frame-level retrieval stores). An adaptive multi-round retrieval agent dynamically selects the most relevant memory source and temporal granularity, achieving an average improvement of 8.4% over the previous state of the art across five long video QA benchmarks.