CVPR2025 LLM Agent AI paper notes paper summaries Agents Dynamic Scenes Compression Multimodal/VLM Reasoning

🦾 LLM Agent¶

📷 CVPR2025 · 9 paper notes

📌 Same area in other venues: 📷 CVPR2026 (42) · 🔬 ICLR2026 (162) · 💬 ACL2026 (82) · 🧪 ICML2026 (59) · 🤖 AAAI2026 (33) · 🧠 NeurIPS2025 (39)

🔥 Top topics: Agents ×4

ATA: Adaptive Transformation Agent for Text-Guided Subject-Position Variable Background Generation: Proposes the ATA (Adaptive Transformation Agent) framework to achieve precise control over subject position and pose in text-guided background generation, dynamically adjusting the subject's placement in the background via an adaptive transformation module while balancing visual consistency and semantic plausibility.
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields: This paper proposes Feature4X, a versatile framework that distills the functionalities of various 2D visual foundation models (e.g., SAM2, InternVideo2) from arbitrary monocular videos into a unified 4D Gaussian feature field via a dynamic optimization strategy. This work represents the first attempt to lift video foundation models to 4D features based on Gaussian Splatting, supporting segment anything from novel views, geometric/appearance editing, and free-form VQA.
GUI-Xplore: Empowering Generalizable GUI Agents with One Exploration: Proposed the GUI-Xplore dataset (312 applications, 32K+ QA pairs, 5-level tasks) and the Xplore-Agent framework (Action-aware GUI Modeling + GUI Transition Graph Reasoning). By simulating the human strategy of "exploring before reasoning", it improves StepSR by approximately 10% on unfamiliar applications compared to state-of-the-art GUI agents.
RL-RC-DoT: A Block-level RL Agent for Task-Aware Video Compression: Proposes RL-RC-DoT, a reinforcement learning-based macroblock-level quantization parameter (QP) control agent for task-aware video compression. By modeling QP selection as a sequential decision-making problem in RL, the agent learns to allocate more bitrate to task-relevant regions under given bitrate constraints, significantly improving performance on vehicle detection and ROI saliency coding tasks. A key advantage is that it does not require running downstream task models during inference, making it suitable for edge device deployment.
SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation: Proposes SceneAssistant, a closed-loop agentic framework based on visual feedback. By designing a fully functioning suite of Action APIs (13 atomic operations spanning object search, deletion, 6DoF spatial operations, and camera control) for VLMs, this approach enables iterative, open-vocabulary 3D scene generation using the ReAct paradigm. It significantly outperforms Holodeck and SceneWeaver in both indoor (preference rate of 61.25%) and open-domain (preference rate of 65.00%) scenarios.
Sketchtopia: A Dataset and Foundational Agents for Benchmarking Asynchronous Multimodal Communication with Iconic Feedback: Proposes the Sketchtopia large-scale dataset (20K+ game sessions, 263K sketches, 916 players) and a three-component Agent framework (ActionDecider + DRAWBOT + GUESSBOT) to study asynchronous, goal-driven multimodal collaborative communication in Pictionary scenarios, introducing three new evaluation metrics: AAO, FRS, and MATS.
SpiritSight Agent: Advanced GUI Agent with One Look: This paper proposes SpiritSight, a vision-based end-to-end GUI agent, which resolves grounding ambiguity under dynamic high-resolution inputs through a multi-tier dataset of 5.73 million samples named GUI-Lasagne and the Universal Block Parsing (UBP) method. On Multimodal-Mind2Web under the non-candidate element setup, SpiritSight-8B achieves a Step Success Rate (SR) of 52.7%, outperforming all vision, language, and hybrid methods.
TANGO: Training-free Embodied AI Agents for Open-world Tasks: This paper proposes TANGO, which orchestrates two minimal navigation foundation primitives (PointGoal Navigation + Memory-based Exploration) through the program generation capability of LLMs. Without any task-specific training and using only few-shot examples, TANGO achieves state-of-the-art (SOTA) results across three distinct embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied QA, demonstrating the generalizability of the "minimal primitive set + LLM composition" paradigm.
Visual Agentic AI for Spatial Reasoning with a Dynamic API: This paper proposes VADAR, an agentic program synthesis approach for 3D spatial reasoning. Multiple LLM agents collaborate to generate Pythonic APIs and dynamically extend new functions to solve common subproblems during the solving process, overcoming the limitations of prior methods like VisProg/ViperGPT that rely on static, human-defined APIs. At the same time, it introduces a new benchmark involving multi-step spatial localization and reasoning, outperforming existing zero-shot methods on 3D understanding tasks.