ECCV2024 LLM Agent AI paper notes paper summaries Agents Few-/Zero-Shot Learning Layout & Composition Reasoning Multimodal/VLM

🦾 LLM Agent¶

🎞️ ECCV2024 · 3 paper notes

🔥 Top topics: Agents ×3

Agent3D-Zero: An Agent for Zero-shot 3D Understanding: Agent3D-Zero proposes a VLM-based zero-shot 3D scene understanding agent framework. By utilizing Set-of-Line visual prompting on the bird's-eye view (BEV) to guide the VLM to actively select observation viewpoints and synthesizing multi-view images for 3D reasoning, it outperforms fine-tuned 3D-LLM methods on tasks like ScanQA.
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning: (Note: Brief note based on abstract) This paper proposes HYDRA, a multi-stage dynamic compositional visual reasoning framework. Through the collaboration of three modules—a Planner, a reinforcement learning cognitive controller (RL Agent), and a Reasoner—it achieves reliable and progressive visual reasoning, reaching SOTA performance on multiple datasets including RefCOCO/RefCOCO+, OK-VQA, and GQA.
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding: This paper proposes VideoAgent, a memory-augmented multimodal Agent. By constructing structured memory (temporal memory storing event descriptions and object memory storing object tracking states) and utilizing four tools to interact with the memory, it performs zero-shot long video QA tasks. It achieves an average gain of +6.6% on NExT-QA and +26.0% on EgoSchema, approaching the performance of Gemini 1.5 Pro.