ECCV2024 Robotics & Embodied AI AI paper notes paper summaries Robotics Navigation Model Compression LLM Few-/Zero-Shot Learning Multimodal/VLM

🤖 Robotics & Embodied AI¶

🎞️ ECCV2024 · 13 paper notes

📌 Same area in other venues: 📷 CVPR2026 (146) · 🔬 ICLR2026 (162) · 💬 ACL2026 (11) · 🧪 ICML2026 (53) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (75)

🔥 Top topics: Robotics ×6 · Navigation ×3

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation: Proposes the STAformer architecture and two affordance-based modules (an environment affordance database + interaction hotspots), improving the relative performance of Short-Term object Interaction Anticipation (STA) in egocentric videos by 30-45% on Ego4D and EPIC-Kitchens.
An Economic Framework for 6-DoF Grasp Detection: This paper proposes the EconomicGrasp framework. By identifying that the ambiguity problem in dense supervision is the root cause of the conflict between performance and resource consumption, it designs an economic supervision paradigm (retaining all view perspectives but cropping rotation angles and depths) and a focus representation module (an interactive grasp head with composite scoring). It outperforms the SOTA by approximately 3 AP on GraspNet-1Billion with only 1/4 of the training time and 1/8 of the memory cost.
Decomposed Vector-Quantized Variational Autoencoder for Human Grasp Generation: Proposes Decomposed VQ-VAE (DVQ-VAE), which decomposes the hand into six parts to encode them into independent codebooks, and designs a dual-stage decoding strategy (posture first, then position), achieving an approximate 14.1% relative improvement in quality index across four benchmark datasets.
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control: The DISCO framework is proposed, which significantly improves the performance of embodied navigation and interaction on the ALFRED benchmark (outperforming SOTA by +8.6% in unseen success rate, without requiring step-by-step instructions) through differentiable scene semantic representation and dual-level coarse-to-fine action control.
GraspXL: Generating Grasping Motions for Diverse Objects at Scale: GraspXL is proposed, an RL-based grasp motion generation framework that generalizes to over 500,000 unseen objects after training on only 58 objects, while simultaneously supporting multi-objective motion control (grasp region, heading, wrist rotation, and hand position) and multiple dexterous hand platforms.
Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos: The Hierarchical Neural Bones (HSNB) framework is proposed, which decomposes object motion in a coarse-to-fine manner using a tree-structured bone system to reconstruct high-quality animatable 3D models from casual videos.
Learning Cross-Hand Policies of High-DOF Reaching and Grasping: A two-stage hierarchical framework is proposed, which uses semantic keypoints and the Interaction Bisector Surface (IBS) as hand-agnostic state representations. Combined with a Transformer policy network and hand-specific adaptation models, it achieves zero-shot transfer of dexterous grasping policies across different high-DOF robotic hands.
LLM as Copilot for Coarse-Grained Vision-and-Language Navigation: This paper proposes the VLN-Copilot framework, where a vision-and-language navigation agent actively seeks help from an LLM when confused under coarse-grained (short and ambiguous) instructions. Acting as a copilot, the LLM generates real-time, fine-grained navigation guidance, significantly improving navigation success rates on two coarse-grained VLN datasets.
Prioritized Semantic Learning for Zero-shot Instance Navigation: This paper proposes the Prioritized Semantic Learning (PSL) method. Through a semantic-augmented agent architecture, a prioritized semantic training strategy, and a semantic expansion inference scheme, it significantly improves the agent's semantic perception capabilities in zero-shot object/instance navigation, achieving SOTA performance on both ObjectNav and the newly proposed InstanceNav tasks.
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots: This paper proposes the first vision-language-action (QUAR-VLA) paradigm for quadruped robots, constructing a multi-task dataset QUARD with 259K episodes and the QUART model based on a pretrained multimodal large model, achieving unified control of multi-tasks such as perception, navigation, and whole-body manipulation.
ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments: Introduces the ReALFRED benchmark, which replaces ALFRED's synthetic single-room scenes with 150 real-world 3D scanned, interactive multi-room environments, providing 30,696 free-form language instructions and revealing a significant performance drop of existing embodied instruction-following methods in real environments.
See and Think: Embodied Agent in Virtual Environment: This paper proposes STEVE, an open-world embodied agent in Minecraft based on three major components: visual perception, language instruction, and code action. By fine-tuning LLaMA-2 on the STEVE-21K dataset and combining it with a visual encoder and a skill database, STEVE significantly outperforms existing methods in tech tree unlocking and block search tasks.
SemGrasp: Semantic Grasp Generation via Language Aligned Discretization: This work proposes SemGrasp, which designs a hierarchical VQ-VAE to discretize grasp poses into three semantic tokens ("orientation-manner-refinement"). It then fine-tunes a Multimodal Large Language Model (MLLM) to align objects, grasps, and language within a unified semantic space, enabling the generation of physically plausible and semantically consistent human grasp poses from natural language instructions.