CVPR2025 Robotics & Embodied AI AI paper notes paper summaries Robotics Multimodal/VLM Navigation Reasoning

🤖 Robotics & Embodied AI¶

📷 CVPR2025 · 40 paper notes

📌 Same area in other venues: 📷 CVPR2026 (146) · 🔬 ICLR2026 (162) · 💬 ACL2026 (11) · 🧪 ICML2026 (53) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (75)

🔥 Top topics: Robotics ×19 · Multimodal/VLM ×10 · Navigation ×4 · Reasoning ×2

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation: This paper proposes 3D-MVP, which extends Masked Autoencoder pretraining from 2D to a 3D multiview setting. By pretraining the multiview Transformer encoder of RVT on 200K 3D objects from Objaverse, downstream fine-tuning improves the average success rate on RLBench from 62.9% to 67.5% and significantly enhances robustness against environmental variations (such as texture, size, and lighting) on COLOSSEUM.
A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning: Through systematic evaluation, it is found that DINO/iBOT outperforms MAE in robot tasks but suffers performance degradation on non-object-centric (NOC) data due to the loss of object-centric representation capabilities. This paper proposes SlotMIM, which uses a semantic bottleneck (reducing prototype numbers to encourage the emergence of objectness), cross-view consistency regularization, and slot-level contrastive learning. This enables the model to learn object-centric representations from NOC data, outperforming MVP/VC-1 pre-trained on >1M samples using only 241K samples.
CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos: Utilizing over 2,000 hours of city walking and driving videos from the internet, action labels are automatically extracted via Visual Odometry (VO) for large-scale imitation learning. This trains embodied agents capable of navigating complex, dynamic urban environments, achieving a 77.3% success rate in real-world deployment, significantly outperforming existing methods.
Coordinated Manipulation of Hybrid Deformable-Rigid Objects in Constrained Environments: This paper proposes a quasi-static trajectory optimization framework based on the Globally Variational Strain (GVS) parameterized Cosserat rod model for dual-arm coordinated manipulation of hybrid deformable-rigid linear objects (hDLO) in constrained environments. By leveraging analytical gradients, the solver achieves a 33x speedup over finite differences, and a ~3cm deformation error is validated on a real dual-arm platform.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models: This paper proposes CoT-VLA, which introduces visual Chain-of-Thought (CoT) reasoning into Vision-Language-Action (VLA) models. By utilizing a two-stage reasoning process—first predicting a subgoal image, then generating an action sequence—combined with hybrid attention and action chunking strategies, it achieves an 81.13% average success rate on the LIBERO benchmark, significantly outperforming existing methods.
Decision SpikeFormer: Spike-Driven Transformer for Decision Making: This work proposes DSFormer, the first spike-driven Transformer for offline reinforcement learning. It designs Temporal Spike Self-Attention (TSSA) and Position Spike Self-Attention (PSSA) to capture temporal/positional dependencies in RL, and introduces Progressive Threshold-dependent Batch Normalization (PTBN) to resolve the conflict between normalization and spiking properties. DSFormer outperforms ANN counterparts on the D4RL benchmark while saving 78.4% of energy consumption.
DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness: This paper proposes DexGrasp Anything, which integrates three physical constraint forces into the training and sampling phases of diffusion models to achieve SOTA dexterous grasp pose generation on almost all open datasets. Additionally, it constructs the largest-scale dexterous grasping dataset containing over 15K objects and more than 3.4 million grasping poses.
DRAWER: Digital Reconstruction and Articulation with Environment Realism: The DRAWER framework automatically constructs interactive digital twins from static scene videos. By combining a dual scene representation of SDF and Gaussian Splatting, it achieves high-fidelity rendering and precise geometry. It supports articulation identification and simulation, Unreal Engine game creation, and real-to-sim-to-real robotic policy transfer.
g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks: This paper proposes g3D-LF, which constructs generalizable 3D-language feature fields for unseen environments by performing multi-level contrastive learning pre-training on approximately 5,000 indoor 3D scenes and nearly 1 million language descriptions. It achieves state-of-the-art (SOTA) or near-SOTA performance across four embodied tasks: VLN (monocular/panoramic), zero-shot object navigation, and situated question answering.
GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities: GigaHands is the largest bimanual activity dataset to date. By designing an "Instruct-to-Annotate" procedural acquisition strategy and a 51-camera markerless capture system, it collects 34 hours of bimanual activities from 56 subjects interacting with 417 objects. It contains 183 million RGB image frames and 84K detailed text annotations, demonstrating the value of data scale in text-driven hand motion generation and motion captioning tasks.
Hearing Anywhere in Any Environment: Proposes xRIR, a unified model for Room Impulse Response (RIR) prediction that generalizes across rooms. It combines a geometric feature extractor using panoramic depth maps with an acoustic encoder leveraging a few-shot reference RIRs. Supported by the newly constructed AcousticRooms dataset (260 rooms, 300k+ RIRs), it significantly outperforms baseline methods in both seen/unseen simulated environments and real-world environments.
LaDA: Language-Grounded Decoupled Action Representation for Robotic Manipulation: Proposed LaDA, which decouples 7-DoF robot actions into three types of motion primitives (translation, rotation, and gripper) and aligns them with language semantics. Using soft-label contrastive learning and adaptive loss weighting, it achieves a 93.6% average success rate on LIBERO with only 1.3B parameters.
Learning Physics-Based Full-Body Human Reaching and Grasping from Brief Walking References: By utilizing only approximately 30 seconds of walking MoCap data and combining transferable movement patterns from walking (shallow network feature alignment) with kinematic-method-generated grasping poses (active data augmentation strategy), this work achieves physically feasible and natural full-body human reach-and-grasp motion generation, achieving a 99.8% grasp success rate in simple scenarios.
Let Humanoids Hike! Integrative Skill Development on Complex Trails: The LEGO-H framework is proposed, which unifies navigation perception and low-level locomotion control via TC-ViT (Temporal-Conditioned ViT). Combined with Hierarchical Latent Matching (HLM) for efficient distillation from an oracle policy, it enables the Unitree H1 humanoid robot to achieve a 68.4% success rate on complex outdoor hiking trails.
Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation: Lift3D proposes a two-stage framework: first, it enhances the implicit 3D perception of 2D foundation models via task-aware MAE depth reconstruction; second, it directly enables 2D models to encode point cloud data by projecting 3D point clouds onto virtual planes to establish a mapping with 2D position embeddings. It achieves an average success rate of 83.9% on MetaWorld (outperforming the previous SOTA DP3’s 65.3% by 18.6 percentage points).
Magma: A Foundation Model for Multimodal AI Agents: Magma unifies UI screenshots, robot data, and human manipulation videos into a single pre-training framework by labeling interactive regions on images (Set-of-Mark) and tracking motion trajectories in videos (Trace-of-Mark). This enables a single model to possess both multimodal understanding and cross-domain action prediction capabilities, achieving SOTA performance in both UI navigation and robotic manipulation.
ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning: This work proposes ManipTrans, a two-stage residual learning framework that transfers human motion capture data to bimanual dexterous hand manipulation: Stage-1 pre-trains an imitation model on pure hand trajectories (wrist + finger tracking + smoothness rewards), and Stage-2 incorporates object interaction constraints (object tracking + contact forces) via a residual module and curriculum learning, achieving an object rotation error of only 8.60° and a bimanual success rate of 39.5% on OakInk-V2.
ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping: This paper proposes a Multi-Layer Occlusion (MLO) representation to learn 3D hand-object occlusion relationships and integrates the large-scale Objaverse 3D object dataset into training, achieving the first hand-object manipulation video generation framework that supports both dexterous bimanual manipulation and generalizable object appearances.
Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation: This paper proposes the HR-Align adaptation paradigm, which leverages paired human-robot video data and a contrastive alignment loss to bridge the semantic discrepancy between models pre-trained on human data and the robot domain in a parameter-efficient manner. It improves the average success rate by 7%+ across 20 simulation tasks and 5 real-world tasks.
MoManipVLA: Transferring Vision-Language-Action Models for General Mobile Manipulation: Proposes MoManipVLA to transfer pre-trained fixed-base VLA models to mobile manipulation scenarios. By jointly planning base movement and manipulator trajectories using a bi-level trajectory optimization (optimizing reachability, smoothness, and collision avoidance), it achieves a 66.1% success rate (+4.2%) on the OVMM benchmark and can be deployed in the real world with only 50 demonstrations.
Neural Motion Simulator: Pushing the Limit of World Models in Reinforcement Learning: This paper proposes MoSim, a world model based on rigid-body dynamics priors and Neural ODEs. Operating in physical state spaces, it performs high-precision, long-horizon predictions, enabling zero-shot reinforcement learning—training policies without any real environment interactions—for the first time.
Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation: Concept-Gated Visual Distillation (CGVD) is proposed, a training-free inference-time framework. Through a pipeline of language instruction parsing → SAM3 segmentation → set-theoretic cross-validation → LaMa inpainting, it selectively removes semantic distractors from the visual input of VLA models, improving the manipulation success rate of \(\pi_0\) from 43.0% to 77.5% in highly cluttered scenes.
PanoAffordanceNet: Towards Holistic Affordance Grounding in 360° Indoor Environments: This paper proposes PanoAffordanceNet, the first 360° panoramic affordance grounding framework. It handles ERP latitude-dependent distortion using a Distortion-Aware Spectrum Modulator (DASM), restores sparse activations into topologically continuous areas via an Omnispherical Densification Head (OSDH), suppresses semantic drift with multi-level training objectives, and constructs the first panoramic affordance dataset 360-AGD, comprehensively outperforming existing methods.
Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception: Proposes a perception scheduling framework for human-robot collaboration that selectively activates perception modules (object detection/pose estimation) based on the trade-off between information gain and computational cost. Under streaming perception scenarios, it reduces computational latency by up to 27.52% while improving MMPose activation recall by 72.73%.
Phoenix: A Motion-based Self-Reflection Framework for Fine-grained Robotic Action Correction: This paper proposes the Phoenix framework, which utilizes motion instructions as a bridge to connect the high-level semantic reflection of MLLMs with low-level robotic action correction. By incorporating a dual-process motion adjustment mechanism and a motion-conditioned diffusion policy, Phoenix achieves fine-grained manipulation failure recovery and supports self-improvement through lifelong learning.
Prof. Robot: Differentiable Robot Rendering without Static and Self-Collisions: Prof. Robot is proposed as the first differentiable robot rendering framework incorporating collision constraints. By binding 3D Gaussian points to each link of a robot's URDF model, differentiable rendering is achieved. Concurrently, static collision (with the environment) and self-collision (within the robot itself) constraints are integrated into the optimization, reducing the collision rate from 24% to 0% while maintaining visual fidelity.
Reasoning in Visual Navigation of End-to-end Trained Agents: A Dynamical Systems Approach: Through large-scale experiments on 262 real-world robot navigation episodes, this work deeply analyzes the emergent reasoning capabilities inside end-to-end RL-trained navigation agents, including a Kalman-filter-like dynamical model, latent memory of scene structures, a finite horizon of planning ability, and value functions associated with long-term planning.
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors: Proposes RoboGround, a two-stage framework: first, a Grounded VLM (GLaMM) generates segmentation masks of target objects and placement areas from images and text instructions; second, Grounded Perceiver utilizes these masks as intermediate representations to guide the robotic policy network in execution. This achieves a 60-100% relative improvement on complex semantic manipulation tasks.
Robotic Visual Instruction: Proposes Robotic Visual Instruction (RoVI), a visual instruction paradigm centered on hand-drawn arrows and circles to guide robotic manipulation instead of natural language, and designs the VIEW pipeline to translate 2D visual instructions into 3D action sequences, achieving an 87.5% success rate in real-world environments.
RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins: RoboTwin proposes a dual-arm robot benchmark framework based on generative digital twins. It leverages 3D generative foundation models to reconstruct 3D digital twins of objects from single 2D images and combines them with Large Language Models to automatically generate robot manipulation code. Under the paradigm of simulation pre-training followed by real-world fine-tuning with sparse data, it achieves significant success rate gains of over 70% in single-arm tasks and over 40% in dual-arm tasks.
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics: SaPaVe proposes an end-to-end active manipulation framework. By decoupling the action space of camera movement and manipulation actions, it adopts a bottom-up, two-stage training strategy (learning semantic camera control first, followed by joint optimization) to train active perception priors on a 200K semantic camera movement dataset. Coupled with a 3D geometry-aware module to enhance execution robustness under viewpoint changes, it achieves 31.25% and 40% higher success rates than GR00T-N1 and \(\pi_0\), respectively, in real-world tasks.
ShowUI: One Vision-Language-Action Model for GUI Visual Agent: Based on Qwen2-VL-2B, ShowUI reduces redundant tokens by 33% and achieves a 1.4x speedup through UI-connected-graph-guided visual token selection. Combined with interleaved vision-language-action streaming and a curated 256K training dataset, it achieves state-of-the-art (SOTA) zero-shot accuracy of 75.1% on ScreenSpot with only 2B parameters.
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters: Proposed SOLAMI, the first end-to-end Social Vision-Language-Action (VLA) modeling framework. By discretizing speech and motion into tokens and modeling them uniformly with a decoder-only LLM, it enables immersive, real-time interaction between users and 3D virtual characters using speech and body language. Additionally, a synthetic multimodal social interaction dataset, SynMSI, was constructed.
Solving Instance Detection from an Open-World Perspective: From an open-world perspective, this work introduces three strategies—metric learning for adapting foundation model features, distractor sampling, and NeRF-based novel-view synthesis—to significantly enhance instance-level feature matching performance in instance detection, substantially outperforming prior arts in both CID and NID setups.
SortScrews: A Dataset and Baseline for Real-time Screw Classification: This paper proposes the SortScrews dataset—an industrial classification dataset containing 560 RGB images of size \(512 \times 512\) across 6 screw categories, accompanied by a reusable data acquisition pipeline. Transfer learning models EfficientNet-B0 and ResNet-18 are established as baselines, with ResNet-18 achieving a validation accuracy of 96.4% on this dataset.
Think Small, Act Big: Primitive Prompt Learning for Lifelong Robot Manipulation: This paper proposes Primitive Prompt Learning (PPL), which encodes motion primitives into reusable prompt vectors. By combining this with flow-aware Motion-Aware Prompting (MAP), it enables the sharing of motion primitives across skills. Using a freeze-and-expand mechanism to support lifelong robot manipulation learning, PPL outperforms baselines such as LoRA and experience replay in both LIBERO and real-world environments.
TinyNav: End-to-End TinyML for Real-Time Autonomous Navigation on Microcontrollers: Deploying an end-to-end quantized CNN on an ESP32 microcontroller to achieve real-time autonomous navigation with a 30ms latency using only 23k parameters and a ToF depth camera.
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method: This paper defines the Long-Horizon Vision-Language Navigation (LH-VLN) task, constructs the NavGen automatic generation platform and the LHPR-VLN benchmark (comprising 3,260 multi-stage tasks with an average of 150 steps), and proposes the MGDM method that achieves multi-stage navigation through short-term memory blurring, long-term memory retrieval, and CoT feedback, outperforming NaviLLM by 23% in the ISR metric.
UniAct: Universal Actions for Enhanced Embodied Foundation Models: UniAct proposes building embodied foundation models in a Universal Action Space, encoding atomic behaviors shared across diverse embodied platforms via a vector-quantized codebook. The 0.5B parameter model outperforms SOTA models 14 times its size and supports rapid adaptation to new robots.
ZeroGrasp: Zero-Shot Shape Reconstruction Enabled Robotic Grasping: ZeroGrasp proposes a unified framework based on an octree Conditional Variational Autoencoder (CVAE) to simultaneously perform high-resolution 3D object reconstruction and 6D grasp pose prediction from a single RGB-D image. By modeling inter-object relations with a multi-object encoder and 3D occlusion fields, it achieves SOTA performance on the GraspNet-1B benchmark and demonstrates generalization capabilities on real robots.