🤖 Robotics & Embodied AI¶

📹 ICCV2025 · 26 paper notes

Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding: This paper proposes AdaRPG, a framework that leverages foundation vision-language models for part-level segmentation and affordance reasoning on articulated objects, and employs GPT-4o to generate high-level control code for adaptively scheduling atomic manipulation skills, achieving cross-category zero-shot generalization in both simulation and real-world environments.
AnyBimanual: Transferring Unimanual Policy for General Bimanual Manipulation: This paper proposes AnyBimanual, a plug-and-play framework that transfers pretrained unimanual manipulation policies to general bimanual manipulation scenarios via a Skill Manager and a Visual Aligner, achieving significant multi-task generalization with only a small number of bimanual demonstrations.
Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective: From a generalization perspective, this paper introduces Sharpness-Aware Minimization (SAM) into multi-task learning (MTL). By decomposing each task's SAM gradient into a "low-loss direction" and a "flat direction" and aggregating them separately, the method reduces gradient conflicts and guides the model toward a jointly flat low-loss region shared across tasks.
Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations: This paper proposes URMMDG, a framework that constructs a cross-modal unified representation space via supervised contrastive learning and decouples class-generic information from modality/domain-specific information through mutual information minimization. This enables effective transfer of classical single-modal domain generalization methods (Mixup, JiGen, IBN-Net) to multimodal domain generalization (MMDG) settings, achieving state-of-the-art performance on the EPIC-Kitchens and HAC benchmarks.
Certifiably Optimal Anisotropic Rotation Averaging: This paper proposes a novel SDP relaxation that enforces solutions to lie within the convex hull of SO(3), conv(SO(3)), achieving for the first time certifiably globally optimal rotation averaging under anisotropic cost functions. It resolves the fundamental failure of the conventional O(3) relaxation in anisotropic settings.
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games: This paper proposes CombatVLA, an efficient 3B-parameter VLA model designed for combat tasks in 3D action role-playing games. Through the Action-of-Thought data format and a truncated inference strategy, CombatVLA achieves inference speeds up to 50× faster than existing VLM-based game frameworks while surpassing human players in combat success rate.
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation: This paper proposes COSMO, a low-cost VLN architecture combining selective memorization, which replaces the computationally expensive attention mechanisms in Transformers with two customized selective state space modules—Round Selective Scan (RSS, capturing global context in a single scan pass) and Cross-modal Selective State Space Module (CS3, dual-stream cross-modal interaction)—achieving navigation performance surpassing the baseline DUET with only 15.5% of its parameters and 9.3% of its FLOPs.
DexVLG: Dexterous Vision-Language-Grasp Model at Scale: This paper presents DexVLG — the first large-scale vision-language-dexterous-grasp model. It introduces DexGraspNet 3.0, a dataset comprising 174K objects and 170M grasp poses with part-level semantic annotations. By combining a VLM encoder with a Flow Matching pose prediction head, DexVLG achieves over 76% zero-shot execution success in simulation and demonstrates semantically aligned dexterous grasping in the real world.
Embodied Representation Alignment with Mirror Neurons: Inspired by mirror neurons, this paper aligns the intermediate representations of action understanding (observing others' behavior) and embodied execution (autonomously performing actions) into a shared latent space via contrastive learning. The work reveals a spontaneous alignment phenomenon between the two model families that correlates with task success rate, and demonstrates that explicit alignment yields improvements on action recognition (+3.3%) and robot manipulation (+3.5%).
EvolvingGrasp: Evolutionary Grasp Generation via Efficient Preference Alignment: This paper proposes EvolvingGrasp, which achieves efficient evolutionary generation and human preference alignment for dexterous grasp pose synthesis via Handpose-wise Preference Optimization (HPO) and a Physics-Aware Consistency Model (PCM), attaining state-of-the-art performance on four benchmark datasets with a 30× inference speedup.
GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices: This paper presents GUIOdyssey, the first comprehensive dataset for cross-app GUI navigation on mobile devices (8,334 episodes, 212 apps, 1,357 app combinations), along with OdysseyAgent—a multimodal navigation agent equipped with a history resampling module that significantly improves cross-app task performance while balancing accuracy and inference efficiency.
iManip: Skill-Incremental Learning for Robotic Manipulation: This paper proposes the iManip framework, which enables robots to continually acquire new manipulation skills without retraining through a temporal replay strategy and a scalable PerceiverIO architecture, while mitigating catastrophic forgetting of previously learned skills. iManip achieves an average improvement of 9.4% over conventional incremental learning baselines on RLBench.
Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning: This paper proposes IMMP (Interaction-Merged Motion Planning), a two-stage strategy — Interaction-Conserving Pre-Merging (constructing a multi-metric checkpoint pool) and Interaction Transfer with Merging (task-vector-based weighted merging grouped by interaction modules) — to transfer agent behavior and interaction knowledge from diverse trajectory datasets to a target domain, effectively improving cross-domain adaptability of motion planning.
TesserAct: Learning 4D Embodied World Models: TesserAct is a 4D embodied world model that trains a video generative model to jointly predict RGB, depth, and normal videos, which are subsequently converted into high-quality 4D scenes, enabling spatiotemporally consistent 3D world dynamics simulation and robot action planning.
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos: This paper proposes Moto, a framework that encodes inter-frame visual motion from video into discrete sequences via unsupervised Latent Motion Tokens. A GPT-style autoregressive pre-training scheme is employed to learn motion priors, which are then transferred to real robot manipulation through a co-fine-tuning strategy. Moto achieves performance competitive with 55B-parameter models on the SIMPLER and CALVIN benchmarks using only 98M parameters.
NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments: This paper proposes NavMorph, an RSSM-based self-evolving world model that models continuous environment dynamics in latent space via a World-aware Navigator and a Foresight Action Planner, and introduces a Contextual Evolution Memory (CEM) for rapid online test-time adaptation.
PacGDC: Label-Efficient Generalizable Depth Completion with Projection Ambiguity and Consistency: This paper proposes PacGDC, which exploits the inherent shape and position ambiguities in 2D-to-3D projection to synthesize large quantities of pseudo-geometric data—using multiple depth foundation models as scale manipulators—thereby achieving generalizable depth completion with minimal annotation cost, attaining state-of-the-art performance in both zero-shot and few-shot settings.
PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation: This paper proposes PASG (Primitive-Aware Semantic Grounding), a closed-loop framework that dynamically couples low-level geometric features with high-level task semantics through automated geometric primitive extraction (keypoints, functional axes, principal axes) and VLM-driven semantic anchoring. PASG achieves near-human-annotation performance on robotic manipulation tasks, and introduces the Robocasa-PA benchmark along with the fine-tuned model Qwen2.5VL-PA.
Rep-MTL: Unleashing the Power of Representation-Level Task Saliency for Multi-Task Learning: This paper proposes Rep-MTL, a multi-task optimization method grounded in representation-level task saliency. It mitigates negative transfer and explicitly promotes cross-task complementarity via entropy-regularized task-specific saliency regulation (TSR) and sample-level cross-task saliency alignment (CSA), without modifying the optimizer or network architecture.
Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning: This paper proposes DTME-MTL, a framework that identifies and categorizes gradient conflicts in token space into range-space conflicts and null-space conflicts, and addresses them via Token Modulation (affine transformation) and Token Expansion (task-specific token insertion), respectively, to mitigate negative transfer in Transformer-based multi-task learning with minimal parameter overhead.
Selective Contrastive Learning for Weakly Supervised Affordance Grounding: This paper proposes a selective contrastive learning approach for weakly supervised affordance grounding (WSAG). By combining prototypical contrastive learning and pixel-level contrastive learning, the method adaptively learns affordance-relevant cues at both object and part granularities, effectively preventing the model from attending to action-irrelevant salient features. The approach comprehensively outperforms competing methods that rely on stronger foundation models (GPT-4, LLaVA, etc.) on AGD20K and HICO-IIF benchmarks.
Self-supervised Learning of Hybrid Part-aware 3D Representations of 2D Gaussians and Superquadrics: This paper proposes PartGS, a self-supervised part-aware 3D reconstruction framework that hybridly couples 2D Gaussian Splatting with superquadrics. Through parameter sharing and multiple regularization terms, PartGS achieves simultaneous high-quality geometric decomposition and texture reconstruction, outperforming state-of-the-art methods by 75.9% in reconstruction accuracy and 16.13 dB in PSNR on DTU, ShapeNet, and real-world scenes.
SITE: towards Spatial Intelligence Thorough Evaluation: This paper presents SITE, a comprehensive spatial intelligence benchmark grounded in a tripartite cognitive-science taxonomy. It comprises 8,068 multiple-choice VQA tasks spanning 31 datasets (images and videos). Evaluation results show that the strongest VLM (GPT-4o) still lags human experts by approximately 32% on overall spatial reasoning, and VLM spatial intelligence scores are highly correlated with robotic manipulation success rates (Pearson \(r=0.902\)).
TransiT: Transient Transformer for Non-line-of-sight Videography: TransiT is a novel architecture for real-time NLOS video reconstruction that achieves 64×64 resolution at 10 FPS from sparse fast-scan (16×16, 0.4 ms/point) transient measurements. The system integrates transient compression, inter-frame feature fusion, and a spatiotemporal Transformer, and further proposes an MMD-based transfer learning strategy to bridge the distribution gap between synthetic and real data.
UnZipLoRA: Separating Content and Style from a Single Image: This paper proposes UnZipLoRA, a method that simultaneously trains two decoupled and compatible LoRAs (a content LoRA and a style LoRA) from a single image. Through three strategies—prompt separation, column separation, and block separation—the method achieves effective disentanglement of content and style, enabling independent manipulation and free recombination. UnZipLoRA surpasses DreamBooth-LoRA, Inspiration Tree, and B-LoRA across all user preference metrics.
Weakly-Supervised Learning of Dense Functional Correspondences: This paper defines the task of Dense Functional Correspondence—establishing pixel-level dense correspondences between objects of different categories based on shared functionality (e.g., "pouring")—and proposes a weakly-supervised learning framework that distills functional and structural knowledge into a new model via VLM-based pseudo-labeling of functional parts combined with multi-view contrastive learning.