Skip to content

🤖 Robotics & Embodied AI

🤖 AAAI2026 · 30 paper notes

📌 Same area in other venues: 📷 CVPR2026 (130) · 🔬 ICLR2026 (162) · 💬 ACL2026 (11) · 🧪 ICML2026 (53) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (26)

🔥 Top topics: Robotics ×12 · Reinforcement Learning ×6 · Adversarial Robustness ×5 · Multimodal/VLM ×4 · Alignment/RLHF ×3

10 Open Challenges Steering the Future of Vision-Language-Action Models

This paper systematically surveys 10 open challenges facing VLA models — multimodal perception, robust reasoning, high-quality training data, evaluation, cross-robot action generalization, resource efficiency, whole-body coordination, safety assurance, agent frameworks, and human-robot collaboration — and discusses four emerging trends: spatial understanding, world dynamics modeling, post-training, and data synthesis.

A Computable Game-Theoretic Framework for Multi-Agent Theory of Mind

This paper proposes a game-theoretic framework based on Poisson cognitive hierarchy, achieving computable multi-agent Theory of Mind via Gamma-Poisson conjugate Bayesian updates. The framework supports recursive bounded-rationality decision-making and online belief revision while avoiding the undecidability of POMDPs.

Actor-Critic for Continuous Action Chunks: A Reinforcement Learning Framework for Long-Horizon Robotic Manipulation with Sparse Reward

AC3 proposes an actor-critic framework that directly learns continuous action sequences (action chunks), stabilizing long-horizon robotic manipulation under sparse rewards via an asymmetric actor update rule—updating the actor only from successful trajectories—and self-supervised anchor-based intrinsic rewards. The method achieves superior success rates over existing approaches across 25 tasks on BiGym and RLBench.

Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation

This paper addresses the base placement problem in open-vocabulary mobile manipulation (OVMM) and proposes a zero-shot framework that constructs a cross-modal representation (Affordance RGB + Obstacle Map+) to project semantic affordance cues onto an obstacle map, followed by a coarse-to-fine iterative optimization that balances semantic and geometric constraints. The method achieves an 85% success rate across five manipulation tasks, substantially outperforming both geometric planners and pure VLM-based approaches.

Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

This paper proposes the CCoL framework, which addresses both physical discontinuity in action sequences and semantic-physical misalignment in Behavioral Cloning through NeuralODE-driven Multimodal Continuous Co-learning (MCC) and bidirectional cross-attention-based Cross-modal Semantic-Physical Alignment (CSA). CCoL achieves an average relative improvement of 8.0% across three simulation platforms, with up to 19.2% on the bimanual insertion task.

Coordinated Humanoid Robot Locomotion with Symmetry Equivariant Reinforcement Learning Policy

This paper proposes SE-Policy, which directly embeds strict symmetry equivariance (actor) and symmetry invariance (critic) into the neural network architecture without additional hyperparameters, enabling humanoid robots to produce spatiotemporally coordinated natural locomotion. The velocity tracking error is reduced by 40% compared to DreamWaQ, and the policy is successfully deployed on a physical Unitree G1 robot.

Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling

This paper proposes GRM, a framework that achieves robust fine-grained image-text alignment through intra-modal saliency/granularity-aware adapters and Gaussian mixture-based region-level uncertainty modeling, attaining state-of-the-art performance on Flickr30K and MS-COCO.

Dexterous Manipulation Transfer via Progressive Kinematic-Dynamic Alignment

This paper proposes the PKDA framework, which automatically converts human hand manipulation videos into high-quality manipulation trajectories for multi-fingered dexterous hands via progressive kinematic-dynamic alignment, achieving an average transfer success rate of 73%.

Distributionally Robust Online Markov Game with Linear Function Approximation

This paper studies online distributionally robust Markov games with linear function approximation. It is the first to identify the hardness of learning in this setting, and proposes the DR-CCE-LSI algorithm, which achieves minimax-optimal sample complexity with respect to the feature dimension \(d\) under a specific feature mapping condition.

From Woofs to Words: Towards Intelligent Robotic Guide Dogs with Verbal Communication

This paper proposes a dialogue system for robotic guide dogs that leverages LLMs and a task planner to achieve Plan Verbalization and Scene Verbalization, supporting multi-turn natural language dialogue to assist visually impaired users in navigation decision-making. The system's effectiveness is validated through a real-user study and simulation experiments.

GRIM: Task-Oriented Grasping with Conditioning on Generative Examples

This paper proposes GRIM (Grasp Re-alignment via Iterative Matching), a training-free task-oriented grasping (TOG) framework that employs a retrieve–align–transfer pipeline combining video generation models with a multi-source memory bank. By leveraging DINO-feature-based semantic 3D alignment, GRIM achieves functional grasp transfer across objects, surpassing GraspMolmo—trained on 379K samples—using only 210 memory instances.

H-GAR: A Hierarchical Interaction Framework via Goal-Driven Observation-Action Refinement for Robotic Manipulation

This paper proposes H-GAR, a hierarchical goal-driven framework that first predicts a goal observation and then synthesizes intermediate observations, while refining coarse-grained actions via a historical action memory bank. This design enables explicit bidirectional interaction between observations and actions, achieving state-of-the-art performance on both simulated and real-robot manipulation tasks.

Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search

This paper formalizes the Human-Centric Open-Future Task Discovery (HOTD) problem—identifying tasks that reduce human burden across multiple possible futures in scenarios where human intentions are concurrent and dynamically evolving. The authors construct the HOTD-Bench benchmark (2K+ real-world videos) and propose CMAST (Collaborative Multi-Agent Search Tree), which substantially outperforms existing LMM baselines via a multi-agent system and a scalable search tree.

ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation

This paper proposes ManiLong-Shot, a framework comprising three modules—interaction-aware task decomposition, invariant region prediction, and region matching—that generalizes to 20 unseen long-horizon manipulation tasks after training on only 10 short-horizon tasks, achieving a one-shot imitation success rate of 30.2%, a relative improvement of 22.8% over the prior state of the art.

Object-Centric Latent Action Learning

This paper proposes an object-centric latent action learning framework that leverages self-supervised object decomposition (VideoSAUR) to disentangle task-relevant entities from visual distractions (e.g., dynamic backgrounds), reducing the performance degradation of LAPO on distracted videos by approximately 50%. A linear action probe is used to automatically select control-relevant slots.

PanoNav: Mapless Zero-Shot Object Navigation with Panoramic Scene Parsing and Dynamic Memory

This paper proposes PanoNav, a mapless zero-shot object navigation framework that uses only RGB images. It unlocks the spatial reasoning capability of MLLMs through Panoramic Scene Parsing and introduces a Dynamic Bounded Memory Queue to prevent local deadlock.

Realistic Synthetic Household Data Generation at Scale

This paper proposes an LLM-driven bidirectional coupling generation framework that iteratively generates large-scale synthetic datasets — encompassing household environment configurations, human activities, and human-robot interactions (HRI) — through a cycle in which persona profiles drive environment generation and environment semantics in turn guide activity generation, targeting the training of home robots.

RLSLM: A Hybrid Reinforcement Learning Framework Aligning Rule-Based Social Locomotion Model with Human Social Norms

This paper proposes RLSLM, a hybrid framework that embeds a psychology-experiment-driven rule-based Social Locomotion Model (SLM) into the reward function of reinforcement learning, enabling agents to efficiently learn navigation policies aligned with human social norms in crowd environments. VR experiments demonstrate that RLSLM achieves significantly higher comfort ratings than existing rule-based baselines.

Robust Out-of-Order Retrieval for Grid-Based Storage at Maximum Capacity

For the problem of uncertain retrieval order in fully loaded 2D grid-based storage systems, this paper proposes the k-bounded perturbation uncertainty model, proves that \(\Theta(k)\) columns is both necessary and sufficient for zero relocation, and presents an efficient robust storage solver and greedy retrieval strategy. The approach nearly eliminates relocations when \(k \leq 0.5c\) and still reduces relocations by 50%+ when \(k\) reaches \(c\).

Scalable Multi-Objective and Meta Reinforcement Learning via Gradient Estimation

This paper proposes PolicyGradEx, which efficiently estimates policy adaptation performance on arbitrary task subsets via first-order gradient approximation and surrogate models, constructs a task affinity matrix, and performs task grouping through convex optimization. PolicyGradEx outperforms state-of-the-art baselines by an average of 16% on multi-objective RL and meta-RL benchmarks, with a speedup of up to 26×.

SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

This paper proposes the SemanticVLA framework, which integrates three modules — a Semantic-guided Dual-encoder Pruner (SD-Pruner), a Semantic-complementary Hierarchical Fuser (SH-Fuser), and a Semantic-conditioned Action Coupler (SA-Coupler) — to substantially reduce visual redundancy while enhancing instruction–vision–action alignment. On the LIBERO benchmark, SemanticVLA achieves a 97.7% success rate, surpassing OpenVLA by 21.1%, while reducing training cost and inference latency by 3.0× and 2.7×, respectively.

Sim-to-Real: An Unsupervised Noise Layer for Screen-Camera Watermarking Robustness

This paper proposes the Simulation-to-Real (S2R) framework, which introduces a novel two-stage noise approximation strategy of "mathematical modeling → unsupervised domain transfer": a mathematical transform \(T\) first maps clean images to a known noise domain \(\mathcal{C}\), and an unsupervised image-to-image network \(G\) then maps \(\mathcal{C}\) to the real screen-camera (SC) noise domain \(\mathcal{U}\). Without requiring paired data, S2R accurately approximates real SC noise and achieves state-of-the-art watermarking robustness (BER reduced by 30–60%) and image quality (PSNR 42.27 dB / SSIM 0.962) across multiple devices, angles, and distances.

SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation

This paper proposes SpatialActor, a framework that explicitly disentangles semantic and geometric representations. It introduces a Semantic-Guided Geometry Module (SGM) that adaptively fuses noisy depth features with a pretrained depth estimation expert prior, and a Spatial Transformer (SPT) that encodes low-level spatial position cues. SpatialActor achieves 87.4% success rate on 50+ RLBench tasks (SOTA +6.0%) and outperforms RVT-2 by 19.4% under heavy-noise conditions.

Test-driven Reinforcement Learning in Continuous Control

This paper proposes the Test-driven Reinforcement Learning (TdRL) framework, which replaces a single reward function with multiple test functions — pass-fail tests defining optimality criteria and indicative tests guiding learning — to represent task objectives. A return function is learned via lexicographic-heuristic trajectory comparison, matching or surpassing hand-crafted reward methods on the DeepMind Control Suite while naturally supporting multi-objective optimization.

Theory of Mind for Explainable Human-Robot Interaction

This paper proposes positioning Theory of Mind (ToM) as a form of Explainable AI (XAI), systematically evaluates existing ToM research in HRI using the seven criteria of the VXAI framework, identifies critical deficiencies (most notably the absence of fidelity assessment), and advocates for integrating ToM into XAI frameworks to achieve user-oriented explanations.

TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception

This paper proposes TouchFormer, a robust multimodal fusion framework that achieves reliable material perception under vision-impaired conditions through three complementary modules: Modality-Adaptive Gating (MAG), intra- and inter-modal attention mechanisms, and Cross-Instance Embedding Regularization (CER). The approach is validated in a robotic sorting experiment under simulated fire scenarios.

Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors

This paper proposes AffordDex, a two-stage framework: the first stage pre-trains human hand motion priors (natural motion trajectories) via imitation learning; the second stage refines the policy through reinforcement learning using a residual module and VLM-guided Negative Affordance Annotation (NAA), achieving dexterous robotic grasping that is both human-like in naturalness and functionally correct (e.g., avoiding the blade and grasping the handle of a knife). The method significantly outperforms state-of-the-art approaches across multiple generalization levels.

Towards Reinforcement Learning from Neural Feedback: Mapping fNIRS Signals to Agent Performance

This paper proposes the NEURO-LOOP framework, which leverages fNIRS (functional near-infrared spectroscopy) brain signals as implicit neural feedback to evaluate RL agent performance. The authors release an fNIRS dataset spanning 25 subjects × 3 domains × 6 conditions. Classification F1 reaches 67% (binary) / 46% (multi-class), with cross-subject fine-tuning yielding improvements of 17% and 41% respectively, laying the groundwork for Reinforcement Learning from Neural Feedback (RLNF).

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

TTF-VLA proposes a training-free temporal token fusion method that selectively reuses visual tokens from historical frames via a dual-dimension mechanism combining grayscale pixel difference and attention-based semantic detection, improving inference quality of VLA models on robotic manipulation tasks with an average gain of 4.0 percentage points on LIBERO.

UrbanNav: Learning Language-Guided Urban Navigation from Web-Scale Human Trajectories

This paper proposes UrbanNav, which leverages web-scale urban walking videos (1,500+ hours from YouTube, yielding 3 million instruction–trajectory–landmark triplets) to train a language-guided urban navigation policy via an automated annotation pipeline and robust filtering mechanism, achieving an 83.3% navigation success rate in real-world deployment.