Skip to content

🤖 Robotics & Embodied AI

🧠 NeurIPS2025 · 73 paper notes

📌 Same area in other venues: 📷 CVPR2026 (130) · 🔬 ICLR2026 (162) · 💬 ACL2026 (11) · 🧪 ICML2026 (53) · 🤖 AAAI2026 (30) · 📹 ICCV2025 (26)

🔥 Top topics: Reinforcement Learning ×14 · Robotics ×11 · Multimodal/VLM ×8 · Agents ×7 · Reasoning ×6

A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

This work is the first to introduce data attribution into online reinforcement learning. It proposes a local attribution framework to quantify each training record's contribution to policy updates, and builds upon it an Iterative Influence Filtering (IIF) algorithm that substantially improves sample efficiency and final performance on both classical RL benchmarks and LLM RLHF.

Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies

This paper proposes DP-AG (Action-Guided Diffusion Policy), which uses the Vector-Jacobian Product (VJP) of a diffusion policy's noise prediction as a structured stochastic force to drive dynamic evolution of latent observation features across diffusion steps, and closes the perception-action loop via a cycle-consistent contrastive loss. DP-AG achieves +6% on Push-T, +13% on Dynamic Push-T, and +23%+ success rate on a real UR5 robot.

Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing

This paper proposes the Adaptive Frontier Exploration on Graphs (AFEG) framework and designs a Gittins index-based policy that is provably optimal when the graph is a forest. On real-world sexually transmitted disease testing networks, the method identifies nearly all HIV-positive individuals by testing only half the population, substantially outperforming greedy and DQN baselines.

Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

ALMI proposes an upper-lower body adversarial training framework: the lower-body policy learns robust locomotion under upper-body motion perturbations, while the upper-body policy learns precise motion imitation under lower-body locomotion perturbations. Through iterative adversarial training converging to a Nash equilibrium, the framework enables stable whole-body coordinated control on the Unitree H1-2 real robot.

Asymptotically Stable Quaternionic Hopfield Structured Neural Network with Supervised Projection-based Manifold Learning

This paper proposes a Quaternion-valued Supervised Hopfield-structured Neural Network (QSHNN) that employs a periodic projection strategy to maintain the quaternionic structural consistency of the weight matrix. The existence and uniqueness of fixed points and their asymptotic stability are established via Lyapunov theory, while bounded trajectory curvature guarantees path smoothness for robotic path planning.

Automaton Constrained Q-Learning

This paper proposes ACQL (Automaton Constrained Q-Learning), which translates Linear Temporal Logic (LTL) task specifications into automata and combines goal-conditioned learning with minimal safety constraints. ACQL is the first scalable method to simultaneously support sequential temporal goals and non-stationary safety constraints in continuous control environments.

AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

AutoToM achieves fully automated model-based Theory of Mind inference—without requiring manual agent model specification—by automatically proposing Bayesian network structures and executing Bayesian inverse planning. Through uncertainty-driven iterative model refinement (adding mental variables or extending time steps), it achieves an average accuracy of 82.43% across 5 ToM benchmarks, surpassing SOTA models such as GPT-4o (63.39%) and o3-mini (73.94%).

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

AutoVLA integrates physical action tokens directly into a pretrained VLM (Qwen2.5-VL-3B), equips the model with fast/slow dual-thinking modes via SFT, and applies GRPO reinforcement fine-tuning to enable adaptive reasoning switching and optimize planning performance. The approach achieves competitive end-to-end driving performance across four major autonomous driving benchmarks: nuPlan, Waymo, nuScenes, and CARLA.

BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

BEAST parameterizes action sequences via B-splines—estimating control points through ridge regression and uniformly quantizing them into fixed-length tokens—achieving 20× token compression (100 steps → 5 tokens), mathematically guaranteed \(C^0\) continuity across action chunks, a top-1 success rate on LIBERO-Long (86.4%), and an inference throughput of 617 Hz (2.14× faster than π₀ and 101× faster than OpenVLA).

Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Meta proposes WAGIBench, a multimodal goal inference benchmark for assistive wearable agents, comprising 3,477 egocentric recordings (29 hours) from 348 participants across four modalities — visual, audio, digital, and longitudinal. Human accuracy reaches 93% versus the best VLM at 84% (MCQ); under generative evaluation, models produce relevant goals only 55% of the time, exposing a substantial gap between current VLMs and real-world wearable deployment.

Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention

This paper reframes multi-head attention as a system of multiple feedforward DAGs sharing a common sink node, and theoretically demonstrates that multiple heads can achieve synergistic effects through cross-head paths—reducing mixing time and amplifying minimax fidelity—with empirical validation on sequential operation tasks.

Can Agents Fix Agent Issues?

This paper presents the first systematic study of automated issue resolution in LLM-based agent systems. Through manual analysis of 201 real-world agent issues, the authors construct a taxonomy comprising 6 categories and 20 subcategories, invest approximately 500 person-hours to build AgentIssue-Bench—a benchmark of 50 reproducible tasks—and find that state-of-the-art software engineering (SE) agents (e.g., SWE-agent, Agentless, AutoCodeRover) achieve correct resolution rates of only 3.33%–12.67% on agent issues, far below their 23%–51% rates on conventional software.

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

CogVLA proposes a three-stage VLA architecture inspired by human multimodal cognition—comprising EFA-Routing for visual aggregation and compression to 25%, LFP-Routing for instruction-aware pruning of 50% of tokens within the LLM, and V-L-A coupled attention—achieving a 97.4% success rate on LIBERO with 2.5× training and 2.8× inference speedups over SOTA methods such as OpenVLA-OFT, and a 70.0% success rate on real-robot tasks.

COOPERA: Continual Open-Ended Human-Robot Assistance

This paper proposes the COOPERA framework, the first to enable continual, open-ended human-robot collaboration research. LLM-driven simulated humans with psychological traits and long-term intentions interact with robots over multiple days in a 3D environment. The robot progressively improves its personalized assistance by learning human characteristics and contextual intentions.

DexFlyWheel: A Scalable Self-Improving Data Generation Framework for Dexterous Manipulation

This paper proposes DexFlyWheel, a dexterous manipulation data generation framework that starts from a single human demonstration and progressively scales data diversity through a self-improving loop composed of IL, residual RL, and data augmentation. The framework generates 2,000+ demonstrations across 4 tasks, achieving an average policy success rate of 81.9% and a real-world transfer success rate of 78.3%.

DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation

DynaNav is proposed to dynamically adjust feature and layer usage according to scene complexity via a trainable hard feature selector and a Bayesian optimization-based early-exit mechanism, achieving a 2.26× FLOPs reduction and 42.3% inference time decrease in visual navigation while maintaining or improving navigation performance.

EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval

Through discrete memory caching (group-independent KV cache computation with selective loading), attention-driven clustering (LLM shallow-layer attention guiding grouping), and semantics-aware retrieval (CLIP + knapsack problem adapted to varying memory budgets), EfficientNav is the first system to achieve zero-shot ObjNav on Jetson Orin using LLaMA-3.2-11b, surpassing the GPT-4 baseline by 11.1% SR while reducing real-time latency by 6.7×.

EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data

This paper proposes EgoBridge, a framework that uses Optimal Transport (OT) to align the joint distribution (features + actions) of human and robot data in a shared policy latent space, combined with Dynamic Time Warping (DTW) to construct pseudo-pairs, enabling cross-embodiment knowledge transfer from egocentric human data to robots, achieving up to 44% absolute improvement in success rate on real-world tasks.

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

EgoThinker constructs EgoRe-5M, a 5-million-sample egocentric video QA dataset with causal CoT annotations and fine-grained hand-object localization data. Through a two-stage training paradigm—SFT for reasoning followed by GRPO for grounding—the approach enables a 7B MLLM to simultaneously perform egocentric causal reasoning and spatio-temporal fine-grained localization for the first time, achieving state-of-the-art results on 8+ benchmarks, with the 7B model surpassing 72B models on temporal grounding.

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

This paper proposes the ESCA framework, which provides structured visual understanding context for MLLM-driven embodied agents via open-vocabulary scene graph generation (the SGClip model), substantially reducing perception error rates and improving task completion rates.

Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

This paper proposes a sim-and-real policy co-training framework based on Unbalanced Optimal Transport (UOT), which aligns the joint observation-action distribution (rather than only the marginal observation distribution), and incorporates a temporally aligned sampling strategy to handle data imbalance, achieving a 30% improvement in OOD generalization on robotic manipulation tasks.

HiMaCon: Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data

This paper proposes a self-supervised framework that learns hierarchical manipulation concepts from unlabeled multi-modal robot demonstrations. It organizes representations via a cross-modal correlation network and a multi-horizon future predictor, enhancing the generalization of imitation learning policies to novel objects, unseen obstacles, and new environments.

Human-assisted Robotic Policy Refinement via Action Preference Optimization

This paper proposes Action Preference Optimization (APO), a human-robot collaboration framework that collects interactive trajectories and applies preference alignment to VLA models using binary desirability signals grounded in prospect theory and an adaptive reweighting scheme, enabling the model to learn from failures and improve iteratively.

Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI Coordination

Inspired by Vygotsky's theory of inner speech, this paper proposes MIMIC, a framework that uses language as an intermediate representation between perception and action. A VLM provides language scaffolding to train a CVAE that generates inner speech, which then conditions a diffusion policy to produce diverse and steerable behaviors.

Knolling Bot: Teaching Robots the Human Notion of Tidiness

This work frames desktop object tidying (knolling) as an NLP-style sequence prediction task, employing a Transformer to autoregressively generate target poses for each object. A Gaussian Mixture Model (GMM) handles solution ambiguity, the model is trained on 2.4 million automatically generated demonstrations to learn a generalizable notion of tidiness, and user preferences are implicitly encoded via the input ordering of objects.

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

This paper proposes LabUtopia — a high-fidelity simulation and hierarchical benchmark suite for scientific laboratory environments. It comprises the LabSim simulator with chemical reaction modeling, LabScene for procedural laboratory scene generation, and LabBench, a five-level benchmark spanning atomic operations to long-horizon mobile manipulation. The suite reveals significant bottlenecks in existing imitation learning methods with respect to long-horizon experimental workflows and object generalization.

Learning Interactive World Model for Object-Centric Reinforcement Learning

This paper proposes FIOC-WM, which learns the interaction structure among objects in a world model via a two-level factorization at the object and attribute levels. It trains a hierarchical policy grounded in interaction primitives, achieving more efficient policy learning and compositional generalization across multiple robot control tasks.

Learning Parameterized Skills from Demonstrations

This paper proposes DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Through a three-level hierarchical policy (discrete skill selection → continuous parameter selection → low-level actions) and an information bottleneck design, DEPS learns interpretable and generalizable skill abstractions, achieving significant improvements over baselines on LIBERO and MetaWorld.

Learning Spatial-Aware Manipulation Ordering

This paper proposes OrderMind, a unified framework that learns manipulation ordering of objects in cluttered scenes directly from RGB-D images via a Spatial Context Understanding encoder and a Temporal Priority Structuring module. Training annotations are generated through VLM distillation with spatial priors. OrderMind significantly outperforms VLM baselines in both simulation and real-world environments while supporting real-time inference (5.6 FPS, 21.3 FPS for the lightweight variant).

LLMscape

LLMscape is a projection-mapped sandscape interactive installation in which multiple independent LLM agents receive multimodal input, converse with one another, and engage in speculation within a shared, mutable physical environment, exploring the process of collaborative sensemaking between humans and AI under cognitive uncertainty.

LUMIA: A Handheld Vision-to-Music System for Real-Time, Embodied Composition

This paper presents Lumia — a handheld camera-shaped device that analyzes captured frames via GPT-4 Vision to generate structured prompts, which are then fed to Stable Audio to synthesize loopable music segments, enabling a real-time, embodied improvisation workflow from visual input to music.

MaNGO: Adaptable Graph Network Simulators via Meta-Learning

This paper proposes MaNGO (Meta Neural Graph Operator), which leverages meta-learning and conditional neural processes (CNP) to learn shared latent structure across simulation tasks under varying physical parameters, enabling rapid adaptation to new physical parameters without retraining.

Massively Parallel Imitation Learning of Mouse Forelimb Musculoskeletal Reaching Dynamics

This work presents MIMIC-MJX, a massively parallel imitation learning pipeline for mouse forelimb musculoskeletal simulation. Leveraging JAX-accelerated PPO at 1.2 million steps/second across thousands of parallel environments, the pipeline trains physically-informed imitation learning policies. The study demonstrates that control cost regularization enables simulated muscle activity to better predict real EMG signals, and employs a Takens-theorem-based nonlinear dynamical systems approach to predict muscle activation from joint kinematics.

Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning

This paper proposes Memo, a Transformer-based memory-augmented framework that periodically generates summary tokens to compress historical context. Memo matches or exceeds the performance of full-context Transformers while reducing the KV cache at inference time by 8–10×, and demonstrates superior generalization to long contexts as well as robustness under streaming inference.

Memory-Augmented Potential Field Theory: A Framework for Adaptive Control in Non-Convex Domains

This paper proposes Memory-Augmented Potential Field Theory (MAPFT), which maintains a dynamic memory module within stochastic optimal control to detect and encode topological features of the state space (local minima, low-gradient regions, etc.), and adaptively reshapes the value function landscape to enable control in non-convex environments. On tasks such as Humanoid-v4, the method achieves a 27% improvement in cumulative reward over the best RL baseline (SAC), and raises the local optima escape rate from ~30% to ~72%.

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

This paper proposes the MesaTask framework, which decomposes task descriptions into a Spatial Reasoning Chain — object reasoning → spatial relationship reasoning → scene graph construction → 3D layout — and combines a 10K+ manually annotated dataset with DPO optimization to generate physically plausible, task-aligned tabletop manipulation scenes.

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning

MindForge introduces explicit Theory of Mind (ToM) representations, natural language communication, and a multi-component memory system into LLM-driven embodied agents, enabling open-source LLM agents to substantially improve task completion rates through collaborative dialogue with expert agents (without gradient updates), achieving 3× more tech-tree milestones and 2.3× more unique items than Voyager in Minecraft.

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

MineAnyBuild is a spatial planning benchmark built upon Minecraft, requiring AI agents to generate executable blueprint matrices from multimodal instructions. The benchmark comprises 4,000 tasks and 500+ architectural/decorative assets, and systematically evaluates MLLM spatial planning capabilities across four dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Results reveal that even GPT-4o achieves only 41.02/100 overall, with open-source models performing substantially worse.

NeSyPr: Neurosymbolic Proceduralization For Efficient Embodied Reasoning

NeSyPr proposes a neurosymbolic proceduralization framework that transforms task plans generated by symbolic planners into composable procedural representations, enabling compact language models to perform efficient single-step reasoning without relying on external symbolic guidance — analogous to the human process of knowledge compilation.

Operation Veja: Fixing Fundamental Concepts Missing from Modern Roleplaying Training Paradigms

This paper systematically critiques four dominant paradigms in role-playing (RP) model training—RAG, fact-value specification, literary data, and synthetic data—arguing that none can produce characters with genuine depth. It proposes the VEJA framework (Values–Experiences–Judgments–Abilities) as a structured basis for character definition and data curation. In an LLM-judged A/B test, VEJA-guided human-curated data significantly outperforms a Gemini Pro 2.5 synthetic baseline with a win/loss/tie ratio of 43:28:29.

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

This paper proposes UEPO, a framework comprising three core components—multi-seed dynamics-aware diffusion policies, dynamic divergence regularization, and diffusion-based data augmentation—to address insufficient multimodal behavioral coverage and distribution shift in offline-to-online reinforcement learning, surpassing Uni-O4 on the D4RL benchmark.

Periodic Skill Discovery

This paper proposes Periodic Skill Discovery (PSD), a framework that maps states onto a circular latent space to naturally encode periodicity, enabling unsupervised discovery of diverse locomotion skills with varying periods.

Policy Compatible Skill Incremental Learning via Lazy Learning Interface

This paper proposes SIL-C, a framework that achieves skill-policy compatibility in skill incremental learning via a bilateral lazy learning interface, enabling incrementally updated skills to directly improve downstream policy performance without retraining or structural modification.

UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Task Planning

UniDomain pretrains a unified PDDL planning domain—comprising 3,137 operators and 2,875 predicates—from 12,393 real-world robotic manipulation videos. Through hierarchical fusion to construct a meta-domain, it achieves zero-shot cross-task symbolic planning, outperforming the strongest baseline by 58% in success rate and 160% in plan optimality.

PROFIT: A Specialized Optimizer for Deep Fine Tuning

PROFIT frames fine-tuning as a multi-task learning problem across the time dimension, and achieves forgetting-resistant fine-tuning without additional data or parameters by orthogonally projecting new-task gradients onto the direction of a "regression equilibrium point."

Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

This paper proposes AcTOL, which learns ordered and continuous vision-language representations via a visual-language ordering loss and a Brownian bridge constraint, without relying on rigid goal-reaching assumptions, achieving significant improvements on downstream simulated and real-world robot manipulation tasks.

RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

This paper proposes RDD (Retrieval-Based Demonstration Decomposer), which models demonstration decomposition as an optimal partition problem and automatically segments long-horizon task demonstrations into subtasks aligned with the training data of low-level visuomotor policies. This approach bridges the gap between high-level planners and low-level policies in hierarchical VLA frameworks, achieving near-expert-decomposer performance on RLBench.

Real-World Reinforcement Learning of Active Perception Behaviors

This paper proposes Asymmetric Advantage-Weighted Regression (AAWR), which leverages additional privileged sensors during training to estimate more accurate advantage functions, enabling efficient learning of active perception policies in the real world. AAWR outperforms all baselines across 8 manipulation tasks spanning varying degrees of partial observability.

Reinforcement Learning with Action Chunking

This paper proposes Q-chunking, which extends action chunking from imitation learning to TD-based reinforcement learning by running RL directly over a "chunked" action space, thereby improving exploration and sample efficiency in long-horizon sparse-reward tasks.

Rethinking the Simulation vs. Rendering Dichotomy: No Free Lunch in Spatial World Modelling

From a cognitive neuroscience perspective, this paper challenges the prevailing view that simulation and rendering are separable processes: it argues that spatial reasoning relies on fine-grained perceptual representations rather than coarse abstractions, and concludes that AI spatial world models likewise require rich perceptual detail — there is no free lunch in spatial modelling.

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

This paper proposes RoboCerebra, a long-horizon robotic manipulation benchmark comprising 1,000 human demonstration trajectories (averaging 2,972 steps, approximately 6× longer than existing benchmarks). Through a hierarchical planning and execution framework and a multi-dimensional evaluation protocol, it systematically assesses VLMs across three System 2 cognitive dimensions: planning, reflection, and memory.

Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics

Robot-R1 proposes training large vision-language models (LVLMs) via reinforcement learning (GRPO) for embodied reasoning. By casting next keystate prediction as multiple-choice questions and optimizing reasoning paths with RL, a 7B-parameter model surpasses GPT-4o on low-level control reasoning tasks.

SAFE: Multitask Failure Detection for Vision-Language-Action Models

SAFE identifies consistent "failure regions" in the internal feature space of VLA models that generalize across tasks. Leveraging this observation, it trains lightweight MLP/LSTM failure detectors and applies Functional Conformal Prediction (FCP) for threshold calibration. The approach achieves 78% ROC-AUC on unseen tasks with less than 1% computational overhead, substantially outperforming token-uncertainty and action-consistency baselines.

SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

This work is the first to systematically apply the Constrained Markov Decision Process (CMDP) framework from Safe Reinforcement Learning (SafeRL) to safety alignment of Vision-Language-Action (VLA) models. Through a four-stage Integrated Safety Approach (ISA)—Model, Elicit, Constrain, and Assure—the method achieves an 83.58% reduction in safety violation costs on mobile manipulation tasks while maintaining task performance (+3.85%).

Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning

This paper proposes the RTZ-VI-LCB algorithm for offline robust two-player zero-sum Markov games (RTZM G). By combining pessimistic robust value iteration with Bernstein-style penalties, it achieves a near-optimal sample complexity of \(O(C_r^* \cdot H^4 \cdot S \cdot (A+B) / \varepsilon^2)\), significantly improving upon the prior best result of \(O(H^5 \cdot S^2 \cdot AB / \varepsilon^2)\) in terms of dependence on both the state space and the action space.

Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning

This paper establishes the first finite-sample convergence guarantees for distributionally robust average-reward reinforcement learning (DR-AMDP), proposing two algorithms (discount reduction and anchoring) that achieve near-optimal sample complexity of \(\widetilde{O}(|S||A|t_{\mathrm{mix}}^2\varepsilon^{-2})\) under both KL and \(f_k\)-divergence uncertainty sets.

Self-Improving Embodied Foundation Models

This paper proposes a two-stage post-training framework for embodied foundation models: Stage 1 performs supervised fine-tuning via behavior cloning and steps-to-go prediction; Stage 2 leverages the resulting self-reward function and success detector for online RL self-improvement. Using only 1–3% additional data, the method achieves over 1.5× improvement in success rate and, for the first time, demonstrates a robot autonomously acquiring novel skills beyond the distribution of imitation data.

Spatial-Aware Decision-Making with Ring Attractors in Reinforcement Learning Systems

This paper integrates ring attractor models from neuroscience into action selection in deep reinforcement learning (DRL). By mapping actions to spatial positions on a ring and injecting Gaussian signals encoding Q-values and uncertainty, the proposed approach achieves a 53% improvement over baseline on Atari 100K.

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

This paper proposes a two-pronged approach combining the SpatialMind structured prompting strategy and the ScanForgeQA synthetic QA dataset to substantially enhance VLMs' ability to perform 3D spatial reasoning from scanned videos, without modifying the underlying model architecture.

STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning

This paper identifies and formalizes the "stage misalignment" problem in Preference-based Reinforcement Learning (PbRL)—wherein comparing behavior segments from different task stages produces uninformative feedback—and proposes STAIR, a method that learns temporal distances via contrastive learning to approximate stage discrepancy. By employing a quadrilateral distance metric for stage-aligned query selection, STAIR substantially outperforms existing PbRL methods on multi-stage tasks.

SutureBot: A Precision Framework & Benchmark for Autonomous End-to-End Suturing

This paper presents SutureBot — the first precision-oriented benchmark and goal-conditioned framework for end-to-end autonomous suturing on the da Vinci surgical robot. It releases a high-fidelity dataset of 1,890 demonstrations, achieves 59%–74% improvements in needle insertion accuracy via point-label goal conditioning, and systematically evaluates state-of-the-art VLA models including π0, GR00T N1, OpenVLA-OFT, and multi-task ACT.

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Talk2Event introduces the first large-scale visual grounding benchmark for event cameras (30,690 annotated referring expressions across four grounding attribute types), and proposes the EventRefer framework, which employs a Mixture of Event-Attribute Experts (MoEE) to dynamically fuse appearance, status, viewer-relation, and inter-object-relation features. EventRefer surpasses existing methods across all three evaluation settings: event-only, frame-only, and fusion.

Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain

This paper proposes the Encoder-Attender-Decoder (EAD) framework to systematically explore task-optimized temporal neural networks for tactile processing. It finds that convolutional recurrent networks (ConvRNNs, especially IntersectionRNN) outperform feedforward and state-space models on both tactile object classification and neural alignment with rodent somatosensory cortex. Contrastive self-supervised learning with tactile-specific augmentations achieves neural fitting comparable to supervised learning, providing the first quantitative characterization of the brain's computational mechanisms for touch.

The Impact of Scaling Training Data on Adversarial Robustness

A systematic evaluation of 36 state-of-the-art vision models under 6 categories of black-box attacks reveals that attack success rate (ASR) decreases logarithmically with training data volume and model scale; however, data quality and model scale are more critical than data volume alone.

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

ThinkAct proposes a dual-system framework that applies action-aligned visual rewards to fine-tune MLLMs via reinforcement learning, eliciting embodied reasoning capabilities and compressing reasoning plans into visual latent representations to guide a downstream action model—realizing a "think before act" VLA reasoning paradigm.

Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning

This paper proposes the TR-DRL framework, which exploits time reversal symmetry in robotic manipulation tasks—via trajectory reversal augmentation (for fully reversible transitions) and time-reversal-guided potential-based reward shaping (for partially reversible transitions)—to significantly improve sample efficiency and final performance of DRL on paired tasks (e.g., door opening/closing).

To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable Reinforcement Learning

Using a theoretical framework (perturbed Block MDP) and controlled locomotion experiments, this paper systematically investigates the algorithmic trade-off between privileged expert distillation and standard RL (without privileged information) in partially observable RL, finding that the trade-off is primarily governed by the stochasticity of latent state dynamics.

Towards Reliable Code-as-Policies: A Neuro-Symbolic Framework for Embodied Task Planning

This paper proposes a neuro-symbolic framework for embodied task planning that augments LLM-based code generation with explicit symbolic verification (checking whether preconditions are satisfied) and interactive verification (active exploration to acquire missing information), enabling more reliable code execution in dynamic and partially observable environments. On RLBench, task success rate improves from a baseline of 38.5% to 84.7%, with executability reaching 86.8%.

Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm

This paper proposes the TRRO theoretical framework and the PIRO practical algorithm, which guarantee monotonic improvement of reward function updates in IRL via a Minorization-Maximization procedure, achieving stability guarantees analogous to those of TRPO/PPO in forward RL.

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

This paper introduces VIKI-Bench, the first hierarchical benchmark for embodied multi-agent cooperation, comprising three evaluation levels—agent activation, task planning, and trajectory perception—and proposes VIKI-R, a two-stage training framework combining CoT-supervised fine-tuning with multi-level reward reinforcement learning. The framework achieves significant improvements over baselines across diverse robot morphologies and multi-view visual observations, with combinatorial coordination patterns emerging during the RL stage.

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching

This paper proposes VLA-Cache, a training-free inference acceleration method for VLA models that identifies and caches KV representations of static visual tokens across frames, filters out task-relevant tokens, and adaptively adjusts the reuse ratio per layer, achieving 1.7× speedup with negligible loss in task success rate.

VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play

This paper presents VolleyBots, a multi-drone volleyball competition testbed that integrates cooperative-adversarial gameplay, turn-based interaction, and agile 3D motion control. Built on Isaac Sim, it establishes a task curriculum from single-agent training to multi-agent competition. A hierarchical policy achieves a 69.5% win rate on the 3v3 task, with demonstrated zero-shot sim-to-real transfer.

Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts

This paper proposes the Context-Enhanced Bellman Equation (CEBE) and Context Sample Enhancement (CSE), which leverage first-order derivative information of environment dynamics and reward functions with respect to context parameters to achieve zero-shot generalization to unseen contexts when training is restricted to a single context.