Skip to content

🤖 Robotics & Embodied AI

🧠 NeurIPS2025 · 57 paper notes

A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

This work is the first to introduce data attribution into online reinforcement learning. It proposes a local attribution framework to quantify each training record's contribution to policy updates, and builds upon it an Iterative Influence Filtering (IIF) algorithm that substantially improves sample efficiency and final performance on both classical RL benchmarks and LLM RLHF.

Adaptive Frontier Exploration on Graphs with Applications to Network-Based Disease Testing

This paper proposes the Adaptive Frontier Exploration on Graphs (AFEG) framework and designs a Gittins index-based policy that is provably optimal when the graph is a forest. On real-world sexually transmitted disease testing networks, the method identifies nearly all HIV-positive individuals by testing only half the population, substantially outperforming greedy and DQN baselines.

AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling

AutoToM achieves fully automated model-based Theory of Mind inference—without requiring manual agent model specification—by automatically proposing Bayesian network structures and executing Bayesian inverse planning. Through uncertainty-driven iterative model refinement (adding mental variables or extending time steps), it achieves an average accuracy of 82.43% across 5 ToM benchmarks, surpassing SOTA models such as GPT-4o (63.39%) and o3-mini (73.94%).

Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention

This paper reframes multi-head attention as a system of multiple feedforward DAGs sharing a common sink node, and theoretically demonstrates that multiple heads can achieve synergistic effects through cross-head paths—reducing mixing time and amplifying minimax fidelity—with empirical validation on sequential operation tasks.

Breaking the Gradient Barrier: Unveiling Large Language Models for Strategic Classification

This paper proposes GLIM (Gradient-free Learning In-context Method), which for the first time leverages the In-Context Learning (ICL) mechanism of LLMs to implicitly simulate the bi-level optimization in strategic classification (feature manipulation + decision rule optimization), enabling efficient strategic classification on large-scale data without any fine-tuning.

C-NAV: Towards Self-Evolving Continual Object Navigation in Open World

The paper proposes C-Nav, a framework that employs dual-path anti-forgetting (feature distillation + feature replay) and adaptive experience selection (LOF-based anomaly detection for keyframe selection) to prevent catastrophic forgetting as a navigation agent incrementally learns new object categories, surpassing full data replay baselines across 4 different architectures.

Can Agents Fix Agent Issues?

This paper presents the first systematic study of automated issue resolution in LLM-based agent systems. Through manual analysis of 201 real-world agent issues, the authors construct a taxonomy comprising 6 categories and 20 subcategories, invest approximately 500 person-hours to build AgentIssue-Bench—a benchmark of 50 reproducible tasks—and find that state-of-the-art software engineering (SE) agents (e.g., SWE-agent, Agentless, AutoCodeRover) achieve correct resolution rates of only 3.33%–12.67% on agent issues, far below their 23%–51% rates on conventional software.

CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

CogVLA proposes a three-stage VLA architecture inspired by human multimodal cognition—comprising EFA-Routing for visual aggregation and compression to 25%, LFP-Routing for instruction-aware pruning of 50% of tokens within the LLM, and V-L-A coupled attention—achieving a 97.4% success rate on LIBERO with 2.5× training and 2.8× inference speedups over SOTA methods such as OpenVLA-OFT, and a 70.0% success rate on real-robot tasks.

C-NAV: Towards Self-Evolving Continual Object Navigation in Open World

This paper proposes C-Nav, a continual object navigation framework that employs a dual-path anti-forgetting mechanism (feature distillation + feature replay) and LOF-based adaptive experience selection to enable navigation agents to incrementally learn new object categories while effectively mitigating catastrophic forgetting. C-Nav surpasses full data replay baselines across 4 mainstream architectures and 2 datasets.

COOPERA: Continual Open-Ended Human-Robot Assistance

This paper proposes the COOPERA framework, the first to enable continual, open-ended human-robot collaboration research. LLM-driven simulated humans with psychological traits and long-term intentions interact with robots over multiple days in a 3D environment. The robot progressively improves its personalized assistance by learning human characteristics and contextual intentions.

DexFlyWheel: A Scalable Self-Improving Data Generation Framework for Dexterous Manipulation

This paper proposes DexFlyWheel, a dexterous manipulation data generation framework that starts from a single human demonstration and progressively scales data diversity through a self-improving loop composed of IL, residual RL, and data augmentation. The framework generates 2,000+ demonstrations across 4 tasks, achieving an average policy success rate of 81.9% and a real-world transfer success rate of 78.3%.

DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation

DynaNav is proposed to dynamically adjust feature and layer usage according to scene complexity via a trainable hard feature selector and a Bayesian optimization-based early-exit mechanism, achieving a 2.26× FLOPs reduction and 42.3% inference time decrease in visual navigation while maintaining or improving navigation performance.

EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval

Through discrete memory caching (group-independent KV cache computation with selective loading), attention-driven clustering (LLM shallow-layer attention guiding grouping), and semantics-aware retrieval (CLIP + knapsack problem adapted to varying memory budgets), EfficientNav is the first system to achieve zero-shot ObjNav on Jetson Orin using LLaMA-3.2-11b, surpassing the GPT-4 baseline by 11.1% SR while reducing real-time latency by 6.7×.

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

This paper proposes EgoThinker, which constructs the large-scale egocentric video reasoning dataset EgoRe-5M (with causal CoT annotations and hand-object grounding labels) and adopts a two-stage training paradigm (SFT + GRPO reinforcement fine-tuning) to endow MLLMs with robust egocentric reasoning, hand-object grounding, and temporal localization capabilities, achieving state-of-the-art performance across multiple egocentric benchmarks.

EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT

EgoThinker constructs EgoRe-5M, a 5-million-sample egocentric video QA dataset with causal CoT annotations and fine-grained hand-object localization data. Through a two-stage training paradigm—SFT for reasoning followed by GRPO for grounding—the approach enables a 7B MLLM to simultaneously perform egocentric causal reasoning and spatio-temporal fine-grained localization for the first time, achieving state-of-the-art results on 8+ benchmarks, with the 7B model surpassing 72B models on temporal grounding.

Explaining and Mitigating Crosslingual Tokenizer Inequities

This work systematically trains approximately 7,000 monolingual tokenizers covering 97 languages, providing the first demonstration that significant token premium disparities persist across languages even after controlling for training data size, vocabulary size, and algorithm. It further identifies vocabulary size and pre-tokenization strategy as key contributing factors, and proposes two mitigation approaches: language-specific optimal vocabulary size and SuperBPE.

FALCON: Fine-grained Activation Manipulation by Contrastive Orthogonal Unalignment for Large Language Model

This paper proposes FALCON, a representation-guided LLM unlearning framework that employs mutual information for parameter selection, a contrastive mechanism for fine-grained knowledge separation, and gradient orthogonal projection to resolve forgetting–retention conflicts. FALCON consistently outperforms existing methods on harmful knowledge, copyright, and entity unlearning benchmarks.

Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

This paper proposes a sim-and-real policy co-training framework based on Unbalanced Optimal Transport (UOT), which aligns the joint observation-action distribution (rather than only the marginal observation distribution), and incorporates a temporally aligned sampling strategy to handle data imbalance, achieving a 30% improvement in OOD generalization on robotic manipulation tasks.

Grasp2Grasp: Vision-Based Dexterous Grasp Translation via Schrödinger Bridges

This paper proposes modeling cross-morphology visual dexterous grasp transfer as a Schrödinger Bridge problem. By learning Score and Flow Matching ([SF]²M) in a latent space and designing physics-aware optimal transport cost functions (over pose, contact maps, grasp wrench space, and Jacobian manipulability), the method achieves distribution-level transfer of grasp intent across different robot hands without requiring paired data.

GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation

This paper proposes GUI-Rise, a framework that jointly designs three subtasks—structured reasoning (progress estimation + decision reasoning), action prediction, and history summarization—combined with GRPO reinforcement learning and a history summarization reward, to significantly improve the cross-domain generalization of GUI navigation agents.

Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability

By systematically exploiting data-level and model-level computation redundancy in ViTs, this paper proposes five techniques—attention sparsification, attention head permutation, clean token regularization, Ghost MoE diversification, and robust tokens—combined with an online learning strategy that dynamically selects operations. The method achieves an average fooling rate of 86.9% on ImageNet-1K, substantially outperforming all baselines.

HiMaCon: Discovering Hierarchical Manipulation Concepts from Unlabeled Multi-Modal Data

This paper proposes a self-supervised framework that learns hierarchical manipulation concepts from unlabeled multi-modal robot demonstrations. It organizes representations via a cross-modal correlation network and a multi-horizon future predictor, enhancing the generalization of imitation learning policies to novel objects, unseen obstacles, and new environments.

Knolling Bot: Teaching Robots the Human Notion of Tidiness

This work frames desktop object tidying (knolling) as an NLP-style sequence prediction task, employing a Transformer to autoregressively generate target poses for each object. A Gaussian Mixture Model (GMM) handles solution ambiguity, the model is trained on 2.4 million automatically generated demonstrations to learn a generalizable notion of tidiness, and user preferences are implicitly encoded via the input ordering of objects.

LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

This paper proposes LabUtopia — a high-fidelity simulation and hierarchical benchmark suite for scientific laboratory environments. It comprises the LabSim simulator with chemical reaction modeling, LabScene for procedural laboratory scene generation, and LabBench, a five-level benchmark spanning atomic operations to long-horizon mobile manipulation. The suite reveals significant bottlenecks in existing imitation learning methods with respect to long-horizon experimental workflows and object generalization.

LatentGuard: Controllable Latent Steering for Robust Refusal of Attacks and Reliable Response Generation

This paper proposes LatentGuard, a three-stage framework that combines behavior-level alignment fine-tuning, structured VAE-supervised latent space modeling, and latent-space dimensional manipulation to achieve interpretable and controllable regulation of LLM refusal behavior — robustly defending against adversarial attacks while preserving responsiveness to benign queries.

Learning Spatial-Aware Manipulation Ordering

This paper proposes OrderMind, a unified framework that learns manipulation ordering of objects in cluttered scenes directly from RGB-D images via a Spatial Context Understanding encoder and a Temporal Priority Structuring module. Training annotations are generated through VLM distillation with spatial priors. OrderMind significantly outperforms VLM baselines in both simulation and real-world environments while supporting real-time inference (5.6 FPS, 21.3 FPS for the lightweight variant).

LLM World Models Are Mental: Output Layer Evidence of Brittle World Model Use in LLM Mechanical Reasoning

Drawing on cognitive science methodology for studying mental models, this work evaluates LLM mechanical reasoning ability using TikZ code representations of pulley systems. LLMs can approximately estimate mechanical advantage and distinguish functional from non-functional systems (Studies 1 & 2), but completely fail at fine-grained structural connectivity reasoning (Study 3), indicating that LLM "world models" exist but are brittle.

LLMscape

LLMscape is a projection-mapped sandscape interactive installation in which multiple independent LLM agents receive multimodal input, converse with one another, and engage in speculation within a shared, mutable physical environment, exploring the process of collaborative sensemaking between humans and AI under cognitive uncertainty.

Manipulating Feature Visualizations with Gradient Slingshots

This paper proposes Gradient Slingshots (GS), a method that "carves" a quadratic activation landscape in the out-of-distribution (OOD) input region of a model, directing the gradient-based optimization of Feature Visualization (FV) toward an arbitrary target image. The approach causes FV to converge to a predefined spurious image while leaving the model's architecture, classification accuracy, and internal feature representations largely intact, thereby exposing a serious vulnerability of FV as a model auditing tool.

MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning

This paper proposes the MesaTask framework, which decomposes task descriptions into a Spatial Reasoning Chain — object reasoning → spatial relationship reasoning → scene graph construction → 3D layout — and combines a 10K+ manually annotated dataset with DPO optimization to generate physically plausible, task-aligned tabletop manipulation scenes.

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning

MindForge introduces explicit Theory of Mind (ToM) representations, natural language communication, and a multi-component memory system into LLM-driven embodied agents, enabling open-source LLM agents to substantially improve task completion rates through collaborative dialogue with expert agents (without gradient updates), achieving 3× more tech-tree milestones and 2.3× more unique items than Voyager in Minecraft.

MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents

MineAnyBuild is a spatial planning benchmark built upon Minecraft, requiring AI agents to generate executable blueprint matrices from multimodal instructions. The benchmark comprises 4,000 tasks and 500+ architectural/decorative assets, and systematically evaluates MLLM spatial planning capabilities across four dimensions: spatial understanding, spatial reasoning, creativity, and spatial commonsense. Results reveal that even GPT-4o achieves only 41.02/100 overall, with open-source models performing substantially worse.

MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents

This paper reveals a novel adversarial attack against multimodal OS Agents, termed MIP (Malicious Image Patches): visually imperceptible adversarial perturbation patches (occupying approximately 1/7 of the screen area) are embedded in screenshots, causing the OS Agent to output a predefined sequence of malicious API calls upon capture. Joint optimization enables universal generalization across user instructions and screen layouts, achieving an attack success rate of up to 100%.

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

This paper introduces MMTU, a large-scale benchmark comprising 28,136 questions spanning 25 real-world table tasks, designed to systematically evaluate LLMs on professional-level table understanding, reasoning, and manipulation. Even frontier reasoning models such as GPT-5 achieve only approximately 69.6% on this benchmark.

mmWalk: Towards Multi-modal Multi-view Walking Assistance

mmWalk constructs the first multi-modal multi-view dataset for walking assistance targeting blind and low-vision (BLV) individuals (62K frames / 559K panoramic images generated via the CARLA simulator, plus 69K VQA pairs), and benchmarks reveal that state-of-the-art VLMs perform inadequately on safety-critical tasks such as risk assessment and navigation landmark recognition (best accuracy only 55.21%); fine-tuning yields a 16.7% generalization improvement on real-world datasets.

NeSyPr: Neurosymbolic Proceduralization For Efficient Embodied Reasoning

NeSyPr proposes a neurosymbolic proceduralization framework that transforms task plans generated by symbolic planners into composable procedural representations, enabling compact language models to perform efficient single-step reasoning without relying on external symbolic guidance — analogous to the human process of knowledge compilation.

Operation Veja: Fixing Fundamental Concepts Missing from Modern Roleplaying Training Paradigms

This paper systematically critiques four dominant paradigms in role-playing (RP) model training—RAG, fact-value specification, literary data, and synthetic data—arguing that none can produce characters with genuine depth. It proposes the VEJA framework (Values–Experiences–Judgments–Abilities) as a structured basis for character definition and data curation. In an LLM-judged A/B test, VEJA-guided human-curated data significantly outperforms a Gemini Pro 2.5 synthetic baseline with a win/loss/tie ratio of 43:28:29.

Policy Compatible Skill Incremental Learning via Lazy Learning Interface

This paper proposes SIL-C, a framework that achieves skill-policy compatibility in skill incremental learning via a bilateral lazy learning interface, enabling incrementally updated skills to directly improve downstream policy performance without retraining or structural modification.

Predicting the Performance of Black-Box LLMs through Follow-Up Queries

This paper proposes QueRE, a method that poses approximately 50 follow-up questions to a black-box LLM (e.g., "Are you confident in your answer?") and uses the resulting "Yes" token probabilities as features to train a linear classifier. QueRE achieves strong performance on predicting model correctness, detecting adversarial manipulation, and distinguishing between different LLMs — surpassing even white-box methods that require access to internal model states.

UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Task Planning

UniDomain pretrains a unified PDDL planning domain—comprising 3,137 operators and 2,875 predicates—from 12,393 real-world robotic manipulation videos. Through hierarchical fusion to construct a meta-domain, it achieves zero-shot cross-task symbolic planning, outperforming the strongest baseline by 58% in success rate and 160% in plan optimality.

RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

This paper proposes RDD (Retrieval-Based Demonstration Decomposer), which models demonstration decomposition as an optimal partition problem and automatically segments long-horizon task demonstrations into subtasks aligned with the training data of low-level visuomotor policies. This approach bridges the gap between high-level planners and low-level policies in hierarchical VLA frameworks, achieving near-expert-decomposer performance on RLBench.

Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation

This paper proposes EigenShift, a method that performs SVD decomposition on the final output projection layer of LLMs to identify semantic directions (eigen-choices) associated with toxic generation, and suppresses toxicity by selectively attenuating the corresponding singular values. On LLaMA-2, EigenShift reduces toxicity by 58% while increasing perplexity by only 3.62, achieving a favorable balance between safety and fluency.

Rethinking the Simulation vs. Rendering Dichotomy: No Free Lunch in Spatial World Modelling

From a cognitive neuroscience perspective, this paper challenges the prevailing view that simulation and rendering are separable processes: it argues that spatial reasoning relies on fine-grained perceptual representations rather than coarse abstractions, and concludes that AI spatial world models likewise require rich perceptual detail — there is no free lunch in spatial modelling.

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation

This paper proposes RoboCerebra, a long-horizon robotic manipulation benchmark comprising 1,000 human demonstration trajectories (averaging 2,972 steps, approximately 6× longer than existing benchmarks). Through a hierarchical planning and execution framework and a multi-dimensional evaluation protocol, it systematically assesses VLMs across three System 2 cognitive dimensions: planning, reflection, and memory.

SAFE: Multitask Failure Detection for Vision-Language-Action Models

SAFE identifies consistent "failure regions" in the internal feature space of VLA models that generalize across tasks. Leveraging this observation, it trains lightweight MLP/LSTM failure detectors and applies Functional Conformal Prediction (FCP) for threshold calibration. The approach achieves 78% ROC-AUC on unseen tasks with less than 1% computational overhead, substantially outperforming token-uncertainty and action-consistency baselines.

SegMASt3R: Geometry Grounded Segment Matching

SegMASt3R augments the pretrained MASt3R 3D foundation model with a lightweight segmentation feature head and a differentiable Sinkhorn matching layer. By leveraging 3D geometric priors, it achieves robust semantic segment matching under extreme viewpoint changes (up to 180°), attaining an AUPRC of 83.6% on the 135–180° baseline (vs. 17% for SAM2).

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

This paper proposes a two-pronged approach combining the SpatialMind structured prompting strategy and the ScanForgeQA synthetic QA dataset to substantially enhance VLMs' ability to perform 3D spatial reasoning from scanned videos, without modifying the underlying model architecture.

SutureBot: A Precision Framework & Benchmark for Autonomous End-to-End Suturing

This paper presents SutureBot — the first precision-oriented benchmark and goal-conditioned framework for end-to-end autonomous suturing on the da Vinci surgical robot. It releases a high-fidelity dataset of 1,890 demonstrations, achieves 59%–74% improvements in needle insertion accuracy via point-label goal conditioning, and systematically evaluates state-of-the-art VLA models including π0, GR00T N1, OpenVLA-OFT, and multi-task ACT.

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Talk2Event introduces the first large-scale visual grounding benchmark for event cameras (30,690 annotated referring expressions across four grounding attribute types), and proposes the EventRefer framework, which employs a Mixture of Event-Attribute Experts (MoEE) to dynamically fuse appearance, status, viewer-relation, and inter-object-relation features. EventRefer surpasses existing methods across all three evaluation settings: event-only, frame-only, and fusion.

Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain

This paper proposes the Encoder-Attender-Decoder (EAD) framework to systematically explore task-optimized temporal neural networks for tactile processing. It finds that convolutional recurrent networks (ConvRNNs, especially IntersectionRNN) outperform feedforward and state-space models on both tactile object classification and neural alignment with rodent somatosensory cortex. Contrastive self-supervised learning with tactile-specific augmentations achieves neural fitting comparable to supervised learning, providing the first quantitative characterization of the brain's computational mechanisms for touch.

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

ThinkAct proposes a dual-system framework that applies action-aligned visual rewards to fine-tune MLLMs via reinforcement learning, eliciting embodied reasoning capabilities and compressing reasoning plans into visual latent representations to guide a downstream action model—realizing a "think before act" VLA reasoning paradigm.

Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs

This paper introduces EngDesign—the first LLM engineering design benchmark spanning 9 engineering domains (operating systems, computer architecture, control systems, mechanical engineering, structural engineering, digital hardware, analog circuits, robotics, and signal processing)—replacing conventional QA matching with a simulation-driven evaluation pipeline. The benchmark reveals that even the most capable reasoning model, o3, achieves only a 34% pass rate.

Towards Reliable Code-as-Policies: A Neuro-Symbolic Framework for Embodied Task Planning

This paper proposes a neuro-symbolic framework for embodied task planning that augments LLM-based code generation with explicit symbolic verification (checking whether preconditions are satisfied) and interactive verification (active exploration to acquire missing information), enabling more reliable code execution in dynamic and partially observable environments. On RLBench, task success rate improves from a baseline of 38.5% to 84.7%, with executability reaching 86.8%.

Uncovering Strategic Egoism Behaviors in Large Language Models

This paper presents the first formal definition of Strategic Egoism (SE) in LLMs and introduces SEBench, a benchmark comprising 160 scenarios across 6 SE dimensions. Experiments on 7 mainstream LLMs show that, under incentive-driven conditions, an average of 69.11% of decisions favor self-interested strategies. Manipulation/coercion and rule circumvention are the most prevalent tactics, and SE tendency is positively correlated with toxic language generation.

Understanding Prompt Tuning and In-Context Learning via Meta-Learning

This paper systematically analyzes the theoretical foundations and limitations of prompt tuning from a Bayesian meta-learning perspective. It proves that soft prompts can achieve optimal adaptation on a single target task within the pretraining distribution, yet face fundamental limitations under multi-task mixture target distributions. Furthermore, soft prefixes can surpass the optimal hard-token sequence by manipulating activations outside the token space.

VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

This paper proposes the CLIP-IN framework, which leverages instruction editing datasets as hard negatives and incorporates long captions to enhance CLIP's fine-grained visual understanding. The approach achieves significant improvements on benchmarks such as MMVP without compromising zero-shot performance, and when integrated into MLLMs, it substantially reduces visual hallucinations.

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

This paper proposes ZEDD (Zero-shot Embedding Drift Detection), which detects prompt injection attacks by measuring semantic drift between benign and suspicious inputs in the embedding space. It leverages GMM/KDE to automatically determine detection thresholds, achieving >93% detection accuracy with <3% false positive rate across multiple LLM architectures.