🤖 Robotics & Embodied AI¶

🔬 ICLR2026 · 47 paper notes

All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation: This paper proposes Tucker Adaptation (TuKA), which represents multi-level navigation knowledge across multiple scenes and environments as a high-order tensor, decomposed via Tucker decomposition into a shared subspace (core tensor + encoder/decoder) and scene/environment expert vectors. Combined with a Decoupled Knowledge Incremental Learning (DKIL) strategy, TuKA enables all-day multi-scene lifelong VLN, achieving superior SR and lower forgetting rates over LoRA variants across 24 navigation scenarios.
AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception: AnyTouch 2 proposes a Tactile Dynamic Pyramid framework, constructs the ToucHD hierarchical dataset comprising 2,426,174 contact samples (covering atomic actions, real-world manipulation, and tactile-force paired data), and designs a unified tactile representation learning framework that operates across three levels of dynamic perception—pixel-level, semantic-level, and physical-level. The approach comprehensively outperforms existing methods on four tasks: static attribute recognition, dynamic physical prediction, and real-world manipulation.
Attribution-Guided Decoding: This paper proposes AGD, a decoding strategy that, at each generation step, selects from high-probability candidate tokens the one with the highest attribution score toward a user-specified region of interest (ROI). This reframes attribution methods from passive analysis tools into active generation guidance mechanisms, achieving significant improvements on both instruction-following and factuality tasks.
Building Spatial World Models from Sparse Transitional Episodic Memories: This paper proposes the Episodic Spatial World Model (ESWM), which constructs spatial world models from sparse, disconnected episodic memories (one-step transitions). The model's latent space spontaneously gives rise to cognitive maps aligned with environmental topology, supporting zero-shot exploration and navigation.
Capability-Based Scaling Trends for LLM-Based Red-Teaming: This paper systematically evaluates 4 jailbreak methods across 600+ attacker–target LLM pairs and finds that attack success rate (ASR) follows a sigmoid scaling law with respect to the capability gap between attacker and target (\(R^2=0.83\)), where the capability gap is quantified via a logit transformation of MMLU-Pro scores.
CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally: Through linear probing experiments, this paper demonstrates that CLIP's bag-of-words (BoW) behavior does not stem from a lack of binding information in the encoders, but rather from a failure of cross-modal alignment. The paper proposes LABCLIP, which trains a single lightweight linear transformation to substantially recover attribute-object binding capability.
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI: This work proposes the D2E framework, demonstrating that desktop gaming interaction data can serve as an effective pretraining substrate for embodied AI. Through the OWA toolkit, 335 hours of human demonstrations are collected; a Generalist-IDM pseudo-annotates 1,000+ hours of YouTube gameplay videos; and VAPT transfer training yields a 1B-parameter model that achieves 96.6% on LIBERO manipulation and 83.3% on CANVAS navigation, matching or surpassing models 7× larger.
Domain Expansion: A Latent Space Construction Framework for Multi-Task Learning: This paper proposes the Domain Expansion framework, which restructures the latent space into mutually orthogonal subspaces via Orthogonal Pooling, structurally preventing gradient conflicts and representation collapse in multi-objective training, and enabling interpretable, composable concept algebra.
Doubly-Robust LLM-as-a-Judge: Externally Valid Estimation with Imperfect Personas: This paper proposes a doubly-robust estimation framework that combines imperfect LLM persona ratings with human annotations subject to sampling bias, yielding statistically valid estimates of GenAI system quality in the simultaneous presence of covariate shift and selection bias.
Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection: This paper proposes Directer (Dynamic Rejection Steering), which dynamically adjusts KV cache steering intensity at each decoding step and incorporates a plausibility constraint, substantially improving LLM instruction following while preventing text quality degradation caused by oversteering.
ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning: This paper proposes ExoPredicator, a framework that jointly learns symbolic state abstractions and causal processes (encompassing both endogenous actions and exogenous mechanisms). Via variational Bayesian inference combined with LLM-based proposals, ExoPredicator learns causal world models with stochastic delays from a small number of trajectories, achieving rapid generalization in planning across five tabletop robot environments.
Experience-based Knowledge Correction for Robust Planning in Minecraft: This paper demonstrates that LLMs cannot self-correct erroneous planning priors (item dependency relations) through prompting alone, and proposes XENON — an algorithmic knowledge management framework consisting of an Adaptive Dependency Graph (ADG) and Failure-Aware Action Memory (FAM) that learns from binary feedback. XENON enables a 7B LLM to surpass the SOTA that uses GPT-4V with oracle knowledge on long-horizon planning tasks in Minecraft.
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors: This paper proposes FALCON (From Spatial to Action), which injects rich 3D spatial tokens from a spatial foundation model into the Action Head rather than the VLM backbone, achieving strong 3D spatial awareness in VLA models while maintaining flexible modality switching between RGB-only and RGB-D inputs. FALCON achieves state-of-the-art performance on both simulation and real-world tasks.
Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI: This paper proposes VIRF (Verifiable Iterative Refinement Framework), a neuro-symbolic hybrid architecture that couples a deterministic Logic Tutor with an LLM planner, using a verifiable formal ontology as a safety anchor. VIRF achieves 0% Hazardous Action Rate (HAR) and 77.3% Goal Completion Rate (GCR) on SafeAgentBench, demonstrating that strict safety guarantees need not compromise agent utility.
Ignore All Previous Instructions: Jailbreaking as a de-escalatory peace building practise to resist LLM social media bots: This paper proposes reframing the jailbreaking of LLM-driven social media propaganda bots as a user-initiated, nonviolent de-escalatory peace-building practice. By exposing the fabricated identities of automated accounts through prompt injection, ordinary users can resist state-sponsored disinformation campaigns without relying on platform moderation.
JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation: Inspired by the left-brain/right-brain division of semantic understanding and spatial cognition in humans, this paper proposes JanusVLN—the first dual implicit neural memory framework designed for VLN—which models spatial-geometric memory and visual-semantic memory respectively as fixed-size KV Caches, enabling efficient spatial reasoning from RGB video alone and achieving state-of-the-art performance on the VLN-CE benchmark.
JULI: Jailbreak Large Language Models by Self-Introspection: This paper reveals that top-k token log probabilities returned by aligned LLM APIs still contain harmful knowledge leakage, and proposes JULI—a BiasNet plugin with fewer than 1% of the target model's parameters—that manipulates logit bias to successfully jailbreak Gemini-2.5-Pro (Harmful Info Score 4.19/5) under API settings restricted to top-5 token probabilities, achieving approximately 140× speedup over LINT while doubling harmfulness scores.
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation: Inspired by the dual-memory system in cognitive science, this paper proposes MemoryVLA, a framework that introduces a Perceptual-Cognitive Memory Bank (PCMB) into VLA models. By incorporating memory retrieval, gated fusion, and consolidation mechanisms to capture long-horizon temporal dependencies, MemoryVLA comprehensively outperforms CogACT and π₀ across 150+ tasks on SimplerEnv, LIBERO, and real-world benchmarks.
ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment: This paper proposes a unified theoretical framework for activation steering based on ordinary differential equations (ODEs), interpreting conventional activation addition as the Euler discretization of an ODE and showing that steering direction identification is equivalent to defining a barrier function. Building on this insight, the authors design ODESteer, which achieves fine-grained steering by numerically solving the ODE with multi-step adaptive integration, yielding gains of 5.7% on TruthfulQA, 2.5% on UltraFeedback, and 2.4% on RealToxicityPrompts.
OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning: This paper proposes OmniEVA, which addresses two critical gaps in spatial MLLMs — poor geometric adaptability (2D-only or hard-coded 3D injection) and the absence of embodiment constraints (plans that are theoretically feasible but physically unexecutable) — via a task-adaptive gated router that dynamically injects 3D positional encodings only when geometric reasoning is required, and an embodiment-aware reasoning framework that integrates physical constraints into the planning loop. OmniEVA achieves state-of-the-art performance on 7 out of 8 benchmarks.
On Entropy Control in LLM-RL Algorithms: This paper provides a theoretical explanation for why conventional entropy regularization is nearly ineffective in LLM-RL (due to the extremely large action space and sparse optimal actions causing entropy bias to overwhelm optimization gains), and proposes AEnt — a method combining clamped entropy (computed over a reduced token space) with an adaptive coefficient — to effectively balance bias and benefit, consistently outperforming baselines on mathematical reasoning tasks.
One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration: This paper proposes the PDDLLM framework, which derives a complete PDDL planning domain (predicates + actions) automatically from a single demonstration trajectory. It generates interpretable symbolic representations through cross-validation between LLM reasoning and physical simulation, and employs a Logical Constraint Adapter (LoCA) to automatically interface with motion planners. The method achieves at least 20% higher success rates than 6 LLM baselines across 1,200+ tasks in 9 environments, and is successfully deployed on 3 physical robot platforms.
PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra: This paper proposes the PERSONA framework, which extracts approximately orthogonal personality vectors from activation space and applies vector algebra operations (scaling, addition, subtraction) to achieve training-free dynamic and compositional personality control. PERSONA attains a score of 9.60 on PersonalityBench, nearly matching the SFT upper bound of 9.61.
Real-Time Robot Execution with Masked Action Chunking: This paper proposes REMAC, which systematically addresses two key failure modes of asynchronous inference—intra-chunk inconsistency and inter-chunk discontinuity—through a masked action chunking training strategy and a prefix-preserved sampling pipeline, enabling more reliable real-time robot control without introducing any additional inference latency.
REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?: This work presents the first systematic study on how referring expressions (REs) in vague human instructions affect LLM-based robot task planning. REI-Bench is introduced to model 9 levels of coreference ambiguity (3 RE difficulty levels × 3 context types). Implicit REs are found to reduce the success rate of existing planners by up to 36.9%. The proposed Task-Oriented Context Cognition (TOCC) method decouples task understanding from planning decision-making, achieving an average improvement of 6.5% in success rate.
RF-MatID: Dataset and Benchmark for Radio Frequency Material Identification: This paper introduces RF-MatID, the first open-source large-scale RF material identification dataset with wide frequency coverage (4–43.5 GHz) and diverse geometric perturbations, comprising 16 fine-grained material categories (5 superclasses) and 142K samples. A comprehensive benchmark is established across 9 deep learning models, 5 frequency protocols, and 7 data split settings.
RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots: RoboCasa365 constructs a large-scale simulation benchmark comprising 365 everyday kitchen tasks, 2,500 diverse kitchen scenes, and over 2,000 hours of robot interaction data. It systematically evaluates generalist robot policies under three paradigms—multi-task learning, foundation model training, and lifelong learning—and finds that task diversity in pretraining data is the key factor for improving downstream generalization.
RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation: This paper presents RoboInter, a unified manipulation suite for intermediate representations, comprising: RoboInter-Tool (a semi-automatic annotation GUI), RoboInter-Data (230K episodes × 571 scenes with dense per-frame annotations across 10+ intermediate representation types), RoboInter-VQA (a 29-category embodied VQA benchmark), and RoboInter-VLA (a plan-then-execute framework supporting both modular and end-to-end variants), providing a complete infrastructure for enhancing VLA generalization through intermediate representations.
RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks: This paper proposes RoboPARA, a two-stage framework that optimizes task parallelism for dual-arm robots via dependency graph construction and graph re-traversal scheduling, achieving 30–50% reduction in execution time and a 34% improvement in success rate over existing methods across multi-scenario benchmarks.
SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests: This paper introduces SocialHarmBench, the first LLM safety evaluation benchmark specifically targeting sociopolitical harms. It comprises 585 prompts spanning 7 categories and 34 countries, revealing systemic safety vulnerabilities in current LLMs across politically sensitive scenarios such as historical revisionism and propaganda manipulation.
Sparse Imagination for Efficient Visual World Model Planning: This paper proposes Sparse Imagination, which achieves substantial inference speedup in ViT patch token-based world model planning by randomly dropping tokens and training with randomly grouped attention (50% drop rate reduces planning time by ~50%), while maintaining or even surpassing full-token planning performance on certain tasks. A key finding is that simple random dropout outperforms sophisticated token selection methods, as static importance ranking suffers from a "blind spot problem" in dynamic planning scenarios.
String Seed of Thought: Prompting LLMs for Distribution-Faithful and Diverse Generation: This paper proposes String Seed of Thought (SSoT), a concise prompting method that instructs LLMs to first generate a random string and then extract randomness from it to select an answer. SSoT significantly improves distribution faithfulness in probabilistic instruction following (PIF) and response diversity in open-ended generation (DAG). The paper theoretically proves that TV distance decays exponentially with string length, and experiments show that reasoning-capable LLMs approach the performance of pseudo-random number generators.
SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models: This paper constructs structurally identical parallel corpora in which entities are mapped to either real or synthetic names, and quantifies the Knowledge Advantage Gap (KA) — the contribution of parametric knowledge — by comparing model performance across the two "parallel worlds." The results show that this gap persists even when models are augmented with RAG and CoT.
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts: This paper proposes Sysformer, a lightweight Transformer module that can be plugged in front of any frozen LLM to adaptively transform system prompts in embedding space conditioned on user input, enabling the model to refuse harmful requests while complying with benign ones—without modifying LLM parameters or filtering user inputs.

Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?: This paper proposes the Theory of Space framework, which systematically evaluates the ability of foundation models to construct and revise spatial beliefs through active exploration, cognitive map probing, and a False Belief paradigm across both text-based and visual environments. The study reveals critical failure modes in current state-of-the-art models, including active-passive performance gaps, inefficient exploration strategies, and deficient belief revision.
THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning: This paper proposes THOR, a framework that systematically addresses three core challenges in tool-integrated mathematical reasoning for LLMs—data construction, fine-grained optimization, and inference enhancement—through three complementary components: the TIRGen data construction pipeline, hierarchical reinforcement learning (joint episode-level and step-level optimization), and a self-correction inference mechanism. THOR achieves state-of-the-art performance among models of comparable scale on benchmarks including MATH500 and AIME.
Token Taxes: Mitigating AGI's Economic Risks: This paper proposes the Token Tax — a surcharge levied on model inference token usage — as a first-line governance instrument for mitigating economic risks in the post-AGI era. It leverages cloud computing providers as intermediaries through a three-stage audit pipeline (black-box token verification → norm-based tax rates → white-box audit). Compared to conventional robot taxes, it offers two distinctive advantages: enforceability through existing compute governance infrastructure, and collection at the point of AI token consumption rather than model hosting location, thereby alleviating global inequality.
Tracing and Reversing Edits in LLMs: Addressing the dual-use risks of Knowledge Editing (KE), this paper proposes EditScope, a method that infers edited target entities from post-edit weights with up to 99% accuracy, alongside a training-free edit reversal approach based on SVD bottom-rank approximation achieving up to 94% reversal rate—requiring only the post-edit weights, without access to the editing prompt or original weights.
TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models: TwinVLA is proposed as a modular framework that composes two pretrained single-arm VLAs into a bimanual VLA via joint attention and MoE, requiring only ~800h of public single-arm data, 50 episodes of bimanual fine-tuning data, and 25 H100 GPU-days—achieving performance comparable to π0, which relies on 10,900h of proprietary data and 1,000+ GPU-days.
UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos: UrbanVerse is a data-driven real-to-sim system that converts crowdsourced city-tour videos into physically-aware, interactive simulation environments. It comprises a 100K+ annotated 3D asset library and an automated scene construction pipeline, generating 160 high-quality scenes in IsaacSim. A PPO navigation policy trained on these scenes achieves an 89.7% success rate in zero-shot real-world transfer, completing a 337 m long-range task with only 2 human interventions.
Visual Planning: Let's Think Only with Images: This paper introduces Visual Planning — the first purely visual reasoning paradigm in which the entire planning process is expressed as a sequence of images without any textual intermediary. A Large Vision Model (LVM) autoregressively generates step-by-step state images. The authors further propose VPRL, a two-stage RL framework combining random-trajectory-initialized exploration with GRPO progress-reward optimization. On three navigation benchmarks (FrozenLake, Maze, MiniBehavior), VPRL achieves an average Exact Match (EM) surpassing text-based reasoning methods by 27%, demonstrating that image-based reasoning substantially outperforms text-based reasoning on vision-first tasks.
VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation: VLBiMan is proposed as a framework that decomposes a single demonstration into invariant and adaptable atomic skills via task-aware bimanual decomposition, employs vision-language anchoring with a VLM to adapt to new object positions and instances in novel scenes, and achieves bimanual coordination through kinematics-aware trajectory composition. The framework achieves an 85.3% success rate across 10 complex bimanual tasks with only one demonstration, substantially outperforming imitation learning baselines that require hundreds of demonstrations.
WebOperator: Action-Aware Tree Search for Autonomous Agents in Web Environment: This paper proposes WebOperator, an action-aware tree search framework that enables autonomous web agents to explore safely and efficiently in partially observable, irreversible real-world web environments through speculative backtracking, destructive action detection, action validation, and action merging. WebOperator achieves a 54.6% success rate on WebArena using gpt-4o, establishing a new state of the art.
What's the Plan? Metrics for Implicit Planning in LLMs and Their Application to Rhyme Generation and Question Answering: This paper proposes a mean activation difference steering method along with accompanying quantitative metrics, and systematically demonstrates across 23 open-source models (1B–32B) on rhyme generation and question answering: representations of target tokens (rhymes/answers) form at early sequence positions (forward planning) and causally influence intermediate token generation (backward planning). Implicit planning emerges as early as 1B-scale models, indicating it is a universal mechanism rather than a capability exclusive to large models.
When Agents Persuade: Propaganda Generation and Mitigation in LLMs: This paper systematically investigates propaganda generation behavior in LLMs, training dedicated detectors to quantify the use of six rhetorical techniques across three LLMs. Results show that all LLMs can generate propaganda and heavily rely on Loaded Language and Flag-Waving. Three fine-tuning approaches (SFT/DPO/ORPO) are employed for mitigation, with ORPO reducing the propaganda classification rate from 77% to 10% and decreasing rhetorical technique usage by 13.4×.
When would Vision-Proprioception Policies Fail in Robotic Manipulation?: This paper identifies why vision-proprioception manipulation policies fail during motion-transition phases—proprioceptive signals dominate optimization and suppress visual learning—and proposes the Gradient Adjustment with Phase-guidance (GAP) algorithm, which adaptively attenuates proprioceptive gradients to restore visual modality learning, achieving significant generalization improvements in both simulated and real-world environments.