🤖 Robotics & Embodied AI¶
🔬 ICLR2026 · 162 paper notes
📌 Same area in other venues: 📷 CVPR2026 (130) · 💬 ACL2026 (11) · 🧪 ICML2026 (53) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (26)
🔥 Top topics: Robotics ×62 · Multimodal/VLM ×38 · Reinforcement Learning ×13 · Navigation ×13 · Agents ×13
- A Primer on SO(3) Action Representations in Deep Reinforcement Learning
-
This paper systematically evaluates various parameterizations of SO(3) rotation actions in Deep Reinforcement Learning (Euler angles / Quaternions / Rotation Matrices / Lie algebra tangent vectors). Through large-scale experiments on PPO, SAC, and TD3 under dense and sparse rewards, it demonstrates that "delta tangent vector actions in the local coordinate frame" are the most robust across nearly all algorithms and tasks, providing a practical guide for selecting rotation actions.
- Abstracting Robot Manipulation Skills via Mixture-of-Experts Diffusion Policies
-
SMP (Skill Mixture-of-Experts Policy) decomposes action generation of diffusion policies into a set of state-adaptive orthogonal skill bases. By using slowly-varying "sticky" gating to activate only a few experts relevant to the current stage, it achieves reusable and transferable multi-task bimanual manipulation at a medium model scale. It reduces inference active parameters to approximately 30% of its own total (about 7% of RDT) while achieving higher success rates than large diffusion baselines.
- Accelerated co-design of robots through morphological pretraining
-
This paper introduces "morphological pretraining": a morphology-agnostic universal controller is pretrained once across tens of millions of robot bodies using differentiable simulation. This frozen (or slightly fine-tuned) controller then enables zero-shot evaluation of arbitrary morphological changes, accelerating robot "body+brain" co-design by an order of magnitude and demonstrating, for the first time, that evolutionary "crossover" can produce offspring superior to their parents.
- Action-aware Dynamic Pruning for Efficient Vision-Language-Action Manipulation
-
Addressing the issue of excessive visual tokens in Vision-Language-Action (VLA) models that consume attention computation during inference, this paper proposes ADP (Action-aware Dynamic Pruning). It utilizes text correlation for anticipatory pruning of task-related visual tokens and uses recent motion magnitude of the robot's end-effector as a gating signal. This enables aggressive pruning during coarse action stages (high displacement) to save computation and restores full visual input during fine manipulation stages (low displacement) to maintain precision. It accelerates OpenVLA-OFT by 1.35× on LIBERO with negligible success rate loss and reduces real-world robot latency to 1.49×.
- Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control
-
This paper provides the first theoretical guarantee for two empirical techniques in imitation learning—action chunking and expert noise-injection data augmentation—using "incremental stability" from control theory. It proves they suppress the compounding error that accumulates exponentially over time in continuous control behavior cloning (BC) to be "horizon-free" under various conditions.
- Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting
-
By verbalizing low-level robot end-effector actions into natural language text and feeding them into a VLM, the fine-tuning data is aligned with the pre-training distribution. This allows converting Gemma-3-12B into a robotic policy (VLA) using only LoRA. In 800+ real-robot experiments, the model retains 85%+ of its VQA capability and achieves zero-shot generalization for multilingual instructions and open-world semantics.
- Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
-
ATE first aligns pre-trained robot actions and target robot actions into a single structured latent space. It then utilizes gradients generated from latent space distances to guide the fine-tuning of diffusion-based or flow-matching VLAs, enabling faster adaptation to new embodiments and tasks with limited demonstration data.
- All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation
-
The authors propose Tucker Adaptation (TuKA), which represents multi-level navigation knowledge across various scenes and environments as high-order tensors. Using Tucker decomposition, the method decouples navigation knowledge into a shared subspace (core tensor + encoders/decoders) and scene/environment-specific expert vectors. Combined with a Decoupled Knowledge Incremental Learning strategy, TuKA achieves all-day multi-scene lifelong VLN, outperforming LoRA variants in SR and forgetting rates across 24 navigation scenarios.
- AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception
-
AnyTouch 2 proposes a tactile dynamic pyramid framework and constructs the ToucHD hierarchical dataset containing 2.426 million contact samples (covering atomic actions, real-world manipulation, and touch-force pairs). It designs a unified representation learning framework for triple-layer dynamic perception—pixel-level, semantic-level, and physical-level—outperforming existing methods across static property recognition, dynamic physical prediction, and real-world manipulation tasks.
- APPLE: Toward General Active Perception via Reinforcement Learning
-
Ours proposes APPLE—a general active perception framework that combines reinforcement learning with supervised learning by modeling active perception as a POMDP. The reward function is designed as the RL reward minus prediction loss, allowing the gradient to naturally decompose into policy gradient and prediction loss components. Based on off-policy algorithms (SAC/CrossQ) and a shared ViViT backbone, its generality is validated across 5 different task benchmarks, where the CrossQ variant eliminates the need for per-task hyperparameter tuning and increases training efficiency by 53%.
- ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning
-
ArtVIP constructs a set of 992 high-quality digital twin articulated objects and accompanying indoor scenes. By utilizing unified modeling standards, articulated physics parameter tuning, asset-embedded interaction behaviors, and pixel-level affordance labeling, it enables robot learning algorithms to be trained, evaluated, and transferred in simulation environments that more closely resemble the real world.
- AutoBio: A Simulation and Benchmark for Robotic Automation in Digital Biology Laboratory
-
AutoBio transforms "robotic experimentation in biological laboratories" into a suite of simulatable, demonstration-generatable, and evaluatable benchmarks: it digitalizes real instruments using 3D Gaussian Splatting, augments MuJoCo with laboratory-specific physics (threads, detents, eccentricity, liquid surfaces), and resolves transparent container and liquid rendering via Blender PBR. Ultimately, it evaluates mainstream VLA models like π0, π0.5, and RDT across 16 biological experiment tasks of three difficulty levels, exposing significant shortcomings in precision manipulation, instruction following, and visual reasoning.
- AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
-
The paper proposes AutoFly, an end-to-end VLA model for autonomous UAV navigation in the wild. By using a pseudo-depth encoder to infer spatial information from RGB inputs and a newly constructed autonomous navigation dataset (13K+ trajectories including 1K real flights), it achieves a 3.9% higher success rate and a 2.6% lower collision rate than OpenVLA in both simulated and real environments.
- Autonomous Functional Play with Correspondence-Driven Trajectory Warping
-
This paper proposes Tether: an open-loop strategy that "warps" demonstration trajectories (requiring only \(\le10\) trials) to new scenes via semantic keypoint correspondence. This is integrated into a closed "autonomous functional play" loop scheduled by a Vision-Language Model (VLM). The system enables a robot to automatically generate 1000+ expert-level trajectories over 26 hours in the real world with minimal human intervention, which are then used to train closed-loop imitation policies that achieve success rates comparable to human teleoperated data.
- BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning
-
BFM-Zero employs online off-policy unsupervised RL (Forward-Backward CPR) to encode actions, goals, and rewards into a shared latent space. It trains a "promptable" humanoid whole-body control generalist policy, achieving zero-shot motion tracking, goal reaching, and reward optimization on the real Unitree G1 without retraining, while supporting fast few-shot adaptation.
- Block-wise Adaptive Caching for Accelerating Diffusion Policy
-
BAC adapts the "feature caching" concept from image diffusion to Diffusion Policy. It utilizes dynamic programming to schedule cache update intervals for each Transformer sub-block individually and introduces the Bubbling Union Algorithm to intercept inter-block error propagation in FFN blocks. This training-free, plug-and-play method accelerates diffusion policy inference by 3× with almost no loss in success rate.
- BOLT: Decision‑Aligned Distillation and Budget-Aware Routing for Constrained Multimodal QA on Robots
-
BOLT decomposes "constrained multiple-choice QA on robots" into option-level decision distillation during training (aligning a 2B student directly with a 13B teacher's preferences over option sets) and budget-aware routing during inference (triggering expensive signals like high-resolution re-evaluation, retrieval, or question decomposition only when cheap signals predict positive gains). Using a 2B student, it achieves 50.50% accuracy on Robo2VLM-1, surpassing the 36.74% of the 13B teacher while reducing VRAM from 26.9GB to 3.8GB and energy consumption by 82.5%.
- Capturing Visual Environment Structure Correlates with Control Performance
-
The authors propose using "regressing the simulator's full state (geometry/object structure/physical properties) from images" as a lightweight proxy task. They demonstrate that this probing accuracy is highly correlated with downstream robot policy success rates, enabling efficient selection of visual backbones without running expensive policy rollouts.
- CE-Nav: Flow-Guided Reinforcement Refinement for Cross-Embodiment Local Navigation
-
CE-Nav proposes a two-stage framework: first, an offline imitation learning stage trains a normalizing flow expert (VelFlow) that is independent of any specific robot embodiment and focuses solely on geometric obstacle avoidance; second, this expert is frozen as a prior for a lightweight online RL refiner to adapt to the specific dynamics of new robots. It achieves SOTA navigation performance on quadruped, biped, and quadrotor platforms while reducing the adaptation time for new robots from 50 hours to 6 hours.
- CompassNav: Steering From Path Imitation to Decision Understanding In Navigation
-
CompassNav shifts the goal navigation training paradigm from "imitating a single expert trajectory" to "decision understanding." By scoring all candidate actions at each step using A* geodesic distances to construct dense supervision, and combining it with a gap-aware hybrid reward for GRPO fine-tuning, the 7B Qwen2.5-VL learns to evaluate the "relative merits of each move," outperforming GPT-4o and even o4-mini on HM3D/MP3D.
- Compositional Diffusion with Guided Search for Long-Horizon Planning
-
This work embeds "population-based search" directly into the diffusion denoising process. By using iterative resampling for local-to-global message passing and pruning based on likelihoods derived from DDIM inversion, it enables short-range diffusion models to compose long-horizon plans that are both locally feasible and globally coherent. The method generalizes across robot planning, panorama synthesis, and long-video generation.
- CoNavBench: Collaborative Long-Horizon Vision-Language Navigation Benchmark
-
CoNavBench is the first Vision-Language Navigation (VLN) benchmark for "multi-robot collaboration," containing 4,048 single-agent/collaborative tasks. It includes NavCraft, an automated data generation platform (two-stage agent + scene graph + efficiency toolbox). Using a fine-tuned Qwen2.5-VL-3B as a reference policy, it demonstrates that collaborative decomposition improves step-level task success rates by a relative 18.11%.
- Contractive Diffusion Policies
-
To address the issue where "sampler error + score estimation error" progressively accumulates/pushes actions away from data support in offline control, this paper uses contraction theory to transform "bringing adjacent denoising trajectories closer" into a differentiable penalty on the maximum eigenvalue of the score network's Jacobian. By adding only one hyperparameter and a lightweight loss term, it can be integrated into existing diffusion policies, with particularly significant gains in data-scarce scenarios.
- Cortical Policy: A Dual-Stream View Transformer for Robotic Manipulation
-
Inspired by the division of labor in the human visual cortex—where the "ventral stream" perceives static scenes and the "dorsal stream" observes dynamic motion—this paper proposes Cortical Policy. It features a dual-stream View Transformer consisting of a static stream (using VGGT to supervise cross-view geometric consistency for 3D spatial reasoning) and a dynamic stream (using a pre-trained gaze estimation model to predict end-effector positions from an egocentric dynamic perspective). Cortical Policy significantly outperforms SOTA models like RVT-2 on RLBench, COLOSSEUM, and real-world tasks (RLBench success rate 81.0% vs. 77.5%, COLOSSEUM +9.4%, and 80% success under dynamic disturbances on real robots vs. 0% for static methods).
- Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning
-
This work utilizes the pre-trained video generation foundation model Cosmos-Predict2-2B as a base, without modifying any network architecture and using only one-stage fine-tuning. It "encodes" robot actions, future states, and state values as "latent video frames" for joint denoising and generation. This allows the model to simultaneously serve as a policy, world model, and value function. It achieves SOTA on LIBERO (98.5%), RoboCasa (67.1%), and real-world dual-arm ALOHA tasks, with an additional 12.5-point improvement using best-of-N planning.
- Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets
-
This paper systematically investigates cross-embodiment offline RL pre-training paradigms. It finds that gradient conflicts lead to negative transfer when the ratio of suboptimal data and robot diversity increase. It proposes Embodiment Grouping (EG), which clusters robots based on morphological graph distance and performs grouped actor updates. EG significantly alleviates negative transfer on 16 robot locomotion benchmarks (e.g., IQL+EG outperforms IQL by 34% on 70% suboptimal datasets).
- Ctrl-World: A Controllable Generative World Model for Robot Manipulation
-
Ours transforms pre-trained passive video diffusion models into a controllable, multi-view, and long-term consistent robotic world model. This allows general-purpose VLA policies to perform closed-loop rollouts in "imaginary space," enabling policy evaluation without real robots and improving success rates by 44.7% through fine-tuning on synthesized success trajectories.
- D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping
-
This paper proposes D-REX, a differentiable real-to-sim-to-real engine based on Gaussian representation. It performs end-to-end object mass identification through visual observations and robot control signals, and utilizes the identified mass for force-aware dexterous grasping policy learning, effectively narrowing the sim-to-real gap.
- D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
-
The D2E framework is proposed, demonstrating that desktop gaming interaction data can serve as an effective pretraining base for embodied AI. By collecting 335h of human demonstrations via the OWA toolkit, pseudo-labeling 1000+h of YouTube gaming videos with Generalist-IDM, and performing VAPT transfer training, a 1B parameter model achieves 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation, matching or exceeding models 7x its size.
- DataMIL: Selecting Data for Robot Imitation Learning with Datamodels
-
DataMIL transfers the datamodels (data attribution) framework from NLP/CV to robot imitation learning. It uses the policy itself to end-to-end assign an "influence score for task success" to each piece of prior data, then selects high-scoring data for co-training with target data. By replacing expensive real-robot evaluations with a rollout-free proxy loss, it outperforms similarity-retrieval baselines by approximately 10% across 60+ simulated and real manipulation tasks. It successfully selects useful cross-embodiment data from large-scale heterogeneous datasets like OXE.
- DemoGrasp: Universal Dexterous Grasping from a Single Demonstration
-
DemoGrasp starts from a single successful grasp demonstration. The RL policy learns only "how to edit this demonstration" (modifying wrist pose to decide where to grasp and finger joints to decide how to grasp), compressing high-dimensional long-horizon dexterous grasping into a single-step decision problem. Using a minimalist reward of binary success and collision penalty, a universal policy is trained on thousands of objects, achieving 95% success in simulation and 86.5% on 110 unseen objects in the real world, with cross-robot transfer across seven embodiments.
- Demystifying Robot Diffusion Policies: Action Memorization and a Simple Lookup Table Alternative
-
This paper systematically demonstrates that Diffusion Policy in small-data robot imitation learning behaves more like retrieving action segments from the training set based on current images rather than learning a generalizable action generator. It proposes an explicit Action Lookup Table (ALT) using contrastive learning embeddings and nearest neighbor retrieval to achieve performance close to Diffusion Policy while providing significantly faster inference and direct OOD detection.
- DexMove: Learning Tactile-Guided Non-Prehensile Manipulation with Dexterous Hands
-
DexMove adopts a hybrid data paradigm combining "large-scale simulation trajectories + a small amount of human tactile demonstrations" to train a flow matching policy. This allows a multi-fingered dexterous hand to push and rotate tabletop objects through wrist-finger coordination and tactile closed-loop control (non-prehensile relocation). On a real robot, it achieves an average success rate of 77.8% across 6 object categories, surpassing ablation baselines by 36.6% and improving efficiency by nearly 300%.
- DexNDM: Closing the Reality Gap for Dexterous In-Hand Rotation via Joint-Wise Neural Dynamics Model
-
DexNDM decomposes the high-dimensional hand-object system into low-dimensional effective dynamics for individual joints using a Joint-Wise Neural Dynamics Model. Combined with an autonomous "Chaos Box" for data collection, it trains a residual policy to correct the simulation-based base policy. This approach achieves the first robust real-world in-hand rotation for complex, high-aspect-ratio, and small objects across multiple wrist orientations using a single policy.
- Difference-Aware Retrieval Policies for Imitation Learning
-
DARP reparameterizes imitation learning from a global "state → action" mapping into a semi-parametric retrieval strategy: it retrieves \(k\) nearest neighbors from expert data, predicts actions based on the difference vectors between each neighbor and the query state, and performs permutation-invariant aggregation. Theoretically equivalent to a parameter-free Laplacian smoothing, DARP consistently improves performance over standard Behavior Cloning by 15–46% across MuJoCo, Robosuite, and RoboCasa.
- Differentiable Simulation of Hard Contacts with Soft Gradients for Learning and Control
-
Addressing the long-standing issues in penalty-based simulators (MuJoCo) where automatic differentiation gradients distort under hard contact and gradients vanish when objects are not in contact, this paper introduces "Adaptive Step Integration (DiffMJX)" to correct discretization-induced gradient errors. It then uses "Distance Contact (CFD) + Straight-Through Trick" to inject informative gradients for non-contacting objects without compromising forward physical realism. This enables real-world cube parameter identification and control of high-dimensional musculoskeletal systems using first-order gradients directly.
- Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining
-
DeFI decomposes robot policy learning into two independent modules—"predicting future frames" and "inferring latent actions." These are pretrained separately on large-scale human and robot videos and then coupled for end-to-end fine-tuning. This allows massive action-less videos to be utilized for VLA, achieving SOTA results on CALVIN ABC-D (Avg. length 4.51), SimplerEnv-Fractal (51.2%), and real-world robots (81.3%).
- Efficient Differentiable Contact Model with Long-range Influence
-
This paper systematically characterizes four properties that a "well-conditioned contact model" must satisfy (barrier-form, second-order smoothness, non-prehensile, and non-vanishing). It designs a differentiable contact potential function that is efficiently evaluated using a Bounding Sphere Hierarchy (BSH) and provides non-zero gradients even when objects are far apart, enabling gradient-based optimizers to discover complex contact-rich motions from trivial initializations.
- Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
-
Using "pointing" (2D coordinate points/trajectory sequences) as a unified embodiment-agnostic intermediate representation, a 3B parameter VLM is trained via two-stage reinforced fine-tuning (RFT). It achieves SOTA performance on 11 spatial reasoning benchmarks and 8 real-robot tasks, with a zero-shot success rate of 87.5%.
- Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization
-
This paper systematically evaluates the memory utilization capabilities of LLM-driven embodied agents through the Memento framework. The study reveals that existing agents can recall simple object semantics but fail to process sequential information regarding user behavior patterns. To address this, a user profile memory module based on a hierarchical knowledge graph is proposed to effectively enhance performance in personalized assistance tasks.
- Embodied Navigation Foundation Model
-
NavFoM is the first cross-embodiment × cross-task embodied navigation foundation model, jointly trained on 8 million navigation samples covering quadrupeds, drones, wheeled robots, and vehicles. It handles arbitrary camera configurations via Temporal-Viewpoint Indicator (TVI) tokens and manages inference overhead through budget-aware history sampling. It achieves SOTA or competitive performance on 7 public benchmarks without fine-tuning.
- Emergence of Spatial Representation in an Actor-Critic Agent with Hippocampus-Inspired Sequence Generator
-
Inspired by the intrinsic recurrent loops in the hippocampal CA3 region, this paper proposes a minimal sequence generator (shift register) integrated with an actor-critic agent. This approach achieves maze navigation using sparse visual inputs while facilitating the emergence of neurobiological phenomena such as place fields, DG orthogonalization, distance-dependent spatial kernels, and task-dependent remapping.
- Emergent Dexterity via Diverse Resets and Large-Scale Reinforcement Learning
-
OmniReset enables the emergence of complex multi-stage dexterous manipulation strategies by automatically generating four types of diverse initial state distributions. Using Large-scale PPO in massively parallel simulations, it requires no human demonstrations, curricula, or task-specific rewards, and achieves zero-shot transfer to real robots.
- Empowering Multi-Robot Cooperation via Sequential World Models
-
This paper proposes SeqWM (Sequential World Model), which introduces the sequential (autoregressive) paradigm into multi-robot model-based reinforcement learning. Each robot independently maintains a world model and sequentially passes predicted trajectories. While reducing modeling complexity, the system naturally evolves advanced collaborative behaviors such as proactive adaptation, temporal alignment, and role division through intent sharing, successfully achieving sim-to-real transfer.
- ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction
-
ENACT formalizes embodied cognition evaluation as world-modeling VQA based on first-person interaction—revealing significant gaps and anthropomorphic biases in current top-tier VLMs compared to humans through forward/inverse sequence reshuffling tasks.
- End-to-end Listen, Look, Speak and Act
-
ELLSA is the first truly end-to-end full-duplex multimodal system. Through the SA-MoE architecture, it connects speech experts and action experts with unified attention, enabling a robot to "listen, look, speak, and act" simultaneously while supporting previously impossible interactions such as barge-in, speaking-while-acting, and contextual visual question answering.
- EquAct: An SE(3)-Equivariant Multi-Task Transformer for 3D Robotic Manipulation
-
EquAct proposes the first multi-task, language-conditioned keyframe manipulation policy that achieves continuous SE(3) equivariance (rotation + translation) within a single unified model. By utilizing an equivariant point Transformer U-Net, spherical harmonic Fourier features, and an SE(3)-invariant iFiLM language modulation layer, it achieves SOTA performance across 18 RLBench tasks (including SE(3) perturbations) and 4 real-world tasks.
- EVLP: Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning
-
EVLP utilizes a unified multimodal generation framework to simultaneously model linguistic reasoning and visual imagination. Coupled with "Bidirectional Dynamic Perception Pre-training" and "Reinforced Supervised Fine-Tuning (RSFT)", the model generates the next linguistic action and sub-goal image from high-level instructions in one step, significantly outperforming various language, visual, and multimodal planning baselines in long-horizon manipulation tasks.
- ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning
-
The ExoPredicator framework is proposed to jointly learn symbolic state abstractions and causal processes (comprising endogenous actions and exogenous mechanisms). By combining Variational Bayesian Inference with LLM proposals, it learns causal world models with stochastic delays from a minimal number of trajectories, enabling rapid generalized planning across five tabletop robotic environments.
- Experience-based Knowledge Correction for Robust Planning in Minecraft
-
The study demonstrates that LLMs cannot self-correct erroneous planning priors (item dependencies) through prompting alone. It proposes XENON—an algorithmic knowledge management system (Adaptive Dependency Graph ADG + Failure-aware Action Memory FAM) that learns from binary feedback, enabling a 7B LLM to outperform SOTA methods using GPT-4V + oracle knowledge in long-term Minecraft planning.
- FASTer: Toward Powerful and Efficient Autoregressive Vision-Language-Action Models with Learnable Action Tokenizer and Block-wise Decoding
-
FASTer compresses continuous robot actions into structured discrete action codes and utilizes block-wise autoregressive VLA to generate action tokens in blocks. This approach significantly reduces autoregressive inference latency while maintaining high control precision, outperforming existing VLA baselines across various simulated and real-world robot platforms.
- From Embedding to Control: Representations for Stochastic Multi-Object Systems
-
This paper proposes Graph Controllable Embeddings (GCE), which embeds the conditional distributions of stochastic multi-body systems into a Reproducing Kernel Hilbert Space (RKHS) to linearize non-linear dynamics. Combined with Graph Neural Networks and mean-field approximations for adaptive modeling of non-uniform interactions, it enables efficient control and few-shot generalization of stochastic, variable-topology multi-body systems using simple linear LQR controllers.
- From Language to Locomotion: Retargeting-free Humanoid Control via Motion Latent Guidance
-
RoboGhost proposes a retargeting-free language-driven humanoid control framework: it allows text-generated "motion latents" to directly serve as conditions for a diffusion policy to denoise executable actions from noise. This bypasses the multi-stage pipeline of "decode motion \(\rightarrow\) retarget to robot \(\rightarrow\) physical tracking," which is prone to error accumulation and high latency, reducing the time from text to deployment from 17.85s to 5.84s.
- From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
-
FSD transforms the task of "predicting grasp points/trajectories" in robotic manipulation into an explicit spatial reasoning process: it first utilizes a spatial relationship graph for visual Chain-of-Thought (SrCoT) and then generates embodiment-agnostic intermediate visual affordances (affordance boxes/points + visual trajectories). This enables zero-shot manipulation without fine-tuning and significantly outperforms affordance baselines across 8 spatial reasoning benchmarks and real-world tasks.
- From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning
-
S2E proposes a hybrid learning framework "from seeing to experiencing": initial pre-training using anchor-guided Gaussian mixture distributions on 100 hours of real navigation videos, followed by RL post-training in simulation with a zero-initialized Residual Attention Module (RAM). By updating only the cross-attention branches, reactive capabilities for obstacle and pedestrian avoidance are injected, allowing navigation foundation models to break through the scaling ceiling of purely offline data and achieve zero-shot transfer to real wheeled and quadruped robots.
- From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
-
Introduces FALCON (From Spatial to Action), which achieves strong 3D spatial perception for VLA models by injecting rich 3D spatial tokens from a spatial foundation model into the Action Head rather than the VLM backbone. It maintains flexible modality switching from RGB-only to RGB-D and achieves SOTA in both simulation and real-world tasks.
- Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
-
GE integrates a "multi-view video world model (GE-Base)" and a "lightweight parallel action decoder (GE-Act)" into a unified video generation framework. The action branch directly reads multi-scale, full-resolution latent representations from the video DiT via block-wise alignment. Combined with slow-fast asynchronous inference, it generates 54-step action trajectories within 200ms on a single RTX 4090 and enables transfer to new robotic embodiments using only 1 hour of teleoperation data.
- Geometry-Aware Policy Imitation
-
GPI treats expert demonstrations as geometric curves in state space rather than sets of state-action samples. It derives two complementary control primitives—"propulsive flow + attractive flow"—from the distance field induced by these curves. These are combined into a non-parametric, interpretable vector field that directly drives the robot. While achieving higher success rates than diffusion policies, it is 20–100× faster in inference and requires two orders of magnitude less memory.
- Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation
-
DualVLN (InternVLA-N1) decouples vision-language navigation into a "Slow System" (7B VLM) for pixel goal grounding and a "Fast System" (lightweight diffusion policy) for continuous trajectory generation. Operating asynchronously, the two systems achieve new SOTA results on VLN-CE / VLN-PE and enable real-world dynamic obstacle avoidance.
- Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI
-
Ours proposes VIRF (Verifiable Iterative Refinement Framework), which integrates a deterministic Logic Tutor with an LLM planner through a neuro-symbolic hybrid architecture. By using a verifiable formal ontology as a safety anchor, it achieves 0% Hazard Action Rate (HAR) and 77.3% Goal Completion Rate (GCR) on SafeAgentBench, demonstrating that strict safety guarantees do not necessitate sacrificing agent utility.
- H\(^3\)DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning
-
H3DP simultaneously introduces "Input Hierarchy (depth-slicing RGB-D) + Representation Hierarchy (multi-scale visual features) + Action Hierarchy (coarse-to-fine hierarchical conditional denoising)" into visuomotor diffusion policies. By explicitly coupling visual perception with action generation, it achieves an average improvement of +27.5% across 44 simulation tasks and +72.4% on real-world dual-arm tasks relative to baselines.
- HAMLET: Switch Your Vision-Language-Action Model into a History-Aware Policy
-
HAMLET enables "single-frame" pretrained VLAs to gain history-awareness in a plug-and-play, near-zero overhead manner by appending a few learnable moment tokens (initialized via time-contrastive learning) and a lightweight memory module. It improves success rates from 29.2% to 76.4% on real-world long-horizon tasks.
- Hierarchical Value-Decomposed Offline Reinforcement Learning for Whole-Body Control
-
Addressing the scarcity of expert data for high-DoF whole-body robots, HVD decomposes the value function of offline RL along the robot's kinematic structure (base/torso/arm). It performs value filtering from large-scale imperfect data and implements fine-grained credit assignment via temporal chunking, significantly outperforming imitation learning baselines on five tasks using a real 21-DoF humanoid robot.
- House Of Dextra : Cross-Embodied Co-Design for Dexterous Hands
-
House of Dextra proposes a cross-embodiment co-design framework for dexterous hands that connects a manufacturable modular hand grammar, morphology-conditioned control policies, and graph-heuristic search. It filters and fine-tunes hand morphologies in simulation, eventually deploying multiple designs (3-fingered, 4-fingered, 5-fingered, etc.) zero-shot to real hardware for blind in-hand rotation.
- HWC-Loco: A Hierarchical Whole-Body Control Approach to Robust Humanoid Locomotion
-
HWC-Loco reformulates humanoid locomotion control as a "robust optimization" problem, utilizing a high-level planner to dynamically switch between two low-level policies: "goal-tracking" and "safety-recovery." This ensures ZMP stability without sacrificing task performance, achieving SOTA results across various terrains, disturbances, and embodiments in both simulation and real-world hardware.
- Hybrid Training for Vision-Language-Action Models
-
This paper proposes Hybrid Training (HyT): an approach that enables VLAs to learn simultaneously from "Chain-of-Thought (CoT)" and "Action" data during training, while bypassing time-consuming thought generation during inference via a "modality variable." This achieves the performance gains of CoT while maintaining the high control frequency of standard VLAs.
- HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
-
HybridVLA enables a single LLM backbone to simultaneously perform diffusion denoising and autoregressive action prediction within a unified token sequence. By adaptively fusing both paradigms through a confidence-based collaborative ensemble, it achieves performance gains of 17% in simulation and 19% on real robots over SOTA models.
- Image Quality Assessment for Embodied AI
-
This work extends Image Quality Assessment (IQA) from "predicting human preference" to "predicting robot task success" for the first time. Based on the Mertonian system, a four-step Perception-Cognition-Decision-Execution pipeline is established. The Embodied-IQA database is constructed, containing 36.9k distorted image pairs and 5.53M fine-grained annotations (compiled from 15 VLMs, 15 VLAs, and 1.5k real-robot experiments). Evaluations using 15 mainstream IQA methods demonstrate that existing metrics designed for humans fail significantly in embodied contexts.
- Interleave-VLA: Enhancing Robot Manipulation with Image-Text Interleaved Instructions
-
This paper proposes Interleave-VLA: a model-agnostic paradigm requiring minimal architectural changes that enables existing VLAs to process "image-text interleaved" instructions (replacing text descriptions of target objects with their images). Along with an automated pipeline that transforms Open X-Embodiment into a 210k interleaved instruction dataset, the approach improves out-of-domain generalization for unseen objects by approximately 2× and demonstrates emergent zero-shot understanding of sketches and web images.
- JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation
-
Inspired by the human brain's left-hemisphere semantic understanding and right-hemisphere spatial cognition, this paper proposes JanusVLN—the first dual implicit neural memory framework designed for VLN. It models spatial-geometric and visual-semantic memories as fixed-size KV Caches, achieving efficient spatial reasoning using only RGB video and reaching SOTA performance on the VLN-CE benchmark.
- Latent Adaptation of Foundation Policies for Sim-to-Real Transfer
-
This paper proposes Found-adapt: it first pre-trains a reusable latent-conditioned foundation policy on offline simulator trajectories, and then corrects the latent variable \(z\) during deployment using a small amount of target domain data. This mitigates the dynamics sim-to-real gap in robot locomotion without retraining the policy network.
- Learning to Grasp Anything By Playing with Random Toys
-
LEGO trains a grasping policy using 3D-printed "toys" randomly assembled from four shape primitives: spheres, boxes, cylinders, and rings. By employing a Detection Pooling (DetPool) mechanism that constrains visual attention to target objects to learn object-centric representations, it achieves a 67% zero-shot success rate on real-world YCB objects, outperforming VLA models that use significantly more data and parameters.
- LeRobot: An Open-Source Library for End-to-End Robot Learning
-
LeRobot is an open-source end-to-end robot learning library released by Hugging Face. It integrates low-level motor middleware, a unified multimodal dataset format, a decoupled asynchronous inference stack, and a suite of state-of-the-art (SOTA) policy implementations, consolidating the fragmented and closed-source robot learning toolstack into a reproducible, low-barrier, vertically integrated platform.
- Lifelong Embodied Navigation Learning
-
This paper proposes the Lifelong Embodied Navigation Learning task and the Uni-Walker framework, enabling LLM-driven embodied navigation agents to sequentially learn multiple navigation tasks (VLN, OLN, DUN). This approach allows the agent to absorb new scenes and instruction styles while significantly reducing the forgetting of previous tasks.
- M³E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts
-
M³E replaces the FFN layers of an LLM navigation agent with "Macro + Micro" dual-routed MoE-LoRA layers. The macro-router employs a GNN on a cognitive map for topology-aware scene-level expert selection, while the micro-router performs instruction-level expert selection based on token hidden states. Combined with a dynamic momentum update strategy that freezes or aggressively updates different experts, this approach achieves cross-environment continual learning under a replay-free constraint, improving both navigation success rates and anti-forgetting capabilities on R2R and REVERIE.
- ManipEvalAgent: Promptable and Efficient Evaluation Framework for Robotic Manipulation Policies
-
ManipEvalAgent utilizes a collaborative group of VLM Agents to mimic how human experts form judgments by "trying it out a few times," performing promptable, multi-turn, and dynamically planned evaluations of robotic manipulation policies. By generating task and evaluation tool code within a simulator, it achieves conclusions comparable to full-scale benchmarks using significantly fewer samples, while providing interpretable diagnostic text instead of a single success rate.
- Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots
-
This paper proposes camera-specific Camera Depth Models (CDM) to calibrate noisy RGB-D inputs from real depth cameras into high-quality metric depth similar to simulation. This allows robotic manipulation policies trained solely on clean simulation depth to transfer to real-world long-horizon tasks with zero fine-tuning.
- Masked Generative Policy for Robotic Control
-
Discretizes robot actions into tokens and utilizes a "Masked Generative Transformer" from image generation to predict entire action sequences in parallel, followed by resampling only low-confidence tokens. This removes the bottlenecks of multi-step denoising in diffusion policies and token-by-token decoding in autoregressive policies, achieving globally coherent and reliable control in dynamic, partially observable, and non-Markovian tasks.
- Memory, Benchmark & Robots: A Benchmark for Solving Complex Tasks with Reinforcement Learning
-
The authors propose the MIKASA memory benchmark suite—unifying fragmented memory RL evaluations with a four-category memory task classification framework. They construct 32 tabletop robotic manipulation memory tasks (MIKASA-Robo) for the first time, systematically exposing memory deficiencies of mainstream RL/VLA agents in partially observable manipulation tasks.
- MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
-
Inspired by the dual memory system in cognitive science, this work proposes the MemoryVLA framework. It introduces a Perceptual-Cognitive Memory Bank (PCMB) into the VLA model to capture long-term dependencies through memory retrieval, gated fusion, and consolidation mechanisms, significantly outperforming CogACT and π₀ across 150+ tasks in SimplerEnv, LIBERO, and real-world environments.
- MetaVLA: Unified Meta Co-Training for Efficient Embodied Adaptation
-
MetaVLA introduces a lightweight context memory module (Action-ANP) derived from Attentive Neural Processes during the VLA post-training phase. It transforms multi-task co-training from a state where "more tasks lead to collapse" to one where "auxiliary tasks improve performance." Using a single model on LIBERO, it reduces OpenVLA training from 240K steps to 75K steps, cuts GPU time by 76%, and outperforms the baseline by 8% on long-horizon tasks.
- MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation
-
This work proposes MolLangBench, a high-quality, unambiguous molecule-language interface benchmark constructed via automated tools and expert annotation. It covers recognition, editing, and generation tasks across SMILES, image, and graph modalities. Evaluations of 16+ commercial LLMs and 5 chemistry-specific models reveal that even GPT-5 remains significantly deficient in basic molecular operations (e.g., achieving only 43% in generation).
- MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
-
MoMaGen models demonstration data generation for bimanual mobile manipulation as a constrained optimization problem. By synergizing hard constraints (reachability, collision-free, visibility) and soft constraints (object visibility during navigation, compact retracted poses), it automatically generates large-scale diverse datasets from a single human teleoperated demonstration. The trained visuo-motor policy can be deployed on physical robots with fine-tuning on only 40 real-world demonstrations.
- MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning
-
MomaGraph unifies spatial relationships, functional relationships, and part-level interaction nodes into a task-oriented scene graph. By training a 7B VLM with reinforcement learning to "graph then plan," it achieves a 71.6% accuracy on a self-built benchmark, surpassing the strongest baseline by 11.4 points.
- Much Ado About Noising: Dispelling the Myths of Generative Robotic Control
-
This paper systematically "demystifies" Generative Control Policies (GCP) for robotics. Through rigorous ablations across 28 behavior cloning benchmarks, the authors prove that the advantage of GCPs over regression policies stems not from multimodal modeling or expressivity, but from the combination of "noise injection during training + supervised iterative computation." Based on this, they design MIP—a minimal two-step policy without distribution fitting—that matches flow model performance.
- MVR: Multi-view Video Reward Shaping for Reinforcement Learning
-
The MVR framework is proposed to utilize video-text similarity from multi-view videos to learn a state relevance function. Combined with state-dependent reward shaping (automatically decaying VLM guidance), it outperforms existing VLM reward methods across 19 tasks in HumanoidBench and MetaWorld.
- Nonparametric Teaching of Attention Learners
-
This paper proposes AtteNT, which reinterprets the training process of attention learners (Transformer/ViT) from the perspective of nonparametric teaching theory. By analytically deriving the importance-adaptive role of attention in parameter gradients, the authors prove that dynamic ANTK converges to the importance-adaptive canonical kernel in functional gradients, bridging the gap between parameter and functional spaces. A greedy teaching algorithm is introduced to select samples with the largest prediction bias, accelerating training by 13.01% for LLM fine-tuning and 20.58% for ViT pre-training while maintaining or improving accuracy.
- OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
-
OmniEVA is proposed to address two major gaps in spatial MLLMs: poor geometric adaptability (2D-only or hard-coded 3D) and lack of embodiment constraints (producing theoretically feasible but physically unexecutable plans). It utilizes a task-adaptive gated router to dynamically inject 3D positional encodings only when geometric reasoning is required and incorporates an embodiment-aware reasoning framework to integrate physical constraints into the planning loop. Ours achieves SOTA results on 7 out of 8 benchmarks.
- OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation
-
OmniNav utilizes a dual-system architecture comprising a VLM backbone and a flow-matching policy head to unify four navigation tasks—instruct-goal, object-goal, point-goal, and frontier exploration—into a single model. The fast system predicts high-precision continuous waypoints from short-term visual contexts to support 5 Hz real-time control, while the slow system performs sub-goal planning with Chain-of-Thought (CoT) using long-term memory and frontiers. Supplemented by joint training with large-scale general vision-language data, it achieves SOTA performance on benchmarks such as R2R-CE, RxR-CE, and HM3D-OVON, and has been successfully deployed on physical robots.
- On Entropy Control in LLM-RL Algorithms
-
The authors theoretically explain why traditional entropy regularization is nearly ineffective in LLM-RL (due to immense action spaces and sparse optima causing entropy bias to overwhelm optimization gains). They propose the AEnt method, which uses clamped entropy (calculated on a reduced token space) and adaptive coefficients to effectively balance bias and benefits, consistently outperforming baselines in mathematical reasoning tasks.
- On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
-
This paper first systematically evaluates the robustness of mainstream VLAs across 17 types of perturbations in four modalities (action, observation, environment, and instruction), finding that action is the most vulnerable modality, existing vision-only robustness methods fail to transfer, and \(\pi_0\) is the most stable. Then, it proposes RobustVLA: robust optimization under worst-case action noise for the output, and action consistency constraints under semantic invariance for the input, utilizing a UCB bandit to automatically select the most harmful perturbations for training. It achieves a 14.0% absolute improvement over \(\pi_0\) on LIBERO and a 65.6% higher success rate than \(\pi_0\) on real-world robots with only 25 demonstrations.
- One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration
-
The authors propose PDDLLM, a framework that automatically derives a complete PDDL planning domain (predicates and actions) from a single demonstration trajectory. By cross-validating LLM reasoning with physical simulation, it generates interpretable symbolic representations and utilizes a Logic Constraint Adapter (LoCA) to interface with motion planners. In over 1200 tasks across 9 environments, its success rate outperforms 6 LLM baselines by at least 20% and has been successfully deployed on 3 physical robot platforms.
- OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning
-
OneTwoVLA unifies fast action execution and slow language reasoning within a single VLA. The model adaptively triggers reasoning using
[BOR]during critical moments and outputs actions directly via[BOA]otherwise, significantly outperforming non-reasoning VLAs and dual-system approaches in long-horizon manipulation, error recovery, human-robot interaction, and open-vocabulary visual grounding. - PA3FF: Part-Aware Dense 3D Feature Fields for Generalizable Articulated Object Manipulation
-
This paper proposes PA3FF—a dense 3D feature field predicted feed-forwardly from point clouds where feature distances reflect whether points belong to the same functional part. Building upon this, the Part-Aware Diffusion Policy (PADP) is introduced, enabling robots to generalize across various articulated objects (door handles, knobs, lids) with minimal demonstrations, significantly outperforming 2D/3D representations like CLIP, DINOv2, and Grounded-SAM in PartInstruct simulations and 8 real-world tasks.
- Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments
-
A Partially Group Invariant MDP (PI-MDP) framework is proposed, utilizing a learnable gating function \(\lambda(s,a)\) to point-wise switch between equivariant and standard Bellman updates in the state-action space. It is theoretically proven that local symmetry breaking is amplified \(1/(1-\gamma)\) times through discounted backup to produce global value function errors, whereas PI-MDP strictly confines errors within the symmetry-breaking regions. Instantiated as PE-DQN and PE-SAC algorithms, the method outperforms strict and approximate equivariant baselines across Grid-World, MuJoCo locomotion, and robotic arm manipulation tasks.
- PixelVLA: Advancing Pixel-level Understanding in Vision-Language-Action Model
-
PixelVLA is the first vision-language-action model to support both pixel-level understanding and multi-modal prompting (text + points/lines/boxes/masks). By integrating three components—a "multi-scale pixel-aware encoder, a visual prompt encoder, and a continuous action decoder"—into existing VLAs and utilizing an automated annotation pipeline to create the Pixel-160K dataset, it enhances manipulation success rates by \(10.1\% \sim 28.7\%\) over OpenVLA at only \(1.5\%\) of the pre-training cost.
- Planning with an Embodied Learnable Memory
-
This paper proposes EPM (Embodied Perception Memory)—a learnable memory that uses a single VLM to maintain a "textual scene representation" through add/delete/update operations from first-person observations. Combined with "human demonstration imitation + Dynamic Difficulty-Aware Fine-Tuning (DDAFT)", the LLM planner achieves up to a 55% success rate improvement over strong baselines on long-horizon mobile manipulation tasks in dynamic home environments within PARTNR.
- Policy Contrastive Decoding for Robotic Foundation Models
-
Addressing the issue where generalist robot policies tend to form spurious correlations between irrelevant features (like background/texture) and actions, this paper proposes Policy Contrastive Decoding (PCD). PCD is a training-free, plug-and-play method that performs contrastive decoding between action distributions derived from "original observations" and "object-removed observations." This forces the policy's attention back onto the target object. It is effective for both autoregressive (OpenVLA) and diffusion-based (Octo, \(\pi_0\)) policies, achieving performance gains of up to 50.6% in simulation and 108% on real hardware.
- Primary-Fine Decoupling for Action Generation in Robotic Imitation
-
PF-DAG decouples action generation in robotic imitation learning into a two-stage process: first selecting a coarse mode from discrete prototypes using a lightweight classifier, and then filling in continuous intra-modal details with a single-step MeanFlow generator. This approach avoids the precision loss of discretization while eliminating the mode bouncing issues typical of single-stage generative policies. It outperforms diffusion and flow baselines across 56 tasks in Adroit/DexArt/MetaWorld and real-world dexterous manipulation tasks with tactile feedback.
- RAVEN: End-to-end Equivariant Robot Learning with RGB Cameras
-
RAVEN treats each pixel patch of an RGB image as an oriented 3D ray. This allows for the construction of the first end-to-end SE(3) equivariant robot manipulation policy using only standard RGB cameras (without requiring point clouds, depth, or fixed top-down views). It significantly outperforms strong baselines like Diffusion Policy in MimicGen / DexMimicGen simulations and real-world experiments, while training 1.6× faster than existing equivariant methods.
- Real-Time Robot Execution with Masked Action Chunking
-
REMAC is proposed to systematically address intra-chunk inconsistency and inter-chunk discontinuity under asynchronous inference through a masked action chunking training strategy and a prefix-preserving sampling pipeline, achieving more reliable real-time robot control without introducing hardware-dependent inference latency.
- ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
-
ReCAPA decomposes long-horizon trajectories of embodied agents into three levels: "Action—Subgoal—Trajectory." It utilizes low-level predictions of high-level semantics to backpropagate correction signals. Combined with Sinkhorn global alignment and Score-field local alignment, it suppresses deviations during the training phase before they accumulate into cascading failures. ReCAPA achieves higher success rates than strong LLM/LMM baselines on AI2-THOR, MineDojo, and VisualAgentBench.
- REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?
-
The first systematic study on the impact of Referring Expressions (RE) in vague human instructions on LLM robotic task planning. It constructs the REI-Bench benchmark modeling 9 levels of coreference ambiguity (3 levels of RE difficulty \(\times\) 3 levels of context). It finds that implicit REs can cause existing planners' success rates to drop by up to 36.9%. It proposes the Task-Oriented Context Cognition (TOCC) method to decouple task understanding from planning decisions, yielding an average success rate improvement of 6.5%.
- Remotely Detectable Robot Policy Watermarking
-
Addressing the realistic scenario where robot policy ownership can only be verified through remote observations (e.g., video, motion capture), this paper proposes CoNoCo. It replaces the white noise originally used for exploration in Reinforcement Learning (RL) with "colored noise" hidden in a secret frequency band. This watermark is then detected using spectral coherence, which is insensitive to system dynamics. The method achieves policy attribution on both simulated and real robots without compromising performance or requiring access to internal states.
- Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning
-
This paper theoretically analyzes the impact of inter-policy diversity on learning efficiency in ensemble policy gradient methods and proposes Coupled Policy Optimization (CPO). By regulating diversity through KL divergence constraints, CPO achieves efficient and stable exploration in large-scale parallel environments.
- RF-MatID: Dataset and Benchmark for Radio Frequency Material Identification
-
Constructs the first open-source, large-scale, wideband (4-43.5 GHz), and geometrically diverse RF material identification dataset, RF-MatID, containing 16 fine-grained material categories (5 superclasses) and 142K samples. A systematic benchmark is established covering 9 deep learning models, 5 frequency protocols, and 7 data splits.
- RFS: Reinforcement Learning with Residual Flow Steering for Dexterous Manipulation
-
RFS unifies "residual reinforcement learning" and "diffusion/flow steering" into a single policy modulation framework. For a pre-trained flow matching policy, it simultaneously learns a latent space noise distribution (for global exploration) and a residual action correction (for local refinement). Without modifying the base policy parameters, it enables efficient fine-tuning, increasing the average success rate from 0.25 (base policy) to 0.87 in simulation and real-world dexterous manipulation.
- RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots
-
RoboCasa365 constructs a large-scale simulation benchmark consisting of 365 daily kitchen tasks, 2,500 diverse kitchen scenes, and over 2,000 hours of robot interaction data. It systematically evaluates the performance of generalist robot policies under three paradigms—multi-task learning, foundation model training, and lifelong learning—finding that the task diversity in pre-training data is a key factor in improving downstream generalization.
- RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation
-
The RoboInter manipulation suite is proposed as a unified resource for intermediate representation data, benchmarks, and models. It includes RoboInter-Tool (a semi-automatic annotation GUI), RoboInter-Data (dense frame-by-frame annotations for 230,000 episodes across 571 scenes with 10+ types of intermediate representations), RoboInter-VQA (a benchmark featuring 29 types of embodied VQA tasks), and RoboInter-VLA (a plan-then-execute framework supporting modular and end-to-end configurations). This suite provides a comprehensive infrastructure to enhance VLA generalization through intermediate representations.
- RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields
-
A standalone deep RL "diagnostic policy" \(\pi_{MD}\) is trained to search a continuous vision-language embedding space learned from limited success/failure data. By treating this space as a "potential field" that drifts toward failure regions and away from success regions, the framework predicts where a robot manipulation policy \(\pi_R\) will fail under environmental changes without extensive real-world trials—uncovering up to 23% more unique vulnerabilities than SOTA vision-language baselines.
- RoboOmni: Proactive Robot Manipulation in Omni-modal Context
-
RoboOmni integrates speech, environmental sounds, visual observations, and robot actions into a unified omni-modal LLM framework, enabling robots to proactively infer user intentions from implicit household contexts, provide vocal confirmation, and execute 7-DoF manipulation actions.
- RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks
-
The RoboPARA framework is proposed to optimize task parallelism for dual-arm robots through a two-stage process of dependency graph construction and graph re-traversal. It achieves a 30-50% reduction in execution time and a 34% improvement in success rate across multi-scenario benchmarks compared to existing methods.
- RobotArena ∞: Scalable Robot Benchmarking via Real-to-Sim Translation
-
This paper proposes RobotArena ∞, a scalable evaluation framework that automatically translates real robot demonstration videos into simulation digital twins. It deploys VLA policies within these simulations and uses a dual-track scoring system (VLM progress scores + crowdsourced human pairwise preferences). Based on over 8,500 preference pairs, it compares 6 VLAs from global laboratories, revealing that current policies exhibit weak cross-dataset generalization and high sensitivity to perturbations.
- Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
-
RIGVid enables robots to perform manipulation tasks such as pouring water and sweeping trash using only "AI-generated videos." Given a language instruction and a scene image, the method uses a video diffusion model to generate demonstration videos, filters failed generations with a VLM, tracks 6D pose trajectories of objects from the video, and retargets them for execution by a robotic arm. This process requires no real demonstrations or robot training data, achieving performance comparable to real human demonstration videos.
- Robust Finetuning of Vision-Language-Action Robot Policies via Parameter Merging
-
To address the issues of generalization loss and overfitting when few-shot finetuning generalist robot policies, this paper proposes RETAIN. It performs linear interpolation between the pre-trained and finetuned policies directly in parameter space. With no additional training or inference overhead, it enables a single policy to robustly complete various out-of-distribution (OOD) variants of new skills while retaining pre-trained general capabilities. The average OOD success rate on real robots is approximately 40% higher than the previous best methods.
- Rodrigues Network for Learning Robot Actions
-
This paper transforms the classical Rodrigues' rotation formula into a learnable Neural Rodrigues Operator and constructs RodriNet, an architecture that explicitly encodes joint kinematic structures. RodriNet significantly outperforms general backbones like MLP, GCN, and Transformer across four categories of tasks: forward kinematics fitting, motion prediction, robot arm imitation learning, and single-image hand reconstruction.
- RRNCO: Towards Real-World Routing with Neural Combinatorial Optimization
-
Ours proposes the RRNCO architecture, which jointly models asymmetric distance, duration, and orientation angles in a deep routing framework for the first time through two innovations: Adaptive Node Embedding (ANE) and Neural Adaptive Bias (NAB). A VRP benchmark dataset based on 100 real-world cities is constructed, significantly narrowing the sim-to-real gap for NCO methods.
- SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
-
For long-horizon, contact-rich deformable object manipulation (e.g., T-shirt folding), this paper proposes SARM—replacing "frame-index progress labels" with "semantically aligned progress labels" via natural language sub-task annotations. It trains a dual-estimator reward model for "stage estimation + sub-task progress estimation," which drives Reward-Aligned Behavior Cloning (RA-BC) to perform soft filtering and re-weighting of demonstrations. This approach increases the T-shirt folding success rate on real robots from 8%/0% (vanilla BC) to 83%/67%.
- Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow
-
Proposes Qflex (Q-guided Flow Exploration), an RL method for scalable exploration in high-dimensional continuous action spaces: actions are transported from a learnable source distribution along a probability flow induced by the Q-function \(\to\) exploration is aligned with task-relevant gradients (rather than isotropic noise) \(\to\) outperforms Gaussian and Diffusion RL baselines on various high-dimensional benchmarks, successfully controlling a full-body human musculoskeletal model with 700 actuators for agile and complex movements.
- Scaling up Memory for Robotic Control via Experience Retrieval
-
MemER decouples the task of "remembering the past" in long-horizon robotic tasks to a high-level VLM. It nominates task-relevant keyframes from recent observations, compresses them into stable visual memory via lightweight temporal clustering, and assigns current sub-tasks to a low-level VLA for execution, achieving performance close to human high-level strategies across three types of real-world long-horizon manipulation tasks.
- Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
-
This paper proposes PLD (Probe-Learn-Distill), a three-stage post-training framework: it freezes the VLA backbone, uses lightweight residual RL to "take over" and train experts on states where the base policy fails, collects distribution-aligned recovery data via hybrid rollouts ("base policy first, then residual expert"), and finally distills this knowledge back into the base model using standard SFT. Without any additional human demonstrations, it approaches a 99% success rate on LIBERO, achieves over 50% improvement in SimplerEnv, and attains 100% success on real-world Franka/YAM tasks with 1 hour of continuous autonomous operation.
- Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning
-
ARMOR decomposes robotic failure understanding into two collaborative tasks: binary detection and natural language explanation. By utilizing multi-round self-refinement, a hybrid of sparse/dense label training, and entropy-based trajectory selection, it simultaneously improves failure detection accuracy and explanation quality on both simulated and real-world warehouse robotic data.
- Sim2Real VLA: Zero-Shot Generalization of Synthesized Skills to Realistic Manipulation
-
Sim2Real-VLA employs a dual-system VLA architecture consisting of "high-level affordance chain planning + low-level tokenized action execution" to transfer manipulation skills generated purely in simulation to real robots in a zero-shot manner, significantly narrowing the Sim2Real gap in bimanual, dexterous, and long-horizon tasks.
- SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
-
SimpleVLA-RL adapts outcome-driven online RL from the LLM domain into a closed-loop robotic training framework suitable for Vision-Language-Action (VLA) models. By utilizing interactive trajectory sampling, binary success rewards, and exploration-enhanced GRPO, it significantly improves data efficiency, generalization, and success rates for long-horizon manipulation across LIBERO, RoboTwin, and real-world robotic tasks.
- SLAP: Shortcut Learning for Abstract Planning
-
SLAP automatically learns a set of "shortcut options" (e.g., a "slap" that pushes aside an obstacle tower) using model-free RL on an abstract planning graph induced by existing TAMP skills (pick/place/move). During evaluation, the planner treats these shortcuts as new edges to search for shorter paths, reducing execution length by over 50% in four simulated robotic environments while surpassing the success rates of both pure planning and pure RL.
- Sparse Imagination for Efficient Visual World Model Planning
-
This paper proposes Sparse Imagination, which achieves significant inference acceleration (reducing planning time by ~50% at a 50% dropout rate) in ViT patch token-based world model planning through random token dropout and random grouped attention training. A key finding is that simple random dropout outperforms complex token selection methods because static importance ranking suffers from a "blind spot problem" in dynamic planning scenarios.
- Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model
-
Spatial Forcing utilizes geometric latents from a pre-trained 3D foundation model (VGGT) to supervise the intermediate visual tokens of a VLA. This enables robotic policies to acquire stronger spatial understanding without requiring additional depth maps or point clouds during inference, leading to improved success rates, convergence speed, and data efficiency on LIBERO, RoboTwin, and real-robot tasks.
- Spatially Guided Training for Vision-Language-Action Model
-
ST4VLA significantly mitigates the issues of "seeing but not moving" or "forgetting how to see after learning to move" in VLA training by first teaching the VLM spatial priors such as points, boxes, and trajectories, and then injecting these priors as implicit planning conditions into a DiT action expert via spatial prompts during the action post-training phase. It achieves stronger generalization in SimplerEnv, LIBERO, large-scale simulated pick-and-place, and real-world long-horizon robotic tasks.
- SpikePingpong: Spike Vision-based Fast-Slow Pingpong Robot System
-
SpikePingpong integrates the high-frequency vision of a spike camera into a "fast-slow dual-system" perception framework. System 1 utilizes a standard RGB-D camera combined with a physical model for rapid point-of-fall prediction, while System 2 employs a spike camera to train a neural calibrator for correcting physical errors. Combined with the IMPACT module for imitation learning to control the return zone, the system achieves a return hit rate of 92% in a 30cm area and 70% in a 20cm area on a real ABB robotic arm, significantly exceeding human averages.
- Statistical Guarantees for Offline Domain Randomization
-
This work formalizes Offline Domain Randomization (ODR) as a Maximum Likelihood Estimation (MLE) problem over a parameterized family of simulators. Under mild regularity and identifiability assumptions, it proves weak consistency (convergence in probability) and, by adding a uniform Lipschitz continuity assumption, establishes strong consistency (almost sure convergence), providing the first theoretical foundation for the empirical success of ODR in sim-to-real transfer.
- TaCo: A Benchmark for Lossless and Lossy Codecs of Heterogeneous Tactile Data
-
This paper proposes TaCo—the first comprehensive benchmark for tactile data codecs. It systematically evaluates lossless and lossy compression across 5 heterogeneous tactile datasets, 30 codecs, and 4 types of downstream tasks. The authors train TaCo-LL (lossless) and TaCo-L (lossy), the first codecs driven purely by tactile data, which achieve new SOTA results across all tasks.
Test-Time Mixture of World Models for Embodied Agents in Dynamic Environments
- Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
-
The Theory of Space framework is proposed to systematically evaluate the ability of foundation models to construct and revise spatial beliefs through active exploration in both textual and visual environments. Utilizing cognitive map probing and the False Belief paradigm, the study reveals critical failure modes in current SOTA models, including the active-passive performance gap, exploration inefficiency, and belief inertia.
- Time Optimal Execution of Action Chunk Policies Beyond Demonstration Speed
-
Addressing the issue where imitation learning (including VLA) execution speed is bottlenecked by demonstration speed, this paper proposes RACE: it redefines "actions" as desired states, performs reachability-aware time-optimal re-timing for each action chunk, and utilizes test-time search to select the smoothest and most controllable future chunks. It doubles execution speed compared to demonstrations and quadruples it compared to original policies without sacrificing success rates.
- Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control
-
LIFT proposes a three-stage pre-training and fine-tuning framework: (i) large-scale parallel SAC pre-training to achieve zero-shot deployment; (ii) offline pre-training of a physics-informed world model based on Lagrangian dynamics; (iii) efficient fine-tuning with deterministic action execution and stochastic exploration within the world model. The full pipeline from simulation to the real world was verified on Booster T1 and Unitree G1 humanoid robots.
- TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models
-
TPRU constructs a large-scale multi-image temporal understanding dataset (24,750 QA pairs, 126,000 images) covering 3 complementary tasks (Temporal Ordering, Next-Frame Prediction, and Previous-Frame Review) across 4 embodied scenarios. Through reinforcement learning fine-tuning, the 7B model surpasses GPT-4o in temporal understanding.
- Translating Flow to Policy via Hindsight Online Imitation
-
HinFlow enables robots to interact with their environment guided by high-level point flow planners. By relabeling the actual flows achieved in each rollout as the intended goals, it provides supervision for training goal-conditioned imitation policies online. This approach achieves an 84% success rate with only 1–5 expert demonstrations, outperforming the strongest baseline by \(1.45\times\).
- TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models
-
TwinVLA introduces a modular framework that combines two pre-trained single-arm VLAs into a bimanual VLA using joint attention and MoE. It achieves performance levels comparable to \(\pi_0\) (which utilizes 10,900h of private data and 1,000+ GPU-days) while requiring only ~800h of public single-arm data, 50 bimanual fine-tuning episodes, and 25 H100 GPU-days.
- Uncertainty-Aware Gaussian Map for Vision-Language Navigation
-
This paper explicitly models the phenomenon of "perceptual ambiguity" for Vision-Language Navigation (VLN) agents. By estimating three types of perceptual uncertainties—geometric, semantic, and appearance—on a differentiable Semantic Gaussian Map (SGM), the authors package them into a unified 3D Value Map. This map is fed into a decision network, preventing the agent from making blind guesses when evidence is insufficient, thereby consistently outperforming State-of-the-Art (SOTA) methods on the R2R, RxR, and REVERIE benchmarks.
- Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
-
UD-VLA integrates "visual instruction understanding → future scene generation → action inference" into a single joint discrete denoising trajectory (JD3P). This allows action tokens to iteratively refine themselves by "attending to" increasingly clear future image tokens during each denoising step. It achieves SOTA performance on CALVIN, LIBERO, and SimplerEnv while reaching an inference speed 4x faster than autoregressive methods.
- UniVLA: Unified Vision-Language-Action Model
-
UniVLA discretizes vision, language, and action into tokens within a shared vocabulary, modeling interleaved observation-action sequences with a single autoregressive Transformer. By introducing a "world model" objective for post-training on 620,000 action-free robot videos before fine-tuning, it sets new SOTA records across CALVIN, LIBERO, and SimplerEnv-Bridge (e.g., 95.5% average on LIBERO, surpassing π0-FAST's 85.5%).
- UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
-
UrbanVerse is a data-driven real-to-sim system that transforms crowdsourced city-tour videos into physically-aware interactive simulation scenes. This system comprises a library of 100K+ annotated 3D assets and an automated scene construction pipeline. It generates 160 high-quality scenes in IsaacSim, where trained PPO navigation policies achieve an 89.7% success rate in zero-shot real-world transfer, completing 337m long-distance tasks with only two human interventions.
- VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
-
VER distills multiple vision foundation models (DINOv2 / ViT / CLIP) into an MoE-style "Vision Expert Library." For downstream robot tasks, only a lightweight router (less than 0.4% parameters) is fine-tuned to dynamically select task-relevant experts per patch. Combined with curriculum Top-K annealing to prevent early routing collapse, VER achieves SOTA performance across 17 robot tasks and various policy heads.
- Verifier-Free Test-Time Sampling for Vision-Language-Action Models
-
This paper proposes MG-Select: a VLA test-time scaling framework that requires no external verifier and no additional training modules. It parallelly samples \(N\) candidate actions and uses the KL divergence between the prediction distribution and a "reference distribution generated by the model itself after masking part of the input conditions" as a confidence measure for Best-of-N selection. It significantly improves the success rate of base VLAs in simulation and real-world pick-and-place tasks (a 168% relative improvement with 30 demonstration samples on RoboCasa).
- villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
-
villa-X introduces two upgrades to "latent action" modeling: grounding latent actions to the robot's physical state via a proprioceptive forward dynamics model (proprio-FDM), and feeding latent actions to low-level control through joint diffusion of "latent experts + robot experts." The model achieves SOTA performance in SIMPLER simulations and on two real-world platforms (gripper + dexterous hand), demonstrating zero-shot transfer to unseen embodiments and open-vocabulary symbols.
- ViPRA: Video Prediction for Robot Actions
-
ViPRA transforms a video prediction model into a robot policy: it first learns motion-centric discrete latent actions from massive "unlabeled" human/robot videos via self-supervision, then uses a video-language model to jointly predict "future frames + latent action sequences" for pre-training. Finally, a chunked flow matching decoder maps latent actions to continuous actions of specific robots. With only 100–200 teleoperated demonstrations, it achieves smooth high-frequency control up to 22 Hz, outperforming the strongest baseline by 16% on SIMPLER and 13% on real-world tasks.
- Virtual Community: An Open World for Humans, Robots, and Society
-
This paper constructs Virtual Community—an embodied multi-agent simulation platform based on the Genesis physics engine. It automatically generates open-world scenes and agent societies using real geospatial data, allowing humanoid avatars and various robots to coexist and interact within the same physical world. The platform includes two benchmarks, the "Community Planning Challenge" and the "Community Robot Challenge," to evaluate high-level multi-agent planning and low-level physical coordination.
- Vision-Language-Action Instruction Tuning: From Understanding to Manipulation
-
InstructVLA proposes the "Vision-Language-Action Instruction Tuning (VLA-IT)" paradigm, which utilizes a single VLM to simultaneously perform multimodal reasoning and latent action planning. These are then handed over to a flow-matching action expert for decoding. Through Mixture of Experts (MoE) adaptation, the model preserves the VLM's multimodal capabilities during action training, allowing reasoning to directly enhance manipulation—achieving a 33% improvement over SpatialVLA on SimplerEnv and a 96% improvement over a fine-tuned OpenVLA on the new SimplerEnv-Instruct benchmark.
- Visual Planning: Let's Think Only with Images
-
This paper proposes Visual Planning—the first pure visual reasoning paradigm: the planning process is entirely expressed by image sequences (without text mediation), using a Large Vision Model to autoregressively generate step-by-step state images. It introduces the VPRL two-stage RL framework (random trajectory initialization for exploration + GRPO with progress reward optimization), achieving an average EM that exceeds text-based reasoning methods by 27% on FrozenLake, Maze, and MiniBehavior navigation tasks. This demonstrates that for "vision-first" tasks, visual reasoning is significantly superior to text reasoning.
- VITA: Vision-to-Action Flow Matching Policy
-
VITA replaces the source distribution of the Flow Matching policy from Gaussian noise with the visual representation itself, allowing the flow to "stream directly from vision to action." This eliminates the need for per-step visual conditioning during denoising. On 14 tasks such as ALOHA and Robomimic, it achieves \(1.5\times–2\times\) faster inference and \(18.6\%–28.7\%\) reduction in VRAM, while reaching or exceeding SOTA success rates.
- VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision–Language Models
-
VITA uses a frozen contrastive VLM (CLIP) as a backbone for goal-conditioned value functions and performs frame-wise gradient updates on a lightweight adaptive module at inference time. The update rule itself is a self-supervised loss derived via meta-learning, which implicitly encodes trajectory history into parameters. This allows a value function trained only in a single environment to generalize zero-shot to entirely new tasks, environments, and robot embodiments, outperforming the state-of-the-art autoregressive VLM-based method GVL.
- Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
-
This paper constructs Vlaser (based on InternVL3, 2B/8B versions), a vision-language-action model using the self-developed Vlaser-6M dataset to integrate "high-level embodied reasoning" and "low-level robot control" into a single backbone. It systematically addresses a long-ignored question: which type of pre-training data is most useful for downstream VLA policy learning? The conclusion is that "higher scores on online reasoning benchmarks do not equal improved downstream manipulation performance; what truly works is in-domain data within the same observation domain as the robot hardware."
- VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation
-
The VLBiMan framework is proposed to decompose a single demonstration into invariant and adaptive atomic skills through task-aware bimanual decomposition. It utilizes VLM vision-language anchoring to adapt to object positions and instance variations in new scenes, combined with kinematic-aware trajectory composition for bimanual coordination. On 10 complex bimanual tasks, it achieves an 85.3% success rate with only 1 demonstration, significantly outperforming imitation learning baselines that require hundreds of demonstrations.
- VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
-
This paper establishes a minimalist adaptation pipeline (VLM4VLA) that adds \(<1\%\) parameters to fairly convert 17 general VLMs into VLA policies. It systematically investigates whether "VLM strength determines VLA performance," concluding that while VLM pre-training is necessary, neither general capabilities nor specialized embodied capabilities reliably predict downstream control performance; the true bottleneck lies in the visual encoder.
- VLMgineer: Vision-Language Models as Robotic Toolsmiths
-
VLMgineer integrates the vision-language understanding, code generation, and commonsense priors of VLMs into an evolutionary search loop to automatically co-design URDF tools and discrete action trajectories for robotic tasks. It demonstrates superior task completion capabilities over human prompts, naive sampling, and off-the-shelf tools across 12 tool-use tasks, evolutionary ablations, and real-world Franka robot validations.
- When a Robot is More Capable than a Human: Learning from Constrained Demonstrators
-
This paper defines the Learning from Constrained Demonstrations (LfCD) problem and proposes LfCD-GRIP to learn state-only goal-proximity rewards from constrained human demonstrations. By using confidence anchors and trajectory interpolation to propagate rewards to states outside the demonstrations, it enables the robot to leverage its larger action space to generate shorter and faster trajectories than the demonstrator.
- When would Vision-Proprioception Policies Fail in Robotic Manipulation?
-
This paper reveals why vision-proprioception manipulation policies fail during motion-transition phases—proprioception signals dominate optimization and suppress vision learning. It proposes the Gradient Adjustment with Phase-guidance (GAP) algorithm, which adaptively reduces proprioception gradients to restore vision modality learning, significantly enhancing policy generalization in both simulated and real-world environments.
- WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control
-
WholeBodyVLA enables bipedal humanoid robots to perform end-to-end "move-and-manipulate" tasks in large spaces for the first time. By utilizing two separately trained Latent Action Models (LAMs) to learn locomotion and manipulation priors from massive "action-unlabeled" egocentric human videos, combined with a discrete-command RL low-level controller tailored for loco-manipulation, it achieves a 21.3% higher average success rate than previous baselines on AgiBot X2.
- WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
-
WMPO migrates the entire reinforcement learning process of VLA policies into a pixel-space action-conditioned video world model for "dreaming." By using the world model to imagine complete trajectories, judging success with a lightweight reward model, and running on-policy GRPO, it significantly improves sample efficiency without physical interaction and enables the emergence of self-correction behaviors.
- World-In-World: World Models in a Closed-Loop World
-
This paper proposes World-In-World—the first open platform to evaluate generative world models in a closed-loop embodied environment. It utilizes a unified "Propose-Semulate-Revise" online planning strategy and a unified Action API to integrate various heterogeneous world models. Using task success rate rather than visual quality as the primary metric, the study reveals three counter-intuitive findings: high visual quality does not equate to task success (controllability is more critical), post-training with action-observation data is more effective than switching to stronger pre-trained video generators, and increasing inference-time compute significantly boosts closed-loop performance.
- WorldGym: World Model as an Environment for Policy Evaluation
-
This paper trains WorldGym, an action-conditioned autoregressive video world model, as a "virtual environment." Robot policies perform rollouts within this model, and a VLM is used for scoring to estimate policy success rates before real-world deployment. Experiments demonstrate that the success rates in the world model are highly correlated with real-world success rates (Pearson \(r=0.78\)) and maintain consistent relative rankings across different policy versions, scales, and training steps.
- X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
-
X-VLA encodes hardware and collection variances from each robot data source into a set of learnable soft prompts. Combined with a concise Transformer + flow matching action generation framework, it achieves robust cross-embodiment adaptation after pre-training on large-scale heterogeneous robot data.