🤖 Robotics & Embodied AI¶
🧪 ICML2026 · 12 paper notes
📌 Same area in other venues: 💬 ACL2026 (1) · 📷 CVPR2026 (37) · 🔬 ICLR2026 (47) · 🤖 AAAI2026 (37) · 🧠 NeurIPS2025 (53) · 📹 ICCV2025 (26)
🔥 Top topics: Robotics ×3 · Multimodal/VLM ×3 · Navigation ×3 · Reasoning ×2 · Agents ×2
- Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation
-
For zero-shot robotic manipulation from "training tasks to novel tasks," the authors decompose demonstrations into "atomic skill-action pairs" as an intermediate representation. They then use a dual-library approach (dynamic library retrieves by visual/planning similarity; static library uses IDF-weighted tokens to supplement missing skills) to provide the LLM with skill-comprehensive in-context demonstrations, thereby upgrading "trajectory imitation" to "compositional skill reasoning."
- Drift is a Sampling Error: SNR-Aware Power Distributions for Long-Horizon Robotic Planning
-
This paper proposes CAPS: reinterpreting "instruction drift" as a systematic sampling error, using SNR (\(=\log|\mathcal{A}|-\mathcal{H}\)) as a metacognitive switch. Only when entering high-entropy "Pivotal Windows" does it trigger Metropolis-Hastings iterative refinement based on power distributions \(\pi\propto p^\alpha\). On RoboTwin, Simpler-WindowX, and Libero-long, it surpasses OpenVLA and TACO in a training-free manner.
- Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models
-
This paper reformulates "vision-action attribution" as an intervention estimation problem, proposing two metrics: ISS (Interventional Saliency Score) and NMR (Nuisance Mass Ratio). By using Bernoulli masks + Gaussian blur perturbation + Action MSE as a proxy for KL divergence, it quantifies which visual regions VLA policies actually rely on. It is shown that NMR is strongly negatively correlated with OOD task success rate (\(r = -0.77\)), making it a cheap diagnostic tool for predicting VLA generalization.
- From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
-
MoLA employs a set of "modal-aware inverse dynamics models (IDM)" pre-trained on large-scale robotics data to translate future frames predicted by a video generation model into three discrete latent actions—semantic, depth, and optical flow. The policy head then controls based on these action-centric representations, achieving robust and accurate "imagination-to-execution" interfaces on CALVIN, LIBERO, LIBERO-Plus, and real UR5e robots.
- HDFlow: Hierarchical Diffusion-Flow Planning for Long-horizon Tasks
-
HDFlow uses a diffusion model to generate sparse strategic subgoals and a rectified flow to generate dense trajectories, further incorporating energy guidance and manifold projection. This constructs a two-layer planner with a division of labor between slow and fast modules, boosting the success rate of long-horizon, sparse-reward tasks such as furniture assembly by 20–30 percentage points.
- Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models
-
LaRA-VLA internalizes both textual and visual CoT in VLA models as continuous latents. Through a three-stage curriculum training (explicit CoT → latent replacement → action expert adaptation), reasoning is completed in the latent space. Compared to explicit CoT, inference latency is reduced by up to 90%, restoring control frequency to real-time levels.
- Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering
-
Reformulates step-by-step prediction in continuous UAV VLN as a "recursive Bayesian estimation = GRU prior + memory bank likelihood + learnable Kalman gain" closed loop. On TravelUAV, fine-tuning with only 10% of the data boosts L1-Full SR from 17.6% to 25.9%, while position drift after 100 steps is flattened to 30–40 meters.
- Optimal and Scalable MAPF via Multi-Marginal Optimal Transport and Schrödinger Bridges
-
This paper proves that anonymous multi-robot path planning (MAPF) can be formulated as a Markovian Multi-Marginal Optimal Transport (MMOT) problem, compressing the original \(K^{T+1}\)-dimensional transport tensor into a polynomial-size LP (P1), with total unimodularity guaranteeing integer optimality. It then generalizes to the Schrödinger bridge, yielding a Sinkhorn-style entropic relaxation (P2) that produces a "shadow transport." Finally, pruning and solving an LP (P3) on the shadow recovers integer solutions, achieving 3.6×–7.1× speedup and <10% cost gap at \(K^{1.15}\) complexity.
- Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
-
This paper proposes SAGE: automatically synthesizing large-scale navigation tasks and IF-THEN experience rules in a physics-constrained semantic sandbox, then distilling these experiences into a VLM policy using hybrid prompt sampling and asymmetric adaptive clipping GRPO. This approach boosts LLM-Match success rate on A-EQA from 43.5% to 53.2% (2B) / 60.2% (4B), and enables transfer to real indoor robots.
- Plug-and-Play Label Map Diffusion for Universal Goal-Oriented Navigation
-
This paper proposes PLMD: merging BEV semantic and obstacle maps into a Label Map, using DDPM to complete unexplored regions’ semantic + obstacle labels under obstacle priors, serving as a plug-and-play module for any GON policy. It consistently sets new SOTA on ON / IIN / MRON tasks across HM3D/MP3D.
- Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
-
To address the issue of VLA (vision-language-action) models collapsing under minor perturbations, this work proposes a video transfer pipeline—"extract semantic/geometric conditions → rewrite caption → conditional video diffusion re-rendering"—to inject visual and environmental diversity into simulation data. Additionally, a three-stage velocity caching reduces generation time by 61%, and a difficulty + diversity-driven coreset sampling selects only 10% of key trajectories. Ultimately, on Robotwin 2.0, LIBERO-Plus, and real robots, RDT-1B / \(\pi_0\) achieve 5–15% improvement.
- STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction
-
STEP attaches a lightweight "previous action history + current observation → next action" Transformer predictor to a diffusion policy, using its output as a denoising warm-start. This compresses 100 denoising steps to just 2, and introduces an execution deadlock defense: if the action change is too small, a bit of noise is injected. Across 9 simulation and 2 real-world tasks, STEP outperforms BRIDGER / DDIM by 21.6% / 27.5% average success rate.