🤖 Robotics & Embodied AI¶
💬 ACL2026 · 11 paper notes
📌 Same area in other venues: 📷 CVPR2026 (130) · 🔬 ICLR2026 (162) · 🧪 ICML2026 (53) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (26)
🔥 Top topics: Navigation ×6 · Multimodal/VLM ×4 · Reasoning ×3 · Robotics ×2
- Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
-
SkillNav decomposes the vision-language navigation task into 5 atomic skills (Direction Adjustment, Vertical Movement, Stop, Landmark Identification, Area Identification) + 1 Temporal Order Planning skill. Each skill fine-tunes a DUET sub-agent using synthetic data, while a training-free VLM router performs temporal reordering + sub-goal localization + skill selection. It achieves SOTA generalization capabilities on GSA-R2R (Test-N-Scene SPL 48% vs. the previous highest of 43%).
- Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection
-
This paper proposes REFORM, shifting multimodal forgery detection from "direct label fitting" to "learning a verifiable forensic reasoning process." Through the ROM reasoning-annotated dataset, dual decoders, and GRPO training, REFORM achieves superior cross-domain generalization and interpretable detection results on ROM, DGM4, and MMFakeBench.
- ElasticFlow: One-Step Physics-Consistent Policy with Elastic Time Horizons for Language-Guided Manipulation
-
The paper proposes ElasticFlow, which replaces instantaneous velocity fields with MeanFlow (mean velocity fields) for learning language-conditioned robotic actions. By explicitly encoding control granularity using an "Elastic Time Horizon \(\Delta t=t-r\)", it achieves 1-NFE single-step inference (~71Hz) and outperforms OpenVLA and \(\pi_0\) on long-horizon tasks such as LIBERO-Long and CALVIN ABC-D.
- GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
-
GoViG proposes a new task of generating navigation instructions based only on initial and goal egocentric observations. It decomposes the task into two steps: "imagining intermediate frames then writing instructions." By jointly training Anole-7B with a dual objective of token-level MSE and label-smoothing CE, and employing one-pass or interleaved multimodal reasoning strategies, the method improves the BLEU-4 score from a baseline of 0.08 to 0.32, maintaining 0.27 on cross-domain real-world videos.
- GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap
-
GROKE proposes evaluating navigation instructions without any vision by serializing OpenStreetMap (OSM) data into JSON and utilizing Gemini-3 Pro as a follower agent to execute instructions on the graph. Navigation metrics (Navigation Error / SR / SDTW) serve as proxies for instruction quality. Compared to heuristic baselines on Map2Seq, it reduces Navigation Error (NE) by 68.5%, and results show that NE is significantly correlated with human judgment of "instruction clarity" (\(r = -0.31, p < 0.01\)).
- Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System
-
Libra-VLA decomposes robot actions into a hybrid action space of "discrete macro-intent + continuous micro-pose." It utilizes System 2 (VLM + parallel coarse-action head) for low-frequency planning and System 1 (diffusion transformer + independent SigLIP encoder) for high-frequency refinement. Achieving true asynchronous execution via an intent buffer, it reaches a SoTA of 97.2% on LIBERO and 79.5% zero-shot on LIBERO-Plus (10% higher than the previous OpenVLA-OFT+).
- Limited Linguistic Diversity in Embodied AI Datasets
-
This paper performs a systematic "linguistic diversity audit" on mainstream VLA training corpora (RT-1, BRIDGE, TacoPlay, Language Table, LIBERO). By quantifying lexical, semantic, and syntactic dimensions, it reveals that VLA data contains < 2% unique instructions, RT-1 has only 49 unique words in the entire corpus, and negation/conditional sentences account for < 1%. This "template-based poverty" compared to instruction-tuning corpora (OASST2 93%, Alpaca 99.8% unique) may be the root cause of VLA models' vulnerability to paraphrasing and generalization failures.
- Mango: Multi-Agent Web Navigation via Global-View Optimization
-
Mango constructs a global approximate structure of a website before navigation and employs Thompson Sampling to dynamically allocate a limited navigation budget among candidate URLs. This prevents LLM web agents from blindly exploring from the homepage and significantly outperforms baselines such as AgentOccam and WebWalker on WebVoyager and WebWalkerQA.
- VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions
-
This paper proposes the VLN-NF benchmark—the first task requiring VLN agents to identify false-premise instructions and output NOT-FOUND in 3D partially observable environments. It further introduces the REV-SPL evaluation metric and the ROAM two-stage hybrid framework, where ROAM achieves 6.1 REV-SPL, representing a 45% improvement over supervised baselines.
- When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models
-
By translating the LIBERO robotic manipulation benchmark into ten languages, this paper systematically reveals for the first time that VLA models suffer a 30–50% drop in success rates under non-English instructions. It identifies that "linguistic influence is highly non-uniform across execution steps"—where only a few critical steps are sensitive to language but dominate failure cases. Based on this, a method for inference-time representation alignment specifically on these steps is proposed, significantly recovering multilingual performance.
- Ability-Oriented Failure Attribution for Vision-Language Navigation Agents
-
This paper addresses multi-level ability failures in embodied agents (specifically Vision-Language Navigation VLN agents) by proposing the CanTest framework. Through ability-oriented test oracles and failure attribution mechanisms, it precisely localizes specific ability defects (Perception/Memory/Planning/Decision-making) leading to task failure, discovering 23–34% more failure cases than existing methods.