Skip to content

🤖 Robotics & Embodied AI

💬 ACL2026 · 11 paper notes

📌 Same area in other venues: 📷 CVPR2026 (130) · 🔬 ICLR2026 (162) · 🧪 ICML2026 (53) · 🤖 AAAI2026 (30) · 🧠 NeurIPS2025 (73) · 📹 ICCV2025 (26)

🔥 Top topics: Navigation ×6 · Multimodal/VLM ×4 · Reasoning ×3 · Robotics ×2

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

SkillNav decomposes the vision-language navigation task into 5 atomic skills (Direction Adjustment, Vertical Movement, Stop, Landmark Identification, Area Identification) + 1 Temporal Order Planning skill. Each skill fine-tunes a DUET sub-agent using synthetic data, while a training-free VLM router performs temporal reordering + sub-goal localization + skill selection. It achieves SOTA generalization capabilities on GSA-R2R (Test-N-Scene SPL 48% vs. the previous highest of 43%).

Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

This paper proposes REFORM, shifting multimodal forgery detection from "direct label fitting" to "learning a verifiable forensic reasoning process." Through the ROM reasoning-annotated dataset, dual decoders, and GRPO training, REFORM achieves superior cross-domain generalization and interpretable detection results on ROM, DGM4, and MMFakeBench.

ElasticFlow: One-Step Physics-Consistent Policy with Elastic Time Horizons for Language-Guided Manipulation

The paper proposes ElasticFlow, which replaces instantaneous velocity fields with MeanFlow (mean velocity fields) for learning language-conditioned robotic actions. By explicitly encoding control granularity using an "Elastic Time Horizon \(\Delta t=t-r\)", it achieves 1-NFE single-step inference (~71Hz) and outperforms OpenVLA and \(\pi_0\) on long-horizon tasks such as LIBERO-Long and CALVIN ABC-D.

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning

GoViG proposes a new task of generating navigation instructions based only on initial and goal egocentric observations. It decomposes the task into two steps: "imagining intermediate frames then writing instructions." By jointly training Anole-7B with a dual objective of token-level MSE and label-smoothing CE, and employing one-pass or interleaved multimodal reasoning strategies, the method improves the BLEU-4 score from a baseline of 0.08 to 0.32, maintaining 0.27 on cross-domain real-world videos.

GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap

GROKE proposes evaluating navigation instructions without any vision by serializing OpenStreetMap (OSM) data into JSON and utilizing Gemini-3 Pro as a follower agent to execute instructions on the graph. Navigation metrics (Navigation Error / SR / SDTW) serve as proxies for instruction quality. Compared to heuristic baselines on Map2Seq, it reduces Navigation Error (NE) by 68.5%, and results show that NE is significantly correlated with human judgment of "instruction clarity" (\(r = -0.31, p < 0.01\)).

Libra-VLA: Achieving Learning Equilibrium via Asynchronous Coarse-to-Fine Dual-System

Libra-VLA decomposes robot actions into a hybrid action space of "discrete macro-intent + continuous micro-pose." It utilizes System 2 (VLM + parallel coarse-action head) for low-frequency planning and System 1 (diffusion transformer + independent SigLIP encoder) for high-frequency refinement. Achieving true asynchronous execution via an intent buffer, it reaches a SoTA of 97.2% on LIBERO and 79.5% zero-shot on LIBERO-Plus (10% higher than the previous OpenVLA-OFT+).

Limited Linguistic Diversity in Embodied AI Datasets

This paper performs a systematic "linguistic diversity audit" on mainstream VLA training corpora (RT-1, BRIDGE, TacoPlay, Language Table, LIBERO). By quantifying lexical, semantic, and syntactic dimensions, it reveals that VLA data contains < 2% unique instructions, RT-1 has only 49 unique words in the entire corpus, and negation/conditional sentences account for < 1%. This "template-based poverty" compared to instruction-tuning corpora (OASST2 93%, Alpaca 99.8% unique) may be the root cause of VLA models' vulnerability to paraphrasing and generalization failures.

Mango: Multi-Agent Web Navigation via Global-View Optimization

Mango constructs a global approximate structure of a website before navigation and employs Thompson Sampling to dynamically allocate a limited navigation budget among candidate URLs. This prevents LLM web agents from blindly exploring from the homepage and significantly outperforms baselines such as AgentOccam and WebWalker on WebVoyager and WebWalkerQA.

VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions

This paper proposes the VLN-NF benchmark—the first task requiring VLN agents to identify false-premise instructions and output NOT-FOUND in 3D partially observable environments. It further introduces the REV-SPL evaluation metric and the ROAM two-stage hybrid framework, where ROAM achieves 6.1 REV-SPL, representing a 45% improvement over supervised baselines.

When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models

By translating the LIBERO robotic manipulation benchmark into ten languages, this paper systematically reveals for the first time that VLA models suffer a 30–50% drop in success rates under non-English instructions. It identifies that "linguistic influence is highly non-uniform across execution steps"—where only a few critical steps are sensitive to language but dominate failure cases. Based on this, a method for inference-time representation alignment specifically on these steps is proposed, significantly recovering multilingual performance.

Ability-Oriented Failure Attribution for Vision-Language Navigation Agents

This paper addresses multi-level ability failures in embodied agents (specifically Vision-Language Navigation VLN agents) by proposing the CanTest framework. Through ability-oriented test oracles and failure attribution mechanisms, it precisely localizes specific ability defects (Perception/Memory/Planning/Decision-making) leading to task failure, discovering 23–34% more failure cases than existing methods.