Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents¶
Conference: ACL 2026
arXiv: 2508.07642
Code: https://github.com/HLR/SkillNav
Area: Robotics / Vision-and-Language Navigation / Modular Agents
Keywords: VLN, Skill Decomposition, VLM Router, Synthetic Data, GSA-R2R Generalization
TL;DR¶
SkillNav decomposes the Vision-and-Language Navigation task into 5 atomic skills (Direction Adjustment, Vertical Movement, Stop, Landmark Detection, Area Identification) plus 1 temporal planning skill. Each skill fine-tunes a DUET sub-agent using synthetic data, coupled with a training-free VLM router for temporal reordering, subgoal localization, and skill selection. It achieves SOTA generalization on GSA-R2R (48% Test-N-Scene SPL compared to the previous best of 43%).
Background & Motivation¶
Background: Mainstream VLN approaches are polarized: (1) Supervised black-box agents (DUET / BEVBERT / ScaleVLN / SRDF), which are trained end-to-end on large-scale synthetic data and perform strongly in-domain on R2R but tend to memorize training trajectories; (2) Zero-shot LLM/VLM agents (MapGPT / NavGPT / DiscussNav), which generalize stably but lack fine-grained visual grounding, with an SR gap of up to ~36 percentage points compared to supervised models.
Limitations of Prior Work: Supervised models exhibit sharp performance degradation in "new building types + new instruction styles" scenarios like GSA-R2R. LLM models lack embodied grounding and cannot precisely select viewpoints. Multi-agent collaborative works (DiscussNav / FlexVLN / CLASH) combine multiple models but often activate several models at each step, causing redundancy, and revert to zero-shot LLM decisions during conflicts, thereby sacrificing in-domain precision.
Key Challenge: The trade-off between "broad generalization (requiring world knowledge from LLMs)" and "precise execution (requiring fine-tuned visual grounding)." End-to-end agents favor the latter, while LLM agents favor the former; the two are difficult to reconcile.
Goal: (1) Identify a "minimal set of atomic skills" for execution that allows each skill to be trained to high precision; (2) Leverage VLMs only for high-level "skill selection + temporal planning" to avoid direct control of low-level actions; (3) Train each skill agent in a closed-loop fashion using synthetic data without relying on human labels.
Key Insight: The authors reuse 4 atomic skills proposed by NavNuances (DC / VM / LR / RR) and add 2 additional skills: Stop and Temporal Order Planning. This mimics the human cognitive process of "decomposing tasks into reusable sub-actions and scheduling them as needed."
Core Idea: Replace the "monolithic end-to-end policy" with "skill decomposition + skill-specific synthetic data + VLM router" to decouple high-level planning from low-level execution, allowing LLM reasoning and fine-tuned visual grounding to each play to their strengths.
Method¶
Overall Architecture¶
SkillNav consists of two main components:
-
5 skill-specific agents \(\mathcal{S} = \{\pi_{da}, \pi_{vm}, \pi_{sp}, \pi_{ld}, \pi_{ar}\}\) (Direction Adjustment / Vertical Movement / Stop and Pause / Landmark Detection / Area and Region Identification). All are based on the DUET architecture and trained in two stages: Stage 1 involves fine-tuning a skill-agnostic backbone on R2R + ScaleVLN/SRDF augmentation + Temporal synthetic data; Stage 2 continues fine-tuning into 5 experts using dedicated synthetic datasets (450 samples per skill).
-
VLM-based Action Router (training-free, three-stage): (1) The Temporal Reordering Module uses GPT-4o to reorder raw instructions into an ordered list of subgoals; (2) The Subgoal Localizer uses Qwen2.5-VL-7B, visual history, and executed subgoals to locate the current subgoal \(p_t^*\) and output a reasoning trace \(r_t\); (3) The Skill Router selects the most suitable skill \(\pi_t^* = \arg\max_{\pi \in \mathcal{S}} \text{Router}(I, p_t^*, r_t)\). The selected expert agent predicts the next action using the original instruction, current observations, and the topological map.
Key Designs¶
-
6-skill task decomposition + skill-specific synthetic data pipeline:
- Function: Decomposes "navigation" into 6 minimally units that are semantically independent and trainable, allowing each agent to focus on a single task.
- Mechanism: Short paths of 4-7 steps are randomly sampled from Matterport3D and filtered via skill-based heuristics (e.g., Direction requires frequent turns; Vertical requires a height difference \(>2\) units and a mandatory staircase). GPT-4o synthesizes R2R-style instructions based on trajectory observations, generating 450 samples per skill (2,000 for Temporal). The DUET backbone is then fine-tuned through Stages 1 and 2.
- Design Motivation: The cost of human-labeling data for each skill is extremely high. Synthetic data ensures the trajectory inherently fits the target skill through strict geometric/semantic filtering, preventing the model from learning via keyword shortcuts (experiments show words like "down" have different meanings across datasets, forcing the model to rely on visual context).
-
VLM Router 3-stage pipeline (Temporal Reorder → Subgoal Localize → Skill Route):
- Function: Invokes the VLM only during skill-switching events rather than at every step, reducing computational overhead.
- Mechanism: First, the LLM explicitly reorders instructions containing temporal words (e.g., "first X then Y then Z") into an ordered subgoal list to eliminate implicit temporal reasoning. Next, the VLM identifies which subgoal to execute currently based on visual history and completed subgoals. Finally, the VLM selects the best-matching skill agent based on the reasoning trace and instruction context.
- Design Motivation: Experiments show that disabling Temporal Reordering drops Test-N-Scene SPL by 2.5%, indicating that explicit temporal decomposition is a necessary structural scaffold. The three-stage division ensures each VLM call is task-specific and errors are localizable.
-
Complete decoupling of VLM reasoning and fine-tuned execution:
- Function: Gains the dual advantages of broad generalization from LLMs and precise grounding from supervised models.
- Mechanism: The VLM only produces the discrete decision of "which skill to select" without directly predicting actions. The selected skill agent performs the final action prediction using its own DUET weights and the original instruction. This means VLM errors lead to the "wrong expert" rather than the "wrong action," localizing the failure.
- Design Motivation: End-to-end VLM agents often couple reasoning and grounding errors. SkillNav experiments show that control-related skills (\(\pi_{sp}\) at 34.42% + \(\pi_{da}\) at 23.61% = 58%) are frequently invoked, while semantic skills (\(\pi_{ld}\) at 14.23% + \(\pi_{ar}\) at 18.75%) only activate when "identifying specific objects," reflecting a precision-first strategy.
Loss & Training¶
Two-stage fine-tuning: Stage 1 on ScaleVLN augmented data + R2R + Temporal synthesis for 50,000 iterations (batch 32, lr 5e-5); Stage 2 on each skill dataset for 30,000 iterations (batch 16). The Router uses vLLM with greedy decoding (temperature 0, max length 40,960), selecting the top-1 skill.
Key Experimental Results¶
Main Results: R2R + GSA-R2R Comparison¶
| Method | R2R Val-Unseen SPL | R2R Test-Unseen SPL | GSA-R2R Test-R-Basic SPL | GSA-R2R Test-N-Basic SPL | GSA-R2R Test-N-Scene SPL |
|---|---|---|---|---|---|
| DUET | 60 | 59 | 47 | 37 | 30 |
| BEVBERT | 64 | 62 | 45 | 35 | 27 |
| ScaleVLN † | 70 | 68 | 67 | 57 | 43 |
| SRDF † | 78 | 77 | 63 | 49 | 43 |
| MapGPT (LLM) | 35 | — | 30 | 23 | 23 |
| NavGPT-2 (FlanT5-5B) | 61 | 60 | 45 | 35 | 43 |
| SkillNav (ScaleVLN-Aug) † | 77 (+6.54) | 70 (+1.80) | 69 (+2.18) | 61 (+4.18) | 48 (+5.26) |
| SkillNav (SRDF-Aug) † | 78 | 77 | 64 | 50 | 45 |
† = augmented with large-scale synthetic data. SkillNav achieves SOTA on GSA-R2R, with Test-N-Scene SPL increasing by 5.26 percentage points over ScaleVLN.
Ablation Study: Action Router Mechanisms¶
| Reorder | Router | Test-R-Basic SPL | Test-N-Basic SPL | Test-N-Scene SPL |
|---|---|---|---|---|
| ✗ | Qwen | 67.80 | 59.62 | 45.43 |
| ✔ | Qwen | 68.88 | 61.34 | 47.96 |
| ✗ | GLM | 66.27 | 58.63 | 42.64 |
| ✔ | GLM | 67.93 | 59.73 | 46.51 |
| Random skill (no router) | — | 67.46 | 59.71 | 43.17 |
| ✔ | GPT-4o | 69.18 | 62.48 | 48.96 |
NavNuances Single-Skill Evaluation (Each agent is strongest on its own skill)¶
| Method | DC SR | VM SR | LR SR | RR SR |
|---|---|---|---|---|
| ScaleVLN | 68.39 | 81.76 | 28.32 | 82.91 |
| SRDF | 59.93 | 82.94 | 26.28 | 77.09 |
| Direction Adjustment agent | 70.81 | 81.76 | 31.39 | 81.82 |
| Vertical Movement agent | 70.68 | 87.65 | 30.22 | 82.18 |
| Landmark Detection agent | 70.29 | 82.35 | 31.53 | 83.64 |
| Area and Region Ident. agent | 67.53 | 84.12 | 29.20 | 85.09 |
Key Findings¶
- Removing Temporal Reordering → Test-N-Scene SPL drops by 2.5%, proving that an explicit temporal structural scaffold is indispensable.
- 5-skill Subset Ablation: Any combination of 2-4 skills performs worse than all 5 skills (e.g., best 4-skill SR is 80.80, while 5-skill SR is 82.59), highlighting the importance of decomposition "completeness."
- Expert Activation Frequency: Control skills (\(\pi_{sp}\) 34.42% + \(\pi_{da}\) 23.61% = 58%) are used far more than semantic skills, suggesting that "continuous state verification" is more frequent in navigation than "sparse semantic anchoring."
- Inference Overhead: SkillNav takes 9.69s/case, which is 2-4× faster than NavGPT/FlexVLN but still ~50× slower than ScaleVLN (28 inferences/s).
Highlights & Insights¶
- Defining "Skill" as High-Level Semantic Concepts: In Appendix A.1, the authors clarify that atomic skills are defined at the semantic intent level rather than the motor execution level (e.g., "walk to the far end of the room" is a Region Identification skill, even if it involves multiple forward moves and turns). This avoids the extremes of over-decomposition or being too coarse.
- VLM for Decision, Not Execution: By restricting the VLM to discrete "skill selection" decisions, errors are localized, while grounding is handled by the fine-tuned DUET. This decoupling of high-level reasoning and low-level grounding is the key to generalization.
- Two-Stage Fine-Tuning against Catastrophic Forgetting: Stage 1 uses large-scale general data to train the backbone, while Stage 2 branches into skill specialization, providing more stability than single-stage training for 5 skills.
- Anti-Shortcut Design in Synthetic Data: Vertical Movement data intentionally includes non-vertical keywords (Landmark 18.72% + Direction 8.05%), forcing the model to learn from vision rather than a dictionary. This anti-shortcut data construction is highly valuable for future work.
Limitations & Future Work¶
- Discrete Viewpoint Simulator Evaluation: Not yet validated in VLN-CE / Habitat continuous control or on real robots; continuous action spaces would require a new skill executor.
- Relatively High Inference Overhead: 50× slower than purely supervised models; deployment in latency-constrained scenarios would require router distillation or caching.
- Incomplete Skill Library: Does not cover specialized scenarios like object manipulation, transparent materials, or human-aware navigation.
- Closed-source/Open-source Mixture (GPT-4o + Qwen2.5-VL): Results in high replication costs; if the GPT-4o API is discontinued, it would impact Temporal Reordering quality.
- Grounding Bottlenecks: Analysis of 17 failure cases reveals that errors are primarily due to visual grounding (e.g., VLM binding a "sink" to the wrong object) rather than router reasoning, suggesting that the grounding module needs further strengthening.
Related Work & Insights¶
- vs. DUET (Backbone): Built on DUET, but improves generalization by splitting a single DUET into 5 skill-specific versions plus a VLM router.
- vs. ScaleVLN / SRDF: Also uses large-scale synthetic data, but SkillNav further categorizes by skill and specializes in Stage 2, outperforming a single massive model.
- vs. MapGPT / NavGPT / DiscussNav: Pure LLM routes are zero-shot but lack grounding; SkillNav uses VLM for decisions only, combining the strengths of both.
- vs. FlexVLN / CLASH (Planner-Executor): Similar hierarchical ideas, but they may activate redundant models per step or revert to zero-shot on conflict. SkillNav always selects the top-1 best-fit specialist.
- vs. SAME (State-Adaptive MoE): Similar to MoE, but while SAME uses implicit routing, SkillNav uses explicit skill semantic routing, offering better interpretability.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combination of "skill decomposition + VLM router + synthetic data closed-loop" is stable and effective for cross-benchmark generalization.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Includes R2R / GSA-R2R / RxR / NavNuances benchmarks + skill subset ablation + temporal ablation + router VLM comparison + failure analysis + leakage analysis + inference overhead analysis.
- Writing Quality: ⭐⭐⭐⭐ Extremely detailed appendices (Skill definitions, data construction, bias checks, hyperparams).
- Value: ⭐⭐⭐⭐ Open-sourced code, project page, and reusable synthetic data pipeline; provides a viable path for "modular + LLM reasoning" in VLN.