Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method¶

Conference: CVPR 2025
arXiv: 2412.09082
Code: https://hcplab-sysu.github.io/LH-VLN
Area: Robotics / Vision-Language Navigation
Keywords: Long-horizon navigation, multi-stage task, vision-language navigation, memory mechanism, benchmark evaluation

TL;DR¶

This paper defines the Long-Horizon Vision-Language Navigation (LH-VLN) task, constructs the NavGen automatic generation platform and the LHPR-VLN benchmark (comprising 3,260 multi-stage tasks with an average of 150 steps), and proposes the MGDM method that achieves multi-stage navigation through short-term memory blurring, long-term memory retrieval, and CoT feedback, outperforming NaviLLM by 23% in the ISR metric.

Background & Motivation¶

Background: Vision-Language Navigation (VLN) enables agents to navigate in 3D environments guided by natural language instructions. Existing benchmarks (e.g., R2R, VLN-CE) feature an average path length of only 55 steps, and instructions involve only a single target—well below the requirements of real-world scenarios.

Limitations of Prior Work: Navigation in real-world scenarios is usually multi-stage (e.g., "go to the kitchen to fetch a cup, then put it on the table in the living room"), which involves long-horizon planning of 150+ steps. Existing methods and benchmarks fail to evaluate this multi-stage long-horizon capability.

Key Challenge: Long-horizon multi-stage navigation requires handling dependencies between subtasks (e.g., task A must be completed before starting B). However, existing evaluation metrics (such as SR and SPL) only assess the final outcome, failing to measure the correctness of intermediate stages.

Key Insight: Three new metrics (ISR, CSR, and CGT) are defined to evaluate independent subtask success rate, conditional subtask success rate, and path difficulty-weighted success rate, respectively. GPT-4 and the NavGen platform are leveraged to automatically generate large-scale multi-stage tasks.

Core Idea: New task definition + new evaluation metrics + large-scale automatically generated benchmark = long-horizon multi-stage VLN.

Method¶

Key Designs¶

NavGen Automatic Generation Platform:
- Function: Automatically generates multi-stage navigation tasks from 3D scenes.
- Mechanism: Given a list of objects and the topological structure of a scene, GPT-4 automatically generates navigation instructions containing 2 to 4 subtasks. Each subtask has an independent start/end point and success evaluation criteria.
MGDM (Memory-Guided Decision Model):
- Function: Handles memory management in long-horizon navigation.
- Mechanism: Short-term memory blurring (compressing recent observations into summaries to avoid information overload) + long-term memory retrieval (retrieving relevant experiences from history) + Chain-of-Thought feedback (utilizing CoT to analyze the current state and decide on the next action).
Three New Evaluation Metrics:
- ISR (Independent Subtask Success Rate): Evaluates each subtask independently.
- CSR (Conditional Subtask Success Rate): Accounts for dependencies between subtasks (if a preceding subtask fails, all subsequent ones are considered failed).
- CGT: Weighs CSR with path difficulty weights.

Loss & Training¶

Imitation learning + cross-entropy loss. LHPR-VLN contains 3,260 tasks covering 216 HM3D scenes, with 39% two-subtask tasks, 52.4% three-subtask tasks, and 8.6% four-subtask tasks.

Key Experimental Results¶

Main Results¶

Method	ISR	CSR	CGT
NaviLLM (Finetuned)	3.81%	1.67%	2.54%
MGDM	4.69%	3.30%	5.83%

All baselines show a success rate close to 0% on tasks with 2–3 subtasks, indicating that long-horizon navigation is highly challenging.

Key Findings¶

Long-horizon navigation is extremely challenging for all existing methods: The best method achieves only a ~5% success rate.
Subtask dependencies heavily impact performance: CSR is significantly lower than ISR, indicating that error cascades are the primary cause of failure.
Memory mechanisms are crucial for long-horizon scenarios: Methods without memory management fail completely.

Highlights & Insights¶

The task definition and benchmark are the core contributions—unveiling the fundamental limitations of existing VLN methods in long-horizon scenarios.
Extremely low absolute performance—the 5% success rate indicates that long-horizon VLN is a highly unresolved, open problem.

Limitations & Future Work¶

The performance of baseline methods is extremely low, making it difficult to demonstrate the room for method improvement.
Limited to the Habitat simulator.
The memory mechanism is tailored for specific tasks, and its generalizability remains to be verified.

Rating¶

Novelty: ⭐⭐⭐⭐ The contribution framework consisting of new tasks, new metrics, and a new benchmark is comprehensive.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple baselines are evaluated, but absolute performance is too low.
Writing Quality: ⭐⭐⭐⭐ The problem definition is clear.
Value: ⭐⭐⭐⭐ Points the VLN community towards the long-horizon direction.