Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning¶
Conference: CVPR 2026 arXiv: 2601.09111 Code: None Area: Robotics / Embodied Intelligence / Vision-Language Navigation Keywords: Vision-Language Navigation, fast-slow reasoning, experience repository, scene generalization, instruction style conversion
TL;DR¶
For the task of general scene-adaptive vision-language navigation (GSA-VLN) in open environments, inspired by Kahneman's dual-process cognitive theory, this paper proposes the slow4fast-VLN framework. A fast reasoning module performs real-time navigation via an end-to-end policy network while accumulating historical memory; a slow reasoning module leverages LLM-based reflection to generate structured, generalizable experience entries. These experiences are fed back into the fast reasoning network via attention-based fusion, enabling continuous adaptation to unseen environments and diverse instruction styles. The proposed framework achieves comprehensive improvements over the previous SOTA (GR-DUET) on the GSA-R2R dataset.
Background & Motivation¶
-
Background: Vision-Language Navigation (VLN) is a foundational task in embodied AI. Conventional approaches such as DUET operate under a closed-set assumption—training and test data share the same environmental styles and instruction forms. More recently, GR-DUET introduced the GSA-VLN task, incorporating 150 scenes across 20 architectural types, distinguishing in-distribution from out-of-distribution (OOD) scenarios, and defining three instruction styles (Basic, Scene, User) to address visual-level scene adaptation.
-
Limitations of Prior Work: (a) When transferring from familiar to OOD environments, agents produce erroneous reasoning paths analogous to hallucinations and fail to recognize their own limitations. (b) Existing fast-slow systems treat the two subsystems as independent parallel modules—experience generated by slow reasoning cannot be incorporated into the fast reasoning policy network, leaving fast reasoning perpetually at its initial capability level and requiring repeated invocation of slow reasoning for similar scenes. (c) GR-DUET focuses exclusively on visual-level scene adaptation, neglecting adaptation to diverse instruction styles.
-
Key Challenge: In open-world settings, generalized experience cannot be compressed into low-latency intuitive response patterns. The absence of information exchange between fast and slow systems means the agent behaves as a perpetual novice in OOD scenes, undermining its generalization and adaptation capacity.
-
Goal: (1) How to realize dynamic interaction between fast and slow reasoning so that slow-thinking experience continuously enhances fast thinking? (2) How to adapt to heterogeneous instruction styles?
-
Key Insight: Inspired by Kahneman's System 1/System 2 theory in Thinking, Fast and Slow, the true value of slow thinking lies not in one-shot resolution of complex problems, but in generating generalizable strategies to augment the fast thinking system.
-
Core Idea: Construct a dynamic fast-slow reasoning interaction framework in which slow reasoning reflects on navigation history to distill structured experiences into an experience repository; these experiences are then fused into the visual features of the fast reasoning network via attention mechanisms, enabling experience-driven navigation decisions.
Method¶
Overall Architecture¶
The framework is formalized as \(\mathcal{F}=\langle\pi,R,M,A\rangle\): \(\pi\) denotes the fast reasoning policy network (based on the DUET architecture), \(R\) is the reflection function, \(M\) is the experience extraction and storage module, and \(A\) is the fast reasoning enhancement module. The pipeline for each episode \(k\) proceeds as follows: the policy network executes navigation → navigation history is stored → slow reasoning reflects → structured experience is extracted → experience fusion enhances the policy network. An additional instruction style conversion module handles diverse instruction inputs.
Key Designs¶
-
Fast Reasoning Module:
- Function: Executes navigation actions based on real-time inputs and accumulates historical memory.
- Mechanism: Adopts the DUET architecture as the policy network \(\pi\). Inputs include the instruction, panoramic observations (panoramic images, GPS positions, neighboring node information), and historical navigation data. A topological mapping module dynamically constructs and updates the map (visited/navigable/current nodes). A global action planning module performs dual-scale encoding (coarse-scale for global navigation scores, fine-scale for local action generation), and a dynamic fusion module computes weights to select the highest-scoring node. Each node employs Llama3.2-Vision to generate visual-textual descriptions. The resulting historical trajectory \(\mathcal{L}(t_j)\) contains complete information including timestamps, step indices, viewpoints, local topology, instructions, selected actions, visual descriptions, and step-level metrics, and is stored in a history repository for use by slow reasoning.
- Design Motivation: End-to-end policy networks operate at high speed and are well-suited for familiar scenes, but lack explicit slow-cognitive modeling for OOD scenarios.
-
Slow Reasoning Module:
- Function: Transforms fast reasoning's historical memory into structured, generalizable experience.
- Mechanism: Defines the experience structure \(\mathcal{E}=[S_t, C_s, R_s, T_n, \eta_s, f]^{\top}\), where \(S_t\) denotes scene type, \(C_s\) spatial context, \(R_s\) spatial rules, \(T_n\) navigation strategy, \(\eta_s\) historical success rate, and \(f\) occurrence frequency. A structured Chain-of-Thought reflection prompt template \(\mathcal{P}\) (comprising four modules: role definition, context filling, task decomposition, and output format constraints) guides the LLM to analyze and extract generalizable experience from navigation data: \(\mathcal{E} = \mathcal{F}_{LLM}(\mathcal{P}(\mathcal{X}))\). Extracted experiences are stored in an experience repository of capacity \(K\).
- Design Motivation: Slow thinking should not serve as a one-shot solution; its genuine value lies in producing generalizable strategies that augment fast thinking. Through the LLM's deep reflective capacity, reusable scene rules and navigation strategies are extracted from both successful and failed navigation histories.
-
Fast-Slow Interaction Module:
- Function: Integrates slow reasoning experience into the fast reasoning network to enable experience-driven decision-making.
- Mechanism: (a) Experience Retrieval: A retrieval key \(\mathcal{K}=[S_t^{cur}, C_s^{cur}, T_n^{cur}]\) is extracted from the current context \(\mathcal{X}_{cur}\); feature similarity is computed against all entries in the experience repository, and the \(M\) most relevant entries exceeding threshold \(\tau_{retrieve}\) are selected. (b) Experience Encoding: An encoder \(G_{enc}\) transforms discrete features into vector representations \(F_e(k) \in \mathbb{R}^d\) via embedding layers and linear projections. (c) Experience Fusion: Visual features \(F_v\) serve as Query and experience features \(F_e^{exp}\) serve as Key/Value in a multi-head attention computation to obtain \(F_{att}\); \(F_v\) and \(F_{att}\) are concatenated and projected back to the original dimension via a linear layer to produce \(F_{fused}\), which replaces the original visual features in the policy network to yield experience-augmented navigation decisions.
- Design Motivation: This constitutes the paper's most critical contribution—integrating experience encodings into the visual feature space via attention mechanisms, enabling the fast reasoning network to draw on accumulated scene knowledge in addition to real-time observations for more robust decision-making.
-
Instruction Style Conversion Module:
- Function: Dynamically converts Scene- and User-style instructions into the Basic style familiar to the model.
- Mechanism: An LLM, guided by CoT prompt engineering, automatically identifies and converts stylistic characteristics in the instruction while preserving core navigation semantics. A conversion confidence score is computed; if it exceeds the threshold, the converted instruction is used; otherwise, the original is retained. The conversion is applied in real time during training and navigation without additional pretraining.
- Design Motivation: GR-DUET addresses only visual-level scene adaptation, overlooking the diversity of instruction styles. Expressions from different users (e.g., children or domain-specific personas) vary considerably; normalizing all inputs to the Basic style reduces comprehension difficulty.
Loss & Training¶
The fast reasoning module follows DUET's training objectives, including global and local action prediction losses. The slow reasoning module involves no gradient-based training and operates as an LLM inference pipeline. The experience fusion module requires training of the fusion layer parameters (\(W_{fusion}\), \(b_{fusion}\)) and the experience encoder parameters.
Key Experimental Results¶
Main Results¶
GSA-R2R Basic Instructions (Environment Adaptation):
| Method | Test-R-Basic SR↑ | SPL↑ | Test-N-Basic SR↑ | SPL↑ |
|---|---|---|---|---|
| DUET (baseline) | 57.7 | 47.0 | 48.1 | 37.3 |
| GR-DUET | 69.3 | 64.3 | 56.6 | 51.5 |
| slow4fast-VLN | 70.8 | 65.0 | 58.4 | 52.9 |
GSA-R2R Scene Instructions:
| Method | Test-N-Scene SR↑ | SPL↑ | nDTW↑ |
|---|---|---|---|
| GR-DUET | 48.1 | 42.8 | 53.7 |
| slow4fast-VLN | 50.7 | 46.6 | 57.8 |
GSA-R2R User Instructions (outperforms GR-DUET across all 5 persona styles)
Ablation Study¶
| FSR | ISC | Test-R-Basic SR | Test-N-Basic SR | Test-N-Scene SR |
|---|---|---|---|---|
| × | × | 64.0 | 53.7 | 42.4 |
| × | ✓ | 64.0 | 53.7 | 46.1 |
| ✓ | × | 69.1 | 58.4 | 47.9 |
| ✓ | ✓ | 69.1 | 58.4 | 50.4 |
Experience Repository Capacity \(K\) Analysis: \(K<50\) yields insufficient experience; \(K>100\) introduces redundant interference; the optimal range is 50–100.
Key Findings¶
- FSR (Fast-Slow Reasoning framework) contributes the most: Adding FSR improves SR on Basic instructions from 64.0 to 69.1 (+5.1%), with consistent gains across all instruction types.
- ISC (Instruction Style Conversion) is most effective on Scene instructions: It exclusively benefits non-Basic instruction styles (Test-N-Scene SR: 42.4→46.1), consistent with design expectations.
- Complementary effect of both modules: The combination achieves the best Test-N-Scene result of 50.4, a further improvement over FSR-only (47.9).
- Case Study: On the first navigation attempt, lacking experience, the agent takes a wrong turn at a multi-branch corridor and misidentifies a target in a restaurant, incurring 15 seconds and a 1.5 m error. After four iterations of experience accumulation, the fifth navigation run reduces time to 8 seconds (−46.7%) and error to 0.3 m (−80%).
Highlights & Insights¶
- Closed-loop fast-slow design: Rather than simply dispatching fast and slow systems to handle tasks of different difficulty in parallel, slow reasoning experience genuinely "reshapes" the fast reasoning decision process via attention-based fusion. This enables the system to evolve over time—the more navigation experience is accumulated, the stronger fast reasoning becomes, progressively reducing reliance on slow reasoning.
- Structured experience design (scene type + spatial context + spatial rules + navigation strategy + success rate + frequency) is highly practical. It constrains the LLM's free-form textual output into retrievable, encodable vectorized knowledge, preserving the richness of geographic knowledge while ensuring engineering usability.
- Instruction style conversion serves as a lightweight preprocessing step that normalizes diverse instructions into a model-familiar Basic style via CoT prompting—a simple yet effective engineering technique transferable to any instruction-following task.
Limitations & Future Work¶
- The experience repository has limited capacity (\(K=50\sim100\)), which may be insufficient for extremely diverse large-scale scenes. Hierarchical or scalable experience organization strategies warrant investigation.
- Slow reasoning depends on an LLM (Llama3.2-Vision), and inference latency during real-time navigation may become a bottleneck. The paper does not discuss inference efficiency in detail.
- Experience retrieval relies on simple feature similarity matching, which may retrieve incorrect entries for scenes that are semantically similar but structurally distinct. More sophisticated retrieval strategies (e.g., contrastive learning, graph neural networks) deserve exploration.
- Experiments are conducted solely on the GSA-R2R dataset; generalization to other VLN benchmarks (e.g., RxR, REVERIE) remains unknown.
- Visual descriptions are generated by Llama3.2-Vision, and their quality directly affects the experience extraction outcome.
Related Work & Insights¶
- vs. GR-DUET: GR-DUET addresses scene adaptation from a visual perspective but neglects instruction style adaptation. slow4fast-VLN enhances both scene adaptation through fast-slow interaction and instruction adaptation through style conversion.
- vs. TourHAMT / OVER-NAV: These memory-augmented methods perform poorly on GSA-VLN tasks (SR < 25%), demonstrating that naive memory mechanisms are insufficient for OOD generalization. The key insight is that memories must be refined through reflection into structured experience.
- vs. Conventional fast-slow systems: Existing approaches (e.g., using LLMs as a slow reasoner to independently handle complex tasks) adopt a "parallel division of labor" paradigm. slow4fast-VLN instead employs a "feedback enhancement" paradigm—the value of slow reasoning lies in continuously improving the capacity of fast reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ The fast-slow reasoning interaction framework is original, and the retrieval–encoding–fusion pipeline for the experience repository is systematically designed.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers three instruction style types with comprehensive ablations and detailed case studies, though evaluation is limited to a single dataset.
- Writing Quality: ⭐⭐⭐⭐ Well-structured with clear motivation; the case study is vivid and intuitive.
- Value: ⭐⭐⭐⭐ The engineering realization of fast-slow cognition offers practical reference value and is applicable to embodied intelligence scenarios requiring online adaptation.