NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation¶
Conference: ICCV 2025 arXiv: 2510.16457 Code: https://github.com/woyut/NavQ_ICCV25 Area: Reinforcement Learning / Vision-and-Language Navigation Keywords: Vision-and-Language Navigation, Q-learning, Foresighted Decision-Making, A* Search, Self-Supervised Pretraining
TL;DR¶
This paper proposes NavQ, a foresighted VLN agent that employs a Q-model to predict, in a single forward pass, long-horizon future semantic aggregation features (Q-features) for each candidate action. Combined with an A*-style search strategy, NavQ achieves significant improvements on object-goal navigation benchmarks.
Background & Motivation¶
Background: Object-goal Vision-and-Language Navigation (VLN) requires an agent to navigate in real 3D environments based on target object descriptions. Existing methods (e.g., DUET, HAMT) primarily make single-step decisions conditioned on historical observations, lacking the ability to anticipate the future consequences of actions.
Limitations of Prior Work: - Local methods (e.g., synthesizing neighboring views via NeRF or diffusion models) predict only one step ahead and fail to capture long-range semantic information. - World model methods (e.g., DreamWalker) support multi-step rollouts but incur high computational cost, and predictions in RGB space are prone to distortion and overfitting. - Object-goal VLN is fundamentally a search problem; A* algorithms have demonstrated that a well-designed heuristic function (estimating future cost) can substantially improve search efficiency, yet effective future heuristics are lacking in the VLN literature.
Key Challenge: Long horizon vs. efficiency — how can long-range future information be obtained in a single pass without costly step-by-step rollouts?
Key Insight: Inspired by the Q-learning principle that Q-values accumulate future rewards, this work replaces scalar Q-values with Q-features — aggregated representations of future observations — while decoupling from reward functions so that the Q-model can be self-supervisedly pretrained on large amounts of unannotated trajectories.
Method¶
Overall Architecture¶
NavQ augments the DUET baseline with a future branch running in parallel with the Global Encoder (GE, history-based): 1. Q-model: Generates Q-features for each candidate action by aggregating potential future observations along that direction. 2. Future Encoder (FE): Interacts task-agnostic Q-features with the textual instruction to produce goal-directed future scores. 3. Future scores are fused with historical scores to realize an A*-style balance: past progress \(+\) future prospects.
Key Designs¶
- Q-function Definition: \(Q(T, a) = R(\mathcal{A}) + \gamma \mathbb{E}_{a' \sim \pi}[Q(T \cup \{\mathcal{A}\}, a')]\)
Unlike conventional Q-learning that predicts a scalar reward, here \(R(\cdot)\) denotes the text-description feature of a node (generated by BLIP or similar), and \(Q\) outputs a feature vector. This decouples rewards from navigation instructions, enabling pretraining on unannotated environments.
- Rollout Strategy Design: Using a random policy would cause nodes shared across multiple paths to produce Q-features with insufficient discriminability across candidates. The authors introduce a shortest-path preference constraint: each rollout path must follow the shortest path segment from the current node. This ensures that each future node's feature is accumulated into exactly one candidate action's Q-feature.
Expanded into an explicit formula: \(Q(T, a) = \sum_{N, t} P_\pi(N, t | T, a) \cdot \gamma^t \cdot R(N)\)
No RL techniques (e.g., TD error) are required; supervision signals are computed by directly enumerating all reachable nodes on the finite graph.
-
Prediction in Text Feature Space (Improved Generalization): Rather than predicting in RGB visual space (where style/texture information can induce spurious correlations), the Q-model predicts in the feature space of textual descriptions, focusing on high-level semantic relationships. For each node, BLIP generates text descriptions for 36 views; their averaged text features serve as \(R(N)\).
-
MAE Warm-Up: The Q-model is first pretrained with MAE (masked reconstruction of trajectory tokens) to provide a strong initialization for Q-learning.
-
Future Encoder + Progress Prediction: The FE is a 4-layer Graph Transformer with the same architecture as the GE. To enforce functional decomposition between GE and FE: the GE output is fed to an MLP predicting "distance traveled" (historical progress), while the FE output is fed to an MLP predicting "remaining distance" (future goal). This directly corresponds to the \(g(n) + h(n)\) decomposition in A*.
Loss & Training¶
Three-stage training pipeline: - Stage 1: Self-supervised Q-model pretraining (MSE loss, 30k steps, lr=1e-5); requires only random trajectories without annotations. - Stage 2: Agent pretraining (MLM + SAP + OG + MRC + progress prediction, 100k steps, lr=5e-5); Q-model is frozen. - Stage 3: Agent online fine-tuning (DAgger + pseudo-expert policy, 20k steps, lr=1e-5).
Key Experimental Results¶
Main Results: REVERIE Dataset¶
| Method | SR↑ | SPL↑ | RGS↑ | RGSPL↑ |
|---|---|---|---|---|
| DUET (baseline) | 46.98 | 33.73 | 32.15 | 23.03 |
| GOAT (CVPR24) | 53.37 | 36.70 | 38.43 | 26.09 |
| VER (CVPR24) | 55.98 | 39.66 | 33.71 | 23.70 |
| NavQ (Ours) | 53.22 | 38.89 | 36.84 | 27.12 |
| NavQ (w. extra scenes) | 54.10 | 39.22 | 37.57 | 27.29 |
Compared to the DUET baseline, NavQ achieves consistent improvements across all metrics: SR +6.24, SPL +5.16, RGSPL +4.09. On the RGSPL metric, NavQ surpasses all comparison methods. Incorporating additional unannotated scenes yields further gains.
Ablation Study¶
| Configuration | OSR | SR | SPL | RGS | RGSPL |
|---|---|---|---|---|---|
| (1) No future branch (baseline) | 54.42 | 48.14 | 33.38 | 30.19 | 21.05 |
| (2) FE only, no QM | 54.84 | 48.20 | 33.92 | 32.52 | 23.14 |
| (3) QM only, no FE | 53.25 | 48.48 | 32.22 | 33.03 | 21.86 |
| (4) QM+FE, no progress loss | 55.98 | 51.55 | 35.79 | 34.51 | 23.81 |
| (5) Full NavQ | 60.47 | 53.22 | 38.89 | 36.84 | 27.12 |
| (6) GT Q-feature, no FE | 60.18 | 54.36 | 41.71 | 37.03 | 28.59 |
| (7) GT Q-feature + FE | 65.38 | 59.27 | 47.04 | 39.68 | 31.62 |
Key observations: (a) QM and FE must work in conjunction; either alone yields limited improvement; (b) the progress loss contributes significantly to functional decomposition; (c) the upper-bound analysis with GT Q-features validates the Q-feature design (including shortest-path rollout policy and discount factor), with SPL reaching up to +14%.
| Discount factor γ | OSR | SR | RGSPL |
|---|---|---|---|
| 0 (neighbor only) | 56.66 | 51.12 | 25.16 |
| 0.3 | 59.73 | 51.95 | 26.61 |
| 0.5 | 60.47 | 53.22 | 27.12 |
| 0.7 | 57.06 | 50.89 | 23.85 |
\(\gamma=0\) degrades to single-step world model performance; \(\gamma=0.5\) achieves the best balance between feature quality and training difficulty.
Key Findings¶
- Training the Q-model on unannotated data outperforms training solely on annotated trajectories (better generalization), with the difference in data scale being a key factor.
- The Q-model is task-agnostic: the same pretrained model is effective on both REVERIE and SOON datasets.
- Results on SOON: OSR +7.88, SR +2.81, SPL +4.07 (val unseen), further confirming generalization capability.
Highlights & Insights¶
- Conceptual leap from Q-value to Q-feature: Generalizing the scalar value function in reinforcement learning to a vector feature function eliminates reward engineering and is naturally compatible with self-supervision.
- The shortest-path rollout strategy ensures each future node is assigned to exactly one candidate action, giving Q-features strong discriminability — a critical design insight.
- Grounding A* in VLN: GE corresponds to \(g(n)\) (cost so far), FE corresponds to \(h(n)\) (estimated remaining cost), with functional separation enforced through progress supervision.
Limitations & Future Work¶
- The current approach is based on discrete navigation graphs and has not been extended to continuous environments.
- A significant gap remains between Q-model predictions and GT Q-features (upper bound): SPL 38.89 vs. 47.04. Improving the expressiveness of the Q-model is the primary direction for future improvement.
- Textual descriptions depend on the quality of image captioning models (BLIP), which may introduce noise.
Related Work & Insights¶
- The approach shares conceptual similarity with Q [Wang 2024], which combines Q-learning and A search for LLM reasoning, but NavQ targets embodied navigation and employs feature vectors rather than scalar Q-values.
- VLV [Chang 2020] also applies Q-learning to navigation but is constrained by closed-set category labels; NavQ achieves task-agnostic general modeling through Q-features.
- NavQ is complementary to world model approaches (DreamWalker, PathDreamer): those methods perform step-by-step rollouts in RGB space, whereas NavQ aggregates information in latent space in a single pass.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Conceptual innovation of Q-value to Q-feature; grounded A* design
- Technical Depth: ⭐⭐⭐⭐ — Theoretical analysis of rollout strategy; three-stage training
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — REVERIE + SOON; multi-dimensional ablations; GT upper-bound analysis; γ sensitivity analysis
- Writing Quality: ⭐⭐⭐⭐ — Applicable to intent-level navigation scenarios such as household assistants