Skip to content

Recursive Visual Imagination and Adaptive Linguistic Grounding for Vision Language Navigation

Conference: AAAI 2026 arXiv: 2507.21450 Code: None (authors indicate public release after anonymous review; not yet available) Area: Embodied Intelligence / Vision-Language Navigation Keywords: VLN, scene representation, language grounding, visual imagination, neural grid, contrastive learning

TL;DR

This paper proposes a VLN policy based on Implicit Scene Representation (ISR), which compresses historical trajectories into a fixed-size compact neural grid via Recursive Visual Imagination (RVI) to learn high-level scene priors, and employs Adaptive Linguistic Grounding (ALG) to finely align different semantic components of navigation instructions with different grid cells. The approach achieves state-of-the-art performance on two continuous-environment navigation benchmarks: R2R-CE and ObjectNav.

Background & Motivation

Vision-Language Navigation (VLN) requires an agent to navigate to a target location or object in an unknown 3D environment following natural language instructions. The central challenge lies in organizing historical visual observations and aligning them with language instructions.

Existing methods suffer from two fundamental limitations. First, scene representation redundancy: approaches based on BEV maps, 3D feature fields, or topological graphs retain excessive geometric and texture details. An agent navigating to "turn left at the sofa at the end of the corridor" should attend to sofa semantics and the turn signal, not wall textures or corridor geometry. Research in behavioral psychology and neuroscience suggests that animals maintain high-level spatial representations rather than precise geometric memories during navigation.

Second, coarse visual-language alignment: existing methods align instructions with scene representations at the sentence level via standard cross-modal attention, without distinguishing how different semantic components in the instruction—landmarks (sofa), actions (turn left), orientations (right side)—correspond to different parts of the scene representation. This ambiguous alignment makes it difficult for agents to accurately track navigation progress.

Inspired by the distinct neural roles of the hippocampus (memory) and cerebellum (motor control), this work proposes that different spatial positions in the ISR grid adaptively correspond to different semantic components of the instruction. The core idea is to model historical trajectories as a compact neural grid, learn high-level semantic priors via visual imagination, and achieve precise instruction tracking through fine-grained language grounding.

Method

Overall Architecture

Scene representation learning is formulated as a sequence modeling problem, and a joint state-action transformer is trained under a behavior cloning framework. The input consists of panoramic RGB-D images, pose estimates, and the previous action; the output is a navigation action prediction. The system comprises three core modules: ISR (compact scene representation), RVI (recursive visual imagination), and ALG (adaptive linguistic grounding).

Key Designs

  1. Implicit Scene Representation (ISR)

    • Function: Compresses historical trajectories into a fixed-size compact representation.
    • Mechanism: Historical observations are modeled as an \(h \times w\) neural grid \(M^t = [m_{ij}^t]_{h \times w}\) (default \(h=w=10\)), where each cell is a \(d=512\)-dimensional feature vector. The grid is initialized with positional encoding \(m_{ij}^0 = w_m^0 + \text{MLP}([i-h/2, j-w/2])\) and updated at each step via a multi-layer transformer interacting with new observations. A key advantage is that the number of grid cells is a fixed hyperparameter, independent of trajectory length.
    • Design Motivation: Unlike topological graphs or BEV maps, ISR retains no explicit geometric details, naturally filtering redundant information. The fixed token count avoids growing computational costs over long sequences. The term "grid" (rather than "voxel") emphasizes relative positional encoding, laying the groundwork for ALG to assign different grid positions to different instruction components.
  2. Recursive Visual Imagination (RVI)

    • Function: Extracts navigation-friendly high-level scene priors from the ISR.
    • Mechanism: RVI comprises three auxiliary tasks: (a) View Imagination (VI): Given a query pose, predicts the visual features at that location from the ISR. A contrastive loss \(\mathcal{L}_{Con}\) establishes pose–visual correspondences. For future poses (\(t'>t\)), a VAE-based prior–posterior KL divergence \(\mathcal{L}_{VF} = \mathcal{L}_{Con} + \beta \text{KL}[q_\vartheta \| p_\vartheta]\) is used to learn the distribution of future frames rather than deterministic rendering. (b) Scene Layout Imagination (SLI): Predicts an ego-centric \(32 \times 32\) local semantic map from the ISR, where each pixel covers \(20\text{cm} \times 20\text{cm}\), supervised with BCE loss. (c) Visual Semantic Prediction (VSP): An auxiliary task predicting the presence and proportion of object categories in the current field of view.
    • Design Motivation: VI trains the agent to learn the regularities of visual transitions rather than memorizing raw frames—"recalling the past and imagining the future." SLI enhances sensitivity to surrounding landmark semantics and relative spatial layout. VSP improves the visual encoder's responsiveness to semantic content.
  3. Adaptive Linguistic Grounding (ALG)

    • Function: Finely aligns different semantic components of the instruction with different neural grid cells.
    • Mechanism: (1) Syntactic parsing decouples the instruction into five semantic components: landmarks, scenes, actions, orientations, and others, producing positional labels. (2) The attention matrix from cross-modal attention is reused as an affinity matrix, automatically matching each language token to its most attended grid cell via row-wise max-pooling. (3) Position alignment applies BCE loss to align the language-modulated ISR distribution with the ground-truth text distribution \(\hat{L}_{total} = \text{Softmax}(\text{MLP}(\text{Mean}([\tilde{m}_0^t, ..., \tilde{m}_i^t])))\). (4) Semantic alignment employs contrastive learning \(\mathcal{L}_{SA}\) to pull semantically similar grid–component pairs closer. (5) Progress Tracking uses an MLP to predict instruction execution progress.
    • Design Motivation: Unlike coarse sentence-level alignment, ALG associates landmark words with "scene memory" grid cells and action words with "action signal" grid cells, requiring no additional matching algorithm since the attention matrix is directly reused.

Loss & Training

The total loss is defined as:

\[\mathcal{L}_{total} = \mathcal{L}_{Action} + \beta(\mathcal{L}_{VF} + \mathcal{L}_{Map} + \mathcal{L}_{Sem}) + \lambda(\mathcal{L}_{Pro} + \mathcal{L}_{PA} + \mathcal{L}_{SA})\]

where \(\beta=0.3\) and \(\lambda=0.5\).

Action prediction uses cross-entropy loss with inflection weighting (upweighting turning actions). The model is pretrained for 100 epochs, followed by DAgger-based fine-tuning for 50+ epochs to address offline-to-online distribution shift. Visual encoders (CLIP ResNet50 and PointNav ResNet18) are frozen throughout training.

Key Experimental Results

Main Results

Dataset Metric Ours Prev. SOTA Gain
R2R-CE Val Unseen OSR/SR/SPL 67/59/50 65/58/49 (ETPNav/Zhang) +2/+1/+1
R2R-CE Test Unseen OSR/SR/SPL 64/57/50 63/56/48 (ETPNav/Zhang) +1/+1/+2
ObjectNav MP3D Val SR/SPL/DTS 40.9/17.1/4.68 40.2/16.0/- (SG-Nav) +0.7/+1.1/-

Ablation Study

Configuration OSR SR SPL Notes
Baseline (no auxiliary tasks) 58 49 43 Cross-attention alignment only
+SLI 60 51 45 Scene layout imagination
+SLI+VI(Con) 62 52 45 Add visual contrastive loss
+SLI+VI(Con+KL) 63 53 47 Add future frame distribution learning
+RVI+Pro+PA 64 55 48 Add progress tracking + position alignment
+RVI+SA(w/o Pro) 63 54 46 Semantic alignment without progress tracking
Full (RVI+ALG) 67 58 50 All components

Key Findings

  • Each component contributes independently; the full model improves SR by 9 points (49→58) and SPL by 7 points (43→50) over the baseline.
  • SLI yields the largest individual gain (+2 SR); the KL divergence term in VI (future frame distribution learning) provides an additional +2 SPL over the contrastive-only variant.
  • Progress Tracking is critical for ALG effectiveness (contributing 1–2 additional points over the variant without it).
  • Hyperparameter robustness is strong: performance varies by less than 1% for \(h=w \in [8, 12]\) and is stable for \(k\) (imagination time steps) in \([10, 30]\).

Highlights & Insights

  • Elegant and practical ISR design: The fixed-size neural grid naturally eliminates the growing computational cost of long-sequence scene representation, while the absence of explicit geometric detail forces the network to learn high-level abstractions.
  • Novel "imagination over rendering" paradigm in RVI: Rather than deterministically predicting future frames (as in DREAMWALKER), the model learns the distribution of future visual transitions—a more robust approach that better reflects human mental models of navigation.
  • Clever reuse of the attention matrix in ALG: Cross-modal attention weights are repurposed as grid–component affinity scores, incurring zero additional computational overhead.
  • Compelling neuroscience-inspired motivation: The analogy between hippocampus (memory) / cerebellum (motor control) and distinct functional roles of grid positions in ISR provides a persuasive conceptual foundation.

Limitations & Future Work

  • Marginal performance gains: SR improvements of only 1–2 points suggest the approach may be approaching the ceiling of non-LLM-based VLN paradigms.
  • Indoor MP3D scenes only: Generalization to larger-scale outdoor environments or more diverse datasets remains unvalidated.
  • Traditional syntactic parsing for instruction decomposition: The paper acknowledges that LLMs may perform better but only includes this as a supplementary experiment; GPT-style models are not adopted as the default.
  • Frozen visual encoders: CLIP ResNet50 is frozen throughout; stronger visual encoders or end-to-end fine-tuning are not explored.
  • No zero-shot VLN generalization: All experiments are conducted in supervised settings with training data available.
  • Limited interpretability of ISR: Although ALG demonstrates grid–component correspondences, the semantics stored in individual grid cells remain opaque.

Compared to ETPNav (topological scene representation), ISR is more compact and avoids redundant visual texture storage in TSR nodes. Compared to GridMM (visual feature field), ISR retains no explicit geometric details. Compared to DREAMWALKER (world model for future view prediction), this approach requires no additional topological map construction, and learning a distribution rather than deterministic rendering is more scalable. Compared to GELA (contrastive learning for entity alignment), ALG covers all semantic components: landmarks, scenes, actions, and orientations.

The paradigm of compressing sequential information into fixed-size implicit representations is conceptually related to Memory Transformers and Neural Turing Machines, but is specialized here for VLN via visual imagination and language grounding. The instruction decomposition and adaptive alignment pattern in ALG may offer methodological insights for other embodied tasks requiring fine-grained instruction following, such as manipulation and task planning.

Rating

  • Novelty: ⭐⭐⭐⭐ The combined design of neural grid ISR, visual imagination RVI, and adaptive grounding ALG is novel and theoretically motivated.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Ablations are thorough (incremental component addition), validated on two tasks, though the number of benchmarks is limited.
  • Writing Quality: ⭐⭐⭐⭐ Methodology is clearly articulated with persuasive neuroscience-based motivation, though the dense mathematical notation may affect readability.
  • Value: ⭐⭐⭐⭐ Fixed-size ISR and fine-grained ALG offer methodological contributions to the VLN field, though performance gains are modest.