VPN: Visual Prompt Navigation¶

Conference: AAAI 2026 arXiv: 2508.01766 Code: github.com/farlit/VPN Area: 3D Vision Keywords: Visual Navigation, Visual Prompt, Top-down View, Vision-Language Navigation, Data Augmentation

TL;DR¶

This paper proposes Visual Prompt Navigation (VPN), a novel navigation paradigm in which users annotate visual trajectories (keypoints connected by arrows) on 2D top-down maps to guide agent navigation, replacing natural language or image-goal instructions. Two datasets, R2R-VP and R2R-CE-VP, are constructed alongside a VPNet baseline model. Combined with view-level and trajectory-level data augmentation, the approach achieves strong performance in both discrete and continuous environments.

Background & Motivation¶

Visual navigation is a core research direction in AI and robotics. Prevailing paradigms include:

PointGoal Navigation: Provides the relative direction and distance to the goal, but lacks intermediate guidance.

ImageGoal Navigation: Supplies an image of the target location, but offers no intermediate navigation cues.

Vision-Language Navigation (VLN): Describes the navigation path in natural language; currently the most active paradigm.

Fundamental Dilemma of Natural Language Instructions: Language is inherently ambiguous when describing object positions, directional changes, and distance relations; striving for precision inevitably leads to verbosity. This creates a fundamental trade-off in human–robot interaction.

Advantages of Visual Prompts¶

The authors propose an intuitive insight: drawing a route on a map is the most natural way for humans to communicate navigation instructions. Key advantages:

High User Accessibility: Non-expert users can naturally specify navigation goals by clicking or sketching trajectories.

Rich Spatial Information: Top-down views inherently preserve complete spatial layouts.

High Reusability: Top-down views can be acquired once via drone imagery or 3D reconstruction and reused repeatedly.

Method¶

Overall Architecture¶

The core contributions of VPN encompass three components: 1. Dataset Construction: Replacing language instructions in R2R/R2R-CE with visual prompts. 2. VPNet Model: Built upon DUET/ETPNav architectures, substituting the language encoder with a ViT encoder. 3. Data Augmentation Strategies: View-level and trajectory-level augmentation.

Key Designs¶

1. Visual Prompt Construction Pipeline¶

Four-step generation: ① generate top-down view ② annotate trajectory by connecting waypoints with arrows ③ crop around the trajectory with a 60 px margin ④ remove black borders to tightly bound the visual prompt.

Design Motivation: Center cropping is a critical step. Ablation studies show that without cropping, different episodes within the same scene share an identical top-down map, causing the model to overfit to the scene rather than learning trajectory information (SR drops to only 31%).

2. VPNet Model Architecture¶

Three core components:

ViT Visual Prompt Encoder: Uses ViT-B/16 (pretrained on ImageNet-21k) to encode 224×224 visual prompt images. For multi-floor scenarios, Order-Aware Floor Concatenation (OAFC) is applied: $$\mathcal{P}_i^o = \text{ViT}(\mathcal{P}_i) + b_i, \quad \mathcal{P} = [\mathcal{P}_1^o, ..., \mathcal{P}_k^o]$$

Node Embedding Module: The agent incrementally builds a topological graph; each node is represented by panoramic view features (encoded via a two-layer Transformer), step embeddings, and position embeddings.

Graph-Aware Cross-Modal Encoder: A multi-layer cross-modal graph Transformer comprising cross-attention layers and Graph-Aware Self-Attention (GASA) layers: $$\text{GASA}(X) = \text{Softmax}\left(\frac{XW_q(XW_k)^T}{\sqrt{d}} + EW_d\right)XW_v$$ where $E$ is the pairwise distance matrix of the topological graph.

3. Data Augmentation Strategies¶

Trajectory-Level Augmentation: Incorporates PREVALENT (178k trajectories) and ScaleVLN (1.6M trajectories) to increase training data diversity.

View-Level Augmentation: - Prompt View Augmentation: Random rotation of top-down views (0°/90°/180°/270°). - Agent View Augmentation: Random sampling of the initial heading direction.

Design Motivation: In VPN, the initial heading is decoupled from the visual prompt (unlike VLN, where language may implicitly encode initial direction), making free rotation applicable.

Loss & Training¶

Behavior Cloning + DAgger: $\mathcal{L} = \lambda \mathcal{L}_{BC} + (1-\lambda) \mathcal{L}_{DAG}$, with $\lambda = 0.5$
Discrete environment: single A5000 GPU, 400k iterations, batch=10, lr=1.5e-5
Continuous environment: dual A5000 GPUs, 400k iterations, batch=16, lr=1e-5

Key Experimental Results¶

Main Results¶

Discrete Environment (R2R-VP):

Method	Training Data	Val Unseen SR↑	Val Unseen SPL↑	Test Unseen SR↑
DUET (VLN)	R2R+PRE+SCA	81	70	80
VPNet	R2R	51.23	43.47	52.40
VPNet	R2R+PRE	65.92	56.17	66.38
VPNet	R2R+PRE+SCA	96.68	94.84	97.56

VPNet achieves 96.68% SR on Val Unseen, substantially outperforming DUET's 81%, using only 1/3 of the ScaleVLN trajectories.

Continuous Environment (R2R-CE-VP):

Method	Setting	Val Seen SR↑	Val Unseen SR↑
ETPNav (VLN)	R2R+PRE	66	57
VPNet	R2R+PRE	84.11	47.96

Ablation Study¶

Effect of Different Visual Prompt Types (Discrete Environment):

Prompt Type	Val Seen SR	Val Unseen SR	Notes
Uncropped full top-down view	31.68	33.94	Overfits to scene
Cropped top-down view only	83.56	45.83	Resembles ImageNav
Cropped + arrows + text	95.74	65.36	Text occludes details
Cropped + arrows	100	65.92	Best

Effect of View-Level Augmentation:

Augmentation	Val Unseen SR↑	SPL↑
None	86.33	82.92
Agent view only	88.18	85.02
Prompt view rotation only	96.41	94.37
Both combined	96.68	94.84

Prompt view rotation yields substantially larger gains than agent view augmentation (+10 SR vs. +2 SR).

Key Findings¶

Remarkable data efficiency of visual prompts: VPNet achieves 96.68% SR with 1.6M trajectories, whereas DUET reaches only 81% SR with 4.9M trajectories.
Cropping is critical: Without cropping, the model severely overfits to scenes (31% vs. 100% Val Seen SR).
Cropping alone (without trajectory annotation) is already effective: The model can infer approximate destinations from the cropped region, resembling ImageNav.
Moderate robustness to noise: Under 20% salt-and-pepper noise, SR decreases from 96.68% to 90.34%.

Highlights & Insights¶

Paradigm Innovation: VPN introduces a fundamentally new navigation paradigm, bridging the gap between language-guided and image-goal navigation.
Data Efficiency Advantage: Visual prompts convey spatial information at a far higher density than natural language instructions.
Strong Practicality: Top-down views can be acquired once via drone imagery or 3D reconstruction and reused across multiple deployments.
Well-Designed Ablation Study: Systematic analysis covers prompt types, augmentation strategies, encoder configurations, and multiple other dimensions.

Limitations & Future Work¶

Validation limited to simulated environments: Experiments are conducted in MP3D/HM3D scenes without real-world testing.
Dependence on high-quality top-down views: Scenes with poor reconstruction quality are not amenable to this approach.
Performance gap in continuous environments: Val Unseen SR of only 47.96%, far below the 96.68% achieved in discrete environments.
Coarse handling of multi-floor scenarios: Floor features are naively concatenated.

VLN (R2R) Series: DUET, BEVBert, and ScaleVLN serve as primary comparison baselines.
DUET: Serves as the architectural foundation for the discrete-environment variant of VPNet.
ETPNav: Serves as the architectural foundation for the continuous-environment variant of VPNet.
RoVI: Uses hand-drawn symbols to guide robotic manipulation; however, VPN is the first work to employ visual prompts as the sole navigation instruction.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — A new navigation paradigm that fills an important gap.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive ablations, but lacks real-world experiments.
Writing Quality: ⭐⭐⭐⭐ — Clear motivation and intuitive illustrations.
Value: ⭐⭐⭐⭐ — User-friendly interaction modality; continuous environment performance warrants further improvement.