ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation¶

Conference: CVPR 2026 arXiv: 2603.05530 Code: None Area: Robotics / Vision-Language Navigation Keywords: VLN, proactive perception, MCTS, zero-shot navigation, LLM agent

TL;DR¶

ProFocus is a training-free progressive framework that achieves state-of-the-art zero-shot VLN performance on R2R and REVERIE benchmarks through two mechanisms: proactive perception (converting panoramic observations into semantic maps and having an LLM generate targeted visual queries) and focused reasoning (BD-MCTS filtering top-k high-value waypoints from large navigation histories).

Background & Motivation¶

Background: Vision-and-Language Navigation (VLN) requires agents to navigate physical environments following natural language instructions. Foundation model-based VLN methods have shown promise through either post-training adaptation or zero-shot prompting, yet both suffer from two critical shortcomings.

Limitations of Prior Work: (1) Passive visual perception — VLM-driven methods uniformly process panoramic or multi-view visual inputs, causing redundant information to inflate visual token counts and diffuse attention across irrelevant features, obscuring instruction-relevant fine-grained cues; (2) Unfocused reasoning — both paradigms receive large amounts of unprioritized historical context containing past observations and waypoints, whereby long trajectory histories dilute attention and impede precise reasoning.

Key Challenge: Navigation requires selective perception (acquiring only task-relevant information) and focused reasoning (attending only to high-value historical waypoints), whereas existing methods process everything in bulk on both fronts.

Goal: To proactively acquire task-relevant visual observations to reduce perceptual redundancy, and to focus reasoning on high-value waypoints within extensive navigation histories.

Key Insight: Establish a closed-loop perception–reasoning cycle through LLM–VLM collaboration, and employ an MCTS variant to filter key waypoints from the global history.

Core Idea: An LLM determines "what needs to be known" based on the semantic map and generates visual queries; a VLM performs fine-grained perception within specified regions; BD-MCTS then focuses decision-making on the top-k high-value waypoints extracted from the full navigation history.

Method¶

Overall Architecture¶

ProFocus comprises three specialized agents: an orchestration agent \(\mathcal{A}_{\mathrm{orch}}^{\boldsymbol{\theta}}\) (LLM, responsible for spatial reasoning and semantic evaluation), a perception agent \(\mathcal{A}_{\mathrm{perc}}^{\boldsymbol{\phi}}\) (VLM, responsible for fine-grained perception), and a decision agent \(\mathcal{A}_{\mathrm{dec}}^{\boldsymbol{\psi}}\) (LLM, reasoning over top-k candidates). The framework takes panoramic images and language instructions as input and outputs navigation actions.

Key Designs¶

Ego-centric Semantic Map:
- Function: Converts panoramic observations into a structured textual representation encoding object positions, depths, and directional relationships.
- Mechanism: The panorama is divided into \(K\) directional views; a VLM detects all objects in parallel, producing \(\{(\boldsymbol{b}_i, \textit{obj}_i)\}_{i=1}^{N_t}\). Monocular depth estimation yields depth \(d_i\), and the heading angle is computed as \(h_i = \pi \cdot (\frac{x_1 + x_2 - F}{F})\). The semantic map is then constructed as \(\mathcal{C}_t = \{(h_i, \textit{obj}_i(\boldsymbol{b}_i), d_i)\}_{i=1}^{N_t}\) and formatted as natural language text.
- Design Motivation: Compressing visual information into structured text enables the LLM to perform spatial reasoning (e.g., "the object to the left") while avoiding the processing of large volumes of raw pixels.
Reasoning-Driven Proactive Perception Loop:
- Function: Proactively acquires instruction-relevant visual information based on task demands, rather than passively processing all inputs.
- Mechanism: The orchestration agent generates a visual query \(\boldsymbol{q}\) and a focus region \(\boldsymbol{R}_{\text{focus}}^t\) conditioned on the semantic map, trajectory history, and instruction: \((\boldsymbol{q}, \boldsymbol{R}_{\text{focus}}^t) = \mathcal{A}_{\mathrm{orch}}^{\boldsymbol{\theta}}(\mathcal{C}_t, \boldsymbol{\tau}_t, \mathcal{I}, \mathcal{H}_{\text{query}})\). The perception agent then performs fine-grained analysis within the focus region: \(\boldsymbol{a}_i^t = \mathcal{A}_{\mathrm{perc}}^{\boldsymbol{\phi}}(\boldsymbol{\mathcal{O}}_t|_{\boldsymbol{R}_{\text{focus}}^t}, \boldsymbol{q})\). The loop iterates until the information is deemed sufficient (\(s_t = \text{sufficient}\)).
- Design Motivation: Restricting perception to instruction-relevant regions reduces visual token count, enhances fine-grained attribute recognition through targeted queries, and enables adaptive perception.
Branch-Diverse MCTS (BD-MCTS):
- Function: Filters top-k high-value waypoint candidates from large navigation histories to guide focused reasoning in the decision agent.
- Mechanism: A search tree \(\mathcal{T} = \langle \boldsymbol{V}_{\mathcal{T}}, \boldsymbol{E}_{\mathcal{T}}, Q, N \rangle\) is maintained across three phases — (I) new nodes are initialized using semantic value \(V_{\text{sem}}(u)\) in lieu of random rollouts; (II) \(Q\)-values are dynamically refined via backpropagation along the path: \(Q(v) \leftarrow Q(v) + \frac{R_t - Q(v)}{N(v)}\); (III) candidates are ranked by a path-aggregated score with distance penalty \(\text{Score}(v) = V_{\text{path}}(v) - \lambda \cdot \frac{d_{\mathcal{G}}(v_t, v)}{\max_{u} d_{\mathcal{G}}(v_t, u)}\), with each parent node contributing at most 2 child nodes to maintain branch diversity.
- Design Motivation: Standard MCTS selects a single optimal action, whereas BD-MCTS selects diverse top-k candidates. The distance penalty ensures physical reachability, and the branch diversity constraint ensures coverage of different exploration directions.

Loss & Training¶

ProFocus is a fully training-free framework requiring no fine-tuning or training of any model component. It employs off-the-shelf LLMs (Qwen3-Max / DeepSeek-V3) and VLMs (Qwen3-VL-Max / GLM-4.5V).

Key Experimental Results¶

Main Results¶

Results on R2R validation unseen set:

Method	NE↓	OSR↑	SR↑	SPL↑
NavGPT (GPT-4)	6.46	42.0	34.0	29.0
MapGPT (GPT-4V)	5.63	57.6	43.7	34.8
MSNav (GPT-4o)	5.24	65.0	46.0	40.0
ProFocus (Q3+Q3VL)	4.92	65.0	52.5	39.8
ProFocus (DS3+GLM)	5.21	63.0	50.0	41.2

Ablation Study¶

Configuration	SR↑	SPL↑	Notes
NavGPT† (DS3)	36.0	28.1	No proactive perception, no focused reasoning
MapGPT† (GLM)	41.4	30.8	Passive VLM-based perception
ProFocus (DS3+GLM)	50.0	41.2	+ Proactive Perception + BD-MCTS
ProFocus (Q3+Q3VL)	52.5	39.8	Stronger backbone further improves results

Key Findings¶

SR improves from 36.0% (NavGPT) to 52.5%, an absolute gain of 16.5 percentage points.
Proactive perception improves both efficiency and accuracy by reducing visual tokens and enhancing fine-grained attribute recognition.
The branch diversity constraint in BD-MCTS is critical for avoiding local optima.
The training-free framework already surpasses certain trained methods under zero-shot settings.

Highlights & Insights¶

The three-agent collaboration exhibits a clear division of labor — orchestration (planning), perception (execution), decision (reasoning) — mirroring the human cognitive process during navigation.
The performance gains of proactive over passive perception demonstrate that selective, high-quality visual information is superior to exhaustive but noisy input.
BD-MCTS introduces a human-like prioritized memory access mechanism into VLN: humans do not uniformly recall all past states either.
The training-free design allows seamless integration of stronger LLM/VLM backbones as they become available.

Limitations & Future Work¶

API overhead: Each navigation step requires multiple LLM/VLM calls, resulting in non-trivial latency.
The semantic map quality depends on the accuracy of object detection and depth estimation.
Evaluation is conducted with specific LLM/VLM combinations; prompt engineering may need adjustment when switching models.
Only R2R and REVERIE are evaluated; generalization to more challenging continuous-environment navigation remains unverified.

NavGPT: Converts panoramic scenes to text descriptions for GPT-4-based decision-making; relies on passive perception.
MapGPT: Employs VLM-based map representations but still processes all views passively.
AO-Planner: Uses SAM-based visual affordance prompting but lacks an active query mechanism.
The proactive perception + focused reasoning paradigm generalizes to other tasks requiring long-horizon historical reasoning, such as dialogue systems and long-document question answering.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of a proactive perception loop and BD-MCTS is novel in the VLN context; the closed-loop perception–reasoning design is elegant.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated on two standard benchmarks (R2R and REVERIE) with multiple model configurations.
Writing Quality: ⭐⭐⭐⭐ Well-structured with rigorous formalization.
Value: ⭐⭐⭐⭐ The training-free framework is highly practical and can serve as a general-purpose enhancement module for LLM/VLM-based navigation.