Parse, Search, and Confirmation: Training-Free Aerial Vision-and-Dialog Navigation with Chain-of-Thought Reasoning and Structured Spatial Memory¶

Conference: CVPR 2026
Area: Robotics / Embodied Navigation
Keywords: Aerial Vision-and-Dialog Navigation, UAV, Training-Free, Chain-of-Thought, Spatial Memory
Code: https://github.com/QY6616/PSC-AVDN (To be released)

TL;DR¶

Addressing the issue that conventional Aerial Vision-and-Dialog Navigation (AVDN) requires supervised fine-tuning and lacks cross-domain generalization, this paper proposes PSC-AVDN. This training-free framework decomposes MLLM navigation into a "Parsing-Search-Confirmation" three-stage Chain-of-Thought (CoT) and incorporates a Structured Spatial Memory (SSM) to compensate for missing spatial/historical information. It achieves training-free SOTA performance on ANDH / ANDH-Full, performing comparably to or better than several fine-tuned methods.

Background & Motivation¶

Background: Aerial Vision-and-Dialog Navigation (AVDN) requires drones to fly based on natural language instructions and resolve ambiguities through dialogue. High-altitude top-down views (similar to remote sensing) offer wide coverage but feature sparse and small landmarks, suitable for disaster relief and environmental monitoring. Traditional AVDN methods rely heavily on supervised fine-tuning.

Limitations of Prior Work: Supervised fine-tuning is costly in terms of computation and labeling, and often fails to generalize across domains. A natural alternative is using MLLMs in a training-free manner by feeding the current view and dialogue to the model. However, naive iterative search is unreliable for two reasons: ① Weak direction grounding: MLLMs are primarily trained on close-range ground views and struggle to translate abstract spatial terms (e.g., "slightly to the right") into geometric cues reflecting drone layouts. Small landmarks in aerial views further hinder alignment. ② Lack of global spatial understanding and temporal state tracking: AVDN requires maintaining beliefs across multiple steps, but autoregressive language-driven reasoning lacks structural mechanisms for mapping or long-term consistency, causing fragmented environmental interpretation.

Key Challenge: The domain gap between MLLM vision-language priors (ground-view) and AVDN geometric requirements (aerial-view), combined with the inherent lack of explicit spatial memory in language-driven reasoning.

Goal: To overcome these limitations by (1) decoupling direction understanding from target localization and (2) providing MLLMs with an explicit structured spatial memory.

Core Idea: Utilize a "Parsing-Search-Confirmation" three-stage CoT for (1) and a Structured Spatial Memory (SSM) for (2), achieving a purely training-free approach through MLLM native capabilities and prompt engineering.

Method¶

Overall Architecture¶

The AVDN task involves \(L\) dialogue rounds. In each round \(l\) with instruction \(U_l\), PSC-AVDN executes a "Parsing → Search → Confirmation" cycle to output a target box \(B_l=(x^1_l,y^1_l,x^2_l,y^2_l)\). The final box denotes the destination. The Parsing stage uses a general LLM to convert ambiguous instructions into stable geometric cues and destination descriptions. The Search stage (S-CoT) explores the aerial observation to narrow down candidate regions. The Confirmation stage (C-CoT) performs fine-grained verification and disambiguation. The SSM provides multi-scale observations, spatial visual memory, and structured geometric memory to maintain global context and long-term consistency.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Instruction Ul + Current View"] --> B["Parsing Stage + Heading Resolution HR<br/>Decouple direction/destination, '10 o'clock' → Absolute 237°"]
    B --> C["Search S-CoT<br/>4-step reasoning to narrow candidates"]
    C --> D["Confirmation C-CoT<br/>Fine-grained verification to output box Bl"]
    E["Structured Spatial Memory SSM<br/>MVO + SVM + SGM"] --> C
    E --> D
    D --> F["Output box Bl and update SSM"]

Key Designs¶

1. Parsing Stage + Heading Resolution (HR): Decoupling Ambiguous Directions

MLLMs struggle to translate terms like "10 o'clock" into geometric actions. Ours uses a general LLM (DeepSeek-V3) to decompose instruction \(U\) into movement direction phrases \(s_{dir}\) and destination descriptions \(s_{des}\). The Heading Resolution (HR) module uses rule-based parsing to map clock-based, degree-based, or compass-based terms to an absolute angle \(\alpha \in [0,2\pi)\). Combined with the current heading \(\phi\), it calculates the relative heading:

\[\delta = \mathrm{wrap}(\alpha - \phi),\]

where \(\mathrm{wrap}(\cdot)\) normalizes the angle. This decouples direction understanding from localization, providing reliable geometric signals \(\delta\) and preventing early navigation failure.

2. Search CoT (S-CoT): Four-Step Interpretable Reasoning

To improve stability in large aerial scenes, S-CoT decomposes search into four serial steps: ① Destination Analysis extracts semantics (category, landmarks, relations) from \(s_{des}\). ② Scene Understanding builds a holistic view using multi-scale observations \(V_t\) and spatial visual memory \(M_t\). ③ Reference Grid Map Generation divides the main view into an \(N\times N\) grid with predicted labels to form structured geometric memory \(R_t\). ④ Target Localization outputs candidate regions based on visual features and destination constraints.

3. Confirmation CoT (C-CoT): Fine-grained Disambiguation

C-CoT verifies candidates through an interpretable chain: it prompts the model to generate verifiable step-by-step reasoning, checking spatial and relational constraints against multi-scale views. By analyzing local structures and adjacency (e.g., verifying if a "large warehouse" actually has a "red building on its left"), it excludes false positives and outputs the final bbox \(B\) with confidence scores.

4. Structured Spatial Memory (SSM): Explicit Spatial/History Cues

SSM addresses the lack of global context in single-frame observations through three components: ① Multi-scale Visual Observation (MVO) resamples the global map \(I\) into different scales \(V^i_t=\mathrm{Resample}(I,s_i)\) to capture both global layout and local details. ② Spatial Visual Memory (SVM) integrates historical views, trajectories, and orientations into a global coordinate canvas: \(M_t=(M_{t-1}\oplus V_t)\oplus(T_t\oplus\theta_t)\). ③ Structured Geometric Memory (SGM) prompts the model to label the \(N\times N\) grid for mid-scale views, updating a persistent memory \(R_t=\mathrm{Update}(R_{t-1}, \bar R_t)\) to provide stable structural priors for multi-step reasoning.

Key Experimental Results¶

Main Results¶

Evaluated on ANDH (sub-trajectories) and ANDH-Full (complete trajectories) using SR (Success Rate), SPL (Success weighted by Path Length), and GP (Goal Progress). Results on ANDH Unseen Val.:

Method	Setting	SPL	SR	GP
GPT-4o	Training-Free	3.4	3.9	-11.8
Qwen-VL-Max	Training-Free	8.7	9.2	5.5
PSC-AVDN (Ours)	Training-Free	17.8	22.6	39.2
FELA	Supervised	17.2	20.6	63.0
HAA-LSTM	Supervised	18.3	20.0	54.4

Ours significantly outperforms training-free baselines and matches or exceeds some supervised methods in SPL/SR. While GP remains lower than some fine-tuned models, it demonstrates strong performance for a zero-shot approach.

Ablation Study¶

Three-stage Reasoning (ANDH Unseen Val.):

Configuration	SPL	SR	GP
Baseline (Iterative search)	8.7	9.2	5.5
+ Parsing	13.5	14.6	26.2
+ Search	15.6	17.5	25.8
+ Confirmation (Full PSC)	16.3	19.3	35.7

SSM Components (on top of PSC):

Configuration	SPL	SR	GP
w/o SSM	16.3	19.3	35.7
+ SVM	16.5	20.4	36.6
+ MVO	16.6	21.1	38.3
+ SGM (Complete SSM)	17.8	22.6	39.2

Key Findings¶

Parsing contributes the largest performance jump (SR 9.2→14.6), confirming that direction decoupling is the most critical factor for MLLM navigation.
SSM components provide cumulative gains: SVM improves temporal consistency, MVO aids multi-scale perception, and SGM optimizes spatial reasoning.
A 5×5 grid size is optimal; too fine or too coarse grids degrade performance as they must match the scale of aerial landmarks.

Highlights & Insights¶

Direction decoupling is the most effective gain: Simply resolving directions via HR doubled SR, proving that rule-based geometric parsing is far more effective than letting MLLMs "guess" orientation.
Training-free yet competitive: By relying on general LLMs/MLLMs and prompt engineering, the framework avoids task-specific training while remaining comparable to supervised methods.
Self-generated geometric memory: Unlike GeoNav which requires external models like Grounded-SAM, SGM allows the MLLM to generate its own reference grid, minimizing external dependencies.

Limitations & Future Work¶

There remains a gap in GP (Goal Progress) compared to fine-tuned methods over long-range trajectories.
High inference costs due to multi-step CoT and multi-scale view processing.
HR's rule-based approach may fail on unconventional directional expressions.
The 12 semantic categories for the grid are derived from dataset statistics and might need adjustment for specialized environments.

vs. Supervised AVDN: Supervised models (FELA, TA-GAT) are accurate but lack cross-domain adaptability; PSC-AVDN is plug-and-play with lower engineering costs.
vs. GeoNav: GeoNav relies on external cognitive maps (Grounded-SAM/OpenGIS), whereas PSC-AVDN performs structured reasoning natively within the MLLM chain.
vs. Open-Nav: While both use training-free CoT, PSC-AVDN specifically addresses high-altitude challenges like directional ambiguity and small landmarks.

Rating¶

Novelty: ⭐⭐⭐⭐ First training-free AVDN framework using structured three-stage CoT and multi-modal SSM.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive multi-stage and component ablations across two datasets.
Writing Quality: ⭐⭐⭐⭐ Clear structure and well-defined motivation-to-solution mapping.
Value: ⭐⭐⭐⭐ High practical value for UAV deployment where training resources are limited.