FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation¶

Conference: CVPR 2026 arXiv: 2604.16298 Code: Project Page Area: Robotics Keywords: UAV Navigation, Vision-Language Navigation, Cognitive Modules, Zero-shot, Hierarchical Memory

TL;DR¶

This paper proposes FineCog-Nav, a zero-shot UAV vision-language navigation framework inspired by human cognition. It decomposes navigation into seven fine-grained cognitive modules—language processing, perception, attention, memory, imagination, reasoning, and decision-making—each driven by moderate-scale foundation models, enabling long-range navigation in complex 3D environments without any training.

Background & Motivation¶

Background: UAV vision-language navigation (VLN) requires an agent to follow multi-step ambiguous instructions for long-range navigation in complex 3D environments from a first-person perspective. While zero-shot methods are relatively mature for ground-level VLN, UAV scenarios pose greater challenges due to continuous 3D motion, limited global perception, and weak landmark discriminability.
Limitations of Prior Work: Existing zero-shot methods rely heavily on large models (e.g., GPT-4V); replacing them with smaller models (e.g., LLaVA-7B) causes the success rate to plummet from 28.3% to 1.7%. Most methods employ generic prompts and loosely coupled module coordination, lacking critical components such as hierarchical planning, dynamic sub-goal extraction, and memory mechanisms.
Key Challenge: Complex UAV navigation demands deep collaboration among perception, reasoning, and decision-making, yet existing frameworks are either monolithic (one large model handles everything) or loosely coupled (insufficient inter-module interaction).
Goal: Design a training-free modular framework that achieves interpretable and generalizable UAV navigation through the collaboration of fine-grained cognitive modules.
Key Insight: Rather than organizing modules by agent roles, the framework organizes them by cognitive function—each module corresponds to one aspect of human cognition (language, perception, attention, memory, imagination, reasoning, decision-making)—and they collaborate via structured input-output protocols.
Core Idea: Fine-grained modularization of cognitive functions allows each module to be implemented with a moderate-scale model paired with role-specific prompts, eliminating dependence on extremely large models, while explicit cognitive dependencies provide interpretability.

Method¶

Overall Architecture¶

A five-step cognitive workflow: ❶ Instruction parsing and sub-goal extraction → ❷ Attention-guided perception → ❸ Imagination-assisted sub-goal assessment → ❹ Multi-level memory management → ❺ Decision-making and action execution. The output of each module feeds into the next, forming a closed perception–reasoning–action loop.

Key Designs¶

Hierarchical Instruction Decomposition (Language Processing Module):
- Function: Decomposes complex navigation instructions into executable sub-goal sequences.
- Mechanism: An instruction parser \(\mathcal{S}\) splits instruction \(I\) into sequential sentences, each paired with an associated landmark: \(\{(I_i, L_i)\}\). A sub-goal extractor \(\mathcal{E}\) further dynamically generates sub-goal lists \(\{g_i^{(k)}\}_{k=1}^K\) based on current environmental observations, prioritizing execution order over syntactic structure.
- Design Motivation: UAV navigation instructions are typically long and multi-step; processing them directly leads to planning failures. Hierarchical decomposition reduces planning complexity.
Attention-Guided Perception + Imagination-Assisted Sub-goal Assessment:
- Function: Focuses on task-relevant information and determines sub-goal completion.
- Mechanism: The attention module identifies key landmarks \(\{L_i, L_{i+1}\}\) from the current and next instructions and generates targeted queries \(\{Q_i\}\). The perception module describes the current scene under attention guidance. The imagination module generates an expected scene description \(R^{[g_i^{(k)}]}\) upon sub-goal completion—not open-ended scene generation, but landmark-centric constrained description to reduce hallucination. The sub-goal assessor integrates observations, sub-goal memory, and imagined references to determine completion.
- Design Motivation: Unguided perception is easily distracted by irrelevant details; the imagination module provides a reference for "what I expect to see," improving assessment accuracy.
Multi-level Memory Management:
- Function: Maintains temporal and contextual consistency throughout long-range navigation.
- Mechanism: A three-tier memory structure: step memory \(M^{[t]}\) (observations and actions at each step) → sub-goal memory \(M^{[g_i^{(k)}]}\) (compressed by an LLM into a summary \(M_\star\) upon sub-goal completion) → instruction memory \(M^{[I_i]}\) (aggregates summaries of completed sub-goals). Inspired by the human memory consolidation process.
- Design Motivation: Flat history (as in NavGPT) leads to information overload and noise in long-horizon tasks. Hierarchical memory filters local noise while preserving global context.

Loss & Training¶

Fully zero-shot; no training is required. Each module uses carefully designed role-specific prompts to drive moderate-scale foundation models (e.g., Qwen2.5-VL-32B alongside various 8B–32B LLMs).

Key Experimental Results¶

Main Results¶

AerialVLN-Fine (300 trajectories):

LLM Backbone	Method	SR3D↑	NE↓	nDTW↑
Qwen3-32B	BaseModel	3.00%	142.72m	17.07%
Qwen3-32B	FineCog-Nav	4.00%	95.31m	20.31%
GPT-4o-mini	BaseModel	0.33%	325.98m	8.74%
GPT-4o-mini	FineCog-Nav	2.33%	100.37m	20.45%
ChatGLM-4-32B	BaseModel	2.00%	180.66m	10.59%
ChatGLM-4-32B	FineCog-Nav	2.33%	94.18m	21.25%

Ablation Study¶

Configuration	SR3D	nDTW	Notes
FineCog-Nav (full)	4.00%	20.31%	All cognitive modules
Flat history replaces hierarchical memory	~2%	~15%	Significant degradation
Imagination module removed	~3%	~17%	Inaccurate sub-goal assessment
Attention module removed	~3%	~16%	Perception distracted by irrelevant information

Key Findings¶

FineCog-Nav consistently outperforms baselines across all LLM backbones: significant gains are achieved even with small 8B models.
Navigation error reduced by more than half: NE for GPT-4o-mini drops from 325.98m to 100.37m (−69%).
Hierarchical memory is the most critical module: ablation experiments show severe performance degradation when replaced by flat history.

Highlights & Insights¶

Organizing modules by cognitive function rather than agent role is the central design philosophy: this differs from role assignment in multi-agent systems and instead simulates the cognitive processes underlying human navigation, yielding superior interpretability.
The imagination module represents an intriguing innovation: generating an "expected scene" as a reference for sub-goal completion assessment, analogous to human mental simulation. Constraining generation to landmark-centric descriptions rather than open-ended generation is key to reducing hallucination.
The AerialVLN-Fine dataset fills the gap in the absence of high-quality fine-grained evaluation benchmarks for UAV VLN.

Limitations & Future Work¶

Absolute success rates remain low (4% at best), indicating that zero-shot UAV VLN remains an extremely challenging problem.
The multi-module pipeline introduces additional inference overhead and the risk of error propagation across modules.
Validation is conducted only in the AerialVLN simulator; no real-world UAV testing has been performed.
The safety module relies on simple depth-based geometric heuristics, which may be insufficient in complex obstacle scenarios.
Future work may explore adaptive inter-module collaboration and end-to-end optimization.

vs. NavGPT: NavGPT uses a single LLM to handle all navigation decisions. FineCog-Nav decomposes the task into specialized cognitive modules, enabling moderate-scale models to accomplish what would otherwise require large models.
vs. SPF (See, Point, Fly): SPF primarily enhances visual localization. FineCog-Nav provides a more complete cognitive framework, incorporating higher-order capabilities such as memory and imagination.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The design philosophy of modularizing cognitive functions is both novel and conceptually deep.
Experimental Thoroughness: ⭐⭐⭐⭐ Six LLM backbones, a self-constructed high-quality benchmark, and ablation analysis.
Writing Quality: ⭐⭐⭐⭐ The framework is described clearly, and the information flow diagrams between cognitive modules are intuitive.
Value: ⭐⭐⭐⭐ Provides a scalable modular framework for zero-shot UAV navigation.