Skip to content

FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

Conference: CVPR 2026 arXiv: 2604.16298 Code: Project Page Area: Robotics Keywords: UAV Navigation, Vision-Language Navigation, Cognitive Modules, Zero-shot, Hierarchical Memory

TL;DR

This paper proposes FineCog-Nav, a zero-shot UAV vision-language navigation framework inspired by human cognition. It decomposes navigation into seven fine-grained cognitive modules—language processing, perception, attention, memory, imagination, reasoning, and decision-making—each driven by moderate-scale foundation models, enabling long-range navigation in complex 3D environments without any training.

Background & Motivation

  1. Background: UAV vision-language navigation (VLN) requires an agent to follow multi-step ambiguous instructions for long-range navigation in complex 3D environments from a first-person perspective. While zero-shot methods are relatively mature for ground-level VLN, UAV scenarios pose greater challenges due to continuous 3D motion, limited global perception, and weak landmark discriminability.
  2. Limitations of Prior Work: Existing zero-shot methods rely heavily on large models (e.g., GPT-4V); replacing them with smaller models (e.g., LLaVA-7B) causes the success rate to plummet from 28.3% to 1.7%. Most methods employ generic prompts and loosely coupled module coordination, lacking critical components such as hierarchical planning, dynamic sub-goal extraction, and memory mechanisms.
  3. Key Challenge: Complex UAV navigation demands deep collaboration among perception, reasoning, and decision-making, yet existing frameworks are either monolithic (one large model handles everything) or loosely coupled (insufficient inter-module interaction).
  4. Goal: Design a training-free modular framework that achieves interpretable and generalizable UAV navigation through the collaboration of fine-grained cognitive modules.
  5. Key Insight: Rather than organizing modules by agent roles, the framework organizes them by cognitive function—each module corresponds to one aspect of human cognition (language, perception, attention, memory, imagination, reasoning, decision-making)—and they collaborate via structured input-output protocols.
  6. Core Idea: Fine-grained modularization of cognitive functions allows each module to be implemented with a moderate-scale model paired with role-specific prompts, eliminating dependence on extremely large models, while explicit cognitive dependencies provide interpretability.

Method

Overall Architecture

A five-step cognitive workflow: ❶ Instruction parsing and sub-goal extraction → ❷ Attention-guided perception → ❸ Imagination-assisted sub-goal assessment → ❹ Multi-level memory management → ❺ Decision-making and action execution. The output of each module feeds into the next, forming a closed perception–reasoning–action loop.

Key Designs

  1. Hierarchical Instruction Decomposition (Language Processing Module):

    • Function: Decomposes complex navigation instructions into executable sub-goal sequences.
    • Mechanism: An instruction parser \(\mathcal{S}\) splits instruction \(I\) into sequential sentences, each paired with an associated landmark: \(\{(I_i, L_i)\}\). A sub-goal extractor \(\mathcal{E}\) further dynamically generates sub-goal lists \(\{g_i^{(k)}\}_{k=1}^K\) based on current environmental observations, prioritizing execution order over syntactic structure.
    • Design Motivation: UAV navigation instructions are typically long and multi-step; processing them directly leads to planning failures. Hierarchical decomposition reduces planning complexity.
  2. Attention-Guided Perception + Imagination-Assisted Sub-goal Assessment:

    • Function: Focuses on task-relevant information and determines sub-goal completion.
    • Mechanism: The attention module identifies key landmarks \(\{L_i, L_{i+1}\}\) from the current and next instructions and generates targeted queries \(\{Q_i\}\). The perception module describes the current scene under attention guidance. The imagination module generates an expected scene description \(R^{[g_i^{(k)}]}\) upon sub-goal completion—not open-ended scene generation, but landmark-centric constrained description to reduce hallucination. The sub-goal assessor integrates observations, sub-goal memory, and imagined references to determine completion.
    • Design Motivation: Unguided perception is easily distracted by irrelevant details; the imagination module provides a reference for "what I expect to see," improving assessment accuracy.
  3. Multi-level Memory Management:

    • Function: Maintains temporal and contextual consistency throughout long-range navigation.
    • Mechanism: A three-tier memory structure: step memory \(M^{[t]}\) (observations and actions at each step) → sub-goal memory \(M^{[g_i^{(k)}]}\) (compressed by an LLM into a summary \(M_\star\) upon sub-goal completion) → instruction memory \(M^{[I_i]}\) (aggregates summaries of completed sub-goals). Inspired by the human memory consolidation process.
    • Design Motivation: Flat history (as in NavGPT) leads to information overload and noise in long-horizon tasks. Hierarchical memory filters local noise while preserving global context.

Loss & Training

Fully zero-shot; no training is required. Each module uses carefully designed role-specific prompts to drive moderate-scale foundation models (e.g., Qwen2.5-VL-32B alongside various 8B–32B LLMs).

Key Experimental Results

Main Results

AerialVLN-Fine (300 trajectories):

LLM Backbone Method SR3D↑ NE↓ nDTW↑
Qwen3-32B BaseModel 3.00% 142.72m 17.07%
Qwen3-32B FineCog-Nav 4.00% 95.31m 20.31%
GPT-4o-mini BaseModel 0.33% 325.98m 8.74%
GPT-4o-mini FineCog-Nav 2.33% 100.37m 20.45%
ChatGLM-4-32B BaseModel 2.00% 180.66m 10.59%
ChatGLM-4-32B FineCog-Nav 2.33% 94.18m 21.25%

Ablation Study

Configuration SR3D nDTW Notes
FineCog-Nav (full) 4.00% 20.31% All cognitive modules
Flat history replaces hierarchical memory ~2% ~15% Significant degradation
Imagination module removed ~3% ~17% Inaccurate sub-goal assessment
Attention module removed ~3% ~16% Perception distracted by irrelevant information

Key Findings

  • FineCog-Nav consistently outperforms baselines across all LLM backbones: significant gains are achieved even with small 8B models.
  • Navigation error reduced by more than half: NE for GPT-4o-mini drops from 325.98m to 100.37m (−69%).
  • Hierarchical memory is the most critical module: ablation experiments show severe performance degradation when replaced by flat history.

Highlights & Insights

  • Organizing modules by cognitive function rather than agent role is the central design philosophy: this differs from role assignment in multi-agent systems and instead simulates the cognitive processes underlying human navigation, yielding superior interpretability.
  • The imagination module represents an intriguing innovation: generating an "expected scene" as a reference for sub-goal completion assessment, analogous to human mental simulation. Constraining generation to landmark-centric descriptions rather than open-ended generation is key to reducing hallucination.
  • The AerialVLN-Fine dataset fills the gap in the absence of high-quality fine-grained evaluation benchmarks for UAV VLN.

Limitations & Future Work

  • Absolute success rates remain low (4% at best), indicating that zero-shot UAV VLN remains an extremely challenging problem.
  • The multi-module pipeline introduces additional inference overhead and the risk of error propagation across modules.
  • Validation is conducted only in the AerialVLN simulator; no real-world UAV testing has been performed.
  • The safety module relies on simple depth-based geometric heuristics, which may be insufficient in complex obstacle scenarios.
  • Future work may explore adaptive inter-module collaboration and end-to-end optimization.
  • vs. NavGPT: NavGPT uses a single LLM to handle all navigation decisions. FineCog-Nav decomposes the task into specialized cognitive modules, enabling moderate-scale models to accomplish what would otherwise require large models.
  • vs. SPF (See, Point, Fly): SPF primarily enhances visual localization. FineCog-Nav provides a more complete cognitive framework, incorporating higher-order capabilities such as memory and imagination.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ The design philosophy of modularizing cognitive functions is both novel and conceptually deep.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Six LLM backbones, a self-constructed high-quality benchmark, and ablation analysis.
  • Writing Quality: ⭐⭐⭐⭐ The framework is described clearly, and the information flow diagrams between cognitive modules are intuitive.
  • Value: ⭐⭐⭐⭐ Provides a scalable modular framework for zero-shot UAV navigation.