AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild¶

Conference: ICLR 2026 arXiv: 2602.09657 Code: https://xiaolousun.github.io/AutoFly Area: Remote Sensing Keywords: VLA, UAV navigation, pseudo-depth, autonomous navigation, sim-to-real

TL;DR¶

This paper proposes AutoFly, an end-to-end VLA model for UAV autonomous navigation in the wild. It infers spatial information from RGB inputs via a pseudo-depth encoder, and is trained on a newly constructed autonomous navigation dataset (13K+ trajectories including 1K real flights). AutoFly achieves 3.9% higher success rate and 2.6% lower collision rate than OpenVLA in both simulated and real environments.

Background & Motivation¶

Background: UAV Vision-Language Navigation (VLN) primarily relies on detailed step-by-step instructions to fly along predefined routes, performing well in controlled environments.

Limitations of Prior Work: Real-world outdoor exploration takes place in unknown environments where detailed navigation instructions are unavailable; only coarse-grained directional or positional guidance can be provided. Existing methods assume complete environmental knowledge and detailed instructions, which does not hold in practice. Furthermore, existing datasets overemphasize instruction following rather than autonomous decision-making and lack real-world data.

Key Challenge: VLA approaches designed for 2D ground robots are ill-suited for UAV navigation in 3D space—UAVs require accurate depth estimation, omnidirectional obstacle avoidance, and altitude control, capabilities that RGB-only spatial reasoning cannot adequately support.

Goal: Enable UAVs to perform autonomous navigation, obstacle avoidance, and target recognition given only coarse-grained guidance (e.g., "fly toward that tree").

Key Insight: Introduce a pseudo-depth encoder to enhance spatial understanding without additional depth sensors, and construct a navigation dataset that emphasizes autonomous behavior modeling.

Core Idea: Augment a VLA model with pseudo-depth and a dedicated autonomous navigation dataset to elevate UAVs from instruction following to autonomous navigation.

Method¶

Overall Architecture¶

AutoFly takes RGB observations and natural language instructions as input. A LLaVA-based vision-language model combined with a pseudo-depth encoder generates depth-aware features, which are fused and fed into an LLM to produce discrete action tokens. A de-tokenizer then maps these tokens to continuous velocity commands.

Key Designs¶

Pseudo-Depth Encoder:
- Function: Infers depth from monocular RGB input and encodes it into a spatial representation aligned with visual tokens.
- Mechanism: Depth Anything V2 is used to generate depth maps from RGB images. Patch embedding and a depth projector map depth tokens into the visual feature space. A Siamese MLP projector (sharing parameters with the visual encoder) enforces a consistent mapping between depth and visual features.
- Design Motivation: Avoids the use of real depth cameras for two reasons: (1) AirSim depth maps are overly idealized, introducing a sim-to-real gap; (2) it reduces UAV payload and hardware cost. The Siamese architecture implicitly regularizes the two modalities via parameter sharing to prevent representational divergence.
Autonomous Navigation Dataset Construction:
- Function: Constructs a dataset of 13K+ trajectories emphasizing autonomous behavior rather than instruction following.
- Mechanism: (1) Ground-truth trajectories are generated by a SAC RL-trained collection agent (95% success rate) supplemented with expert demonstrations; (2) 12 AirSim environments plus 1K real flight recordings; (3) trajectory rebalancing—a segmentation function splits trajectories into obstacle avoidance and target-seeking phases to balance their training proportions.
- Design Motivation: Addresses two limitations of existing datasets—over-reliance on instruction following and lack of real-world data. Rebalancing mitigates the class imbalance caused by obstacle avoidance behavior dominating long-horizon navigation.
Two-Stage Training Strategy:
- Stage 1: Vision-language alignment (initialized with Prismatic-VLMs).
- Stage 2: Action fine-tuning with depth information (LLM backbone lr=2e-5, depth projector lr=1e-4, 80K steps).

Loss & Training¶

Standard cross-entropy loss over autoregressive action token prediction is used. The last 256 tokens of the LLaMA2 vocabulary serve as the action token mapping space.

Key Experimental Results¶

Main Results¶

Method	Overall SR↑	CR↓	PER↑
RT-1	24.3	65.1	61.1
RT-2	41.9	26.0	73.7
OpenVLA	44.0	24.5	75.1
AutoFly	47.9	21.9	77.3

Sim-to-Real in Real Environments¶

Scene	Sim:Real Ratio	SR	CR	PER
Indoor	0K:1K	10	40	61.1
Indoor	10K:1K	60	30	76.5
Outdoor	10K:1K	55	35	75.1

Key Findings¶

The pseudo-depth encoder contributes a 3.9% SR improvement and 2.6% CR reduction over depth-free OpenVLA, with pronounced advantages in dense obstacle environments.
The Siamese projector outperforms both the non-Siamese variant and direct depth fusion, confirming that parameter sharing enforces consistent feature mapping.
The 5% SR gap between indoor (60%) and outdoor (55%) real-world results indicates reasonable environmental adaptability.
Increasing simulation data volume consistently improves real-world performance (10→25→60% SR), validating the large-scale simulation plus limited real-data strategy.
Trajectory rebalancing is critical for training—KL divergence for obstacle avoidance behavior is approximately 0.36, and imbalance leads to learning bias.

Highlights & Insights¶

Paradigm shift from instruction following to autonomous navigation: Existing UAV VLN research focuses on step-by-step flight; this work is the first systematic study of coarse-grained goal-directed autonomous navigation, which more closely reflects real operational needs.
Pseudo-depth as an elegant engineering choice: Replacing depth cameras with Depth Anything V2 simultaneously eliminates the sim-to-real gap and reduces hardware dependency.
General applicability of the trajectory rebalancing strategy: Behavioral distribution imbalance in long-horizon control tasks is a pervasive problem; the phase-based rebalancing approach is transferable to other settings.

Limitations & Future Work¶

Absolute success rates remain modest (47.9% in simulation, 55–60% in real environments), leaving a considerable gap to practical deployment.
The action space is limited to 3-DOF (linear velocity), with no handling of attitude angle control.
The dataset scale is relatively small (13K trajectories vs. OpenFly's 100K), and language instructions are brief (avg. 12 words).
The depth encoder depends on the quality of Depth Anything V2; depth estimation may degrade in extreme environments.

vs. OpenVLA: AutoFly augments OpenVLA with a pseudo-depth encoder and a navigation-specific dataset, achieving consistent improvements across all metrics.
vs. AerialVLN/OpenUAV: These datasets emphasize instruction following, with average instructions of 83–104 words; the AutoFly dataset uses only 12 words on average, better reflecting coarse-grained real-world guidance.
vs. training-free methods (VLM zero-shot): No fine-tuning is required, but these approaches lack the high-frequency reactive control needed in dense obstacle environments.

Rating¶

Novelty: ⭐⭐⭐⭐ The autonomous navigation paradigm and pseudo-depth design are original contributions.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers simulation and real environments with extensive ablations, though absolute performance remains limited.
Writing Quality: ⭐⭐⭐⭐ Clear structure with detailed dataset description.
Value: ⭐⭐⭐⭐ An important exploration in the direction of UAV autonomous navigation.