FlySearch: Exploring how vision-language models explore¶
Conference: NeurIPS 2025 (Datasets and Benchmarks Track) arXiv: 2506.02896 Code: Available (environment, scenes, and codebase publicly released) Area: Multimodal VLM / Embodied Intelligence / Benchmarking Keywords: vision-language models, object navigation, exploration capability, 3D environments, UAV
TL;DR¶
FlySearch introduces a photorealistic 3D outdoor environment built on Unreal Engine 5 to evaluate the exploration capabilities of VLMs. Results reveal that state-of-the-art VLMs fail to reliably complete even simple search tasks, and the performance gap relative to humans widens dramatically as task difficulty increases.
Background & Motivation¶
Vision-language models (VLMs) have demonstrated strong performance on tasks such as image captioning and VQA, yet their capacity for active exploration in unstructured real-world environments remains largely unknown. Object navigation (ObjectNav) — requiring an agent to locate a specified target and navigate to it within a simulated environment — is a critical capability for evaluating embodied intelligence. However, existing benchmarks suffer from several limitations:
Environmental constraints: Most ObjectNav benchmarks (Habitat, AI2-THOR) focus exclusively on indoor scenes.
System evaluation vs. model evaluation: Prior work predominantly integrates VLMs as components within complex systems rather than directly assessing VLMs' intrinsic exploration capabilities.
Lack of zero-shot openness: Most approaches rely on task-specific object detectors, making them ill-suited for open-world search scenarios.
Insufficient visual realism: Some benchmarks employ simplified rendering pipelines.
FlySearch addresses these gaps by placing VLMs in control of an unmanned aerial vehicle (UAV) tasked with searching for targets across large-scale outdoor scenes, thereby systematically evaluating their capacity to formulate and execute exploration strategies.
Method¶
Overall Architecture¶
The FlySearch system comprises three core components:
- Simulator: A high-fidelity rendering environment based on Unreal Engine 5, supporting dynamic lighting, wind variation, and procedural generation.
- Evaluation Controller: A Python-based module that manages scene generation, VLM communication, and result aggregation.
- Scene Generator: Procedurally generates an unlimited number of test scenarios.
Evaluation task procedure:
- The model receives a text prompt describing the target to be found.
- At each step, the model receives a 500×500-pixel top-down RGB image from the UAV (with a coordinate grid overlay).
- The model outputs a relative displacement command in the form <action>(X, Y, Z)</action>.
- Upon locating the target, the model outputs FOUND to terminate the search.
Success criterion: The difference between the agent altitude \(h_{agent}\) and the highest point of the target \(h_{object}\) must be \(\leq 10\text{m}\), with the target within the field of view.
Key Designs¶
Two evaluation environment types: - Forest environment: Based on the Unreal "Electric Dreams" demo, with procedurally generated vegetation and rocks. - Urban environment: Based on the "City Sample" demo, representing a modern American city of approximately \(4 \times 4\) km.
Three standardized challenge sets:
| Challenge | Scenes | Step Limit | Description |
|---|---|---|---|
| FS-1 | 400 | 10 | Basic search; target visible within initial field of view |
| FS-Anomaly-1 | 200 | 10 | Search for "out-of-place" targets (e.g., a giraffe in a city) |
| FS-2 | 200 | 20 | Difficult search; target may be occluded, requiring systematic exploration |
Search targets: - Urban: construction sites, crowds, rubbish piles, fires, vehicles - Forest: campsites, rubbish piles, people (simulating injured hikers), forest fires, buildings - Anomalies: UFOs, light aircraft, dinosaurs, tanks, giraffes, etc.
Loss & Training¶
FlySearch is an evaluation benchmark. For VLM fine-tuning experiments, the authors apply GRPO (Group Relative Policy Optimization) to train Qwen2.5-VL 7B in the forest environment:
- 6,750 unique scenes and 67,500 training samples are generated.
- The reward function combines reasoning quality (at least 100 tokens of reasoning) and action quality (whether the agent moves closer to the target).
- LoRA fine-tuning is applied with the visual encoder frozen.
- Training is conducted on 4 NVIDIA H100 GPUs over several hours.
Key Experimental Results¶
Main Results¶
| Model | FS-1 Overall (%) | FS-2 Overall (%) |
|---|---|---|
| Human (untrained) | 67.0 | — |
| Gemini 2.0 Flash | 42.0 | ~7.0 |
| Claude 3.5 Sonnet | ~37.0 | — |
| GPT-4o | ~36.0 | ~4.0 |
| Pixtral-Large 124B | ~30.0 | — |
| Qwen2-VL 72B | ~15.0 | — |
| LLaVA-OneVision 72B | ~12.0 | — |
| Small models (≤11B) | <4.0 | — |
| Qwen2.5-VL 7B (GRPO fine-tuned) | 21.5 (City) | 0.0 |
FS-Anomaly-1 results:
| Model | FS-Anomaly-1 Overall (%) |
|---|---|
| Gemini 2.0 Flash | 35.5 |
| GPT-4o | 27.0 |
| Claude 3.5 Sonnet | 27.5 |
| Pixtral-Large | 15.0 |
| Qwen2-VL 72B | 7.5 |
| Small models (≤11B) | <3.5 |
Ablation Study¶
| Configuration | Gemini | Pixtral |
|---|---|---|
| FS-1 baseline (10 steps) | 42.5 | 30.0 |
| FS-1 reduced to 5 steps | ~38.0 (−10%) | ~25.0 (−17%) |
| FS-1 increased to 20 steps | ~40.0 (−6%) | ~25.0 (−17%) |
| FS-1 with compass-based actions | 17.5 (Forest) | 22.0 (Forest) |
| FS-1 without grid overlay | 31.5 (Forest) | 20.0 (Forest) |
| FS-Anomaly with explicit target type | Significant gain | Significant gain |
Key Findings¶
- VLMs lag far behind humans: On FS-1, the best-performing VLM (Gemini, 42%) trails humans (67%) by 37 percentage points; on FS-2, this gap widens to 835%.
- Small models are nearly incapable: Open-source models with ≤11B parameters achieve success rates below 4%, primarily due to failure to correctly format outputs.
- More steps can be harmful: Increasing the step budget from 10 to 20 degrades performance for both Gemini and Pixtral, indicating that VLMs struggle to maintain coherent strategies over longer horizons.
- Fine-tuning yields limited gains: GRPO fine-tuning improves Qwen's FS-1 urban performance from 1.5% to 21.5%, yet FS-2 performance remains at 0%.
- Anomaly detection is challenging: VLMs tend to flag visually salient but normal objects (e.g., yellow taxis) as anomalies while overlooking genuinely anomalous ones (e.g., a nearby tank).
- Coordinate grid overlay is critical: Removing the grid overlay causes a substantial performance drop (−26% in the forest environment), demonstrating that VLMs rely heavily on explicit spatial references.
Highlights & Insights¶
- Precise diagnosis of VLM failure modes: The paper systematically categorizes failure causes into visual hallucination, contextual misinterpretation, and task planning failure.
- High-quality environment design: Near-photorealistic rendering via Unreal Engine 5 combined with procedural generation enables unlimited scenario scalability.
- Meaningful human baseline: An online user study establishes a well-grounded human reference for comparison.
- Strong reproducibility: Three standardized challenge sets ensure fair and comparable evaluation conditions.
- Reveals a fundamental gap: VLMs lack systematic exploration strategies — humans search along streets, whereas VLMs move randomly.
Limitations & Future Work¶
- Pure VLM evaluation only: More complex ObjectNav systems (e.g., SLAM-integrated methods) are not assessed.
- Simple prompting strategy: Only zero-shot prompting is employed; few-shot learning and prompt optimization are unexplored.
- Fixed top-down camera perspective: The camera always points downward, limiting environmental perception modalities.
- Collision issues in urban environments: Models frequently trigger collision avoidance (particularly in FS-2), reducing search efficiency.
- No multi-target or cooperative search: Only single-agent, single-target search is evaluated.
Related Work & Insights¶
- BALROG: Evaluates VLMs in game environments but lacks outdoor 3D scenes.
- VisualAgentBench: A multi-environment VLM benchmark, but not focused on exploration capability.
- Habitat / AI2-THOR: Standard ObjectNav environments, but restricted to indoor settings.
- AirSim: A UAV simulator built on Unreal Engine, but based on an older engine version.
- Insight: "Knowing where to look" and "knowing how to search systematically" represent two fundamentally distinct capabilities; current VLM architectures may have a structural deficiency in the latter.
Rating¶
- Novelty: ★★★★★ — First large-scale benchmark for evaluating VLM exploration capability in outdoor 3D environments.
- Technical Depth: ★★★★☆ — Carefully designed environment and evaluation framework; primary contribution lies in the benchmarking methodology.
- Experimental Thoroughness: ★★★★★ — 9 VLMs, 3 challenge sets, human baselines, and multi-dimensional ablation analysis.
- Practical Value: ★★★★☆ — Offers direct guidance for understanding and improving spatial reasoning and exploration in VLMs.
- Overall Recommendation: ★★★★☆