GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks¶
Conference: ICCV 2025 arXiv: 2411.19325 Code: https://github.com/The-AI-Alliance/GEO-Bench-VLM Area: Multimodal VLM Keywords: Vision-Language Models, Geospatial, Remote Sensing Benchmark, Multimodal Evaluation, Temporal Analysis
TL;DR¶
This paper introduces GEOBench-VLM, a comprehensive benchmark designed to evaluate VLMs on geospatial tasks, encompassing 31 sub-tasks across 8 major categories and over 10,000 manually verified instructions. The benchmark reveals that current state-of-the-art VLMs, including GPT-4o, still perform poorly on geospatial tasks, with the highest accuracy reaching only 41.7%.
Background & Motivation¶
Existing VLM evaluation benchmarks (e.g., SEED-Bench, MMMU, MMBench) focus primarily on general vision-language tasks and fail to adequately address the unique challenges of geospatial applications:
Temporal Change Detection: Monitoring urban development and environmental degradation requires temporal analysis capabilities.
Large-Scale Object Counting: Remote sensing imagery demands precise counting of buildings, vehicles, and similar objects.
Small Object Detection: Object scales vary considerably in satellite imagery.
Non-Optical Data Interpretation: Understanding unconventional imagery such as SAR and multispectral data.
Existing remote sensing VLM benchmarks (e.g., VLEO) lack coverage of temporal analysis, segmentation tasks, and non-optical data evaluation. GEOBench-VLM aims to fill this gap by providing a comprehensive geospatial evaluation framework for both general-purpose and remote sensing-specific VLMs.
Method¶
Overall Architecture¶
GEOBench-VLM is an evaluation benchmark suite rather than a novel model; its core contributions lie in data construction, task design, and the evaluation framework. A multiple-choice question (MCQ) format is adopted to enable objective and scalable automated evaluation, mitigating biases and hallucination issues associated with open-ended responses.
Key Designs¶
-
Task taxonomy of 8 categories and 31 sub-tasks:
- Scene Understanding: Scene classification, land-use classification, crop classification.
- Object Classification: Fine-grained classification of ship types, aircraft types, etc.
- Object Localization & Counting: Referring expression detection and counting of various objects (vehicles, aircraft, buildings, water bodies, trees, marine debris, etc.).
- Event Detection: Fire risk assessment, disaster type classification.
- Caption Generation: Image captioning to assess scene and detail description capabilities.
- Semantic Segmentation: Referring expression segmentation, generating binary masks for specific targets.
- Temporal Understanding: Change detection, disaster damage assessment, long-term crop classification.
- Non-Optical Data: SAR-based ship detection, flood detection, earthquake magnitude estimation.
- Design Motivation: To cover the full spectrum of remote sensing application scenarios.
-
Data construction pipeline:
- Open-source remote sensing datasets are integrated, with each task sampling from multiple datasets to ensure diversity.
- For classification tasks, GPT-4o generates five-option MCQs consisting of one correct answer, one semantically similar "nearest distractor" (manually verified), and three plausible distractors.
- Counting tasks convert detection data into questions, providing the correct count alongside options with ±20% and ±40% deviations.
- Spatial relationship tasks are annotated by human annotators with cross-validation.
- Caption generation combines GPT-4o generation with manual refinement.
- Design Motivation: Combining automatic generation with manual verification to ensure data quality.
-
Comprehensive VLM evaluation framework:
- Thirteen state-of-the-art VLMs are evaluated, including general-purpose models (GPT-4o, LLaVA-OneVision, Qwen2-VL, InternVL2, etc.) and remote sensing-specific models (GeoChat, RS-LLaVA, EarthDial, etc.).
- Multi-dimensional metrics are employed: MCQ accuracy, detection precision, segmentation mIoU, and caption BERTScore.
- Design Motivation: To simultaneously evaluate both general-purpose and domain-specific models, comprehensively revealing capability gaps.
Loss & Training¶
As a benchmark paper, no training is involved. All VLMs are evaluated in a zero-shot inference setting directly on GEOBench-VLM.
Key Experimental Results¶
Main Results — VLM Accuracy Across Task Categories¶
| Model | Event Detection | Object Classification | Counting | Scene Understanding | Caption (BERT) |
|---|---|---|---|---|---|
| GPT-4o | 0.473 | 0.586 | 0.397 | 0.711 | 0.642 |
| EarthDial | 0.542 | 0.404 | 0.363 | 0.771 | 0.538 |
| Qwen2-VL | 0.464 | 0.456 | 0.402 | 0.676 | 0.590 |
| LLaVA-OneVision | 0.406 | 0.459 | 0.438 | 0.664 | 0.632 |
| InternVL-2 | 0.346 | 0.306 | 0.328 | 0.573 | 0.597 |
| GeoChat | 0.337 | 0.313 | 0.292 | 0.609 | 0.440 |
| SPHINX | 0.236 | 0.205 | 0.186 | 0.217 | 0.645 |
Ablation Study — Referring Expression Detection Precision¶
| Model | Prec@0.5 | Prec@0.25 |
|---|---|---|
| SPHINX | 0.341 | 0.529 |
| EarthDial | 0.243 | 0.414 |
| Qwen2-VL | 0.152 | 0.252 |
| GeoChat | 0.115 | 0.210 |
| GPT-4o | 0.009 | 0.039 |
Key Findings¶
- Best-performing models remain limited: LLaVA-OneVision ranks first with an average MCQ accuracy of 41.7%, only slightly above twice the random-guess baseline.
- No model achieves comprehensive dominance: GPT-4o excels at object classification, EarthDial at scene understanding and event detection, and LLaVA-OneVision at counting.
- Remote sensing-specific models do not always outperform: General-purpose models surpass domain-specific models on multiple tasks.
- Counting tasks pose major challenges: Accuracy drops substantially for all models in high-density scenes (>50 objects).
- Temporal information is underutilized: Multi-temporal data degrades performance on some tasks, indicating that current VLMs struggle to leverage temporal dependencies.
- GPT-4o performs worst on localization (Prec@0.5 of only 0.009) while excelling at object classification.
- Prompt sensitivity: GPT-4o and InternVL2 are most sensitive to prompt variations, whereas EarthDial and SkySenseGPT are comparatively stable.
Highlights & Insights¶
- Filling a critical gap: This is the first comprehensive geospatial VLM benchmark spanning 8 categories and 31 sub-tasks, including previously absent categories such as temporal analysis, non-optical data, and segmentation.
- Manual verification ensures quality: Over 10,000 instructions are manually verified, and the MCQ format reduces evaluation bias.
- In-depth analyses are informative: Analyses of object density vs. counting accuracy, prompt sensitivity, and single- vs. multi-temporal comparisons reveal deeper limitations of current VLMs.
- Open-source and extensible: The benchmark is publicly available, facilitating iterative follow-up research.
Limitations & Future Work¶
- The MCQ format limits assessment of VLMs' open-ended generation capabilities.
- Only a subset of models supports certain tasks (e.g., segmentation), restricting the scope of comparison.
- Inference latency and computational efficiency are not evaluated.
- Data sources are predominantly public remote sensing datasets, which may introduce distributional bias.
- Future work could incorporate 3D geospatial understanding and multimodal fusion tasks (e.g., joint optical and SAR reasoning).
Related Work & Insights¶
- Complements the VLEO benchmark with significant extensions in temporal analysis, segmentation, and non-optical data.
- Reveals limitations of fine-tuning remote sensing-specific VLMs, as domain-specific models do not consistently outperform general-purpose ones across tasks.
- In-depth analyses of counting and localization tasks can guide architectural improvements for remote sensing VLMs.
- Provides concrete performance targets for the development of next-generation geospatial-specific VLMs.
Rating¶
- Novelty: ⭐⭐⭐ A benchmark paper with creative task design and data construction, though methodological innovation is limited.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluates 13 VLMs across 31 tasks with rich analytical dimensions.
- Writing Quality: ⭐⭐⭐⭐ Well-structured, though some tables are data-dense and moderately difficult to read.
- Value: ⭐⭐⭐⭐ Provides the geospatial AI community with a much-needed standardized evaluation tool of high practical utility.