GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks¶

Conference: ICCV 2025 arXiv: 2411.19325 Code: https://github.com/The-AI-Alliance/GEO-Bench-VLM Area: Multimodal VLM Keywords: Vision-Language Models, Geospatial, Remote Sensing Benchmark, Multimodal Evaluation, Temporal Analysis

TL;DR¶

This paper introduces GEOBench-VLM, a comprehensive benchmark designed to evaluate VLMs on geospatial tasks, encompassing 31 sub-tasks across 8 major categories and over 10,000 manually verified instructions. The benchmark reveals that current state-of-the-art VLMs, including GPT-4o, still perform poorly on geospatial tasks, with the highest accuracy reaching only 41.7%.

Background & Motivation¶

Existing VLM evaluation benchmarks (e.g., SEED-Bench, MMMU, MMBench) focus primarily on general vision-language tasks and fail to adequately address the unique challenges of geospatial applications:

Temporal Change Detection: Monitoring urban development and environmental degradation requires temporal analysis capabilities.

Large-Scale Object Counting: Remote sensing imagery demands precise counting of buildings, vehicles, and similar objects.

Small Object Detection: Object scales vary considerably in satellite imagery.

Non-Optical Data Interpretation: Understanding unconventional imagery such as SAR and multispectral data.

Existing remote sensing VLM benchmarks (e.g., VLEO) lack coverage of temporal analysis, segmentation tasks, and non-optical data evaluation. GEOBench-VLM aims to fill this gap by providing a comprehensive geospatial evaluation framework for both general-purpose and remote sensing-specific VLMs.

Method¶

Overall Architecture¶

GEOBench-VLM is an evaluation benchmark suite rather than a novel model; its core contributions lie in data construction, task design, and the evaluation framework. A multiple-choice question (MCQ) format is adopted to enable objective and scalable automated evaluation, mitigating biases and hallucination issues associated with open-ended responses.

Key Designs¶

Task taxonomy of 8 categories and 31 sub-tasks:
- Scene Understanding: Scene classification, land-use classification, crop classification.
- Object Classification: Fine-grained classification of ship types, aircraft types, etc.
- Object Localization & Counting: Referring expression detection and counting of various objects (vehicles, aircraft, buildings, water bodies, trees, marine debris, etc.).
- Event Detection: Fire risk assessment, disaster type classification.
- Caption Generation: Image captioning to assess scene and detail description capabilities.
- Semantic Segmentation: Referring expression segmentation, generating binary masks for specific targets.
- Temporal Understanding: Change detection, disaster damage assessment, long-term crop classification.
- Non-Optical Data: SAR-based ship detection, flood detection, earthquake magnitude estimation.
- Design Motivation: To cover the full spectrum of remote sensing application scenarios.
Data construction pipeline:
- Open-source remote sensing datasets are integrated, with each task sampling from multiple datasets to ensure diversity.
- For classification tasks, GPT-4o generates five-option MCQs consisting of one correct answer, one semantically similar "nearest distractor" (manually verified), and three plausible distractors.
- Counting tasks convert detection data into questions, providing the correct count alongside options with ±20% and ±40% deviations.
- Spatial relationship tasks are annotated by human annotators with cross-validation.
- Caption generation combines GPT-4o generation with manual refinement.
- Design Motivation: Combining automatic generation with manual verification to ensure data quality.
Comprehensive VLM evaluation framework:
- Thirteen state-of-the-art VLMs are evaluated, including general-purpose models (GPT-4o, LLaVA-OneVision, Qwen2-VL, InternVL2, etc.) and remote sensing-specific models (GeoChat, RS-LLaVA, EarthDial, etc.).
- Multi-dimensional metrics are employed: MCQ accuracy, detection precision, segmentation mIoU, and caption BERTScore.
- Design Motivation: To simultaneously evaluate both general-purpose and domain-specific models, comprehensively revealing capability gaps.

Loss & Training¶

As a benchmark paper, no training is involved. All VLMs are evaluated in a zero-shot inference setting directly on GEOBench-VLM.

Key Experimental Results¶

Main Results — VLM Accuracy Across Task Categories¶

Model	Event Detection	Object Classification	Counting	Scene Understanding	Caption (BERT)
GPT-4o	0.473	0.586	0.397	0.711	0.642
EarthDial	0.542	0.404	0.363	0.771	0.538
Qwen2-VL	0.464	0.456	0.402	0.676	0.590
LLaVA-OneVision	0.406	0.459	0.438	0.664	0.632
InternVL-2	0.346	0.306	0.328	0.573	0.597
GeoChat	0.337	0.313	0.292	0.609	0.440
SPHINX	0.236	0.205	0.186	0.217	0.645

Ablation Study — Referring Expression Detection Precision¶

Model	Prec@0.5	Prec@0.25
SPHINX	0.341	0.529
EarthDial	0.243	0.414
Qwen2-VL	0.152	0.252
GeoChat	0.115	0.210
GPT-4o	0.009	0.039

Key Findings¶

Best-performing models remain limited: LLaVA-OneVision ranks first with an average MCQ accuracy of 41.7%, only slightly above twice the random-guess baseline.
No model achieves comprehensive dominance: GPT-4o excels at object classification, EarthDial at scene understanding and event detection, and LLaVA-OneVision at counting.
Remote sensing-specific models do not always outperform: General-purpose models surpass domain-specific models on multiple tasks.
Counting tasks pose major challenges: Accuracy drops substantially for all models in high-density scenes (>50 objects).
Temporal information is underutilized: Multi-temporal data degrades performance on some tasks, indicating that current VLMs struggle to leverage temporal dependencies.
GPT-4o performs worst on localization (Prec@0.5 of only 0.009) while excelling at object classification.
Prompt sensitivity: GPT-4o and InternVL2 are most sensitive to prompt variations, whereas EarthDial and SkySenseGPT are comparatively stable.

Highlights & Insights¶

Filling a critical gap: This is the first comprehensive geospatial VLM benchmark spanning 8 categories and 31 sub-tasks, including previously absent categories such as temporal analysis, non-optical data, and segmentation.
Manual verification ensures quality: Over 10,000 instructions are manually verified, and the MCQ format reduces evaluation bias.
In-depth analyses are informative: Analyses of object density vs. counting accuracy, prompt sensitivity, and single- vs. multi-temporal comparisons reveal deeper limitations of current VLMs.
Open-source and extensible: The benchmark is publicly available, facilitating iterative follow-up research.

Limitations & Future Work¶

The MCQ format limits assessment of VLMs' open-ended generation capabilities.
Only a subset of models supports certain tasks (e.g., segmentation), restricting the scope of comparison.
Inference latency and computational efficiency are not evaluated.
Data sources are predominantly public remote sensing datasets, which may introduce distributional bias.
Future work could incorporate 3D geospatial understanding and multimodal fusion tasks (e.g., joint optical and SAR reasoning).

Complements the VLEO benchmark with significant extensions in temporal analysis, segmentation, and non-optical data.
Reveals limitations of fine-tuning remote sensing-specific VLMs, as domain-specific models do not consistently outperform general-purpose ones across tasks.
In-depth analyses of counting and localization tasks can guide architectural improvements for remote sensing VLMs.
Provides concrete performance targets for the development of next-generation geospatial-specific VLMs.

Rating¶

Novelty: ⭐⭐⭐ A benchmark paper with creative task design and data construction, though methodological innovation is limited.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluates 13 VLMs across 31 tasks with rich analytical dimensions.
Writing Quality: ⭐⭐⭐⭐ Well-structured, though some tables are data-dense and moderately difficult to read.
Value: ⭐⭐⭐⭐ Provides the geospatial AI community with a much-needed standardized evaluation tool of high practical utility.