FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models¶
Conference: ICLR2026 arXiv: 2512.08016 Code: knowledge-computing/FRIEDA Area: Multimodal VLM Keywords: cartographic reasoning, map VQA, spatial relations, multi-image reasoning, benchmark
TL;DR¶
This paper introduces FRIEDA, a benchmark that systematically evaluates large vision-language models (LVLMs) on multi-step, cross-map cartographic reasoning. The strongest model, Gemini-2.5-Pro, achieves only 38.20% accuracy, far below the human baseline of 84.87%.
Background & Motivation¶
- Cartographic reasoning is a core human cognitive capability, involving comprehensive interpretation of legends, scale bars, compasses, map text, and geometric features, and is indispensable in real-world applications such as urban planning and disaster response.
- Existing LVLM research typically treats maps as a special case of charts, neglecting the unique symbolic grammar and spatial relational reasoning that maps require.
- Current map VQA benchmarks exhibit significant shortcomings: (1) most cover only a subset of spatial relations (e.g., navigation or entity recognition only); (2) map styles are limited (predominantly choropleth or web basemaps); (3) cross-map reasoning is rarely addressed; (4) in-document map retrieval scenarios are absent.
- Consequently, existing benchmarks cannot comprehensively assess whether LVLMs possess human-level map reading ability.
Core Problem¶
How to design a cartographic reasoning benchmark that covers all three categories of spatial relations (topological, metric, and directional), requires multi-step reasoning and cross-map integration, and closely reflects real-world document usage scenarios?
Method¶
Task Definition¶
FRIEDA organizes questions around four core dimensions:
- Spatial Relation Reasoning: Based on the three major spatial relation categories in the GIS literature:
- Topological relations: border (shared boundary), equal (geometric coincidence), intersect (overlap), within (containment)
- Metric relations: distance (computing real-world distances using scale bars)
- Directional relations: orientation (determining cardinal directions using compass roses)
- Map Element Interpretation: Requires understanding the semantics of map text, legend, map scale, and compass.
- Cross-Map Reasoning: Requires aligning shared symbols, labels, and scale bars across multiple maps and integrating multi-source evidence.
- Contextual Setting: The model must first retrieve the relevant map from multiple maps within the same document before answering.
Benchmark Construction Pipeline¶
- Map Collection: Maps are collected from publicly available government reports, environmental assessment documents, geological surveys, and other sources across six thematic domains, covering 32 countries with highly diverse styles.
- Question Generation: GPT-4/GPT-o3 is used to generate candidate questions, ensuring that no question can be answered via search engines or without viewing the map.
- Expert Review: Two GIS experts (with 7 and 2 years of experience, respectively) manually verify answers and resolve ambiguous questions.
- Annotation Validation: Eleven doctoral researchers (8 with cartographic expertise) conduct a four-week annotation process; only questions for which ≥2/3 of annotators agree on the gold-standard answer are retained, yielding a final set of 500 questions.
Dataset Statistics¶
| Item | Count |
|---|---|
| Total questions | 500 |
| Source documents | 210 |
| Total maps | 17,030 |
| Single-map questions | 202 (40.4%) |
| Multi-map questions | 298 (59.6%) |
| Questions requiring legend | 417 (83.4%) |
| Avg. maps per contextual question | 9.5 |
Evaluation Protocol¶
Answers fall into three categories, each evaluated differently:
- Text answers: Mistral Small 3.1 is used as an LLM-as-Judge for semantic matching rather than exact string comparison.
- Distance answers: Unit-aware parsing + MAPE; answers within 20% error are considered correct.
- Directional answers: Adjacent cardinal direction tolerance is permitted (e.g., if the gold standard is North, NW and NE are also accepted).
Key Experimental Results¶
Overall Performance¶
| Model | Accuracy |
|---|---|
| Human average | 84.87% |
| Gemini-2.5-Pro | 38.20% |
| GPT-5-Think | 37.20% |
| Claude-Sonnet-4 | 31.60% |
| Qwen2.5-VL-72B (best open-source) | 25.60% |
| Ovis2.5-9B-Think | 25.80% |
Analysis by Spatial Relation¶
- Orientation is the category where models perform best: Gemini-2.5-Pro reaches 71.59%.
- Distance is the most challenging: the best model achieves only 27.47% (GPT-5-Think), and human performance is also relatively lower at 78.28%.
- For equal relations, GPT-5-Think (44.44%) significantly outperforms Gemini-2.5-Pro (33.33%), reflecting its advantage in multi-map reasoning.
- Claude-Sonnet-4 performs best on distance questions, demonstrating stronger scale bar interpretation.
Key Findings¶
- The accuracy gap between direct and contextual settings is minimal (88.03% question-level consistency), indicating that the primary bottleneck lies in cartographic reasoning itself rather than map retrieval.
- Model size shows no clear positive correlation with performance; training data and reasoning mechanisms are more critical.
- Enabling the Think mode improves Ovis2.5-9B by approximately 5%, primarily in directional judgment and multi-map alignment.
Error Analysis (Gemini-2.5-Pro)¶
| Error Type | Proportion |
|---|---|
| Legend misinterpretation (color/symbol mapping errors) | 25.61% |
| Cross-map interpretation failure | 23.78% |
| Spatial relation semantic confusion | 16.46% |
| Scale bar errors | 9.76% |
| Incorrect map text selection | 8.93% |
| Counting errors | 6.71% |
Highlights & Insights¶
- Comprehensive spatial relation coverage: The first map VQA benchmark to systematically cover all three major categories—topological, metric, and directional—comprising six relation types in total.
- Cross-map reasoning: 59.6% of questions require joint reasoning across multiple maps, filling a critical evaluation gap in multi-map cartographic reasoning.
- Real-world map diversity: Sourced from 210 real documents across 32 countries and six domains (geology, urban planning, environmental assessment, etc.), avoiding the simplification bias inherent in synthetic maps.
- Rigorous quality control: Expert curation, annotation by 11 doctoral researchers, and a ≥2/3 consensus filter ensure high question quality.
- Dual-mode evaluation: Direct and contextual settings decouple reasoning capability from retrieval capability.
Limitations & Future Work¶
- The dataset covers only Latin-script documents, excluding maps in Chinese, Arabic, and other languages.
- The scale of 500 questions is relatively limited, and sample sizes across spatial relation subcategories are uneven.
- Evaluation of fine-tuned models is absent, making it difficult to assess whether domain adaptation can substantially improve performance.
- The reliability of LLM-as-Judge evaluation depends on the specific evaluation model and may introduce bias.
- The effects of chain-of-thought prompting or tool augmentation (e.g., GIS API calls) on performance remain unexplored.
Related Work & Insights¶
| Dimension | MapQA/MapWise | MapEval | FRIEDA |
|---|---|---|---|
| Map types | Primarily choropleth | Web basemaps | Diverse real-document maps |
| Spatial relations | None | Partial | All three categories, six types |
| Multi-map reasoning | No | No | Yes (59.6%) |
| Document context | No | No | Yes (contextual setting) |
| Answer format | Multiple choice | MC/short answer | Open-ended |
Unlike spatial reasoning work on natural images such as SpatialVLM and SpatialRGPT, FRIEDA focuses on map-specific symbolic systems (legends, scale bars, compass roses), evaluating symbol-to-semantics mapping ability rather than spatial perception in natural scenes.
The benchmark reveals systematic deficiencies in current LVLMs' understanding of symbolic visual representations; legend misinterpretation accounts for the largest share of errors, suggesting insufficient modeling of discrete symbol-to-semantics mappings. Cross-map reasoning failures resemble alignment issues in multi-image VQA and may require explicit spatial alignment modules or attention mechanisms. Distance estimation—which demands scale bar comprehension followed by numerical computation—represents a distinctive failure mode where tool-augmented LLMs may offer a viable solution. Directional reasoning performs relatively well, indicating that models have acquired basic compass recognition, yet still fail when the compass is rotated.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First benchmark to comprehensively cover multiple spatial relation categories using real-world maps
- Experimental Thoroughness: ⭐⭐⭐⭐ — 11 models + human baseline + fine-grained error analysis
- Writing Quality: ⭐⭐⭐⭐ — Task definitions are clear, and GIS theory is tightly integrated with LVLM evaluation
- Value: ⭐⭐⭐⭐ — Fills an important evaluation gap with practical significance for advancing spatial intelligence in LVLMs