SPOT: Bridging Natural Language and Geospatial Search for Investigative Journalists¶
- Conference: ACL 2025
- arXiv: 2506.13188
- Code: GitHub | Demo
- Area: Other
- Keywords: Natural Language Interface, OpenStreetMap, Geolocation Verification, Investigative Journalism, LLM Fine-Tuning, YAML Query
TL;DR¶
Proposes the SPOT system, which fine-tunes LLaMA 3 to translate natural language scene descriptions into YAML queries, combining this with a semantic tag bundling mechanism to enable reliable natural language access to OpenStreetMap data, serving geolocation verification in investigative journalism.
Background & Motivation¶
- Core Problem: OpenStreetMap (OSM) is a crucial resource for investigative journalists conducting geolocation verification, but its query language, OverpassQL, poses a high barrier to entry for non-technical users.
- Limitations of Prior Work:
- Overpass Turbo: Requires mastery of OverpassQL syntax, making it difficult for non-technical users.
- GeoGuessr GPT: Based on ChatGPT but is not open-source and does not connect to the OSM database.
- GeoSpy: Only accepts image input and does not support natural language.
- EarthKit: Requires users to manually select OSM tags, still presenting a technical barrier.
- OverpassT5 (Staniek et al.): Directly generates OverpassQL but requires users to understand the OSM tag schema.
- Design Motivation: To build a fully open-source, reliable, and accurate OSM geospatial search tool for investigative journalists that supports unstructured natural language input.
Method¶
Overall Architecture¶
SPOT consists of four core components: (1) OSM tag bundling construction and indexing → (2) synthetic training data generation → (3) LLaMA 3 model fine-tuning → (4) inference and post-processing. User inputs a natural language description → the model outputs a YAML query → semantic search replaces concepts with OSM tags → PostGIS database query → interactive map displays the results.
Key Designs¶
- Multi-layer Intermediate Representation (YAML): Instead of directly generating OverpassQL, the model first generates a YAML structured query without OSM tags (containing search area, entities, properties, and spatial relationships), and then maps the entity names to OSM tag bundles via a semantic search engine. This decoupled design avoids the need to retrain the model when OSM tags are updated.
- Semantic Tag Bundling System: Groups visually similar OSM tags (e.g., light rail / subway / tram into the same bundle) and combines BM25 + SBERT hybrid retrieval to handle typos and synonyms in user queries.
- Synthetic Training Data Pipeline: Generates 43,976 training samples by randomly combining YAML field values + 7 personas + 5 writing styles + GPT-4o, covering real-world scenarios such as typos, grammatical errors, non-Latin alphabets, and ambiguous spatial terms.
Loss & Training¶
Fine-tunes LLaMA 3 using LoRA (rank=32, alpha=64) with a learning rate of 1e-5, weight decay of 0.01, and early stopping patience of 10.
Key Experimental Results¶
Main Results (195 Real-World User Query Benchmark)¶
| Model | Adaptation | Area | Entity | Entity* | Property | Relation |
|---|---|---|---|---|---|---|
| GPT-4o | Zero-shot | 88.14 | 2.28 | 90.21 | 3.03 | 9.8 |
| GPT-4o | One-shot | 89.18 | 1.13 | 92.03 | 10.96 | 11.11 |
| Mistral | Adapter | 93.33 | 82.54 | 95.01 | 56.58 | 45.45 |
| LLaMA 3 | Adapter | 92.31 | 81.41 | 96.15 | 50.00 | 48.05 |
| Qwen2.5 | Adapter | 92.31 | 82.31 | 95.69 | 51.95 | 52.60 |
| Phi | Adapter | 92.82 | 79.59 | 94.10 | 53.33 | 53.90 |
| mT5 | Adapter | 88.21 | 72.34 | 90.02 | 48.89 | 37.01 |
Ablation Study (Hallucination Comparison: GPT-4o vs. Fine-tuned Models)¶
| Model | Entity Omission | Entity Hallucination | Property Omission | Property Hallucination |
|---|---|---|---|---|
| GPT-4o (0-shot) | 48 | 37 | 53 | — |
| Fine-tuned LLMs | Significantly Reduced | Significantly Reduced | Significantly Reduced | — |
Key Findings¶
- Fine-tuned Small Models Far Outperform GPT-4o Zero/Few-shot: GPT-4o achieves only 2.28% in entity recognition (zero-shot), whereas the fine-tuned Mistral reaches 82.54%, indicating that the OSM tag schema requires domain adaptation.
- Synthetic Data Pipeline is Effective: 43K synthetic samples cover various real-world user input patterns (typos, non-Latin alphabets, ambiguous spatial terms), making the fine-tuned model robust.
- YAML Intermediate Representation Outperforms Direct OverpassQL Generation: The decoupled design allows the tag system to be updated independently, and YAML syntax is more fault-tolerant than JSON.
- Properties and Relations Remain Challenging: Even for the best fine-tuned model, property accuracy (~56%) and relation accuracy (~53%) still have significant room for improvement.
Highlights & Insights¶
- The first full-stack open-source natural language geospatial search system oriented towards investigative journalism, which has been deployed in production.
- Innovative multi-layer intermediate representation design (YAML → semantic search → OSM tags), decoupling language understanding and tag mapping.
- The synthetic data pipeline design is derived from user research within the professional OSINT community, covering real-world scenarios such as typos, multilingual queries, and ambiguous spatial terms.
- Fine-tuned open-source small models (LLaMA 3 8B) significantly outperform GPT-4o zero/few-shot across core metrics.
- Model weights and the training pipeline are fully open-sourced, enabling deployment on private infrastructure to meet the security requirements of news organizations.
Limitations & Future Work¶
- The recognition accuracy of properties and spatial relations remains low (~50-56%), and queries with complex multi-entities and multi-relations may fail.
- Relies on the completeness and coverage of OSM data, which can be sparse in certain developing regions.
- The tag bundling list is statically hand-crafted, requiring manual maintenance and updates for newly emerging geospatial feature types.
- Evaluated on only 195 test queries, which is a relatively small scale and may not fully cover all real-world scenarios.
- Training data is fully generated by a synthetic pipeline, which may lead to distribution shift relative to real user queries.
- The end-to-end geolocation success rate (i.e., whether users can actually find the target location) was not evaluated.
Related Work & Insights¶
- Text-to-SQL: DAIL-SQL (Gao et al. 2024), MCS-SQL (Lee et al. 2025), Jang et al. 2023 (T5 adapter tuning), Zhang et al. 2024 (LLaMA adapter)
- OSM Query: OverpassT5 (Staniek et al. 2024, direct OverpassQL generation), Lawrence & Riezler 2016 (semantic parsing intermediate representation), Will 2021
- Geospatial Tools: Overpass Turbo (native OQL), GeoSpy (image input, closed-source), EarthKit (semi-structured, manually selected tags), GeoGuessr GPT (ChatGPT wrapper, closed-source)
- LLM Fine-Tuning: LoRA (Hu et al.), Unsloth accelerated training, LLaMA 3 (Touvron et al.)
- Semantic Retrieval: SBERT (Reimers & Gurevych 2019), Elasticsearch hybrid retrieval (BM25 + vector)
Rating¶
- Novelty: ⭐⭐⭐⭐ — Novel system design combining multi-layer intermediate representation and semantic tag bundling
- Value: ⭐⭐⭐⭐⭐ — Already deployed in production and fully open-sourced, directly serving investigative journalists
- Experimental Thoroughness: ⭐⭐⭐ — Small test set, with room for improvement in some metrics
- Overall: ⭐⭐⭐⭐