GeoRC: A Benchmark for Geolocation Reasoning Chains¶

Conference: ACL 2026
arXiv: 2601.21278
Code: GitHub
Area: Multimodal/Geolocation
Keywords: Geolocation, Reasoning Chains, VLM Evaluation, GeoGuessr, Explainability

TL;DR¶

Proposes GeoRC, the first geolocation reasoning chain benchmark written by GeoGuessr champion-level experts (800 reasoning chains, 500 scenes). It evaluates the ability of VLMs to generate auditable reasoning chains, finding that while closed-source VLMs can match human localization accuracy, the quality of their reasoning chains lags significantly, while open-source VLMs are almost equivalent to a pure hallucination baseline.

Background & Motivation¶

Background: VLMs have approached near-optimal human expert levels in global image localization tasks—the country-level accuracy of large-scale closed-source models (Gemini, GPT-5) is comparable to GeoGuessr world champions.

Limitations of Prior Work: While VLMs can locate photos, they perform poorly when explaining "why this location was chosen"—reasoning chains often contain hallucinations, miss fine-grained visual details, and exhibit tunnel-vision style post-hoc rationalization. This makes their localization decisions unauditable and unverifiable.

Key Challenge: Localization accuracy is high but explainability shows a massive gap—a VLM's "correct answer" may be based on incorrect reasoning paths, which is unacceptable in applications such as investigative journalism and OSINT that require trustworthy reasoning chains.

Goal: To build the first geolocation reasoning chain benchmark written by top experts to quantify the gap between VLM reasoning chains and human experts.

Key Insight: Invite three GeoGuessr champion-level players (including the 2025 world champion) to write detailed localization reasoning processes to establish a "gold standard" for reasoning chains.

Core Idea: Evaluate the matching degree between VLM reasoning chains and expert reasoning chains via a precision-recall-F1 framework, using LLM-as-judge for automated evaluation.

Method¶

Overall Architecture¶

GeoRC addresses whether VLMs can clearly explain localization decisions. It consists of three components: first, three GeoGuessr champions wrote 800 reasoning chains for 500 scenes as the human expert "gold standard"; second, three automatic evaluation methods (one-to-all LLM-as-judge, key-point guided LLM-as-judge, and VLM-as-judge) were designed to align and score candidate chains against expert chains; finally, precision/recall/F1 are used to measure reasoning chain quality, combined with country-level localization accuracy to measure the correctness of conclusions. The core of the workflow is transforming the "reasoning process" into a quantifiable and auditable evaluation target, rather than just checking if the final answer is correct.

Key Designs¶

1. Expert Reasoning Chain Dataset: Defining the "Gold Standard" through World Champion Thought Processes

VLMs often correctly identify a location but fail to state the underlying evidence. The problem is that there is no trustworthy reference to judge if its reasoning is sound. GeoRC invited three champion-level players (including 2025 world champion Radu Casapu) to write reasoning chains for 500 GeoGuessr locations, documenting the full range of localization logic from coarse to fine—hundreds of distinguishing scene attributes such as infrastructure, vegetation, architectural styles, vehicles, and text are utilized as clues to narrow the range. 150 locations were co-annotated by multiple experts to calculate inter-expert consistency.

Notably, reasoning chains are naturally non-exhaustive: different experts focus on different clues for the same image, and no one lists all available information. This variation is not noise but a fundamental value of the evaluation—it means that even between human experts, the F1 score is only approximately 57, providing a realistic ceiling for VLM performance.

2. One-to-all LLM-as-judge Evaluation: Splitting Reasoning Chain Alignment into Precision and Recall

With expert chains established, a scalable scoring method was required. Every step of a candidate reasoning chain is compared against all steps of the reference chain to calculate similarity: a forward traversal yields precision (how many candidate steps correspond to the reference), and a backward traversal yields recall (how many reference clues were covered by the candidate). These are combined into an F1 score. Thus, a reasoning chain is penalized by precision for "stating irrelevant attributes" and by recall for "missing key clues."

The use of an LLM as a judge is justified by its alignment with human scoring: the one-to-all method shows an MAE of only 12.06 against human scores (compared to a human-to-human MAE of 12.72). The correlation coefficient of 0.69 indicates that the automated evaluation error falls within the range of human disagreement.

3. Multi-level Baseline Design: Calibrating VLM Levels with Three "Floor Lines"

A score alone is not intuitive—the difference between a 56 and an 18 requires context. GeoRC set three baselines to bound the reasoning chain quality: a Random baseline (using an expert chain from a different location), which scores near zero (1.90); a Hallucination baseline (telling the LLM the country/city without the image and letting it "invent" a chain), which scores approximately 18; and a Rewrite baseline (rewriting the best expert chain), which achieves a high score.

The Hallucination baseline is the most diagnostic: it represents the score achievable by relying purely on geographic common sense without looking at the image. If a VLM's score is close to this line, it suggests the model is not extracting real scene information from the image; rather, its reasoning is merely post-hoc rationalization. As shown later, open-source VLMs fall into this trap.

Key Experimental Results¶

Main Results¶

Candidate	F1	Country Accuracy
Human Expert Avg	56.69	94.67%
GPT-4.1	~44	~90%
Gemini 2.5 Pro	~40	~88%
GPT-5	~42	~92%
Qwen2.5-VL-72B	~35	~70%
Llama-3.2-90B	~20	~55%
Hallucination Baseline	18.13	—
Random Baseline	1.90	—

Key Findings¶

The best VLM (GPT-4.1) still has an F1 gap of approximately 12 points compared to human experts.
Open-source VLMs (Llama, Qwen-3) score close to the hallucination baseline (~18 vs ~20), implying they extract almost no useful scene information from images.
Models cluster into three distinct groups: Expert > Closed-source VLM > Open-source VLM.
Close localization accuracy does not imply similar reasoning chain quality—GPT-5 matches human accuracy but shows a large F1 gap.
Qwen2.5 has higher recall than precision because its reasoning chains include many irrelevant, non-distinguishing attributes.

Highlights & Insights¶

Fills an important gap: The first geolocation reasoning chain benchmark written by genuine world-champion level experts.
Reveals a profound disparity: Accuracy \(\approx\) Reasoning ability; a VLM's "correct answer" may be based on hallucinated reasoning.
Warning from the Hallucination baseline: The quality of open-source VLM reasoning chains is equivalent to LLM hallucinations without images, indicating a severe deficiency in visual understanding in smaller models.
Practical evaluation methodology: The LLM-as-judge approach shows high consistency with human scoring and is scalable.

Limitations & Future Work¶

Data limited by Google Street View coverage: Insufficient coverage in certain regions (e.g., Africa, Central Asia).
Limited inter-expert consistency: The non-exhaustive nature of reasoning chains results in an expert-to-expert F1 of only ~57.
Only evaluates country-level localization: Finer-grained (city/street) evaluation is more difficult.
Future directions: Finer-grained localization evaluation, construction of training data, and human-in-the-loop localization systems.

vs Pigeon: Claims to surpass humans but only compares accuracy rather than reasoning quality.
vs Traditional Geolocation methods: Methods like im2gps lack reasoning chain generation capabilities.
vs General VLM Benchmarks: Benchmarks like ChartQA focus on different dimensions, while GeoRC focuses on fine-grained visual attribute extraction.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First expert-level reasoning chain benchmark; unique and valuable problem definition.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple VLMs and evaluation methods with human verification and multi-baseline comparisons.
Writing Quality: ⭐⭐⭐⭐⭐ Reasoning chain examples and analytical charts are highly persuasive.
Value: ⭐⭐⭐⭐⭐ Opens a new dimension for VLM explainability evaluation with direct applications in OSINT and investigative journalism.