Skip to content

GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Conference: NeurIPS 2025 arXiv: 2505.13731 Code: Available Area: Multimodal VLM Keywords: Image geolocalization, distance-aware ranking, large vision-language models, learning-to-rank, spatial reasoning

TL;DR

This paper proposes GeoRanker, a distance-aware ranking framework that leverages large vision-language models (LVLMs) to model spatial relationships between queries and candidates, achieving state-of-the-art worldwide image geolocalization via a multi-order distance loss.

Background & Motivation

Worldwide image geolocalization — predicting GPS coordinates from a single image — poses substantial challenges due to the extreme visual variability across geographic regions. Current state-of-the-art methods (e.g., G3) adopt a two-stage pipeline: first retrieve candidates, then select the best match. Two critical bottlenecks exist:

Candidate selection relies on simple heuristics: Existing methods typically encode queries and candidates independently using cosine similarity, failing to model their interactions and spatial relationships, which makes it difficult to distinguish visually similar yet geographically distant scenes.

Training objectives ignore spatial structure: Point-wise similarity supervision neglects the rich spatial relationships among candidates — such as Tobler's First Law of Geography and the relative distances between candidates.

Analysis reveals that the current SOTA (G3) achieves only 16.7% accuracy at 1 km on IM2GPS3K, yet top-\(k\) retrieved candidates frequently contain better matches — indicating that the bottleneck lies in the ranking stage rather than the retrieval stage.

Method

Overall Architecture

GeoRanker consists of two stages:

  1. Data construction stage: Constructs the GeoRanking dataset, providing spatially diverse candidate sets for each query.
  2. Ranking model stage: Employs an LVLM to jointly encode query–candidate interactions and predict geographic distance scores.

Key Designs

1. GeoRanking Dataset Construction

  • Database encoding: Uses the MP16-Pro multimodal dataset; each candidate \(c_m\) is encoded by three encoders (GPS, text, image) into a feature vector \(\mathbf{v}_{c_m} = \text{concat}(\text{Enc}^{gps}, \text{Enc}^{text}, \text{Enc}^{img})\).
  • Query encoding: Query images are projected into GPS and text embedding spaces via adapter layers, made compatible with the multimodal candidate vectors.
  • Candidate retrieval: Cosine similarity is computed to select top-\(N\) candidates, where the top-\(k_1\) form the ranking candidates \(\mathcal{C}_{rc}\) and the bottom \(k_2\) serve as negative samples \(\mathcal{C}_{neg}\).
  • Dataset scale: 100K samples, comprising 2 million query–candidate pairs.

2. GeoRanker Ranking Model

Unlike conventional methods that encode queries and candidates independently, GeoRanker assembles them into a prompt and uses an LVLM to model complex interactions:

  • Input construction: \(\mathbf{x} = \text{Prompt}(q, \mathcal{C}_{rc}, \mathcal{C}_{neg}, p)\), where negative samples use text only (GPS + description) to reduce memory consumption.
  • Model architecture: Qwen2-VL-7B-Instruct serves as the LVLM backbone, with LoRA (\(r=16\), \(\alpha=32\)) inserted into the q/k/v projection layers.
  • Distance score: The hidden state at the final token position is mapped to a scalar distance score via a linear value head: \(s = \mathbf{w}^\top \mathbf{h}_{final}\).

3. Inference Stage

At inference, retrieved candidates \(\mathcal{C}_r\) are combined with LVLM-generated candidates \(\mathcal{C}_g\) (produced via GPT-4V). A score is computed for each query–candidate pair, and the GPS coordinates of the highest-scoring candidate are used as the final prediction.

Loss & Training

Multi-Order Distance Optimization Objectives

First-order distance loss (Partial Plackett-Luce Loss): Candidates are sorted in ascending order of true geographic distance; the PL loss optimizes the ranking consistency of predicted scores: $\(\mathcal{L}_{PL}^{(1)} = -\frac{1}{K^{(1)}} \sum_{i=1}^{K^{(1)}} \log \frac{\exp(s_{\pi(i)})}{\sum_{j=i}^{k_1} \exp(s_{\pi(j)})}\)$

Second-order distance loss: Captures relative spatial differences among candidates — supervising the ranking of first-order distance differences, such that candidate pairs with larger geographic distance gaps receive larger score gaps: $\(\mathcal{L}_{PL}^{(2)} = -\frac{1}{K^{(2)}} \sum_{i=1}^{K^{(2)}} \log \frac{\exp(\Delta s_{(i)})}{\sum_{j=i}^{P} \exp(\Delta s_{(j)})}\)$

Joint optimization: \(\mathcal{L}_{total} = \lambda \cdot \mathcal{L}_{PL}^{(1)} + (1-\lambda) \cdot \mathcal{L}_{PL}^{(2)}\), with \(\lambda=0.7\) and \(K^{(1)}=1\).

Training configuration: AdamW, lr=1e-4, batch size=4, 1 epoch, on 4 × NVIDIA L40S GPUs.

Key Experimental Results

Main Results

Method IM2GPS3K 1km 25km 200km YFCC4K 1km 25km 200km
GeoCLIP (NeurIPS'23) 14.11 34.47 50.65 9.59 19.31 32.63
PIGEON (CVPR'24) 11.3 36.7 53.8 10.4 23.7 40.6
G3 (NeurIPS'24) 16.65 40.94 55.56 23.99 35.89 46.98
GeoRanker 18.79 45.05 61.49 32.94 43.54 54.32
Relative Gain ↑12.9% ↑10.0% ↑10.7% ↑37.3% ↑21.3% ↑15.6%

Ablation Study (IM2GPS3K)

Variant 1km 25km 200km 750km 2500km
w/o second-order loss 18.48 44.61 60.96 75.61 88.28
w/o negative samples 17.35 44.51 60.82 76.37 88.28
w/o candidate images 15.58 41.77 59.15 75.40 88.35
w/o generated candidates 18.21 43.47 59.69 75.47 88.75
Full model 18.79 45.05 61.49 76.31 89.29

Key Findings

  1. All components contribute positively to final performance; candidate image information contributes the most (removal causes a 3.21% drop at 1 km).
  2. The second-order loss contributes more prominently at coarse granularities (country/continent), confirming the benefit of modeling relative spatial relationships among candidates.
  3. GeoRanker outperforms all ranking baselines (Random, Top-1 similarity, LVLM Prompting) across the board.
  4. Hyperparameter analysis shows \(\lambda=0.7\) and \(K^{(1)}=1\) are optimal; increasing \(K^{(1)}\) introduces a train–test distribution mismatch.

Highlights & Insights

  1. Paradigm shift from similarity matching to structured spatial reasoning: Rather than encoding queries and candidates independently, GeoRanker models their interactions via an LVLM.
  2. Elegant multi-order distance loss design: The first-order loss constrains who is closest, while the second-order loss constrains by how much — the two are complementary.
  3. GeoRanking dataset is the first dataset specifically designed for geographic ranking tasks, representing a meaningful contribution to the field.
  4. The 37.3% improvement in 1 km accuracy on YFCC4K is remarkably substantial.

Limitations & Future Work

  1. Inference requires an LVLM forward pass for every query–candidate pair, resulting in high computational cost.
  2. Inference relies on GPT-4V to generate candidates, introducing additional API overhead.
  3. The candidate retrieval stage still depends on conventional embedding similarity, and retrieval quality imposes an upper bound on final performance.
  4. Evaluation is limited to two geolocalization benchmarks; broader scenario assessment (e.g., indoor environments, adverse weather) is lacking.
  • The work draws deeply on Learning-to-Rank (LTR) methodology; the application of the Plackett-Luce ranking loss to spatial ranking is particularly noteworthy.
  • The RAG-style retrieve-then-generate paradigm from G3 is inherited and improved upon; the ranking module constitutes an independent contribution.
  • Inspiration: similar distance-aware ranking ideas could be transferred to other spatial reasoning tasks, such as map matching and scene navigation.

Rating

  • Novelty: ⭐⭐⭐⭐ (The combination of multi-order distance loss and LVLM-based ranking is innovative)
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Comprehensive ablations, hyperparameter analysis, and ranking baseline comparisons)
  • Writing Quality: ⭐⭐⭐⭐ (Clear structure and rigorous mathematical derivations)
  • Value: ⭐⭐⭐⭐ (Significant performance gains and an open-source dataset contribution)