GeoSURGE: Geo-localization using Semantic Fusion with Hierarchy of Geographic Embeddings¶

Conference: CVPR 2026 arXiv: 2510.01448 Code: N/A Area: Image Retrieval & Localization Keywords: Visual geo-localization, semantic fusion, hierarchical geographic embeddings, contrastive learning, cross-attention

TL;DR¶

GeoSURGE introduces hierarchical geographic embeddings and a semantic fusion module, framing global image geo-localization as a matching problem between visual representations and learned geographic representations. The method achieves state-of-the-art performance on 22 out of 25 metrics across 5 benchmarks.

Background & Motivation¶

Background: Global visual geo-localization aims to determine the geographic location of an image solely from its visual content. Existing approaches fall into two main categories: retrieval-based methods (matching a query image against a large-scale geotagged image database) and classification-based methods (discretizing the Earth's surface into geographic cells and training a classifier). Recent works such as GeoCLIP, which replaces image references with GPS coordinates, and Img2Loc and G3, which leverage large vision-language models, have further advanced performance.

Limitations of Prior Work: Retrieval-based methods require large-scale similarity search at inference time, incurring substantial computational cost. Classification-based methods must trade off between spatial resolution and global coverage. More fundamentally, the low dimensionality of GPS coordinates makes it difficult to learn expressive geographic representations. While GeoCLIP partially addresses this with Random Fourier Features, the low-dimensional GPS bottleneck remains.

Key Challenge: Geographic coordinates (latitude and longitude) are inherently 2D scalars and struggle to encode rich geographic semantics. Moreover, image appearance features are sensitive to changes in illumination, weather, and viewpoint, limiting the robustness of purely RGB-based representations.

Goal: (1) How to construct geographic representations that are sufficiently expressive to yield discriminative features across geographic regions at multiple scales? (2) How to make visual representations more robust by incorporating scene semantic information to complement appearance features?

Key Insight: The authors observe that the notion of geographic cells from classification methods can be combined with the matching paradigm of retrieval methods — rather than treating geographic cells as discrete class labels, a trainable embedding vector is learned for each cell. Simultaneously, scene structural information from semantic segmentation is used to supplement RGB appearance features.

Core Idea: Replace low-dimensional GPS coordinates with hierarchical learnable geographic embeddings as the geographic representation, and replace purely appearance-based features with a latent cross-attention fusion of semantic segmentation and RGB modalities.

Method¶

Overall Architecture¶

GeoSURGE takes an RGB image as input and predicts its latitude and longitude on Earth. The system consists of two core components: (1) Geographic representation — the Earth's surface is recursively partitioned into multi-level geographic cells using S2 Geometry, with each cell assigned a learnable embedding vector, forming a hierarchical distributed geographic representation; (2) Visual representation — RGB features are extracted via a CLIP ViT backbone, while semantic segmentation maps are generated by OneFormer, and the two are fused into a robust visual feature vector via latent cross-attention. The model is trained with an InfoNCE contrastive objective to align visual-geographic feature pairs. At inference, hierarchical matching is performed level by level, and the product of probabilities across levels yields the final prediction.

Key Designs¶

Hierarchical Geographic Embeddings
Function: Provide multi-scale distributed feature representations for the Earth's surface.
Mechanism: Google S2 Geometry projects the Earth's surface onto the six faces of a cube and recursively subdivides them. Any cell containing more than \(\tau_{max}\) samples is further split; cells with fewer than \(\tau_{min}\) samples are discarded. By varying \(\tau_{max}\) (from 25,000 to 500, yielding 7 levels), a coarse-to-fine multi-level partition is produced. Each geographic cell at each level is associated with a learnable 768-dimensional embedding vector, aligned with image features via contrastive learning during training. Embeddings at different levels are learned independently to encourage diversity; at inference, probabilities from all levels are multiplied to obtain the hierarchical inference result.
Design Motivation: GPS coordinates are 2D scalars with limited expressiveness and require auxiliary components (e.g., Random Fourier Features) to enhance them. Learnable embedding vectors can accumulate visual information from all training images in a region, yielding richer geographic representations. The multi-scale hierarchical design enables coarse and fine-grained information to complement each other.
Semantic Fusion Module
Function: Inject semantic segmentation information into RGB appearance features to generate robust visual representations.
Mechanism: RGB patch tokens and the CLS token are extracted using a CLIP ViT-Large backbone (with all but the last few layers frozen). OneFormer then generates an ADE20K semantic segmentation map, which is projected into semantic tokens via a linear layer. Semantic tokens serve as queries, and RGB tokens serve as keys and values in latent multi-headed cross-attention (latent attention reduces memory overhead), followed by an MLP with residual connections and LayerNorm. Three fusion blocks are stacked sequentially to learn hierarchical fused features. The fused CLS token is extracted and projected via LayerNorm and a linear layer to produce the final visual feature vector.
Design Motivation: Pure RGB features are sensitive to illumination, weather, and viewpoint variations. Scene structure from semantic segmentation (buildings, vegetation, roads, etc.) is more invariant. Furthermore, semantic information can implicitly suppress localization-irrelevant regions (e.g., people, vehicles). Latent cross-attention — rather than simple concatenation — enables semantic information to selectively guide RGB feature aggregation.
Contrastive Training and Hierarchical Inference
Function: Align visual and geographic representations and enable multi-scale inference.
Mechanism: During training, the fused CLS token \(\mathbf{v}\) and the geographic embedding \(\mathbf{g}\) corresponding to the ground-truth location are extracted for each training sample. The InfoNCE loss \(\mathcal{L}_i = -\log \frac{\exp(\mathbf{v}_i^\top \mathbf{g}_i / \tau)}{\sum_j \exp(\mathbf{v}_i^\top \mathbf{g}_j / \tau)}\) maximizes the cosine similarity of correct pairs. The full training objective is the sum of losses across all hierarchy levels. At inference, softmax probabilities are computed between the query image and all embeddings at each level; for each cell at the finest level, the probabilities of all ancestor cells are multiplied to yield the final prediction.
Design Motivation: Hierarchical inference combines the high confidence of coarse-grained levels with the high resolution of fine-grained levels, analogous to a progressive geographic search.

Loss & Training¶

AdamW optimizer with an initial learning rate of 0.0001 and weight decay of 0.0001; effective batch size of 1024. Learning rate decay (gamma=0.5) is applied each epoch, with early stopping triggered after 4 epochs without improvement. The temperature parameter is initialized to 0.07 and is independent per level. Training runs for 21 hours on 8 A6000 GPUs. Ten Crop augmentation is used to average predictions at inference.

Key Experimental Results¶

Main Results¶

Dataset	Metric (1km)	GeoSURGE	GeoCLIP	PIGEOTTO	G3/GPT-4V
IM2GPS	Street-level	27.0	16.5	11.8	-
IM2GPS	Continent-level	93.2	88.6	91.1	-
IM2GPS3k	Street-level	17.2	14.1	10.9	16.6
IM2GPS3k	Continent-level	87.6	83.8	84.4	84.7
YFCC26k	Street-level	17.8	11.6	10.1	-
GWS15k	Continent-level	80.8	74.1	84.7†	-

GeoSURGE achieves state-of-the-art on 22 out of 25 metrics across 5 datasets. Excluding LVLM-based methods, it achieves the best performance on all 25 metrics.

Ablation Study¶

Configuration	YFCC26k 1km	YFCC26k 25km	GWS15k 1km	GWS15k 25km
Full (7 levels)	17.8	31.5	1.0	4.6
5 levels	11.1	30.0	0.4	3.5
1 level	8.9	27.5	0.1	3.1
3 fusion blocks	17.8	31.5	1.0	4.6
No fusion	13.8	30.4	0.6	4.6

Key Findings¶

Hierarchy depth has the greatest impact on fine-grained metrics (street/city level): 7 levels vs. 1 level yields approximately a 2× improvement on YFCC26k at 1km (17.8 vs. 8.9).
The semantic fusion module provides approximately 36.5% relative improvement on YFCC26k at 1km (13.8→17.8).
Geographic embeddings (retrieval target) consistently outperform classification targets under both hierarchical and flat settings, demonstrating that embeddings and hierarchy are complementary.
LVLM-based methods have an advantage at fine-grained localization (likely due to memorization of landmark features from large-scale pretraining), but GeoSURGE comprehensively outperforms them at medium-to-coarse granularity.

Highlights & Insights¶

The design that unifies the geographic cells of classification methods with the feature matching of retrieval methods is particularly elegant. Treating cells as learnable embeddings rather than class labels allows each region to "automatically accumulate" its visual characteristics, capturing the advantages of both paradigms.
Using semantic segmentation as the query in cross-attention — rather than as an independent second representation — is a meaningful design choice. Semantics do not directly participate in feature representation but serve as a structural signal guiding RGB feature aggregation. This "guiding rather than replacing" paradigm is transferable to other multimodal fusion tasks.
The most significant performance gains are observed on GWS15k (globally uniform distribution), indicating strong generalization of the proposed method.

Limitations & Future Work¶

The method depends on OneFormer for semantic segmentation preprocessing, increasing inference overhead for large-scale image collections.
The partitioning scheme is fixed to S2 Geometry; the impact of adaptive or data-driven partitioning on performance is unexplored.
The influence of different visual backbone networks (e.g., ViT-H, DINOv2) has not been investigated.
For geographic regions with sparse training data (e.g., small islands near oceans), the nearest reference image can be over 800 km from the ground truth, indicating that data coverage remains a fundamental bottleneck.

vs. GeoCLIP: Both use a CLIP backbone and contrastive learning, but GeoCLIP directly embeds GPS coordinates using Random Fourier Features, whereas GeoSURGE employs learnable geographic cell embeddings, avoiding the information loss of low-dimensional GPS. GeoSURGE consistently outperforms GeoCLIP under the same backbone and data.
vs. Img2Loc/G3: These methods leverage GPT-4V's large-scale world knowledge for RAG-based localization, yielding advantages at street level (likely from landmark memorization), but GeoSURGE is consistently stronger at medium-to-coarse granularity.
vs. TransLocator: TransLocator performs repeated fusion between parallel semantic and RGB backbones, whereas GeoSURGE employs latent cross-attention, which is more efficient and yields better results.

Rating¶

Novelty: ⭐⭐⭐⭐ The combined design of hierarchical embeddings and semantic fusion is novel, though individual components are not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Five benchmark datasets, detailed ablation studies, and qualitative analysis — very comprehensive.
Writing Quality: ⭐⭐⭐⭐ Clear logic and well-organized structure.
Value: ⭐⭐⭐⭐ Establishes a new state-of-the-art baseline in global visual geo-localization.