HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation¶
Conference: ICLR 2026 arXiv: 2601.23064 Code: None Area: Diffusion Models Keywords: Visual Geolocation, Hyperbolic Embeddings, Hierarchical Entities, Contrastive Learning, Retrieval
TL;DR¶
HierLoc reformulates visual geolocation as an image-to-entity alignment problem in hyperbolic space, replacing 5M+ image embeddings with ~240K geographic entity embeddings. It achieves a 19.5% reduction in mean geodesic error and a 43% improvement in sub-region accuracy on OSV5M.
Background & Motivation¶
Visual geolocation—inferring the capture location from image content—is a global, cross-scale challenge. Existing approaches fall into three categories: retrieval-based (requiring indexing of millions of image embeddings), classification-based (grid-cell classification that ignores geographic continuity), and generative-based (diffusion models that struggle at fine-grained scales). The root cause lies in the inherent hierarchical structure of geography (country → region → sub-region → city): the number of entities grows exponentially from country to city level, yet Euclidean distance grows only linearly, causing deep-level entities to become crowded and lose discriminability. Hyperbolic space naturally provides exponential volume growth, perfectly matching this hierarchical branching structure. HierLoc's key starting point is reframing geolocation from "image-to-image retrieval" to "image-to-entity alignment."
Method¶
Overall Architecture¶
Geographic entities and images are embedded in the Lorentz hyperbolic space. Images are encoded by a frozen visual encoder (DINOv3) and projected onto the hyperbolic manifold; entities are represented by fusing image, text, and coordinate modalities. Cross-modal attention aligns images with four-level hierarchical entities, and a pretrained Geo-Weighted Hyperbolic InfoNCE (GWH-InfoNCE) loss supervises the alignment. At inference, beam search retrieves predictions by traversing the entity hierarchy.
Key Designs¶
-
Hierarchical Entity Construction and Embedding:
- Function: Compresses training metadata into ~240K hierarchical entities (233 countries, 4,946 regions, 29,214 sub-regions, 209,894 cities).
- Mechanism: Each entity is associated with three modalities—mean image embedding \(\text{Img}_i\) (averaged DINOv3 features of training images), text embedding \(\text{Text}_i\) (CLIP-encoded entity name), and coordinate embedding \(\text{Coords}_i\) (SphereM+ encoded). Anchor embeddings \(A_i\) are randomly initialized in the tangent space at the origin and mapped to the hyperboloid; the final embedding is \(H_i = \exp_O(\log_0(A_i) + \alpha_{\text{node}} \Delta_i)\).
- Design Motivation: Although simple, mean embeddings yield stable and discriminative prototypes at the entity level.
-
Cross-Modal Attention:
- Function: Performs multi-head attention in the tangent space with image features as queries and entity features as keys/values.
- Mechanism: Eight-head attention is applied independently at each hierarchical level; context vectors from all four levels are concatenated, fused via MLP, and added back to the original image features. Only the image stream is updated; entity embeddings remain fixed—preventing overfitting to training data.
- Design Motivation: This asymmetric update strategy ensures the generalizability of entity embeddings.
-
GWH-InfoNCE Loss:
- Function: Incorporates geographic structure into hyperbolic contrastive learning.
- Mechanism: Negative samples are weighted by great-circle distances \(g_{\ell,k}\) computed via the haversine formula: \(w_{\ell,k} = 1 + \lambda \exp(-g_{\ell,k}/\sigma)\). The loss is \(\mathcal{L}_\ell = -\log \frac{\exp(-d_\ell^+/\tau)}{\exp(-d_\ell^+/\tau) + \sum_k w_{\ell,k} \exp(-d_{\ell,k}^-/\tau)}\). The total loss aggregates across levels: \(\mathcal{L} = \sum_{\ell} \beta_\ell \mathcal{L}_\ell\).
- Design Motivation: Geographically proximate negatives are harder to distinguish and should receive higher weights to enhance fine-grained discrimination.
Loss & Training¶
- AdamW is used for Euclidean parameters; RiemannianAdam is used for manifold parameters.
- Batch size 16, learning rate \(2\times10^{-4}\), trained for 5 epochs on 6× L40S GPUs (~60 hours).
- Inference uses beam search (beam width 10) to progressively refine predictions across the entity hierarchy.
Key Experimental Results¶
Main Results (OSV5M Benchmark)¶
| Method | GeoScore↑ | Distance (km)↓ | Country% | Region% | Sub-region% | City% |
|---|---|---|---|---|---|---|
| SC Retrieval | 3597 | 1386 | 73.4 | 45.8 | 28.4 | 19.9 |
| LocDiff | - | - | 77.0 | 46.3 | - | 11.0 |
| HierLoc (DINOV3) | 3963 | 861 | 82.9 | 55.0 | 40.7 | 23.3 |
Ablation Study (Component Contributions)¶
| Configuration | GeoScore | Notes |
|---|---|---|
| Euclidean space | Baseline | Deep-level entity crowding |
| + Hyperbolic space | Improved | Exponential volume growth |
| + GWH-InfoNCE | Best | Geography-aware negative weighting |
| Laplace vs. Gaussian decay | Laplace superior | Choice of decay kernel matters |
Key Findings¶
- Country accuracy +8.8%, region +20.1%, sub-region +43.2%, city +16.8%.
- Mean geodesic error reduced by 19.5% (1,386 km → 861 km vs. SC Retrieval).
- Search space substantially reduced by compressing ~9.6M image records to 240K entities.
- DINOv3 encoder outperforms ViT-L/14.
Highlights & Insights¶
- The "image-to-entity alignment" paradigm reduces retrieval complexity from \(O(N)\) to sub-linear via hierarchical traversal.
- The design intuition behind geographic distance-weighted negatives in GWH-InfoNCE is particularly elegant—geographically close samples are the truly hard negatives.
- The asymmetric cross-modal attention (updating only image features while keeping entity embeddings fixed) effectively prevents overfitting.
Limitations & Future Work¶
- Using mean image embeddings for city-level entities may discard visual diversity information.
- The beam search width is fixed at 10; an adaptive strategy may yield better results.
- The method requires a pre-constructed hierarchy, which may be limited for regions lacking administrative division data.
Related Work & Insights¶
- vs. PIGEON: Relies on large-scale classification with semantic fusion, but collapses to a single-level output, discarding hierarchical signals.
- vs. GeoCLIP: Directly uses coordinates as prediction targets without exploiting hierarchical structure.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First application of hyperbolic embeddings to global hierarchical visual geolocation.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation on OSV5M with validation on multiple external benchmarks.
- Writing Quality: ⭐⭐⭐⭐ Method description is detailed with clear mathematical derivations.
- Value: ⭐⭐⭐⭐⭐ Geometry-aware hierarchical embeddings offer broader inspiration for other hierarchical structure tasks.