HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition¶
Conference: CVPR 2026 arXiv: 2506.04764 Code: https://suhan-woo.github.io/HypeVPR/ (Project Page) Area: Other Keywords: Visual Place Recognition, Hyperbolic Space, Panoramic Images, Hierarchical Embedding, Perspective-to-Equirectangular Matching
TL;DR¶
This paper proposes HypeVPR, a visual place recognition framework based on hierarchical embedding in hyperbolic space, specifically designed to address cross-field-of-view matching between perspective (query) and equirectangular panoramic (database) images. By constructing multi-level descriptors from local to global within the Poincaré ball, HypeVPR achieves a flexible balance among accuracy, efficiency, and storage, achieving retrieval speeds several times faster than sliding-window baselines at comparable accuracy.
Background & Motivation¶
Visual Place Recognition (VPR) localizes a query by retrieving the most similar image from a database, and is a core capability for autonomous navigation and mobile robotics. Traditional P2P (perspective-to-perspective) methods require storing images from multiple orientations per location to cover all possible query viewpoints, resulting in substantial storage and retrieval overhead.
The Perspective-to-Equirectangular (P2E) framework offers a more practical alternative: the database represents each location with a single panoramic image covering all directions, while queries remain ordinary perspective images. However, P2E faces a fundamental challenge: panoramic images contain 360° information while queries cover only a limited field of view — how can one extract a descriptor from a panorama that simultaneously encodes global context and precisely matches a local viewpoint?
Existing methods (PanoVPR, Orhan et al.) either apply sliding-window cropping over panoramas with pairwise comparison (\(O(n)\) comparisons, extremely slow), or fail to capture the structured internal relationships within panoramas.
The paper's core insight is that visual scenes possess a natural hierarchical structure — a panorama contains multiple wide-FoV sub-views, each of which contains narrower local viewpoints. Hyperbolic space is naturally suited for modeling hierarchical relationships (exponentially growing distances allow low-distortion embedding of tree structures), making it more appropriate than Euclidean space for organizing multi-level descriptors of panoramic images.
Method¶
Overall Architecture¶
HypeVPR comprises two network branches: (1) a query network \(\mathcal{F}_q\) that generates a single hyperbolic descriptor \(\mathbf{h}_q\) from a perspective image; and (2) a database network \(\mathcal{F}_d\) that decomposes a panoramic image into multi-level windows and generates a set of multi-level hyperbolic descriptors \(\mathbf{H}_d\) from local to global via a Hierarchical Aggregation Module (HAM). At retrieval time, top-level descriptors are used for fast coarse filtering, followed by re-ranking with lower-level descriptors.
Key Designs¶
-
Hierarchical Modeling of Panoramic Images:
- Function: Decompose a panoramic image into \(L\) levels of windows with decreasing horizontal field of view.
- Mechanism: The top level \(\ell=1\) is the full panorama; each subsequent level halves the horizontal FoV: \(I_d^{(\ell)} \in \mathbb{R}^{H \times \frac{W'}{2^{\ell-1}} \times C}\). The bottom-level windows match the resolution of the query image, enabling shared backbone weights. This constitutes a hierarchical tree from global to local.
- Design Motivation: Only a small portion of a panorama matches the query; the remainder is redundant. Hierarchical modeling allows the system to first determine the approximate direction globally, then refine to the matching region.
-
Hierarchical Aggregation Module (HAM):
- Function: Progressively aggregate Euclidean features from bottom-level windows into hierarchical descriptors in hyperbolic space.
- Mechanism: Each window's Euclidean descriptor \(\mathbf{d}_d^{(\ell,j)}\) is obtained via GeM pooling followed by a linear layer, then mapped onto the Poincaré ball via the exponential map: \(\mathbf{h}_d^{(\ell,j)} = \exp_0^c(\mathbf{d}_d^{(\ell,j)})\). Adjacent windows' hyperbolic descriptors are aggregated via the Einstein midpoint: \(\mathcal{A}_{hyp}(h_1,...,h_n) = \frac{\sum_j \gamma_j h_j}{\sum_j \gamma_j}\), where \(\gamma_j = \frac{1}{\sqrt{1-c\|h_j\|^2}}\) is the Lorentz factor.
- Design Motivation: In hyperbolic space, points farther from the origin represent finer-grained concepts, while points closer to the origin represent more abstract global semantics. The Einstein midpoint respects the distance structure of hyperbolic geometry through norm-aware weighting, preventing the loss of hierarchical information during aggregation.
-
Adjustable Hierarchical Retrieval:
- Function: Flexibly adjust the accuracy-efficiency trade-off without retraining.
- Mechanism: Top-level descriptors \(\mathbf{h}_d^{(1,1)}\) are first used to retrieve Top-\(K'\) candidates; sub-descriptors from selected levels \(\mathbb{L}\) then re-score candidates: \(d_\ell = \min_k d_c(\mathbf{h}_q, \mathbf{h}_d^{(\ell,k)})\). After Z-score normalization, scores are combined as \(s = \sum_{\ell \in \{1\} \cup \mathbb{L}} w_\ell \hat{s}_\ell\).
- Design Motivation: Different deployment scenarios impose different requirements on speed and accuracy. The hierarchical structure naturally supports a cascade strategy of "coarse matching for speed, fine matching for precision," allowing users to control the trade-off by selecting which levels to activate.
Loss & Training¶
Three loss functions are employed: (1) Hierarchical triplet loss \(\mathcal{L}_{hier}\) — descriptors with overlapping FoV at adjacent levels form positive pairs, while non-overlapping descriptors at the same level form negative pairs, using hyperbolic distance; (2) a standard triplet loss for query-database matching; (3) end-to-end training under the triplet loss framework.
Key Experimental Results¶
Main Results¶
| Method | Backbone | Pitts250K R@1 | Pitts250K R@5 | YQ360 R@1 | YQ360 R@5 | Time/Query (ms) |
|---|---|---|---|---|---|---|
| PanoVPR×16 (ConvNeXt-S) | ConvNeXt-S | 40.3 | 63.0 | 46.0 | 83.2 | 48.6 |
| HypeVPR-L (ConvNeXt-S) | ConvNeXt-S | 43.4 | 64.3 | 52.4 | 85.2 | 14.0 |
| Orhan et al.* | ResNet-101 | 47.0 | 66.4 | 47.6 | 79.2 | 1555.2 |
| HypeVPR-O* | ResNet-50 | 66.5 | 82.1 | 53.6 | 81.2 | 4.0 |
Ablation Study¶
| Configuration | Key Metric | Note |
|---|---|---|
| HypeVPR-B vs PanoVPR×8 (Swin-T) | R@1: 29.4 vs 22.0 (Pitts) | Single descriptor outperforms 8-window sliding baseline |
| HypeVPR-L vs PanoVPR×16 (Swin-T) | R@1: 32.5 vs 33.6 (Pitts) | Comparable accuracy, 3.5× faster |
| HypeVPR-B speed vs PanoVPR×8 | 3.6ms vs 17.0ms | 4.7× speedup |
| HypeVPR-L speed vs PanoVPR×16 | 14.0ms vs 48.6ms | 3.5× speedup |
| HypeVPR-O vs Orhan et al. | Time: 4.0ms vs 1555.2ms | 388× speedup, R@1 +19.5 |
Key Findings¶
- Under additional large-scale training data (HypeVPR-O*), Pitts250K R@1 reaches 66.5%, far surpassing Orhan et al.'s 47.0%, with 388× speedup.
- Hierarchical retrieval enables HypeVPR to outperform PanoVPR even when using fewer sub-descriptors (-B mode vs. more sliding windows).
- Hyperbolic space embedding preserves local-to-global hierarchical relationships within panoramas better than Euclidean space.
- The accuracy-efficiency trade-off can be flexibly controlled at inference time by selecting active levels, without retraining.
Highlights & Insights¶
- Introducing hyperbolic space into VPR is a highly natural and compelling innovation: the panorama → sub-view → local region nesting relationship is inherently tree-structured.
- Einstein midpoint aggregation ensures that hierarchical aggregation respects hyperbolic geometry, avoiding geometric distortion introduced by naive averaging.
- Adjustable hierarchical retrieval is a practically valuable feature: on edge devices, modest accuracy sacrifices can yield substantial speedups.
Limitations & Future Work¶
- The number of bottom-level windows grows exponentially with the number of levels (\(2^{L-1}\)), resulting in non-trivial memory and computational overhead for large \(L\).
- Experiments are conducted on only two datasets (Pitts250K-P2E and YQ360), leaving generalizability insufficiently validated.
- The per-level weights \(w_\ell\) in hierarchical retrieval appear to be set manually; learned or adaptive weighting could be explored.
- The shared backbone applies identical feature extraction to both perspective queries and panoramic windows, which may not be globally optimal.
Related Work & Insights¶
- The key distinction from PanoVPR is that PanoVPR performs brute-force sliding-window search at the P2P level, whereas HypeVPR reduces matching complexity from \(O(n)\) to \(O(\log n)\) via hierarchical embedding.
- Hyperbolic embedding has been applied in NLP (Poincaré embeddings) and image retrieval; this paper represents its first application to P2E VPR.
- The hierarchical triplet loss design is generalizable to other visual matching problems with natural hierarchical structure.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ The combination of hyperbolic space, hierarchical panorama modeling, and adjustable retrieval is highly novel and well-motivated.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comparisons across multiple frameworks are thorough, though the number of datasets is limited.
- Writing Quality: ⭐⭐⭐⭐ Theoretical foundations are solid, figures are clear, and motivation is articulated convincingly.
- Value: ⭐⭐⭐⭐ A practical solution for P2E VPR; the speed advantage is of considerable value for real-world deployment.