HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition¶

Conference: CVPR 2026 arXiv: 2506.04764 Code: https://suhan-woo.github.io/HypeVPR/ (Project Page) Area: Other Keywords: Visual Place Recognition, Hyperbolic Space, Panoramic Images, Hierarchical Embedding, Perspective-to-Equirectangular Matching

TL;DR¶

This paper proposes HypeVPR, a visual place recognition framework based on hierarchical embedding in hyperbolic space, specifically designed to address cross-field-of-view matching between perspective (query) and equirectangular panoramic (database) images. By constructing multi-level descriptors from local to global within the Poincaré ball, HypeVPR achieves a flexible balance among accuracy, efficiency, and storage, achieving retrieval speeds several times faster than sliding-window baselines at comparable accuracy.

Background & Motivation¶

Visual Place Recognition (VPR) localizes a query by retrieving the most similar image from a database, and is a core capability for autonomous navigation and mobile robotics. Traditional P2P (perspective-to-perspective) methods require storing images from multiple orientations per location to cover all possible query viewpoints, resulting in substantial storage and retrieval overhead.

The Perspective-to-Equirectangular (P2E) framework offers a more practical alternative: the database represents each location with a single panoramic image covering all directions, while queries remain ordinary perspective images. However, P2E faces a fundamental challenge: panoramic images contain 360° information while queries cover only a limited field of view — how can one extract a descriptor from a panorama that simultaneously encodes global context and precisely matches a local viewpoint?

Existing methods (PanoVPR, Orhan et al.) either apply sliding-window cropping over panoramas with pairwise comparison (\(O(n)\) comparisons, extremely slow), or fail to capture the structured internal relationships within panoramas.

The paper's core insight is that visual scenes possess a natural hierarchical structure — a panorama contains multiple wide-FoV sub-views, each of which contains narrower local viewpoints. Hyperbolic space is naturally suited for modeling hierarchical relationships (exponentially growing distances allow low-distortion embedding of tree structures), making it more appropriate than Euclidean space for organizing multi-level descriptors of panoramic images.

Method¶

Overall Architecture¶

HypeVPR comprises two network branches: (1) a query network \(\mathcal{F}_q\) that generates a single hyperbolic descriptor \(\mathbf{h}_q\) from a perspective image; and (2) a database network \(\mathcal{F}_d\) that decomposes a panoramic image into multi-level windows and generates a set of multi-level hyperbolic descriptors \(\mathbf{H}_d\) from local to global via a Hierarchical Aggregation Module (HAM). At retrieval time, top-level descriptors are used for fast coarse filtering, followed by re-ranking with lower-level descriptors.

Key Designs¶

Hierarchical Modeling of Panoramic Images:
- Function: Decompose a panoramic image into \(L\) levels of windows with decreasing horizontal field of view.
- Mechanism: The top level \(\ell=1\) is the full panorama; each subsequent level halves the horizontal FoV: \(I_d^{(\ell)} \in \mathbb{R}^{H \times \frac{W'}{2^{\ell-1}} \times C}\). The bottom-level windows match the resolution of the query image, enabling shared backbone weights. This constitutes a hierarchical tree from global to local.
- Design Motivation: Only a small portion of a panorama matches the query; the remainder is redundant. Hierarchical modeling allows the system to first determine the approximate direction globally, then refine to the matching region.
Hierarchical Aggregation Module (HAM):
- Function: Progressively aggregate Euclidean features from bottom-level windows into hierarchical descriptors in hyperbolic space.
- Mechanism: Each window's Euclidean descriptor \(\mathbf{d}_d^{(\ell,j)}\) is obtained via GeM pooling followed by a linear layer, then mapped onto the Poincaré ball via the exponential map: \(\mathbf{h}_d^{(\ell,j)} = \exp_0^c(\mathbf{d}_d^{(\ell,j)})\). Adjacent windows' hyperbolic descriptors are aggregated via the Einstein midpoint: \(\mathcal{A}_{hyp}(h_1,...,h_n) = \frac{\sum_j \gamma_j h_j}{\sum_j \gamma_j}\), where \(\gamma_j = \frac{1}{\sqrt{1-c\|h_j\|^2}}\) is the Lorentz factor.
- Design Motivation: In hyperbolic space, points farther from the origin represent finer-grained concepts, while points closer to the origin represent more abstract global semantics. The Einstein midpoint respects the distance structure of hyperbolic geometry through norm-aware weighting, preventing the loss of hierarchical information during aggregation.
Adjustable Hierarchical Retrieval:
- Function: Flexibly adjust the accuracy-efficiency trade-off without retraining.
- Mechanism: Top-level descriptors \(\mathbf{h}_d^{(1,1)}\) are first used to retrieve Top-\(K'\) candidates; sub-descriptors from selected levels \(\mathbb{L}\) then re-score candidates: \(d_\ell = \min_k d_c(\mathbf{h}_q, \mathbf{h}_d^{(\ell,k)})\). After Z-score normalization, scores are combined as \(s = \sum_{\ell \in \{1\} \cup \mathbb{L}} w_\ell \hat{s}_\ell\).
- Design Motivation: Different deployment scenarios impose different requirements on speed and accuracy. The hierarchical structure naturally supports a cascade strategy of "coarse matching for speed, fine matching for precision," allowing users to control the trade-off by selecting which levels to activate.

Loss & Training¶

Three loss functions are employed: (1) Hierarchical triplet loss \(\mathcal{L}_{hier}\) — descriptors with overlapping FoV at adjacent levels form positive pairs, while non-overlapping descriptors at the same level form negative pairs, using hyperbolic distance; (2) a standard triplet loss for query-database matching; (3) end-to-end training under the triplet loss framework.

Key Experimental Results¶

Main Results¶

Method	Backbone	Pitts250K R@1	Pitts250K R@5	YQ360 R@1	YQ360 R@5	Time/Query (ms)
PanoVPR×16 (ConvNeXt-S)	ConvNeXt-S	40.3	63.0	46.0	83.2	48.6
HypeVPR-L (ConvNeXt-S)	ConvNeXt-S	43.4	64.3	52.4	85.2	14.0
Orhan et al.*	ResNet-101	47.0	66.4	47.6	79.2	1555.2
HypeVPR-O*	ResNet-50	66.5	82.1	53.6	81.2	4.0

Ablation Study¶

Configuration	Key Metric	Note
HypeVPR-B vs PanoVPR×8 (Swin-T)	R@1: 29.4 vs 22.0 (Pitts)	Single descriptor outperforms 8-window sliding baseline
HypeVPR-L vs PanoVPR×16 (Swin-T)	R@1: 32.5 vs 33.6 (Pitts)	Comparable accuracy, 3.5× faster
HypeVPR-B speed vs PanoVPR×8	3.6ms vs 17.0ms	4.7× speedup
HypeVPR-L speed vs PanoVPR×16	14.0ms vs 48.6ms	3.5× speedup
HypeVPR-O vs Orhan et al.	Time: 4.0ms vs 1555.2ms	388× speedup, R@1 +19.5

Key Findings¶

Under additional large-scale training data (HypeVPR-O*), Pitts250K R@1 reaches 66.5%, far surpassing Orhan et al.'s 47.0%, with 388× speedup.
Hierarchical retrieval enables HypeVPR to outperform PanoVPR even when using fewer sub-descriptors (-B mode vs. more sliding windows).
Hyperbolic space embedding preserves local-to-global hierarchical relationships within panoramas better than Euclidean space.
The accuracy-efficiency trade-off can be flexibly controlled at inference time by selecting active levels, without retraining.

Highlights & Insights¶

Introducing hyperbolic space into VPR is a highly natural and compelling innovation: the panorama → sub-view → local region nesting relationship is inherently tree-structured.
Einstein midpoint aggregation ensures that hierarchical aggregation respects hyperbolic geometry, avoiding geometric distortion introduced by naive averaging.
Adjustable hierarchical retrieval is a practically valuable feature: on edge devices, modest accuracy sacrifices can yield substantial speedups.

Limitations & Future Work¶

The number of bottom-level windows grows exponentially with the number of levels (\(2^{L-1}\)), resulting in non-trivial memory and computational overhead for large \(L\).
Experiments are conducted on only two datasets (Pitts250K-P2E and YQ360), leaving generalizability insufficiently validated.
The per-level weights \(w_\ell\) in hierarchical retrieval appear to be set manually; learned or adaptive weighting could be explored.
The shared backbone applies identical feature extraction to both perspective queries and panoramic windows, which may not be globally optimal.

The key distinction from PanoVPR is that PanoVPR performs brute-force sliding-window search at the P2P level, whereas HypeVPR reduces matching complexity from \(O(n)\) to \(O(\log n)\) via hierarchical embedding.
Hyperbolic embedding has been applied in NLP (Poincaré embeddings) and image retrieval; this paper represents its first application to P2E VPR.
The hierarchical triplet loss design is generalizable to other visual matching problems with natural hierarchical structure.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ The combination of hyperbolic space, hierarchical panorama modeling, and adjustable retrieval is highly novel and well-motivated.
Experimental Thoroughness: ⭐⭐⭐⭐ Comparisons across multiple frameworks are thorough, though the number of datasets is limited.
Writing Quality: ⭐⭐⭐⭐ Theoretical foundations are solid, figures are clear, and motivation is articulated convincingly.
Value: ⭐⭐⭐⭐ A practical solution for P2E VPR; the speed advantage is of considerable value for real-world deployment.