Towards Test-time Efficient Visual Place Recognition via Asymmetric Query Processing¶
Conference: AAAI 2026 arXiv: 2512.13055 Code: github.com/jaeyoon1603/AsymVPR Area: Model Compression Keywords: Visual Place Recognition, Asymmetric Retrieval, Geographical Memory Bank, Implicit Embedding Augmentation, Lightweight Deployment
TL;DR¶
This paper proposes AsymVPR, an efficient asymmetric framework for Visual Place Recognition (VPR), which replaces expensive k-NN precomputation with a Geographical Memory Bank and bridges the capacity gap between a lightweight query network and a high-capacity gallery network via Implicit Embedding Augmentation, achieving retrieval performance close to the full-size model using only ~8% of its FLOPs.
Background & Motivation¶
Visual Place Recognition (VPR) determines the geographic location of a query image via image retrieval, serving as a core component in navigation, AR, and SLAM. Recent foundation models such as DINOv2 have significantly advanced VPR performance, yet their substantial computational overhead precludes deployment on mobile and edge devices.
Asymmetric retrieval offers an elegant solution: the gallery side employs a high-capacity model to offline precompute features, while the query side uses a lightweight network for online processing. The key challenge lies in making the embedding spaces of these two heterogeneous networks compatible.
Limitations of existing asymmetric approaches:
- k-NN dependency: Methods such as CSD and D3still rely on k-nearest-neighbor information to transfer contextual knowledge; however, precomputing and storing k-NN information for large-scale gallery embeddings is extremely costly—particularly in VPR, which involves millions of images.
- Capacity gap: Lightweight query models struggle to fully capture the complex feature distributions and variability that high-capacity gallery models can encode.
- Underexplored in VPR: While asymmetric retrieval has been studied in other domains, it has received little attention in VPR—despite VPR possessing a unique advantage: naturally available geographic metadata.
Core Insight: VPR datasets inherently include geographic coordinates (GPS), which can be leveraged as a prior to directly construct structured feature representations, completely replacing k-NN computation.
Method¶
Overall Architecture¶
The system comprises two models: - Gallery network \(f_g\): A high-capacity model (DINOv2-B + SALAD/BoQ) that extracts and stores gallery features \(\mathcal{F}_g\) offline. - Query network \(f_q\): A lightweight model (MobileViTv2/EfficientViT-B2 + same-type aggregator) trained so that its embeddings are compatible with \(\mathcal{F}_g\).
Training constraint: only the gallery dataset and precomputed features may be used; the gallery network and stored embeddings cannot be modified.
Key Designs¶
- Geographical Memory Bank
Leveraging the GPS coordinates inherent in VPR datasets, gallery features from the same geographic location are aggregated into location centroids: $\(\mathcal{M} = \{\mathbf{c}_j\}_{j=1}^M \subset \mathbb{R}^d\)$ where each centroid \(\mathbf{c}_j\) is the mean of all gallery features at the same geographic location.
An asymmetric contrastive loss is constructed based on these centroids: $\(\mathcal{L}_{\text{asym}} = -\log \frac{e^{\mathbf{q} \cdot \mathbf{g} / \tau}}{e^{\mathbf{q} \cdot \mathbf{g} / \tau} + \sum_{j \in \mathcal{N}(x)} e^{\mathbf{q} \cdot \mathbf{c}_j / \tau}}\)$
The query embedding \(\mathbf{q}\) is aligned with the corresponding gallery embedding \(\mathbf{g}\) while being pushed away from centroids of different locations.
Efficiency advantage: Precomputation time is reduced from 1,392.76 minutes (CSD) to 0.26 minutes (a 5,356× speedup), as geographic centroids require only simple mean computation.
- Implicit Embedding Augmentation
Lightweight query models struggle to capture the rich feature variability encoded per location by high-capacity models. The proposed solution augments training with location-specific covariance information.
Explicit augmentation: Augmented embeddings are sampled from a multivariate normal distribution \(\tilde{\mathbf{g}} \sim \mathcal{N}(\mathbf{g}, \gamma \Sigma)\), where \(\Sigma\) is the feature covariance matrix at that location.
Implicit derivation: As the number of samples \(K \to \infty\), a closed-form upper bound is derived using Jensen's inequality and the moment-generating function of the multivariate Gaussian: $\(\mathcal{L}_{\text{asym}^+} = -\log \frac{e^{\mathbf{q} \cdot \mathbf{g} / \tau}}{e^{\mathbf{q} \cdot \mathbf{g} / \tau} + \sum_{j \in \mathcal{N}(x)} e^{\mathbf{q} \cdot \mathbf{c}_j / \tau + (\gamma / 2\tau^2) \mathbf{q}^T \Sigma \mathbf{q}}}\)$
Regularization effect: Via eigendecomposition \(\mathbf{q}^T \Sigma \mathbf{q} = \sum_j \lambda_j (v_j^T \mathbf{q})^2\), directions of high variance (typically corresponding to intra-class variations such as viewpoint and illumination changes) are penalized more strongly, guiding the query model to focus on stable, location-discriminative features.
- Symmetric Aggregator Architecture
The same type of aggregator (e.g., both SALAD or both BoQ) is used on both the query and gallery sides to ensure architectural consistency. Experiments show this is more effective than using an MLP with a larger parameter count (Table 5: a symmetric architecture with 22.3M parameters outperforms an MLP with 33.4M parameters).
Loss & Training¶
- Optimizer: AdamW, initial learning rate \(5 \times 10^{-4}\), cosine decay to \(1 \times 10^{-4}\)
- 15 epochs, batch size 64, images resized to \(322 \times 322\)
- Temperature \(\tau = 0.05\), augmentation scaling \(\gamma = 15\)
- Diagonal approximation used for the covariance matrix to reduce GPU memory overhead
- Training data: GSV-Cities dataset (560k images, 67k places)
- Hardware: single RTX 4090
Key Experimental Results¶
Main Results¶
Gallery model: BoQ (DINOv2-B); Query model: EfficientViT-B2
| Method | Pitts250k R@1 | MSLS R@1 | Tokyo24/7 R@1 | Nordland R@1 | AmsterTime R@1 |
|---|---|---|---|---|---|
| BoQ (Symmetric, upper bound) | 96.6 | 93.8 | 96.5 | 81.3 | 63.0 |
| EfficientViT-BoQ (Symmetric) | 94.3 | 87.7 | 85.7 | 51.6 | 41.6 |
| CSD | 94.6 | 91.4 | 90.5 | 68.7 | 48.1 |
| D3still | 95.0 | 91.9 | 91.7 | 72.8 | 46.7 |
| Ours | 95.4 | 92.3 | 92.7 | 74.6 | 48.6 |
Query-side computational efficiency:
| Query Network | FLOPs (G) | % of Gallery | Params (M) | % of Gallery | Inference Speed |
|---|---|---|---|---|---|
| DINOv2-B (BoQ) | 49.1 | 100% | 95.2 | 100% | 1.0× |
| EfficientViT-B2 | 4.4 | 8.9% | 22.3 | 23.4% | 3.0× |
| MobileViTv2 | 4.2 | 8.5% | 12.1 | 12.7% | 3.6× |
Ablation Study¶
| Configuration | Pitts250k R@1 | Tokyo24/7 R@1 | Nordland R@1 | AmsterTime R@1 | Note |
|---|---|---|---|---|---|
| Full method | 95.4 | 92.7 | 74.6 | 48.6 | — |
| w/o Implicit Embedding Augmentation | 94.3 | 90.2 | 70.3 | 46.9 | Nordland drops 4.3% |
| Explicit Augmentation (K=10) | 94.7 | 91.4 | 73.3 | 47.7 | Implicit > Explicit |
| Queue-based Memory Bank | 94.2 | 90.2 | 72.0 | 47.1 | Geographic prior is effective |
Training efficiency comparison:
| Metric | CSD (k-NN-based) | Ours |
|---|---|---|
| Precomputation time | 1392.76 min | 0.26 min |
| Per-iteration training time | 1.34 sec | 0.19 sec |
| GPU memory | 23.9 GB | 17.5 GB |
Key Findings¶
- The Geographical Memory Bank is 5,000× faster (precomputation) than k-NN-based methods while delivering superior performance.
- Implicit Embedding Augmentation improves Nordland R@1 by 4.3% (the most challenging benchmark for seasonal variation), demonstrating that covariance modeling effectively handles appearance changes.
- The implicit formulation outperforms explicit sampling, being both mathematically tighter and empirically more effective.
- The symmetric aggregator architecture is more effective than an MLP with larger parameter count (architectural consistency > parameter count).
- Grad-CAM visualizations show that the proposed method's attention patterns most closely resemble those of the high-capacity gallery model.
- The method achieves state-of-the-art results across all 5 benchmarks and all query-gallery combinations.
Highlights & Insights¶
- Elegant use of domain priors: VPR datasets inherently contain geographic metadata; incorporating this structural information into the training framework is a natural yet previously overlooked idea.
- Mathematical derivation of implicit augmentation: Starting from explicit sampling, Jensen's inequality is applied to derive a closed-form solution that provides theoretical guarantees while eliminating sampling overhead.
- Regularization effect of covariance modeling: High-variance directions receive stronger penalties, causing the query model to automatically learn to ignore viewpoint/illumination changes and focus on location-discriminative features.
- Dramatic improvement in training efficiency: Precomputation time is reduced from 23 hours to 16 seconds, making large-scale VPR training genuinely practical.
Limitations & Future Work¶
- Validation is limited to two lightweight backbones (MobileViTv2, EfficientViT-B2); more extreme compression scenarios (e.g., TinyNet) remain unexplored.
- The diagonal approximation of the covariance matrix may discard cross-dimensional correlation information.
- The Geographical Memory Bank relies on accurate GPS coordinates and requires alternative solutions in scenarios where GPS is unavailable or imprecise (e.g., indoor localization).
- Only global retrieval is evaluated; integration with local feature matching for re-ranking is not explored.
- Query and gallery network embedding dimensions must be identical, constraining further compression opportunities.
- Combining knowledge distillation with the proposed framework has not been investigated.
Related Work & Insights¶
- CSD / D3still: k-NN-based asymmetric methods serving as the primary baselines; this paper demonstrates that domain priors can fully replace k-NN.
- SALAD / BoQ: Current state-of-the-art VPR methods, both fine-tuned on DINOv2; this work enables their deployment on edge devices.
- Compatible Training literature: Primarily addresses backward compatibility during model updates; this paper extends the concept to heterogeneous-capacity settings.
- The Implicit Embedding Augmentation idea draws from person re-identification and recommender systems, and is generalized here to VPR with geographic priors.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The Geographical Memory Bank is a concise yet effective novel design.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Five benchmarks, multiple query-gallery combinations, detailed ablations and efficiency analyses.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear problem motivation, complete mathematical derivations, and well-structured experimental presentation.
- Value: ⭐⭐⭐⭐ — Provides a practical solution for deploying VPR on edge devices.