VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation¶
Conference: CVPR2025
arXiv: 2603.12918
Code: To be confirmed
Area: Autonomous Driving
Keywords: cross-view pose estimation, geo-localization, view-invariant representation, polar transformation, positional attention
TL;DR¶
VIRD constructs view-invariant representations through a dual-axis transformation (polar transformation + context-enhanced positional attention) to achieve omnidirectional cross-view pose estimation without orientation priors, reducing position and orientation errors on KITTI by 50.7% and 76.5%, respectively.
Background & Motivation¶
- Global localization is crucial for autonomous driving and robotics, but the reliability of GNSS decreases in dense urban areas due to signal occlusion and multipath effects.
- Cross-view pose estimation (CVPE) achieves fine-grained 3-DoF localization by matching ground images with satellite images, serving as a promising alternative to GNSS.
- Early CVPE methods relied on coarse orientation priors, which are often inaccurate or unavailable in practice.
- Existing omnidirectional CVPE methods ignore the huge perspective gap between ground and satellite views, where relying solely on semantic similarity is insufficient to establish spatial correspondences.
- Polar transformation solves horizontal alignment but ignores vertical axis misalignment; projection-based transformation is sensitive to camera calibration and introduces severe artifacts at vertical structures such as buildings.
- Effectively resolving vertical axis misalignment remains an open challenge.
Method¶
Overall Architecture¶
VIRD constructs view-invariant descriptors for omnidirectional cross-view pose estimation. The workflow includes: (1) descriptor construction via dual-axis transformation; (2) view-reconstruction loss to enhance view invariance; (3) descriptor matching + residual regression to predict the final pose.
Key Designs¶
1. Dual-Axis Transformation
Horizontal Axis — Polar Transformation: - Performs polar transformation on the satellite feature map \(F_s\) centered at the candidate position. - Maps azimuth to the horizontal axis and radial distance to the vertical axis. - Set the transformed width to \(W_s = \frac{2\pi}{\text{HFoV}} \cdot W_g\) to ensure FoV consistency.
Vertical Axis — Positional Attention (PA): - Defines three positional encodings: shared virtual \(P_a \in \mathbb{R}^{H_Q \times d_p}\), ground \(P_g\), and satellite \(P_{s2p}\). - Attention weights \(\mathcal{A}_v = \text{Softmax}\left(\frac{(P_a W_v^Q)(P_v W_v^K)^\top}{\sqrt{d_k}}\right)\). - Establishes cross-view consistent vertical correspondences via the shared virtual vertical axis.
Context-Enhanced Positional Attention (CEPA): - Standard PA assumes identical vertical transformation for all horizontal directions, lacking adaptability to vertical structures. - CEPA refines attention using ground feature context: \(\mathcal{A}_{g'} = \mathcal{A}_g + \text{Softmax}(\Phi(\mathcal{A}_g \oplus F_g))\). - Allows the model to adaptively transform ground features in different horizontal directions based on the scene context.
2. View-Reconstruction Loss - Trains descriptors to reconstruct both original views and cross-views simultaneously. - Four decoders: \(G_{g \to g}\), \(G_{s \to s}\), \(G_{s \to g}\), \(G_{g \to s}\). - Original-view reconstruction \(\mathcal{L}_{\text{origin}}\) + cross-view reconstruction \(\mathcal{L}_{\text{cross}}\). - Guides descriptors to encode vertical structural information, resolving ambiguities in visually similar road scenes.
3. Matching and Regression - Descriptor matching: Computes cosine similarity on candidate-pose grids, trained with InfoNCE loss. - Pose regression: Predicts the residual \(\Delta \mathbf{p} = (\Delta x, \Delta y, \Delta \theta)\) from the coarsely matched pose.
Loss & Training¶
\(\mathcal{L} = \mathcal{L}_{\text{recon}} + \mathcal{L}_{\text{match}} + \mathcal{L}_{\text{reg}}\)
Key Experimental Results¶
KITTI Dataset (No Orientation Prior, Same-Area)¶
| Method | Backbone | Med. Pos.(m)↓ | Med. Ori.(°)↓ | R@1m Lat.↑ |
|---|---|---|---|---|
| SliceMatch | VGG16 | 5.41 | 4.42 | 39.73% |
| CCVPE | EffNet-B0 | 3.47 | 6.12 | 53.30% |
| DenseFlow | ResNet18 | 4.26 | 0.99 | 73.87% |
| VIRD | VGG16 | 2.07 | 1.02 | 79.46% |
VIGOR Dataset (Same-Area, Unaligned)¶
| Method | Med. Pos.(m)↓ | Med. Ori.(°)↓ |
|---|---|---|
| SliceMatch | 5.77 | 67.37 |
| CCVPE | 4.56 | 75.86 |
| VIRD | 3.74 | 2.15 |
Key Findings¶
- Median position error decreases from 4.26m to 2.07m (a 50.7% reduction) on KITTI, and orientation error decreases from 4.42° to 1.02° (a 76.5% reduction).
- Cross-Area generalization is also significantly superior to existing methods.
- The view-reconstruction loss significantly contributes to encoding vertical structural information.
- CEPA achieves greater improvement compared to standard PA in complex urban scenes.
Highlights & Insights¶
- Dual-Axis Transformation Innovation: For the first time, explicitly resolves the cross-view gap along both horizontal and vertical axes separately, establishing consistent correspondences via a shared virtual axis.
- CEPA Adaptability: Dynamically adjusts vertical attention weights through ground context, capturing orientation changes in vertical structures such as buildings.
- No Camera Parameters Required: Positional attention learns vertical transformations without relying on internal or external camera parameters, avoiding calibration sensitivity associated with perspective projection.
- View-Reconstruction Regularization: Enhances the view-invariance of descriptors through reconstruction tasks, serving as an elegant self-supervised signal.
Limitations & Future Work¶
- The polar transformation assumes a flat ground approximation, which may fail in hilly or highly uneven terrains.
- Validation is only performed on KITTI and VIGOR; scenes with larger discrepancies in satellite image resolution or timestamps have not been tested.
- View reconstruction requires additional decoders, which increases training overhead; although omitted during inference, this adds complexity to the training phase.
- Negligible pitch/roll is assumed, which might not hold for real-world robotics (especially on slopes).
Related Work & Insights¶
- Polar transformation originates from the works of Shi & Li, and VIRD complements this with a vertical axis solution.
- CEPA can be generalized to other tasks requiring cross-domain vertical alignment (e.g., remote sensing-to-ground matching, floor-level localization).
- The concept of view-reconstruction loss can inspire regularization designs for other cross-view or cross-modal matching tasks.
- It complements LiDAR-based localization and is highly suitable for scenarios with only monocular cameras.
Rating¶
- Novelty: ⭐⭐⭐⭐ (Dual-Axis Transformation + CEPA + View Reconstruction)
- Experimental Thoroughness: ⭐⭐⭐⭐ (Two datasets, Same/Cross-Area, multiple backbones)
- Writing Quality: ⭐⭐⭐⭐⭐ (Clear diagrams, thorough analysis of problems)
- Value: ⭐⭐⭐⭐ (Significantly advances prior-free cross-view localization SOTA)