Skip to content

VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation

Conference: CVPR2026 arXiv: 2603.12918 Code: To be confirmed Area: Autonomous Driving Keywords: cross-view pose estimation, view-invariant representation, polar transformation, positional attention, autonomous driving localization

TL;DR

This paper proposes VIRD, which constructs view-invariant representations via dual-axis transformation (polar transformation + context-enhanced positional attention) to achieve state-of-the-art cross-view pose estimation without orientation priors, reducing position and orientation errors on KITTI by 50.7% and 76.5%, respectively.

Background & Motivation

Global localization is a critical requirement for autonomous driving: Accurate global localization is essential for autonomous driving and mobile robotics, serving as a foundational capability for real-world navigation.

GNSS is unreliable in urban scenarios: In dense urban areas, GNSS signals degrade severely due to occlusion and multipath effects, causing significant drops in localization accuracy.

Cross-view pose estimation as an alternative: Estimating the 3-DoF pose of a ground camera using geo-referenced satellite imagery is a promising alternative, but the large viewpoint discrepancy between ground and satellite views poses a fundamental challenge.

Existing methods rely on orientation priors: Early methods assume a known coarse orientation and iteratively refine within a narrow search space; however, orientation priors are often inaccurate or unavailable in practice, leading to convergence to suboptimal solutions.

Semantic methods neglect spatial correspondence: Recent omnidirectional CVPE methods reduce the viewpoint gap via semantic similarity (cross-attention, contrastive learning), but overlook spatial correspondences, leaving the viewpoint discrepancy fundamentally unresolved.

Geometric transformation methods each have drawbacks: Polar transformation addresses only horizontal alignment while neglecting the vertical axis; projective transformation depends on camera parameters and introduces severe artifacts around vertical structures such as buildings.

Method

Overall Architecture

VIRD is an omnidirectional cross-view pose estimation framework that builds view-invariant descriptors through dual-axis transformation. The overall pipeline is as follows: 1. Feature extraction: A pretrained CNN (VGG16 / EfficientNet-B0) extracts ground features \(F_g \in \mathbb{R}^{C \times H \times W_g}\) and satellite features \(F_s \in \mathbb{R}^{C \times A \times A}\) separately. 2. Horizontal-axis alignment: Polar transformation is applied to satellite features, mapping the azimuth angle to the horizontal axis. 3. Vertical-axis alignment: A Context-Enhanced Positional Attention (CEPA) module eliminates vertical-axis misalignment. 4. Descriptor generation: Features are compressed along the vertical direction and flattened into orientation-aware 1D descriptors \(D_g\) and \(D_{s2p}\). 5. Coarse matching + fine regression: A coarse pose is obtained via cosine similarity matching, then refined by a regression module that predicts residual corrections.

Key Designs

Polar Transformation (horizontal alignment): Satellite feature maps are resampled in polar coordinates centered at a candidate location, mapping azimuth and radial distance to the horizontal and vertical axes, respectively. The transformation width \(W_s = \frac{2\pi}{\text{HFoV}} \cdot W_g\) ensures consistency across different fields of view.

Positional Attention (PA): Three sets of sinusoidal positional encodings are defined — a shared virtual encoding \(P_a\), a ground encoding \(P_g\), and a satellite encoding \(P_{s2p}\) — and an attention mechanism learns the mapping between vertical coordinates across views: $\(\mathcal{A}_v = \text{Softmax}\left(\frac{(P_a W_v^Q)(P_v W_v^K)^\top}{\sqrt{d_k}}\right)\)$ The core insight is to reinterpret positional attention as learning a cross-view vertical coordinate transformation through a shared virtual axis, without requiring camera parameters.

Context-Enhanced Positional Attention (CEPA): PA assumes a consistent vertical transformation across all horizontal directions and cannot adapt to direction-dependent variations in vertical structures. CEPA leverages local context from ground features to enhance attention weights: $\(\mathcal{A}_{g'} = \mathcal{A}_g + \text{Softmax}\left(\Phi(\mathcal{A}_g \oplus F_g)\right)\)$ A convolutional layer \(\Phi\) processes the concatenated attention weights and features, enabling the model to adaptively transform ground features according to scene context.

View Reconstruction Loss: Each descriptor is trained to reconstruct both the original and the cross-view image, comprising an original reconstruction loss \(\mathcal{L}_{\text{origin}}\) and a cross-reconstruction loss \(\mathcal{L}_{\text{cross}}\), guiding descriptors to encode vertical structural information and improve view invariance.

Loss & Training

The total loss consists of three terms: $\(\mathcal{L} = \mathcal{L}_{\text{recon}} + \mathcal{L}_{\text{match}} + \mathcal{L}_{\text{reg}}\)$

  • \(\mathcal{L}_{\text{recon}} = \alpha_1 \mathcal{L}_{\text{origin}} + \alpha_2 \mathcal{L}_{\text{cross}}\): view reconstruction loss (\(\ell_1\)-loss)
  • \(\mathcal{L}_{\text{match}}\): InfoNCE matching loss, encouraging high similarity at ground-truth poses
  • \(\mathcal{L}_{\text{reg}}\): \(\ell_2\)-loss on pose residuals

Key Experimental Results

Main Results

KITTI dataset (no orientation prior, cross-area setting, EfficientNet-B0):

Method Position Mean (m)↓ Position Median (m)↓ Orientation Mean (°)↓ Orientation Median (°)↓
HighlyAccurate 15.50 16.02 89.84 89.85
SliceMatch 14.85 11.85 23.64 7.96
CCVPE 13.94 10.98 77.84 63.84
FG2 13.58 11.72 90.12 90.42
VIRD (Ours) 11.12 5.41 22.03 1.87

VIGOR dataset (no orientation alignment, cross-area setting, EfficientNet-B0):

Method Position Mean (m)↓ Position Median (m)↓ Orientation Mean (°)↓ Orientation Median (°)↓
CCVPE 5.41 1.89 27.78 13.58
DenseFlow 7.67 3.67 17.63 2.94
FG2† 5.95 2.40 28.41 2.20
VIRD (Ours) 4.61 1.55 16.50 1.17

Ablation Study

Dual-axis transformation ablation (KITTI cross-area, VGG16):

Transformation Strategy Position Median (m) Orientation Median (°)
Projective S2G 10.59 3.84
Projective G2S 15.20 5.44
Polar only 11.75 4.00
Polar + PA 9.76 3.44
Polar + CEPA 8.88 3.36

Model component ablation (KITTI cross-area, VGG16):

Configuration Position Median (m) Orientation Median (°)
Polar + CEPA 8.88 3.36
+ \(\mathcal{L}_{\text{origin}}\) 8.29 3.31
+ \(\mathcal{L}_{\text{cross}}\) 8.10 3.21
+ Both 7.90 3.05
+ Both + Regression 7.05 2.22

Key Findings

  • Dual-axis transformation significantly outperforms all single geometric transformation baselines; the incremental introduction of PA and CEPA yields consistent improvements.
  • The view reconstruction loss contributes most to orientation estimation, reducing mean orientation error by 39.1%; cross-reconstruction proves more effective than original reconstruction alone.
  • The regression module reduces position median by 0.85 m and orientation median by 0.83°, though mean orientation error slightly increases due to opposite-direction predictions being refined further away.
  • VIRD maintains low errors across varying levels of orientation noise (±10° to ±180°), demonstrating substantially better robustness than CCVPE and FG2.

Highlights & Insights

  • Dual-axis transformation strategy is elegantly designed: The cross-view matching problem is decomposed into two tractable sub-problems — horizontal (polar transformation) and vertical (positional attention) — avoiding the dependence of projective transformation on camera parameters and depth information.
  • CEPA module is intuitively motivated and effective: Positional attention is reinterpreted as a coordinate transformation through a shared virtual axis, and context enhancement enables adaptation to direction-dependent vertical structural variations.
  • View reconstruction loss is cleverly designed: Requiring descriptors to reconstruct both original and cross-view images enforces view invariance while encouraging the model to focus on structures shared across views.
  • No orientation prior required: VIRD achieves state-of-the-art performance over a full 360° search space, yielding high practical value for real-world autonomous driving localization.

Limitations & Future Work

  • Only 3-DoF pose (x, y, yaw) is addressed; pitch and roll are assumed negligible, which may not hold in complex terrain.
  • In the VIGOR setting with aligned orientations, position accuracy falls short of methods such as FG2 that exploit 3D structural information.
  • The regression module may amplify errors when the coarse matching orientation is incorrect (opposite-direction issue).
  • Validation is limited to the KITTI and VIGOR datasets; generalization to a broader range of urban environments and regions remains unknown.
  • Polar transformation must be applied at each candidate pose during training, and computational overhead may be significant as the number of candidates increases.
  • Limited-angle CVPE: Methods such as HighlyAccurate and PIDLoc use the LM algorithm or neural pose optimizers for iterative refinement within a narrow angular range, constrained by orientation priors.
  • Omnidirectional CVPE: SliceMatch employs content-based cross-attention; CCVPE uses contrastive learning; both emphasize semantic similarity while neglecting spatial correspondence.
  • Geometric transformations: Polar transformation (Shi, Li, et al.) addresses horizontal alignment; projective transformation (HighlyAccurate, et al.) attempts to handle both axes simultaneously but depends on camera parameters and introduces artifacts.
  • FG2: Exploits height-aware 3D point selection to narrow the viewpoint gap, but sparse point matching leads to degraded orientation estimation.

Rating

  • Novelty: ⭐⭐⭐⭐ — The dual-axis transformation strategy and CEPA module are novel; the reinterpretation of positional attention is insightful.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive comparisons on two mainstream datasets with complete ablations, including robustness analysis and visualizations.
  • Writing Quality: ⭐⭐⭐⭐ — Clear structure, well-motivated problem formulation, and intuitive illustrations.
  • Value: ⭐⭐⭐⭐ — State-of-the-art without orientation priors, with direct applicability to real-world autonomous driving localization.