Skip to content

GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction

Conference: CVPR 2026
arXiv: 2603.21943
Code: GitHub
Area: Remote Sensing / Cross-View Geolocalization
Keywords: Cross-View Geolocalization, Flow Field Regression, Iterative Refinement Sampling, Real-Time Inference, Probabilistic Displacement Prediction

TL;DR

GeoFlow is a lightweight flow-matching-inspired framework for fine-grained cross-view geolocalization (FG-CVG). It learns probabilistic displacement fields combined with an iterative refinement sampling (IRS) algorithm to achieve precise 2-DoF localization from ground to satellite images in continuous space, reaching SOTA-competitive accuracy at 29 FPS real-time speed.

Background & Motivation

Fine-grained cross-view geolocalization (FG-CVG) aims to estimate the precise 2-DoF position of a ground image relative to a satellite image, which is critically important for autonomous navigation in GPS-denied areas. Existing methods face the following dilemmas:

  1. Matching-based methods (e.g., CCVPE): Discretize the search space into a finite patch grid, essentially treating it as a classification problem. They suffer from quantization errors limited by patch size and are difficult to scale to large search areas.
  2. Regression-based methods (e.g., HC-Net, FG2): While operating in continuous space, they often require camera intrinsics, BEV projection, or intermediate geometric estimation as priors, incurring high computational overhead that hinders real-time deployment.
  3. Accuracy-speed trade-off: High-accuracy models (e.g., FG2, 4.20 FPS) are too slow, while fast models lack sufficient accuracy.

The core motivation of GeoFlow is: Can precise localization in continuous space be achieved while maintaining real-time inference speed? The authors draw inspiration from flow matching models—flow matching learns vector fields to iteratively transport samples from a prior distribution to a target distribution, a process that naturally mirrors the human "coarse-to-fine" reasoning during localization.

Method

Overall Architecture

GeoFlow consists of three core components:

  1. Cross-view feature extraction and matching: Extracts features from ground and satellite images, fusing them into a global visual representation \(\mathbf{f}_{vis}\) via cross-attention.
  2. Probabilistic displacement regression network: Given an arbitrary initial hypothesis position \(\mathbf{q}_0\), predicts the probabilistic displacement (distance \(r\) and direction \(\theta\)) to the target position.
  3. Iterative Refinement Sampling (IRS): At inference time, generates multiple random hypotheses and iteratively refines them over multiple rounds to converge to a consistent estimate.

Key Designs

  1. Lightweight Cross-View Feature Extraction

    • Two separate EfficientNet-B0 backbones for ground and satellite images (deliberately choosing a lightweight architecture to validate the method itself)
    • 1×1 convolutions project both feature streams to a common dimension \(d\)
    • Fixed 2D sinusoidal positional encodings are added for spatial awareness
    • Cross-attention mechanism: Ground tokens serve as queries, satellite tokens as keys/values, allowing ground representations to incorporate satellite spatial information
    • Adaptive average pooling yields a global visual representation \(\mathbf{f}_{vis} \in \mathbb{R}^d\)
  2. Probabilistic Displacement Regression (Core Innovation)

    • Reformulates localization as learning a regression field \(\mathbf{v}^\phi(\mathbf{q}_0, \mathbf{f}_{vis})\)
    • Input: concatenation of visual representation \(\mathbf{f}_{vis}\) and initial hypothesis position \(\mathbf{q}_0\)
    • Parameterizes displacement in polar coordinates \((r, \theta)\), predicted by two heads:
    • Distance head: Predicts Gaussian distribution parameters \((\mu_r, \sigma_r^2)\), i.e., \(r \sim \mathcal{N}(\mu_r, \sigma_r^2)\)
    • Direction head: Predicts von Mises-Fisher distribution parameters \((\mu_\theta, \kappa)\), suitable for modeling directional uncertainty on the unit circle \(S^1\)
    • Refinement formula: \(\hat{\mathbf{q}}_1 = \mathbf{q}_0 + \mu_r \cdot \frac{\mu_\theta}{\|\mu_\theta\|_2}\)
    • Design Motivation: Probabilistic modeling provides not only point estimates but also uncertainty quantification, far superior to deterministic regression
  3. Iterative Refinement Sampling (IRS, Inference Algorithm)

    • Initializes \(N\) hypothesis points \(\mathcal{Q}_0 = \{\mathbf{q}_0^{(i)}\}_{i=1}^N\), uniformly sampled on the satellite image
    • Iterates for \(R\) rounds: each round calls the regression network in parallel on all hypotheses to predict displacements and update positions
    • Final position is the mean of all converged hypotheses: \(\hat{\mathbf{q}}_{final} = \text{mean}(\mathcal{Q}_R)\)
    • Key efficiency design: Visual feature extraction (EfficientNet + Cross-Attention) runs only once; IRS iterations only re-run the ultra-lightweight coordinate projection layer and MLP regression head
    • Supports inference-time scaling: \(N\) and \(R\) can be flexibly adjusted to trade off between accuracy and speed without retraining

Loss & Training

  • During training, hypothesis positions \(\mathbf{q}_0\) are uniformly sampled on the satellite image, and displacements \(\mathbf{u}_{gt}\) to the ground truth are computed
  • Distance NLL loss: \(\mathcal{L}_r = \frac{1}{2}\left(\frac{(r_{gt} - \mu_r)^2}{\sigma_r^2} + \log \sigma_r^2\right)\)
    • Inverse-variance weighting penalizes errors under high confidence; \(\log \sigma_r^2\) regularization prevents variance collapse to zero
  • Direction AngMF loss: \(\mathcal{L}_\theta = -\log(\kappa^2+1) + \kappa \cdot \cos^{-1}(\mu_\theta^T \cdot \theta_{gt}) + \log(1+\exp(-\kappa\pi))\)
    • Directly minimizes angular error, more robust than L2 loss
  • Total loss: \(\mathcal{L} = \mathcal{L}_r + \mathcal{L}_\theta\)

Key Experimental Results

Main Results

KITTI Dataset (Same-Area)

Method Mean (m) ↓ Median (m) ↓ FPS ↑ Params (M)
FG2 0.75 0.52 4.20 -
HC-Net 0.80 0.50 25.00 11.21
CCVPE 1.22 0.62 24.00 57.40
GeoFlow 0.98 0.68 29.49 7.38

VIGOR Dataset

Method Same Mean (m) ↓ Cross Mean (m) ↓ FPS ↑
FG2 2.18 2.74 3.60
HC-Net 2.65 3.35 20.00
CCVPE 3.60 4.97 18.00
GeoFlow 3.51 4.62 29.49

Ablation Study

Effect of IRS Iteration Rounds (KITTI Cross-Area, N=10)

Rounds R Mean (m) Median (m) FPS
1 10.69 9.95 32.55
3 8.47 5.88 31.23
5 8.42 5.60 29.49
10 8.41 5.59 26.23

Effect of IRS Hypothesis Count (KITTI Cross-Area, R=5)

Seeds N Mean (m) FPS
1 8.58 30.70
10 8.42 29.49
20 8.41 28.08

Single Inference vs. Full IRS

Config Mean (m) Median (m) Note
N=1, R=1 12.47 11.79 Single inference baseline
N=10, R=5 8.42 5.60 Full IRS, 32.5% error reduction

Key Findings

  1. Inference-time scaling is genuinely effective: From R=1 to R=3, mean error drops by 20.8%, with FPS nearly unaffected (32.55→31.23)
  2. Extreme efficiency: GeoFlow has only 7.38M parameters (1/7.8 of CCVPE), 686 MiB memory (1/6.9 of CCVPE), and 7× the speed of FG2
  3. IRS is a critical component: The single inference vs. IRS comparison shows that IRS is not a marginal improvement but halves the median error

Highlights & Insights

  1. Paradigm innovation: Reformulates FG-CVG as learning probabilistic displacement fields + iterative hypothesis refinement, fundamentally different from traditional matching/regression paradigms
  2. Extreme efficiency design: Visual features are computed only once; IRS iterations run only ultra-lightweight MLPs, achieving the breakthrough of "iterative methods can also be real-time"
  3. Inference-time scaling is observed for the first time in FG-CVG—analogous to test-time compute scaling in LLMs
  4. Elegance of probabilistic modeling: Gaussian for distance and vMF for direction are more principled than deterministic regression, with uncertainty naturally learned through NLL loss
  5. Multi-hypothesis consensus mechanism: Similar to particle filtering, multi-hypothesis convergence naturally suppresses visual ambiguity

Limitations & Future Work

  1. Absolute accuracy gap remains: On VIGOR Cross-Area, Mean 4.62m vs. FG2's 2.74m—the lightweight design incurs some accuracy loss
  2. Only handles 2-DoF: Does not address heading (θ) estimation, assuming orientation is known (from IMU/compass)