CVPR 2026 Remote Sensing Cross-View Geolocalization Flow Field Regression Iterative Refinement Sampling Real-Time Inference Probabilistic Displacement Prediction

GeoFlow: Real-Time Fine-Grained Cross-View Geolocalization via Iterative Flow Prediction¶

Conference: CVPR 2026
arXiv: 2603.21943
Code: GitHub
Area: Remote Sensing / Cross-View Geolocalization
Keywords: Cross-View Geolocalization, Flow Field Regression, Iterative Refinement Sampling, Real-Time Inference, Probabilistic Displacement Prediction

TL;DR¶

GeoFlow is a lightweight flow-matching-inspired framework for fine-grained cross-view geolocalization (FG-CVG). It learns probabilistic displacement fields combined with an iterative refinement sampling (IRS) algorithm to achieve precise 2-DoF localization from ground to satellite images in continuous space, reaching SOTA-competitive accuracy at 29 FPS real-time speed.

Background & Motivation¶

Fine-grained cross-view geolocalization (FG-CVG) aims to estimate the precise 2-DoF position of a ground image relative to a satellite image, which is critically important for autonomous navigation in GPS-denied areas. Existing methods face the following dilemmas:

Matching-based methods (e.g., CCVPE): Discretize the search space into a finite patch grid, essentially treating it as a classification problem. They suffer from quantization errors limited by patch size and are difficult to scale to large search areas.
Regression-based methods (e.g., HC-Net, FG2): While operating in continuous space, they often require camera intrinsics, BEV projection, or intermediate geometric estimation as priors, incurring high computational overhead that hinders real-time deployment.
Accuracy-speed trade-off: High-accuracy models (e.g., FG2, 4.20 FPS) are too slow, while fast models lack sufficient accuracy.

The core motivation of GeoFlow is: Can precise localization in continuous space be achieved while maintaining real-time inference speed? The authors draw inspiration from flow matching models—flow matching learns vector fields to iteratively transport samples from a prior distribution to a target distribution, a process that naturally mirrors the human "coarse-to-fine" reasoning during localization.

Method¶

Overall Architecture¶

GeoFlow consists of three core components:

Cross-view feature extraction and matching: Extracts features from ground and satellite images, fusing them into a global visual representation \(\mathbf{f}_{vis}\) via cross-attention.
Probabilistic displacement regression network: Given an arbitrary initial hypothesis position \(\mathbf{q}_0\), predicts the probabilistic displacement (distance \(r\) and direction \(\theta\)) to the target position.
Iterative Refinement Sampling (IRS): At inference time, generates multiple random hypotheses and iteratively refines them over multiple rounds to converge to a consistent estimate.

Key Designs¶

Lightweight Cross-View Feature Extraction
- Two separate EfficientNet-B0 backbones for ground and satellite images (deliberately choosing a lightweight architecture to validate the method itself)
- 1×1 convolutions project both feature streams to a common dimension \(d\)
- Fixed 2D sinusoidal positional encodings are added for spatial awareness
- Cross-attention mechanism: Ground tokens serve as queries, satellite tokens as keys/values, allowing ground representations to incorporate satellite spatial information
- Adaptive average pooling yields a global visual representation \(\mathbf{f}_{vis} \in \mathbb{R}^d\)
Probabilistic Displacement Regression (Core Innovation)
- Reformulates localization as learning a regression field \(\mathbf{v}^\phi(\mathbf{q}_0, \mathbf{f}_{vis})\)
- Input: concatenation of visual representation \(\mathbf{f}_{vis}\) and initial hypothesis position \(\mathbf{q}_0\)
- Parameterizes displacement in polar coordinates \((r, \theta)\), predicted by two heads:
- Distance head: Predicts Gaussian distribution parameters \((\mu_r, \sigma_r^2)\), i.e., \(r \sim \mathcal{N}(\mu_r, \sigma_r^2)\)
- Direction head: Predicts von Mises-Fisher distribution parameters \((\mu_\theta, \kappa)\), suitable for modeling directional uncertainty on the unit circle \(S^1\)
- Refinement formula: \(\hat{\mathbf{q}}_1 = \mathbf{q}_0 + \mu_r \cdot \frac{\mu_\theta}{\|\mu_\theta\|_2}\)
- Design Motivation: Probabilistic modeling provides not only point estimates but also uncertainty quantification, far superior to deterministic regression
Iterative Refinement Sampling (IRS, Inference Algorithm)
- Initializes \(N\) hypothesis points \(\mathcal{Q}_0 = \{\mathbf{q}_0^{(i)}\}_{i=1}^N\), uniformly sampled on the satellite image
- Iterates for \(R\) rounds: each round calls the regression network in parallel on all hypotheses to predict displacements and update positions
- Final position is the mean of all converged hypotheses: \(\hat{\mathbf{q}}_{final} = \text{mean}(\mathcal{Q}_R)\)
- Key efficiency design: Visual feature extraction (EfficientNet + Cross-Attention) runs only once; IRS iterations only re-run the ultra-lightweight coordinate projection layer and MLP regression head
- Supports inference-time scaling: \(N\) and \(R\) can be flexibly adjusted to trade off between accuracy and speed without retraining

Loss & Training¶

During training, hypothesis positions \(\mathbf{q}_0\) are uniformly sampled on the satellite image, and displacements \(\mathbf{u}_{gt}\) to the ground truth are computed
Distance NLL loss: \(\mathcal{L}_r = \frac{1}{2}\left(\frac{(r_{gt} - \mu_r)^2}{\sigma_r^2} + \log \sigma_r^2\right)\)
- Inverse-variance weighting penalizes errors under high confidence; \(\log \sigma_r^2\) regularization prevents variance collapse to zero
Direction AngMF loss: \(\mathcal{L}_\theta = -\log(\kappa^2+1) + \kappa \cdot \cos^{-1}(\mu_\theta^T \cdot \theta_{gt}) + \log(1+\exp(-\kappa\pi))\)
- Directly minimizes angular error, more robust than L2 loss
Total loss: \(\mathcal{L} = \mathcal{L}_r + \mathcal{L}_\theta\)

Key Experimental Results¶

Main Results¶

KITTI Dataset (Same-Area)

Method	Mean (m) ↓	Median (m) ↓	FPS ↑	Params (M)
FG2	0.75	0.52	4.20	-
HC-Net	0.80	0.50	25.00	11.21
CCVPE	1.22	0.62	24.00	57.40
GeoFlow	0.98	0.68	29.49	7.38

VIGOR Dataset

Method	Same Mean (m) ↓	Cross Mean (m) ↓	FPS ↑
FG2	2.18	2.74	3.60
HC-Net	2.65	3.35	20.00
CCVPE	3.60	4.97	18.00
GeoFlow	3.51	4.62	29.49

Ablation Study¶

Effect of IRS Iteration Rounds (KITTI Cross-Area, N=10)

Rounds R	Mean (m)	Median (m)	FPS
1	10.69	9.95	32.55
3	8.47	5.88	31.23
5	8.42	5.60	29.49
10	8.41	5.59	26.23

Effect of IRS Hypothesis Count (KITTI Cross-Area, R=5)

Seeds N	Mean (m)	FPS
1	8.58	30.70
10	8.42	29.49
20	8.41	28.08

Single Inference vs. Full IRS

Config	Mean (m)	Median (m)	Note
N=1, R=1	12.47	11.79	Single inference baseline
N=10, R=5	8.42	5.60	Full IRS, 32.5% error reduction

Key Findings¶

Inference-time scaling is genuinely effective: From R=1 to R=3, mean error drops by 20.8%, with FPS nearly unaffected (32.55→31.23)
Extreme efficiency: GeoFlow has only 7.38M parameters (1/7.8 of CCVPE), 686 MiB memory (1/6.9 of CCVPE), and 7× the speed of FG2
IRS is a critical component: The single inference vs. IRS comparison shows that IRS is not a marginal improvement but halves the median error

Highlights & Insights¶

Paradigm innovation: Reformulates FG-CVG as learning probabilistic displacement fields + iterative hypothesis refinement, fundamentally different from traditional matching/regression paradigms
Extreme efficiency design: Visual features are computed only once; IRS iterations run only ultra-lightweight MLPs, achieving the breakthrough of "iterative methods can also be real-time"
Inference-time scaling is observed for the first time in FG-CVG—analogous to test-time compute scaling in LLMs
Elegance of probabilistic modeling: Gaussian for distance and vMF for direction are more principled than deterministic regression, with uncertainty naturally learned through NLL loss
Multi-hypothesis consensus mechanism: Similar to particle filtering, multi-hypothesis convergence naturally suppresses visual ambiguity

Limitations & Future Work¶

Absolute accuracy gap remains: On VIGOR Cross-Area, Mean 4.62m vs. FG2's 2.74m—the lightweight design incurs some accuracy loss
Only handles 2-DoF: Does not address heading (θ) estimation, assuming orientation is known (from IMU/compass)