PiLoT: Neural Pixel-to-3D Registration for UAV-based Ego and Target Geo-localization¶

Conference: CVPR 2026
Paper: CVF Open Access
Code: None (Project page: https://nudt-sawlab.github.io/PiLoT/)
Area: Remote Sensing / UAV Visual Localization / 3D Vision
Keywords: UAV Localization, Pixel-to-3D Registration, GNSS-denied, Pose Optimization, Target Geo-localization

TL;DR¶

PiLoT unifies "UAV ego-localization + arbitrary target geo-localization" into a single problem: "pixel-to-3D registration between real-time video frames and georeferenced 3D maps." Using a dual-thread engine, a lightweight network trained on millions of synthetic data points, and a Joint Neural-Guided Optimizer (JNGO), it achieves a median error of 1.37 m and 25+ FPS on Jetson Orin under GNSS/IMU-denied conditions.

Background & Motivation¶

Background: For UAVs to "know their own location + know the location of what they see," the mainstream approach involves two decoupled pipelines: ego-localization using VIO (Visual-Inertial Odometry) fused with GNSS, and target localization using additional active sensors such as laser rangefinders (e.g., DJI Matrice 4 series).

Limitations of Prior Work: This paradigm has two critical flaws. First, it highly depends on GNSS, failing immediately in signal-jammed or denied environments. Second, laser-based target measurement is expensive and bulky, measures only one point at a time, and cannot perform geo-localization for arbitrary pixels in the frame. Even in pure vision scenarios, VIO/SLAM, while locally smooth, accumulates drift during long-endurance flights due to the lack of global references.

Key Challenge: UAV localization is constrained by an impossible triangle of "accuracy, robustness, and real-time performance." Drift-free accuracy requires alignment with global maps; robustness requires handling drastic appearance changes (day/night, seasonal) and aggressive 6-DoF maneuvers (inter-frame displacements exceeding 10 m / 10°, beyond the convergence basin of standard optimizers); and real-time performance is often hampered by learning-based dense matchers limited by onboard compute.

Goal: To simultaneously provide 6-DoF UAV poses and map any query pixel \(\mathbf{u}=(u,v)^\top\) to real-world coordinates (lon, lat, alt) without GNSS or IMU.

Key Insight: The authors advocate for a paradigm shift—moving away from sensor stacking and instead redefining ego and target localization as a single pixel-to-3D registration problem. By continuously registering real-time video streams to a global 3D map (e.g., Google Earth), the system naturally recovers the 6-DoF pose. Arbitrary pixel geo-coordinates are then derived via the pose and depth, solving both tasks at once.

Core Idea: Replace the decoupled "GNSS + Laser" sensor scheme with "registration in feature space between video frames and georeferenced 3D maps." A specific component is designed for each side of the impossible triangle (Dual-threading for real-time, synthetic data for robust generalization, and JNGO for accuracy under aggressive motion).

Method¶

Overall Architecture¶

PiLoT takes a georeferenced 3D map \(M\), a monocular video stream \(\{I^q_i\}\) with known intrinsics, and a coarse initial pose prior \(\tilde{T}_{\text{init}}\) as inputs. It outputs the 6-DoF pose \(\hat{T}_i\) per frame and 3-DoF geo-coordinates for any target pixel. The system is driven by two parallel threads: The Rendering Thread predicts a reference pose via Kalman filtering, renders a synthetic reference view at that pose, and back-projects 3D geographic anchors to form a reference bundle. The Localization Thread extracts multi-scale features for each new query frame and uses the JNGO optimizer to align the query frame with these geographic anchors in feature space to solve for a globally consistent pose, which is then fed back to the rendering thread for the next frame.

The Key Insight is: once the ego-pose is solved, projecting a camera ray for any pixel into the 3D map and querying the depth provides its geo-coordinates—making target localization a "free byproduct" of ego-localization.

%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
    A["Input: Monocular frames +<br/>Georeferenced 3D Map"] --> B["UAV-specific Feature Extraction<br/>Synthetic Training · Lightweight Net"]
    A --> C["Dual-thread Engine<br/>Rendering: KF Prediction + Synthetic View<br/>Back-project 3D Geo-anchors B_i"]
    C -->|Reference bundle B_i| D["JNGO Optimizer<br/>Rotation-aware Sampling → Parallel Refinement → Motion-constrained Selection"]
    B -->|Multi-scale Features + Uncertainty| D
    D --> E["6-DoF Ego-pose T̂"]
    E -->|Ray casting per pixel<br/>Depth query| F["Target 3-DoF Geo-coords<br/>(lon, lat, alt)"]
    E -.->|Feedback pose for B_(i+1)| C

Key Designs¶

1. Dual-thread Engine: Decoupling rendering from localization for real-time, drift-free performance

A naive approach would be a serial "render reference view \(\rightarrow\) refine pose" loop, but the strong temporal dependency between rendering and optimization creates a bottleneck. PiLoT adopts a decoupled parallel architecture: the rendering thread uses a constant-velocity Kalman filter to predict a reference pose \(\hat{T}_{i|i-1}\) from the previous estimate \(\hat{T}_{i-1}\), renders a new reference view \((I^r_i, D^r_i)\), and back-projects \(N\) depth-valid pixels into the world frame to obtain 3D geographic anchors:

\[\mathbf{P}_{i,j}^{W} = \hat{\mathbf{T}}_{i|i-1}\left( D_{i}^{r}(\mathbf{p}^r_{i,j}) \cdot \mathbf{K}^{-1}\mathbf{p}^r_{i,j} \right)\]

This bundle \(\mathcal{B}_i := (I_i^{r},\, \hat{\mathbf{T}}_{i|i-1}, \{\mathbf{P}_{i,j}^{W}\}_{j=1}^{N})\) is passed to the localization thread. Each query frame is constrained by "dynamic geo-anchors" anchored to the global map, ensuring no drift over time. Another clever design is the one-to-many strategy: instead of rendering multiple views for fast motion, PiLoT renders a single reference view and allows a "swarm" of pose hypotheses to be refined against this shared rendering, achieving a large search range with minimal rendering cost.

2. Million-scale Synthetic Dataset: Geometric supervision for "3D-grounded" features and zero-shot sim-to-real

UAV localization requires lightweight networks that remain discriminative under perspective and lighting changes. However, existing datasets lack dense depth or precise poses for sequential geometric supervision. The authors built an automated AirSim–Cesium–Unreal pipeline to simulate UAV trajectories over photorealistic global terrain, rendering over 1.1 million RGB-D pairs (82 regions, 650 km flight) with absolute poses and per-pixel depth. A three-level feature pyramid is extracted at 1/4, 1/2, and 1 resolutions (backbone: MobileOne-S0 with a compact U-Net decoder, channel width \(C=32\)). Training follows a direct alignment paradigm, minimizing the reprojection error between ground-truth projections \(\mathbf{p}_j^q\) and estimated projections \(\tilde{\mathbf{p}}_j^q\):

\[\mathcal{L} = \sum_{j} \rho_B\left( \left\| \mathbf{p}_j^q - \tilde{\mathbf{p}}_j^q \right\|_2^2 \right)\]

where \(\rho_B(\cdot)\) is the Barron robust loss. Since the supervision signal is "geometric consistency" rather than appearance, the network learns features invariant to photometric changes and grounded in 3D structures, enabling zero-shot generalization to real UAV images after training only on synthetic data.

3. JNGO: Joint Neural-Guided Optimizer for large-displacement convergence under aggressive maneuvers

Aggressive UAV maneuvers cause large inter-frame displacements where traditional gradient optimizers stall in local minima. JNGO merges global exploration and local exploitation in three steps:

First, Rotation-aware Hypothesis Generation: Observing that pixel displacement is more sensitive to rotation than translation, larger sampling ranges are assigned to pitch and yaw. Rotational perturbations are sampled uniformly from an anisotropic box \(\mathcal{B}_r = [-\alpha_{\text{pitch}},\alpha_{\text{pitch}}]\times[-\alpha_{\text{yaw}},\alpha_{\text{yaw}}]\), while translation perturbations are sampled from a Gaussian \(\delta\mathbf{t}_m\sim\mathcal{N}(\boldsymbol{\mu}_t,\boldsymbol{\Sigma}_t)\) based on KF predictions, generating \(M\) hypotheses (e.g., \(M=144\)).

Second, Neural-Guided Parallel Refinement: Each hypothesis \(\tilde{T}_m\) undergoes coarse-to-fine Levenberg–Marquardt optimization (CUDA-parallelized). At pyramid level \(\ell\), the residual between query and reference features:

\[\mathbf{r}^{(\ell)}_{j,m} = \mathbf{f}^q_{\ell}\left( \pi\left( \mathbf{K}_{\ell}, \tilde{\mathbf{T}}_{m}^{-1}, \mathbf{P}_j^W \right) \right) - \mathbf{f}^r_{\ell}(p^r_j)\]

is minimized using weights \(w_\ell(j)\) based on uncertainty, with LM iterating on \(SE(3)\): \((\mathbf{J}^\top\mathbf{W}\mathbf{J}+\lambda\mathbf{I})\Delta\boldsymbol{\xi}=-\mathbf{J}^\top\mathbf{W}\mathbf{r}\).

Third, Motion-constrained Selection: The best hypothesis is chosen using the feature cost plus a physical motion prior that penalizes deviation from the Kalman-predicted trajectory \(\hat{T}_{\text{pred}}\):

\[\mathcal{C}_{\text{total}}^{(m)} = \mathcal{C}_{\text{photo}}^{(m,\ell=2)} + \lambda\,\|\log(\hat{\mathbf{T}}_{\text{pred}}^{-1}\tilde{\mathbf{T}}^{'}_{m})^\vee\|_2^2\]

The combination of "stochastic sampling + gradient refinement + motion prior" ensures stability even under displacements of 10 m / 10°.

Loss & Training¶

End-to-end training uses the geometric reprojection loss \(\mathcal{L}\) with \(N=500\) anchors. Optimized with Adam (lr \(=10^{-3}\)) for 30 epochs across 8x RTX 4090s. Training simulates initialization uncertainty with 5-15 m translation and 5-15° rotation noise. Data augmentation includes Fourier high-frequency noise and photometric jitter.

Key Experimental Results¶

Main Results: Ego-localization (Tab. 2)¶

Shared map, 512 px input; median errors in m / °; R@1/3/5 is recall within specified thresholds; Comp. is completion rate.

Dataset	Method	FPS↑	Med m↓	Med °↓	R@1 (m,°)↑	Comp.↑
SynthCity-6 (Synth)	Render2Loc(LoFTR)	2.0	0.49	0.04	76.5	100.0
SynthCity-6 (Synth)	Render2Loc(RoMaV2)	0.8	0.47	0.04	77.2	100.0
SynthCity-6 (Synth)	PiLoT	28.0	0.46	0.03	80.4	100.0
UAVScenes (Real, Zero-shot)	Render2Loc(LoFTR)	2.0	1.62	0.52	23.2	100.0
UAVScenes (Real, Zero-shot)	PiLoT	28.0	1.27	0.47	25.5	100.0
UAVD4L-2yr (Real, 2yr gap)	Render2Loc(RoMaV2)	0.8	1.05	0.97	43.2	100.0
UAVD4L-2yr (Real, 2yr gap)	PiLoT	28.0	0.92	0.89	45.8	100.0

PiLoT achieves the lowest median error and highest recall across all datasets while being the fastest (28 FPS). Render2Loc(LoFTR/RoMaV2), while accurate, runs at only 0.8–2 FPS. Real-world deployments were zero-shot without fine-tuning, including the UAVD4L-2yr dataset with significant seasonal/lighting gaps.

Main Results: Target Geo-localization (Tab. 3)¶

R@k is the ratio of targets with 3D distance error within k meters.

Method	Single-target (Real) R@1/3/5	Multi-target (Synth) R@1/3/5
Render2ORB	72.13 / 84.59 / 89.74	79.51 / 91.04 / 93.28
PixLoc	83.37 / 87.29 / 91.85	86.15 / 91.88 / 93.91
Render2Loc	87.62 / 92.60 / 96.25	89.03 / 93.15 / 96.07
PiLoT	90.81 / 94.32 / 96.85	93.74 / 95.56 / 98.19

Superior ego-localization accuracy directly translates to better target localization across both scenarios.

Ablation Study (Tab. 4)¶

Recall(%) @ 1m/1° under different initialization noise budgets:

Configuration	w/ 3m,3°	w/ 5m,5°	w/ 10m,10°	Explanation
Off-the-shelf backbone	4.2	0.0	0.0	General features fail in UAV domain
+ Domain training	51.4	43.2	15.2	Synthetic training is the foundation
+ Rotation-aware sampling	83.8	78.9	70.6	Massive gain for large noise
+ Motion reg. (Full)	84.3	84.3	84.2	Equalizes performance across budgets

Training Data	3m,3°	5m,5°	10m,10°
Synth (No lighting/weather)	63.5	62.4	61.6
MegaDepth Only	69.9	69.5	68.7
Ours (Full Synth)	84.3	84.3	84.2

Key Findings¶

Domain-specific training is mandatory: Off-the-shelf backbones fail in the UAV domain (4.2% recall @ 3m,3°). Training on the proposed synthetic data jumps performance to 51.4%.
Rotation-aware sampling is critical for large displacements: It elevates recall from 15.2% to 70.6% under 10m/10° budgets, validating the observation that pitch/yaw sampling is more effective.
Motion regularization provides stability: It levels the performance across different noise budgets, specifically fixing robustness issues at the difficult 10 m / 10° scale.
Photometric diversity in synth data matters: Synthetic data with diverse lighting and weather (84.3) outperforms both real MegaDepth (69.9) and simple synthetic data (63.5), suggesting that geometric supervision + visual diversity bridges the sim-to-real gap more effectively than real-world images from other domains.

Highlights & Insights¶

Task Unification: Target localization becomes a "free" operation via ray casting once the 6-DoF pose is solved, eliminating GNSS and laser rangefinder dependencies.
One-to-Many Strategy: Sharing a single rendering among multiple hypotheses shifts the search cost from "rendering" to "parallel optimization," which is key for real-time performance.
Sampling from Domain Heuristics: Using anisotropic bounding boxes for sampling based on motion sensitivity (pitch/yaw) is a translatable insight for other registration tasks with strong motion priors.
Geometric vs. Photometric Supervision: Forcing the network to learn features grounded in underlying 3D structures via geometric consistency enables training on synthetic data with zero-shot generalization.

Limitations & Future Work¶

Performance may degrade under extreme conditions (e.g., thick fog) or significant calibration errors.
Dependence on high-fidelity 3D mesh maps: Currently limited to areas with mesh data; the authors plan to extend this to DOM and DEM representations for rural and urban coverage.
Zero-shot validation scope: While successful, more stress testing on camera/terrain distributions significantly different from the training set is needed.
Target localization currently assumes "point-like targets" with known 2D pixel positions; large-scale or non-rigid targets are not yet addressed.
The first frame still requires a coarse pose prior (provided via coarse GNSS/IMU or ground truth).

vs VIO/SLAM (e.g., ORB-SLAM3): SLAM is locally smooth but drifts; PiLoT is globally consistent and drift-free per frame at the cost of requiring a pre-existing map.
vs 2D Satellite Registration: Early methods are limited to 3-DoF (lat, lon, yaw) and nadir views; PiLoT recovers full 6-DoF via 3D maps.
vs Matching-based (LoFTR/RoMaV2): These are accurate but run at ~1 FPS on onboard platforms; PiLoT reaches 28 FPS via direct alignment and parallel hypotheses.
vs PixLoc: PixLoc is sensitive to initialization and generalizes poorly in the UAV domain; PiLoT closes this gap via large-scale grounded training and JNGO.

Rating¶

Novelty: ⭐⭐⭐⭐ Clear paradigm shift by unifying dual localization tasks and designing specific components for the "impossible triangle."
Experimental Thoroughness: ⭐⭐⭐⭐ Solid zero-shot validation on multiple real datasets; comprehensive ablations for components and data.
Writing Quality: ⭐⭐⭐⭐ Clear narrative linking contributions to problems; excellent use of formulas and diagrams.
Value: ⭐⭐⭐⭐ Highly practical for GNSS-denied UAV operations; proven to run on embedded platforms.