PIDLoc: Cross-View Pose Optimization Network Inspired by PID Controllers¶

Conference: CVPR 2025
arXiv: 2503.02388
Code: None
Area: Autonomous Driving / Cross-View Localization
Keywords: Cross-View Localization, PID Controller, Pose Optimization, LiDAR, Satellite Image Matching

TL;DR¶

Inspired by PID controllers, this paper proposes PIDLoc, a cross-view pose optimization network. By integrating three branches—P (proportional, local feature discrepancy), I (integral, global multi-pose candidate aggregation), and D (derivative, gradient of feature discrepancy)—with a spatial-aware pose estimator, PIDLoc achieves robust and precise localization even under large initial pose errors.

Background & Motivation¶

Accurate localization is crucial for autonomous driving, yet GNSS signals are often blocked in environments such as urban canyons. Cross-view pose optimization directly estimates vehicle poses by matching ground-view and satellite-view images, bypassing bin-resolution limitations.

However, prior cross-view pose optimization methods face several key challenges:

Relying solely on the feature discrepancy at the current pose (analogous to a P controller), which lacks global context and fine-grained adjustment capabilities.
Prone to falling into local optima under large initial pose errors, especially in scenes with repetitive patterns like buildings and trees.
Existing methods independently estimate the pose of each feature point and then average them, ignoring the spatial relationships between features, which leads to inconsistent pose estimations.

These issues strongly resemble the challenges faced by conventional PID controllers, where a P controller is susceptible to steady-state errors and local oscillations, whereas introducing I (integral) and D (derivative) components enhances global convergence and accuracy.

Method¶

Overall Architecture¶

PIDLoc utilizes a weight-shared U-Net to extract ground and satellite view feature maps. It establishes cross-view feature correspondences through LiDAR point cloud projections, subsequently generating multi-dimensional contextual features via the three PID branches. These features are then fed into a Spatial-aware Pose Estimator (SPE) to iteratively update the pose.

Key Design 1: PID Branches¶

Function: Extract multi-level context from the cross-view feature discrepancy \(e(\mathbf{P}) = \mathbb{F}_s[\mathcal{I}_s(\mathbf{P})] - \mathbb{F}_g[\mathcal{I}_g]\)
Mechanism: The P (proportional) branch provides the local feature discrepancy at the current pose \(w_p = k_p \cdot e(\mathbf{P})\); the I (integral) branch searches multi-pose candidates over a 3-DoF spatial grid and concatenates their feature discrepancies \(w_i = \text{concat}([k_i \cdot e(\mathbf{P}')\ \text{for}\ \mathbf{P}' \in \mathcal{P}^{cand}])\) to provide global context; the D (derivative) branch computes the gradient of the feature discrepancy with respect to the pose \(w_d = k_d \|\partial e(\mathbf{P})/\partial \mathbf{p}\|_2\) to capture fine-grained variations. These three are concatenated to form \(w(\mathbf{P}) = w_p \oplus w_i \oplus w_d\)
Design Motivation: Relying solely on the P signal easily leads to trapping in local forums or optima under repetitive patterns; the I branch provides global candidate comparisons to avoid tracking failure; the D branch leverages feature gradients to achieve sub-pixel accurate adjustments.

Key Design 2: Spatial-aware Pose Estimator (SPE)¶

Function: Model the spatial relationships of the PID branch features to achieve consistent pose estimation.
Mechanism: Unlike existing methods that independently estimate the pose of each feature point and then average them, SPE utilizes channel-shared MLPs to model the local spatial relationships of PID branch features and jointly predicts the pose by embedding positional encoding into satellite coordinates.
Design Motivation: Independent estimation may converge to different local optima, causing inconsistency after averaging; SPE achieves more accurate and consistent pose estimation by explicitly modeling spatial dependencies.

Key Design 3: Cross-View Visual Feature Extraction¶

Function: Establish robust feature correspondences between ground and satellite views.
Mechanism: A weight-shared U-Net is used to extract ground and satellite feature maps \(\mathbb{F}_g, \mathbb{F}_s\) respectively, and corresponding features are sampled by projecting the LiDAR point cloud onto both views using camera intrinsics.
Design Motivation: LiDAR provides reliable depth information to avoid depth ambiguity in ground homography, and sparse features are more suitable for precise matching than dense features.

Loss & Training¶

Supervised learning is used, with end-to-end training driven by a regression loss between the predicted pose and the ground truth.

Key Experimental Results¶

Main Results: Cross-View KITTI Dataset¶

Method	Modality	Position Error (m) ↓	Orientation Error (°) ↓	Lateral Recall @1m (%) ↑
HighlyAccurate	RGB	7.41	1.92	-
Boosting	RGB	6.39	1.55	-
SIBCL	RGB+LiDAR	5.69	0.61	46.7
VFA	RGB	6.95	0.55	40.5
PIDLoc	RGB+LiDAR	4.96	0.40	56.4

Ablation Study: Contribution of Each Branch¶

Configuration	Position Error (m)	Orientation Error (°)
P only	5.69	0.61
P + I	5.32	0.48
P + D	5.41	0.45
P + I + D	5.15	0.42
P + I + D + SPE	4.96	0.40

Key Findings¶

The position error is reduced by 37.8% (4.96m vs. 7.41m of the previous best), and the orientation error is reduced by 34.4%.
The I branch contributes the most under large initial pose errors (over a \(40\text{m} \times 40\text{m}\) region), effectively preventing trapped scenarios caused by repetitive patterns.
The D branch provides more significant assistance for orientation estimation, leveraging feature gradients to achieve fine-grained adjustment.
Compared to independent estimation followed by averaging, SPE achieves an additional error reduction of approximately 4%.

Highlights & Insights¶

Excellent analogy from PID controllers to deep learning: Mapping control theory concepts into the feature space, where the P, I, and D branches each have a clear physical interpretation.
I branch addresses repetitive pattern issues: Providing global context through multiple pose candidates, which is unattainable for methods purely based on the current pose.
D branch utilizes a differentiable projection chain: Calculating feature sensitivity through the complete Jacobian chain \(\partial e / \partial \mathbf{p}\).

Limitations & Future Work¶

The grid search of the I branch increases computational overhead, requiring a balance between the number of candidates and the search range.
Relying on LiDAR data, its extensibility to pure-vision solutions remains to be verified.
Robustness under extreme weather conditions and illumination variations has not been fully evaluated.
The PID gain coefficients are learnable parameters rather than manually tuned classical PID gains, prompting a need for deeper theoretical analysis.

SIBCL: The first work to utilize LiDAR depth for cross-view matching, with which the P branch of PIDLoc is equivalent.
VFA: Introduces top-down feature aggregation, yet is still confined to the current pose.
The concept of PID analogy can be extended to other visual localization tasks requiring iterative optimization.

Rating¶

⭐⭐⭐⭐ — The analogical design of the PID controller is ingenious, and both the I and D branches have clear motivations and experimental validations. It substantially outperforms previous methods on the KITTI dataset. However, its reliance on LiDAR limits its scope of application.