MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP¶

Conference: ICCV 2025 arXiv: 2507.15257 Code: https://github.com/anpei96/mincd-pnp-demo Area: 3D Vision Keywords: Image-to-point cloud registration, 2D-3D correspondences, PnP, Chamfer distance, multi-task learning

TL;DR¶

This paper proposes MinCD-PnP, which reduces the computationally expensive Blind PnP to a problem of minimizing the Chamfer distance between 2D-3D keypoints via a triple approximation strategy. A lightweight multi-task learning module, MinCD-Net, is designed and integrated into existing I2P registration frameworks, achieving significant improvements in inlier ratio and registration recall under cross-scene and cross-dataset settings.

Background & Motivation¶

Image-to-point cloud (I2P) registration is a fundamental task in computer vision, aiming to establish 2D-3D correspondences between image pixels and 3D points in a point cloud, followed by pose estimation via PnP algorithms. This task is widely applied in visual localization, navigation, SLAM, and 3D reconstruction.

Limitations of Prior Work: Mainstream deep learning-based I2P registration methods (e.g., P2-Net, 2D3D-MATR) establish correspondences through pixel-to-point feature matching, but this purely feature-level approach ignores the projective geometric constraints of 2D-3D correspondences (i.e., a valid correspondence ⟨q,p⟩ must satisfy \(q = \pi(Tp)\)), making it difficult to effectively reject outliers.

Problem with Differentiable PnP: To introduce geometric constraints, recent methods adopt differentiable PnP (e.g., EPro-PnP) to supervise correspondence learning. However, differentiable PnP is highly sensitive to noise and outliers in predicted correspondences — when the network inevitably generates incorrect correspondences, the PnP-estimated pose becomes unreliable, in turn hindering the effectiveness of correspondence learning.

Inspiration from Blind PnP: Blind PnP achieves robustness to noise and outliers by jointly optimizing the transformation matrix \(T\) and the correspondence matrix \(C\). However, its computational complexity is prohibitive (involving a Boolean matrix search of size \(M \times N\)), rendering it incompatible with gradient backpropagation in deep learning frameworks.

Core Problem: Can the robustness of Blind PnP be retained while reducing its computational complexity to make it applicable for end-to-end correspondence learning? This paper answers affirmatively: through a triple approximation strategy, Blind PnP is reformulated as an optimizable problem of minimizing Chamfer distance.

Method¶

Overall Architecture¶

MinCD-PnP consists of two components: (1) a theoretical triple approximation that reformulates Blind PnP into the MinCD-PnP problem; and (2) the MinCD-Net module at the implementation level, which solves MinCD-PnP via multi-task learning. MinCD-Net can be plug-and-play integrated into existing I2P registration architectures (e.g., 2D3D-MATR).

Key Designs¶

Approximation I: From Inlier Maximization to Chamfer Distance Minimization
- Function: Eliminates the Boolean correspondence matrix \(C\) in Blind PnP, reformulating the inlier count maximization as Chamfer distance minimization.
- Mechanism: Through inequality derivation, it is shown that \(\max_{\mathbf{T},\mathbf{C}} \kappa(\mathbf{T},\mathbf{C}) \leq \max_{\mathbf{T}} \kappa^{\star}(\mathbf{T})\), where \(\kappa^{\star}\) is an upper bound in Chamfer distance form. Leveraging this bound, the original discrete combinatorial optimization is relaxed into continuous Chamfer distance minimization: \(L_{\text{Chamfer}}(\mathbf{T}|\mathbf{S}_I, \mathbf{S}_P) = \sum_{q \in S_I} \min_{p \in S_P} \|q - \pi(Tp)\|^2 + \sum_{p \in S_P} \min_{q \in S_I} \|q - \pi(Tp)\|^2\)
- Design Motivation: The \(M \times N\) Boolean matrix search is computationally prohibitive, whereas Chamfer distance requires only nearest-neighbor search per point, substantially reducing complexity while remaining differentiable.
Approximation II: Keypoint Sampling to Reduce Chamfer Distance Computation
- Function: Samples representative keypoints from the full pixel set and point cloud, reducing the Chamfer distance matrix from \(M \times N\) (~\(10^{11}\)) to \(M_0 \times N_0\) (~\(10^6\)).
- Mechanism: Keypoint sets \(K_I\) and \(K_P\) are sampled from \(S_I\) and \(S_P\) respectively, and \(L_{\text{Chamfer}}(\mathbf{T}|\mathbf{K}_I, \mathbf{K}_P)\) replaces the full computation. With 2D/3D keypoint counts on the order of \(10^3\), the matrix size is reduced by approximately \(10^5\).
- Design Motivation: Even after eliminating the Boolean matrix, computing Chamfer distance over all pixels and points remains infeasible; keypoint sampling preserves representativeness while drastically reducing computational cost.
Approximation III: 2D Keypoint-Guided 3D Keypoint Learning
- Function: Simplifies joint learning of \(K_I\) and \(K_P\) to learning only \(K_P\), with \(K_I\) provided by a pretrained detector.
- Mechanism: The Shi-Tomasi corner detector is used to extract 2D keypoints \(K_I\) in advance. For each 2D keypoint \(q\), the nearest 3D point is found via feature matching: \(p_q^{\star} = \arg\min_{p} d(\mathbf{f}_q^{2D}, \mathbf{f}_p^{3D})\), and 3D keypoint learning is supervised by an IoU-style loss \(L_{\text{key}}(q)\). A threshold \(s_{th} = e^{-0.4}\) is introduced to filter low-confidence matches.
- Design Motivation: Jointly learning 2D and 3D keypoints with sufficient inliers is difficult, whereas mature 2D keypoint detectors already provide good spatial representativeness in image space.
MinCD-Net Multi-Task Learning Module
- Function: Unifies the three approximations of MinCD-PnP into a single end-to-end trainable lightweight module.
- Mechanism: The final optimization objective is \(\varphi^{\star} = \arg\min_\varphi \left( L_{\text{corr}} + \lambda_1 \sum_{q \in K_I} L_{\text{key}}(q) + \lambda_2 \min_T L_{\text{Chamfer}}(T|K_I, K_P) \right)\). A Point Transformer encodes 2D/3D keypoint features; global features are then passed through an MLP to predict pose \(T\) (se(3)→SE(3)).
- Design Motivation: The three loss terms work synergistically — \(L_{\text{corr}}\) handles feature-space matching, \(L_{\text{key}}\) ensures 3D keypoints approximate 2D keypoints, and \(L_{\text{Chamfer}}\) enforces global geometric consistency.

Loss & Training¶

The total loss consists of three terms: - \(L_{\text{corr}}\): Circle Loss for pixel-to-point feature matching, alleviating extreme inlier/outlier imbalance. - \(L_{\text{key}}\): Keypoint learning loss in IoU form, ensuring learned 3D keypoints approximate 2D keypoints within a projection error threshold \(\tau\). - \(L_{\text{Chamfer}}\): Chamfer distance loss, computed after pose prediction via Point Transformer + MLP as a reprojection-based Chamfer distance.

Training strategy: The first 20 epochs train only \(L_{\text{corr}}\) (\(\lambda_1 = \lambda_2 = 0\)); the subsequent 20 epochs incorporate \(L_{\text{key}}\) and \(L_{\text{Chamfer}}\) (\(\lambda_1 = 0.2\), \(\lambda_2 = 0.0001\)). The threshold \(\tau\) is adaptively determined based on camera intrinsics and the RR threshold.

Key Experimental Results¶

Main Results¶

Cross-scene generalization (7-Scenes dataset, trained on Office, tested on other scenes):

Method	IR (AVG)	RR (AVG)	Gain (IR)	Gain (RR)
P2-Net	0.393	0.510	-	-
MATR	0.456	0.491	-	-
MATR+Diff.PnP	0.463	0.501	+0.007	+0.010
MATR+BPnPNet	0.486	0.578	+0.030	+0.087
MATR+MinCD-Net	0.568	0.647	+0.112	+0.156

Cross-dataset generalization (trained on Kitchen, tested on RGBD-V2):

Method	IR (AVG)	RR@0.1 (AVG)
MATR	0.291	0.692
MATR+Diff.PnP	0.303	0.708
MATR+BPnPNet	0.315	0.799
MATR+MinCD-Net	0.371	0.870

Ablation Study¶

Configuration	IR (AVG)	RR (AVG)	Note
MATR (baseline)	0.456	0.491	No PnP constraint
+ Diff.PnP	0.463	0.501	Differentiable PnP, limited gain
+ BPnPNet	0.486	0.578	Weighted Blind PnP, improved but gradient-limited
+ MinCD-Net (\(L_{\text{key}}\) only)	~0.52	~0.58	Keypoint learning only
+ MinCD-Net (full)	0.568	0.647	Three losses combined, best performance

Key Findings¶

MinCD-Net improves IR by ~+0.10 and RR by ~+0.15 over Diff.PnP in the cross-scene setting, demonstrating that the robustness advantage of Blind PnP is effectively preserved.
In the most challenging Stairs scene (severe texture scarcity), MinCD-Net improves RR from 0.226 (Diff.PnP) to 0.571, a gain of ~150%.
MinCD-Net also achieves top performance in cross-dataset experiments, demonstrating strong generalization.
Using the multi-dataset pretrained model MinCD-Net†, IR reaches 0.581 and RR reaches 0.914 on RGBD-V2.

Highlights & Insights¶

The paper successfully introduces Blind PnP — a classical but computationally expensive robust estimation method — into a deep learning framework via a three-level theoretical approximation, representing an excellent fusion of theoretical rigor and engineering practicality.
The plug-and-play design of MinCD-Net allows integration into arbitrary I2P registration architectures rather than being tied to a specific network.
The keypoint-guided strategy cleverly leverages mature 2D detectors to circumvent the difficulty of joint learning.
Replacing exact PnP solving with Chamfer distance confers inherent robustness to noise and outliers.

Limitations & Future Work¶

2D keypoints rely on a pretrained detector (Shi-Tomasi), which may be suboptimal in scenes with extreme texture scarcity; learned keypoint detection could be considered as an alternative.
The theoretical bounds of the triple approximation are upper bounds rather than tight bounds; a more rigorous optimality analysis remains to be conducted.
Evaluation is currently limited to indoor scenes; performance on large-scale outdoor environments (e.g., LiDAR-camera registration in autonomous driving) remains to be explored.
Encoding keypoints with Point Transformer may incur increasing computational overhead as the number of keypoints grows.

2D3D-MATR: The strong baseline for I2P registration, upon which MinCD-Net is integrated to achieve improvements.
EPro-PnP: End-to-end probabilistic PnP with some robustness to noise, but struggles with large numbers of outliers.
BPnPNet: The only prior work applying Blind PnP to correspondence learning, but with limited gradient information due to RANSAC-based filtering.
Insight: The strategy of "softening" classical robust estimation methods and embedding them into deep learning is broadly applicable to other geometric estimation tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ (Triple approximation is novel, though it builds on existing theoretical frameworks)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (4 datasets, cross-scene + cross-dataset, comparison with multiple methods)
Writing Quality: ⭐⭐⭐⭐ (Theoretical derivations are clear, though mathematical notation is occasionally heavy)
Value: ⭐⭐⭐⭐ (Practical improvement for I2P registration; the plug-and-play module design is worth adopting)