The NeRFect Match: Exploring NeRF Features for Visual Localization¶

Conference: ECCV 2024
arXiv: 2403.09577
Code: Project Page
Area: 3D Vision
Keywords: Visual Localization, NeRF Features, 2D-3D Matching, Pose Estimation, Scene Representation

TL;DR¶

Proposes NeRFMatch, which explores the potential of internal NeRF features as 3D descriptors and establishes an attention-based 2D-3D matching network. It achieves competitive localization performance on Cambridge Landmarks, validating the feasibility of NeRF as a scene representation for localization.

Background & Motivation¶

Key Challenge¶

Key Challenge: Background: Visual localization aims to determine the camera pose of a query image in a 3D environment. Mainstream representations include: point clouds + descriptors (HLoc), meshes (MeshLoc), and learned implicit representations (APR/SCR). As a compact 3D scene representation (Mip-NeRF is only 5.28MB), NeRF has been used for data augmentation, auxiliary supervision, and pose refinement, but the potential of its internal features for direct 2D-3D matching remains under-explored. Prior works (CrossFire, NeRF-Loc) require joint training with the matching task and cannot utilize pre-trained NeRFs.

Method¶

Overall Architecture¶

A three-step pipeline: (1) Image retrieval to find the nearest reference pose; (2) Render 3D points and features from the reference view using NeRF, and establish 2D-3D correspondences using NeRFMatch to compute the pose; (3) Optional iterative pose refinement.

Key Designs¶

NeRF Feature Rendering: Given a 3D encoder \(\Theta_x\) (with \(L\) layers) of NeRF, the \(j\)-th layer feature is extracted as \(f^j = \Theta_x^j \circ \cdots \circ \Theta_x^1(P_x(X))\). The surface points \(\hat{X}(r) = \sum w_i X_i\) and corresponding descriptors \(\hat{F}^j(r) = \sum w_i f_i^j\) are aggregated through volume rendering. Key Insight: This feature only depends on the 3D coordinates and is independent of the viewpoint, making it naturally suitable for cross-view matching.

NeRFMatch-Mini: A lightweight version where a CNN encoder extracts \(8\times\) downsampled image features and directly performs dual-softmax matching with NeRF features, without requiring a learnable matching module.

NeRFMatch (Full): Incorporates coarse-to-fine matching with self-attention and cross-attention modules. The coarse matching layer uses a shared self-attention module (to pull the feature spaces of the two domains closer) and explicitly concatenates the positional encodings of 3D coordinates to enhance spatial awareness, followed by cross-attention for cross-domain interaction. The fine matching layer performs sub-pixel level matching via heatmap regression within a high-resolution local window.

Pose Refinement: Two schemes: (1) Iterative matching refinement: re-matching using the estimated pose as the new reference; (2) iNeRF-style photometric optimization followed by re-matching.

Loss & Training¶

Coarse Matching Loss: \(L_c = -\frac{1}{M_{gt}} \sum \log(S(i,j))\) (log loss to maximize the dual-softmax probability of the GT positions)
Fine Matching Loss: \(L_f = \frac{1}{M_f} \sum \frac{1}{\sigma^2(i)} \|\tilde{x}_j - x_j\|_2\) (variance-weighted pixel distance loss)

Key Experimental Results¶

Outdoor Localization on Cambridge Landmarks¶

Median pose error (cm / °):

Main Results¶

Method	Scene Representation	Kings	Hospital	Shop	StMary	Court	Average
DSAC*	SCR Network	15/0.3	21/0.4	5/0.3	13/0.4	49/0.3	20.6/0.3
ACE	SCR Network	28/0.4	31/0.6	5/0.3	18/0.6	43/0.2	25/0.4
HLoc	3D+RGB	12/0.2	15/0.3	4/0.2	7/0.2	16/0.1	10.8/0.2
NeRFLoc	NeRF+RGBD	11/0.2	18/0.4	4/0.2	7/0.2	25/0.1	13/0.2
NeRFMatch-Mini	NeRF+RGB	19.0/0.3	30.2/0.6	10.3/0.5	11.3/0.4	29.1/0.2	20.0/0.4
NeRFMatch	NeRF+RGB	13.0/0.2	19.4/0.4	8.5/0.4	7.9/0.3	17.5/0.1	13.3/0.3

Model Efficiency¶

Ablation Study¶

Component	Size	Inference Time
Mip-NeRF Scene	5.28 MB	-
NeRF Feature Rendering (3600 points)	-	141 ms
NeRFMatch-Mini	42.8 MB	37 ms
NeRFMatch	50.4 MB	157 ms

Key Findings¶

Under the condition of only using NeRF+RGB (without depth), NeRFMatch achieves an average error of 13.3cm, which is close to NeRFLoc (13.0cm) that requires RGBD.
Using pure NeRF features (without RGB from image retrieval) can still achieve 14.6cm, proving that the internal features of NeRF themselves are high-quality 3D descriptors.
Significant improvement is obtained on the first iteration after iterative refinement, with diminishing returns for further iterations.
Performance on the indoor 7-Scenes is relatively weak, pointing out directions for future improvement.

Highlights & Insights¶

Core Discovery: The internal features learned by NeRF through view synthesis are naturally viewpoint-invariant, allowing them to serve as 3D descriptors without additional training.
Allows utilizing the NeRF model for localization without modifications or retraining, thereby directly benefiting from continuous advancements in NeRF research.
The Mini version demonstrates that reasonable localization can be achieved without learning matching functions, but solely by learning good feature representations.

Limitations & Future Work¶

Poor performance in indoor scenes due to low texture and motion blur.
Dependent on image retrieval to provide the initial pose range.
The 141ms latency of NeRF feature rendering limits real-time applications.

Unlike CrossFire and NeRF-Loc, this work treats NeRF as a black-box scene representation rather than a module that needs joint training, making it more versatile. The discovery of internal NeRF features provides significant inspiration for understanding what NeRF learns.

Rating¶

Novelty: ⭐⭐⭐⭐
Practicality: ⭐⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐⭐⭐
Writing Quality: ⭐⭐⭐⭐