CVPR 2025 Autonomous Driving LiDAR localization Scene Coordinate Regression training acceleration sample classification guidance redundant downsampling SLAM

LightLoc: Learning Outdoor LiDAR Localization at Light Speed¶

Conference: CVPR 2025
arXiv: 2503.17814
Code: liw95/LightLoc
Area: Autonomous Driving
Keywords: LiDAR localization, Scene Coordinate Regression, training acceleration, sample classification guidance, redundant downsampling, SLAM

TL;DR¶

This paper proposes LightLoc, which achieves a 50x acceleration in large-scale outdoor LiDAR localization training (1 hour vs. 2 days) while attaining a state-of-the-art (SOTA) position accuracy of 0.83m. This is achieved via Sample Classification Guidance (SCG) to reduce regression ambiguity in visually similar areas, and Redundant Sample Downsampling (RSD) to discard already well-learned frames.

Background & Motivation¶

Background¶

Background: LiDAR localization aims to estimate the 6DoF pose of a sensor, which is a fundamental capability for autonomous driving and robotics. Existing methods are divided into two categories:

Map-based methods (retrieval/registration): require storing and transmitting 3D maps, leading to high communication overhead.
Regression-based methods:

Key Challenge¶

Key Challenge: APR (Absolute Pose Regression): such as DiffLoc, achieves an accuracy of around 2m, but requires 145 hours of training.

Core Idea¶

Core Idea: SCR (Scene Coordinate Regression): such as LiSA, achieves an accuracy of 0.95m, but requires 53 hours of training.

The core bottleneck for slow training in large-scale outdoor scenes:

Limitations of Prior Work¶

Limitations of Prior Work: Large coverage area: A 2km² area contains many visually similar regions (such as similar combinations of roads and buildings), which increases the difficulty of regression learning.

Additional Notes¶

Additional Notes: Massive data: With ~150K training frames, caching features on the GPU like ACE does would require approximately 150GB, which is infeasible.

Method¶

Overall Architecture¶

LightLoc is based on a scene-agnostic feature backbone + scene-specific prediction head paradigm: 1. The backbone is trained in parallel with regression heads on 18 scenes in nuScenes (2 days, one-time cost) to obtain a general feature extractor. 2. For a new scene, only the lightweight prediction head is trained, accelerating the training to 1 hour through SCG and RSD.

Key Designs¶

1. Sample Classification Guidance (SCG)

Core Idea: Use a classification task to assist regression learning, reducing ambiguity of visually similar areas in large-scale scenes.

Label Generation: K-Means clustering is used to partition training locations into $k_1$ clusters, generating classification labels (zero-cost, fast).
Classification Network Training: The backbone is frozen, and a global max pooling + MLP classification head is trained for only 5 minutes.
Guided Regression: The sample probability distribution features output by the classification network are used as additional conditional inputs for SCR, normalized to the unit sphere after adding Gaussian noise ($\sigma=0.1$).

The classification loss uses cross-entropy with label smoothing ($\epsilon=0.1$): $$\mathcal{L}_{cls} = -\sum_{i=1}^{k_1} \left(l_i^*(1-\epsilon) + \frac{\epsilon}{k_1}\right) \log(l_i')$$

2. Redundant Sample Downsampling (RSD)

Core Idea: High-frequency LiDAR acquisition (10Hz) + large perception range (100m) lead to a substantial amount of redundant frames, meaning adequately learned samples can be safely discarded.

Hierarchical downsampling strategy: - Stage 1 ($0 \sim E_1$): Train on the full dataset, recording the median L1 loss $\mathcal{L}_m$ for each sample. - Stage 2 ($E_1 \sim E_1+S$): Calculate the variance $\mathcal{V}$ of $\mathcal{L}_m$ within a sliding window $S$. Sort the samples in descending order of variance, keeping the top $(1-r_d)$ proportion of samples (high variance = not converged = requires more training). - Stage 3: Repeat the above process on the reduced dataset, further downsampling to a proportion of $(1-r_d)^2$. - Stage 4 ($E_s \sim E$): Resume training on the full dataset to ensure final convergence.

3. SCG-Enhanced SLAM

Integrate the fast training (5 minutes) and confidence estimation capabilities of SCG into SLAM: - Construct hierarchical classification (two-level K-Means, $k_1 \times k_2$ clusters). - Confidence $c$ is defined as the product of the two-level classification probabilities. - Fuse Kalman filter SLAM pose estimates with the position observations from the classification network. - The measurement noise $V_t = I \times (1-c)$ gives greater weight to high-confidence estimates.

Loss & Training¶

Backbone training: L1 regression loss (Eq. 1)
SCG: Label smoothing cross-entropy (Eq. 2)
SCR: L1 regression loss + SCG feature guidance

Key Experimental Results¶

QEOxford Dataset¶

Main Results¶

Method	Type	Training Time	Parameters	Mean Position Error [m]	Mean Angle Error [°]
PosePN++	APR	11h	5M	5.13	1.69
DiffLoc	APR	145h	40M	1.86	0.87
SGLoc	SCR	50h	105M	1.53	1.60
LiSA	SCR	53h	105M	0.95	1.14
LightLoc	SCR	1h	22M	0.83	1.12

Key Findings¶

vs LiSA: Training time 1h vs 53h = 53x acceleration, position accuracy improved from 0.95m to 0.83m (13% gain).
vs DiffLoc: Training time 1h vs 145h = 145x acceleration, position accuracy improved from 1.86m to 0.83m.
Parameters: 22M vs 105M (LiSA) — 5x reduction.

NCLT Dataset¶

LightLoc also achieves SOTA on the NCLT dataset, with a position error of 0.87m, further validating its generalization ability.

Highlights & Insights¶

Precise Problem Definition: Clearly identifies the two core bottlenecks of slow training in large-scale outdoor scenes (large coverage area + massive data) and proposes targeted solutions.
Clever Design of Classification-Guided Regression: The classification network, which takes only 5 minutes to train, not only accelerates SCR regression but also serves as an external measurement source for SLAM, killing two birds with one stone.
Generalization of RSD: The loss-variance-based redundancy detection method does not rely on specific task assumptions, making it generalizable to other data-redundant training scenarios.
Extreme Engineering Practicality: Training a new scene in 1 hour renders the system practically deployable, solving the core pain point where prior SCR methods were difficult to apply in practice due to excessively long training times.

Limitations & Future Work¶

Backbone training still requires 2 days (though this is a one-time cost). Since it is trained on 18 scenes of nuScenes, its generalization across significantly different domains (e.g., indoors) has not been verified.
The number of clusters $k_1$ in SCG and the downsampling ratio $r_d$ in RSD need to be manually set, lacking an adaptive adjustment mechanism.
Only validated on autonomous driving datasets; its performance in other LiDAR application scenarios (e.g., robotics, UAVs) remains unknown.
The improvement in angle error is not as significant as that in position error, which may be insufficient in orientation-sensitive applications.

Map-based: PointNetVLAD $\rightarrow$ MinkLoc3D $\rightarrow$ LCDNet
APR Methods: PointLoc $\rightarrow$ PosePN++ $\rightarrow$ HypLiLoc $\rightarrow$ DiffLoc (SOTA APR)
SCR Methods: SGLoc $\rightarrow$ LiSA (SOTA SCR) $\rightarrow$ ACE (camera SCR acceleration)
Training Acceleration: ACE (GPU feature caching), GLACE (multi-view optimization) — but both target small scenes.

Rating¶

Novelty: 4/5 — Both SCG and RSD have sound motivational support, and their combination yields outstanding results.
Effectiveness: 5/5 — 50x acceleration combined with accuracy improvement makes the results highly convincing.
Clarity: 4/5 — The algorithmic pseudocode is clear, and the multi-dataset evaluation is comprehensive.
Significance: 5/5 — Overcomes the main barrier to transitioning SCR methods from theory to practice (namely, training time).