SafeMap: Robust HD Map Construction from Incomplete Observations¶

Conference: ICML 2025
arXiv: 2507.00861
Code: None
Area: Autonomous Driving
Keywords: HD Map Construction, BEV Perception, Sensor Failure Robustness, View Reconstruction, knowledge distillation

TL;DR¶

SafeMap proposes a plug-and-play robust framework for HD map construction. By utilizing two modules, Gaussian-based Perspective View Reconstruction (G-PVR) and Distillation-based BEV Correction (D-BEVC), it accurately constructs vectorized HD maps even under incomplete observations where camera views are missing.

Background & Motivation¶

Online High-Definition (HD) map construction is a critical task in autonomous driving, providing precise static environmental information for vehicle planning and navigation. Current mainstream methods rely on complete inputs from multi-view cameras. However, in real-world driving scenarios, cameras may suffer from occlusions, malfunctions, or physical damage, leading to missing image data from certain views.

Existing methods face the following challenges:

Vulnerability Exposure: MapBench evaluations demonstrate that sensor failures significantly degrade HD Map model performance, posing threats to traffic safety.

Limitations of Prior Work: Methods like MetaBEV and UniBEV address the sensor failure issue for 3D object detection but still rely on complete multi-view images. M-BEV only utilizes local crops of neighboring views for recovery, which requires a predefined cropping ratio and fails to fully utilize information from all available views.

Research Gap: Robust methods for HD Map construction under incomplete observations have not been fully explored, despite map construction being highly dependent on static environmental data captured by surrounding cameras.

SafeMap is the first robust framework specifically designed for HD Map construction under incomplete multi-view camera data.

Method¶

Overall Architecture¶

SafeMap is built upon the MapTR framework and consists of four core components:

Map Encoder: 2D feature extractor + PV-to-BEV transformation module
G-PVR Module (Gaussian-based Perspective View Reconstruction): Perspective view feature reconstruction based on Gaussian sampling
D-BEVC Module (Distillation-based BEV Correction): Distillation-based BEV feature correction
Map Decoder: MapTR-based decoder and prediction heads

During training, a Random View Masking (RVM) mechanism is used to randomly mask 2D image features of a specific view to simulate camera failures, followed by information recovery via the G-PVR and D-BEVC modules. During testing, the reconstruction module predicts the features of the missing views.

Key Designs¶

G-PVR: Gaussian-based Perspective View Reconstruction Module¶

This module represents the core innovation of the paper, addressing the problem of "how to reconstruct missing views from multiple available views."

Core Idea: Different views contribute differently to the reconstruction of the missing view. Neighboring views contain the most relevant information and should receive higher weights, while distant views (such as the opposite view) contain less information and should be down-weighted.

Specific Steps:

Perspective Panoramic View Concatenation: Starting from the left and right neighboring views of the missing view, all available frames are arranged sequentially by spatial distance and concatenated into a perspective panoramic view \(F_{PPV} = \text{Concat}(F_{PV}^a) \in \mathbb{R}^{H \times N_a W \times C}\)
Gaussian Reference Point Generation: Generate reference points following a Gaussian distribution on the perspective panoramic view:
- Horizontal direction: \(p_x \sim \mathcal{N}(N_a W / 2, \sigma^2)\) (centered on the middle to concentrate sampling points on the neighboring view regions)
- Vertical direction: \(p_y \sim \mathcal{U}(0, H)\) (uniformly distributed to cover the full height)
Deformable Attention Reconstruction: Utilize a learnable query \(V\) and an offset network \(\theta_{\text{offset}}\) to generate sampling offsets, extracting sampled features at reference point locations via a deformable attention mechanism: \(\hat{k} = \hat{x} W_k, \quad \hat{v} = \hat{x} W_v\) \(\Delta p = \theta_{\text{offset}}(V), \quad \hat{x} = \phi(F_{PV}^a; p + \Delta p)\)
MAE-style Transformer Reconstruction: Reconstruct missing view features using MAE-like Transformer blocks.

Advantages of G-PVR over Local PVR (M-BEV method): - Utilizes all available views instead of only the two adjacent views - Eliminates the need for predefined cropping ratios - Naturally encodes the view importance prior via a Gaussian distribution - Flexibly adapts to varying numbers of input views and spatial configurations

D-BEVC: Distillation-based BEV Correction Module¶

Following PV-level reconstruction, further calibration in the global BEV feature space is necessary. D-BEVC leverages the complete-observation BEV features \(F_{BEV}^{com}\) as a supervision signal, calibrating the incomplete-observation BEV features \(F_{BEV}^{incom}\) using MSE loss:

\[\mathcal{L}_{Cor} = \text{MSE}(F_{BEV}^{com}, F_{BEV}^{incom})\]

This module allows the incomplete-observation BEV features to implicitly benefit from the complete-observation omnidirectional BEV features during training.

Loss & Training¶

The overall optimization objective consists of three parts:

\[L = L_{\text{map}} + \lambda_1 L_{\text{Rec}} + \lambda_2 L_{\text{Cor}}\]

Loss Term	Definition	Weight
\(L_{\text{map}}\)	Map construction loss (classification + point-to-point + edge direction)	1.0
\(L_{\text{Rec}}\)	PV reconstruction loss: \(\\|F_{PV}^{com} - F_{PV}^{incom}\\|\)	\(\lambda_1 = 0.05\)
\(L_{\text{Cor}}\)	BEV correction loss: \(\text{MSE}(F_{BEV}^{com}, F_{BEV}^{incom})\)	\(\lambda_2 = 5\)

Training Strategy: - Optimizer: AdamW, learning rate \(4.2 \times 10^{-4}\) - Training: Fine-tuned for 8 epochs on nuScenes and 2 epochs on Argoverse2 - Batch size: 4 for nuScenes, 6 for Argoverse2 - Gaussian variance: \(\sigma = 3\) - Maximum of 100 map elements per frame, with 20 points per element - BEV grid size 0.75m, Transformer decoder with 2 layers - A random single-view RGB image is dropped per frame during training

Key Experimental Results¶

Main Results¶

Experiments are evaluated on nuScenes (6 views) and Argoverse2 (7 views) datasets, using mAP based on Chamfer distance (thresholds at 0.5m/1.0m/1.5m) as the assessment metric.

Dataset	Baseline Model	Scenario	mAP Gain	Explanation
nuScenes	HIMap	Various missing views	+2.4% ~ +18.2%	6 single-view missing scenarios
Argoverse2	MapTR	Complete observations	+1.0%	Performance gain also achieved under complete views
Argoverse2	MapTR	Missing front view	+4.1%	Front view has the most significant impact
nuScenes	MapTR	Sensor corruptions	mRR +1.9%, mCE +9.4%	8 real-world corruption scenarios
nuScenes	HIMap	Sensor corruptions	mRR +6.2%, mCE +16.8%	MapBench evaluation

Ablation Study¶

Configuration	Key Metric	Description
Baseline (w/o reconstruction module)	Reference	Original MapTR model only
+ G-PVR only	Significant gain	PV-level reconstruction is effective
+ D-BEVC only	Significant gain	BEV-level correction is effective
+ G-PVR + D-BEVC (Full SafeMap)	Best	The two modules are complementary
Mean-PVR (averaging all available views)	Below baseline	Simple averaging is ineffective
MAE-PVR (MAE + masked token)	Outperforms Mean	But weaker than G-PVR
Standard-PVR (uniform reference points)	Outperforms MAE	But weaker than G-PVR
Gaussian-PVR (Gaussian reference points)	Optimal	Prior knowledge is effective
D-BEVC w/ L1	Suboptimal	Manhattan distance
D-BEVC w/ L2	Optimal	Euclidean distance (default)
D-BEVC w/ KL	Weakest	KL divergence performs poorly

Key Findings¶

Front/Rear Views are Most Critical: Missing the front view (CAM_FRONT) or rear view (CAM_BACK) has the heaviest impact on performance, as these views contain the most critical map elements.
Negligible Parameter Count Increase: SafeMap adds only 0.4MB to 3.6MB of parameters, having almost no effect on inference speed and GPU memory consumption.
Tolerance to Multi-View Loss: As the number of missing views increases (from 1 to 5), performance gradually degrades. SafeMap consistently and significantly outperforms MapTR under 1 to 5 missing views, exhibiting a much slower performance decay.
Hyperparameter Robustness: Performance remains robust across the ranges of \(\lambda_1 \in [0.01, 0.09]\), and \(\lambda_2\) & \(\sigma \in [1, 5]\).

Highlights & Insights¶

Gaussian Sampling Encodings for View Prior: The intuition that adjacent views are more critical is elegantly encoded into the distribution of reference points via a Gaussian distribution, which is more natural and effective than uniform sampling or local cropping.
Dual-level Reconstruction Strategy: PV-level reconstruction (G-PVR) handles local feature recovery, while BEV-level correction (D-BEVC) ensures global consistency, making them highly complementary.
Plug-and-Play Design: Can be directly integrated into existing frameworks like MapTR and HIMap without substantial architecture adjustments.
Performance Gain Under Complete Scenarios: Although primarily targeted at incomplete observations, SafeMap also improves performance under complete observations, showing that the reconstruction module enhances the model's feature representation ability.
First Systematic Study: The first robust framework specifically targeting HD Map construction under incomplete multi-view data.

Limitations & Future Work¶

Limitation to Camera Modality only: Sensor failure scenarios under multi-sensor fusion (e.g., LiDAR-Camera fusion) are not investigated.
Single-frame Reconstruction: Temporal information is not exploited; incorporating history frames could potentially further improve reconstruction quality.
Requirement of Complete Data During Training: D-BEVC relies on complete observations as teacher signals, requiring the training dataset to contain complete multi-view images.
Primary Evaluation Focuses on Single-View Loss: Although multi-view failure experiments were conducted, major comparative experiments remain concentrated on single-view loss scenarios.
Assumed Gaussian Distribution: Applying a fixed variance \(\sigma\) uniformly across all views might not be optimal, and an adaptive variance scheme could yield better results.

MapTR/MapTRv2: Baseline methods for SafeMap, serving as pioneering works in end-to-end vectorized HD map construction.
HIMap: An HD map construction method based on hybrid representation learning, on which SafeMap's effectiveness is also validated.
M-BEV: A masked view reconstruction method in 3D object detection. SafeMap's G-PVR represents a comprehensive improvement over its Local PVR.
MapBench: A robustness evaluation benchmark for HD map models under sensor corruptions.
MetaBEV/UniBEV: BEV perception methods addressing sensor failures, which are, however, not applicable to HD map scenarios.

Insight: The idea of using Gaussian sampling as a spatial prior can be extended to other multi-view fusion tasks (e.g., 3D detection, occupancy prediction). Similarly, the distillation-based calibration strategy can be applied to other scenarios requiring robust handling of missing inputs.

Rating¶

Dimension	Rating (1-5)	Description
Novelty	4	Gaussian sampling view prior + dual-level reconstruction is a novel combination.
Value	5	Plug-and-play, minimal parameter overhead, high practical deployment value.
Experimental Thoroughness	5	Evaluated on two datasets, multiple baselines, extensive ablations, and hyperparameter sensitivity analysis.
Writing Quality	4	Clear structure, well-articulated motivation.
Overall Evaluation	4.5	High utility, solid experimental backup, solid ICML work.