This paper proposes HRMapNet, which maintains a low-cost global historical rasterized map to provide complementary prior information for online vectorized map perception. It enhances existing methods at two levels—BEV feature aggregation and query initialization—achieving significant improvements on nuScenes and Argoverse 2.
High-definition maps (HD maps) are crucial for autonomous driving, but traditional offline construction is extremely costly, prompting researchers to shift toward online map perception based on onboard sensors.
Online vectorized map perception methods, represented by MapTR, directly predict vectorized map elements in the BEV space. However, relying solely on the current frame's onboard sensors leads to a substantial decline in accuracy and robustness under challenging scenarios such as occlusions, adverse weather, or nighttime.
Temporal information serves as a feasible completion. However, existing methods (e.g., StreamMapNet) only leverage short-term temporal information from the past few frames, failing to fully exploit the value of historical observations.
The core insight of this paper is that historical prediction results can be rasterized and accumulated into a global map at a low cost, serving as prior information for online perception. Rasterized maps offer advantages such as ease of merging, easy retrieval, clear semantics, and small memory footprint.
How can historical map information be maintained and utilized in a cost-effective manner to compensate for the perceptual limitations of single-frame onboard sensors in challenging scenarios?
1. Global Rasterized Map Construction and Maintenance¶
The online prediction results (vectorized map) of each frame are rasterized into a local raster map \(M_i^l \in \{0,1\}^{H \times W \times N}\), where \(N=3\) represents three categories of map elements: lane divider, pedestrian crossing, and road boundary.
Based on the ego-pose, local coordinates are mapped to global coordinates, and a global map \(M^g\) is maintained using update rules similar to an occupancy grid map:
If the local prediction indicates the presence of a map element at a position, the global value increases by \(S^+\) (default: 30).
If not, the global value decreases by \(S^-\) (default: 1).
During retrieval, a local raster map is cropped from the global map based on the current ego-pose and binarized using a threshold \(S_{th}\).
The global map is stored using 8-bit unsigned integers, with a memory cost of approximately 1 MB per kilometer (the Boston map in nuScenes is only about 120 MB, whereas the BEV feature-based map in NMP requires 11 GB).
Existing methods extract BEV features \(F_I \in \mathbb{R}^{H \times W \times C}\) from onboard camera images.
HRMapNet additionally places BEV queries at positions with map elements in the retrieved local raster map and extracts corresponding features from the images via spatial cross-attention, obtaining the complementary BEV features \(F_M\). Positions without map elements are padded with zeros.
Final feature fusion: \(F_{BEV} = \text{Conv}(\text{Concat}(F_I + F_M, M^l))\), where the image BEV features, complementary map features, and the semantic information of the raster map itself are concatenated and then fused via convolution.
In the DETR paradigm, learnable queries must search for map elements from random positions. The historical raster map provides a prior for where these elements are likely to exist.
For each valid position \(p\) in the raster map, a position embedding \(PE(p)\) and a semantic label embedding \(LE(p)\) are encoded and summed to obtain the map prior embedding \(ME(p) = PE(p) + LE(p)\).
Base queries interact with the map prior embeddings via cross-attention before being fed into the original decoder layers. This enables queries to locate target elements more efficiently.
To control memory overhead, the local raster map is downsampled (default resolution of 0.6 m) before extracting the prior embeddings.
Simple Yet Effective Concept: Rasterizing historical predictions into a global map and feeding it back into online perception is conceptually simple but practically powerful, boosting the mAP by 4 to 6 points.
Plug-and-play: The framework can be integrated with most existing vectorized map perception methods. The paper validates its effectiveness with two representative baselines, MapTRv2 and StreamMapNet.
Extremely Low Storage Cost: The global rasterized map only requires approximately 1 MB/km, which is orders of magnitude lower than BEV feature-based methods (e.g., NMP requires 11 GB).
High Robustness: It is robust to localization errors, showing almost unaffected performance within typical localization accuracy ranges.
Great Potential for Practical Applications: It is naturally suited for crowdsourced map perception scenarios, where multiple vehicles collectively maintain a global map.
The paper primarily focuses on utilizing historical raster maps. However, the quality of the raster map depends on online perception accuracy. When first driving into a new area, no historical information is available, and the system degenerates to pure online perception.
The update strategy of the global map is relatively simple (fixed incremental/decremental values), lacking a confidence-based adaptive update mechanism.
The training memory overhead of the Query Initialization module reaches 65 GB at a 0.3 m resolution, demanding downsampling to control resource usage, which may lead to a loss of fine-grained information.
The evaluation is conducted only on nuScenes and Argoverse 2. Both datasets feature geographical overlaps between their train and validation sets, so the performance on completely unseen areas warrants further verification.
Map obsolescence and scene dynamics are not addressed—information in historical maps may no longer be valid due to constructions or other changes.
The idea of using rasterized maps as lightweight containers for historical information can be extended to other BEV perception tasks, such as 3D object detection and occupancy prediction.
Crowdsourced map perception is a highly valuable direction for practical application. Multi-vehicle collaborative construction and sharing of global maps can further enhance each individual vehicle's perception capabilities.
The design approach of Query Initialization (guiding DETR queries with prior info to narrow down the search space) is highly generalizable and can be applied to prior injection in other detection or segmentation tasks.
The initial map experiment (achieving 83.7 mAP with the train set map) implies that if high-quality historical maps are available, online perception accuracy can leap dramatically, which holds great significance for real-world deployment.
Novelty: ⭐⭐⭐⭐ (Although simple conceptually, the insight is solid; using raster maps as a low-cost historical information carrier is a smart engineering choice)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Two datasets, two baselines, comprehensive ablation, and rich auxiliary experiments on robustness and initial maps)
Writing Quality: ⭐⭐⭐⭐ (Clear logic, informative figures and tables)
Value: ⭐⭐⭐⭐ (Direct reference value for actual deployment of autonomous driving map perception)