Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention¶

Conference: ECCV 2024
arXiv: 2407.06683
Code: No public code
Area: Autonomous Driving / Online Map Estimation / Behavior Prediction
Keywords: BEV features, online mapping, trajectory prediction, attention mechanism, inference acceleration

TL;DR¶

This paper proposes to directly expose internal BEV features from online map estimation models to downstream trajectory prediction models (instead of just passing decoded vectorized maps). Through three BEV feature injection strategies, the proposed method achieves up to a 73% acceleration in inference and up to a 29% improvement in prediction accuracy.

Background & Motivation¶

In autonomous driving systems, perceiving the static environment (road layout, lane lines, etc.) is a key input for downstream behavior prediction and motion planning. Traditional solutions rely on pre-annotated High-Definition (HD) maps, which are highly expensive to maintain and scale poorly. Recently, online map estimation methods have emerged, utilizing an "encoder-decoder" architecture: the encoder transforms multi-camera images into BEV feature grids, and the decoder predicts vectorized map elements (lane lines, boundaries, etc.) from them.

However, this sequential pipeline of "first decoding the map, then feeding it to the prediction model" suffers from two major limitations: 1. Information Bottleneck: The Transformer decoder compresses rich intermediate BEV features into sparse point sets, discarding a large amount of useful information that downstream tasks cannot access. 2. Computational Redundancy: The attention mechanism of the map decoder consumes most of the model's runtime, creating a speed bottleneck for the entire system.

Core Problem¶

Can the map decoding stage be bypassed, allowing the trajectory prediction model to directly access BEV features generated by the online mapping model's encoder? Can this simultaneously speed up inference and improve prediction accuracy?

Method¶

Overall Architecture¶

The core idea is straightforward: in the existing "online mapping \(\rightarrow\) trajectory prediction" pipeline, the prediction model is no longer restricted to consuming only the decoded vectorized maps. Instead, it is granted direct access to the BEV feature grid output by the encoder. Specifically, the authors propose three BEV feature injection strategies corresponding to different levels of integration depth.

The input consists of multi-view camera images, which are processed by the encoder of the online mapping model (ResNet-50 backbone + PV2BEV transformation, such as BEVFormer or LSS) to obtain BEV features \(B_t \in \mathbb{R}^{H \times W \times D}\). Then, the BEV features are split into a sequence of patches (an \(N \times P^2D\) tensor) in a ViT-like manner, and linearly projected to obtain \(N \times D\) patch embeddings as the input sequence for the Transformer.

Key Designs¶

Strategy 1: Agent-BEV Feature Attention (Replacing agent-lane interaction)
In the HiVT prediction model, BEV patch attention completely replaces the original agent-lane local interaction encoder. Specifically, the BEV patch corresponding to the agent's position is selected as the Query \(Q_A \in \mathbb{R}^{M \times D}\), and all BEV patches serve as Key/Value \(K_M = V_M \in \mathbb{R}^{N \times D}\). The agent-BEV embedding is obtained via multi-head attention: \(\mathbf{e}_A = \text{MHA}(Q_A, K_M, V_M)\). This completely eliminates the need for vectorized lane information, reducing computational complexity to \(O(M \times N)\) with \(M \ll N\). Key advantage: Bypasses the time-consuming map decoding stage to achieve end-to-end acceleration.
Strategy 2: BEV Feature-Enhanced Lane Vertices (Preserving and enhancing lane info)
For prediction models like DenseTNT that rely heavily on vectorized maps, lane information cannot be discarded completely. This strategy aligns the dimensions of BEV features using a 1D CNN, locates the corresponding BEV grid position for each map vertex based on its spatial coordinates, and concatenates the raw vertex coordinate features with the BEV features. The enhanced map elements are then encoded by a VectorNet backbone (with hidden dimensions doubled to accommodate the increased feature size). The goal is to supplement vectorized maps with implicit semantics that are difficult to express, such as drivable areas and road textures.
Strategy 3: Temporal BEV Features Replacing Agent Trajectory Info
This strategy utilizes temporal BEV features from StreamMapNet (which fuse historical BEV features). Since temporal BEV features encode both static road information and implicit dynamic agent motion, the agent-BEV attention from Strategy 1 can directly replace the agent subgraph encoded by VectorNet in DenseTNT. Consequently, agent trajectory inputs are completely discarded, and the prediction model no longer requires explicit agent historical trajectories.

Loss & Training¶

The map estimation model is first trained independently until convergence. The encoder is then frozen, and BEV features for all scenes are extracted and cached.
The prediction model is trained under three settings: Baseline (vectorized map only), Uncertainty (adding map uncertainty), and Ours (adding BEV features).
Hyperparameters such as learning rate, weight decay, and dropout are adjusted individually for different map-prediction combinations (e.g., in Strategy 2, the learning rate is decreased to \(10^{-4}\), weight decay is increased from 0.01 to 0.05, and dropout is increased from 0.1 to 0.2).
All models are trained on a single RTX 4090 GPU.

Key Experimental Results¶

Dataset: nuScenes (2Hz annotation; prediction models take 2 seconds of observation to predict 3 seconds of the future).

Mapping Model + Prediction Model	Method	minADE↓	minFDE↓	MR↓
MapTR + HiVT	Baseline	0.4234	0.8900	0.0955
MapTR + HiVT	+ Uncertainty	0.4036	0.8372	0.0822
MapTR + HiVT	+ Ours	0.3617	0.7401	0.0720
StreamMapNet + HiVT	Baseline	0.4035	0.8569	0.0996
StreamMapNet + HiVT	+ Ours	0.3800	0.7709	0.0746
MapTR + DenseTNT	Baseline	1.0462	2.0661	0.3494
MapTR + DenseTNT	+ Ours	0.7608	1.4700	0.2593
StreamMapNet + DenseTNT	Baseline	0.8864	1.7050	0.2467
StreamMapNet + DenseTNT	+ Ours	0.7377	1.3661	0.1987

Inference Speed (RTX 4095, Unit: FPS):

Combination	Baseline FPS	Ours FPS	Speedup
HiVT + MapTR	22.4	9.1ms equivalent	42-73% faster
HiVT + MapTRv2	26.7	13ms equivalent	35-62% faster
HiVT + StreamMapNet	33.6	29.4ms equivalent	8-15% faster

Ablation Study¶

Patch Size: A size of 20×20 (corresponding to 6m×6m in the real world) is optimal. Too small (10×5) lacks spatial context, while too large (40×20) discards too much fine-grained information during linear projection.
BEV Encoder Selection: Encoders using temporal information (such as MapTR or StreamMapNet based on BEVFormer) perform significantly better than LSS-based ones (without temporal fusion, such as MapTRv2). Specifically, temporal BEV features yield relative gains of 16%/19%/24% in minADE/minFDE/MR, whereas non-temporal equivalents yield only 4%/2%/4%.
BEV Features of MapTR vs MapTRv2: MapTR BEV features prove more beneficial than those of MapTRv2, suggesting that the MapTR decoder might introduce noise, which direct BEV feature usage successfully bypasses.
Importance of Centerlines: MapTRv2-CL (which includes centerline prediction) exhibits a strong baseline prediction accuracy. Consequently, the marginal gain of adding BEV features is minimal, confirming that centerlines are highly crucial for trajectory prediction.

Highlights & Insights¶

Simple and Elegant Concept: Instead of designing new mapping or prediction models, this work focuses on the "interface" between them, enhancing overall performance through superior information flow.
Simultaneous Acceleration and Accuracy Boost: Bypassing the map decoder saves computation and avoids decoder-induced noise, achieving a rare win-win situation for both speed and accuracy.
"Dark Knowledge" in BEV Features > Explicit Maps: Experiments confirm that intermediate BEV features contain richer information (such as drivable area semantics and road textures) than decoded vectorized maps. This provides valuable insights for end-to-end autonomous driving architectures.
Temporal BEV Encodes Agent Dynamics: Temporal BEV features from StreamMapNet implicitly model dynamic agent motion in addition to static road geometry, making it possible to completely replace agent trajectory inputs. This discovery has great potential for end-to-end driving.
Plug-and-Play: The three proposed strategies can be flexibly adopted based on the characteristics of downstream prediction models.

Limitations & Future Work¶

Reduced Interpretability: Replacing explicit vectorized maps with black-box BEV features reduces the interpretability of predicted behaviors, which poses safety concerns in autonomous driving.
Lack of Joint Training: The current approach uses a two-stage training scheme where the mapping model is frozen to extract BEV features before training the prediction model. End-to-end joint optimization with gradients backpropagating to the encoder is yet to be realized.
Validated only on nuScenes: The method has not been verified on larger datasets like Argoverse 2 or Waymo.
Limited Acceleration for Strategies 2 and 3: These strategies still rely on the map decoder to output vectorized lane info, with acceleration gains primarily coming from Strategy 1.
Under-analyzed BEV Feature Space: The PCA visualization is preliminary, and a systematic analysis of the BEV feature space (e.g., what semantics different channels encode) is missing.
Potential Direction: Co-training mapping and prediction models, allowing prediction losses to guide the encoder to generate more prediction-friendly BEV features.

vs. Gu et al. (CVPR 2024, [13]): The latter appends uncertainty information from decoded map elements to the prediction model. This paper extracts richer information earlier (directly from the encoder output) and achieves higher computational efficiency. Results show the proposed method outperforms the uncertainty-based method in almost all combinations.
vs. End-to-End Solutions (e.g., UniAD, VAD): These methods also share internal BEV features but treat mapping as an explicit sub-task (either dense or vectorized prediction), introducing extra computational overhead. The proposed strategies could be integrated into these architectures for further acceleration.
vs. Original HiVT / DenseTNT: The originals require HD Map inputs. This work adapts them to handle BEV features directly, expanding their applicability in HD-map-free scenarios.

Insights & Connections¶

Implications for End-to-End Driving: In multi-task autonomous driving stacks, the choice of interface between sub-tasks (BEV features vs. vectorized outputs vs. occupancy grids) significantly affects performance and warrants systematic study.
Temporal BEV Encoding Agent Dynamics: If temporal BEV features implicitly contain agent motion, an explicit perception-and-tracking pipeline may not be strictly required for prediction. This opens a promising, minimalist path for end-to-end driving.

Rating¶

Novelty: ⭐⭐⭐⭐ Intuitive yet highly novel perspective; first to systematically study the direct utilization of intermediate BEV features in prediction.
Experimental Thoroughness: ⭐⭐⭐⭐ Evaluated across 4 mapping models × 2 prediction models × 3 settings with comprehensive ablation, though limited to nuScenes.
Writing Quality: ⭐⭐⭐⭐ Well-structured with a clear flow from problem formulation to methodology and evaluation.
Value: ⭐⭐⭐⭐ Provides practical guidance for AD system design and demonstrates substantial potential for intermediate feature reuse.