Skip to content

RESBev: Making BEV Perception More Robust

Conference: CVPR 2026 arXiv: 2603.09529 Code: N/A Area: Autonomous Driving / BEV Perception Robustness Keywords: BEV perception, robustness, world model, adversarial attack, plug-and-play

TL;DR

This paper proposes RESBev, a plug-and-play robustness enhancement framework for BEV perception. It employs a latent-space world model to predict clean BEV semantic priors from historical frames, and an anomaly reconstructor that fuses these priors with corrupted current observations via cross-attention. On nuScenes, RESBev achieves an average improvement of 15–20 IoU points across four LSS-based models under 10 types of perturbations (including natural corruptions and adversarial attacks), and generalizes to corruption types unseen during training.

Background & Motivation

Background: BEV perception is a core representation for autonomous driving. LSS-based methods (BEVFusion, BEVFormer, FIERY, etc.) achieve strong results on benchmarks such as nuScenes. However, these models are extremely fragile in real-world deployment—under natural corruptions (fog, darkness, snow, camera failure, frame loss) or adversarial attacks (FGSM, PGD, C&W), IoU can plummet from 33 to 9.

Limitations of Prior Work: Existing defense strategies suffer from three major limitations: (1) multi-modal fusion relies on expensive LiDAR sensors and assumes redundant sensors remain reliable; (2) simple temporal aggregation cannot filter adversarial perturbations, since adversarial features are numerically nearly indistinguishable from clean features; (3) adversarial training only defends against seen attack types and fails to generalize; (4) most methods are tightly coupled to specific architectures.

Key Challenge: Adversarial attacks generate feature-space perturbations with extremely low MSE yet catastrophic semantic impact—simple attention-based aggregation cannot distinguish adversarially corrupted features from clean ones. A mechanism is needed to "bypass" the corrupted current observation and generate a clean prior from historical information.

Goal: To develop a lightweight, general, and generalizable BEV robustness enhancement solution that can be inserted into any LSS-based model and simultaneously handles both natural corruptions and adversarial attacks.

Key Insight: Driving scenes exhibit strong temporal consistency—the BEV state of the current frame can be reasonably predicted from historical frames and ego-motion. The robustness problem is thus reformulated as a temporal prediction problem: a world model generates an "expected state" from historical clean frames, which is then selectively fused with the actual (potentially corrupted) observation.

Core Idea: A latent-space world model predicts a clean BEV semantic prior for the current frame, which is then fused with the corrupted current observation via gated cross-attention to achieve adaptive recovery from arbitrary perturbation types.

Method

Overall Architecture

RESBev is inserted as a plug-and-play module at the Splat stage of the LSS pipeline. It consists of two core components: (1) a Semantic Prior Predictor that predicts a clean BEV prior for the current frame from the previous frame's reconstructed features and ego-motion, and (2) an Anomaly Reconstructor that adaptively fuses the predicted prior with the corrupted current observation via cross-attention to selectively extract useful information.

Key Designs

  1. Architecture Decisions Driven by Three-Level Spatial Analysis:

    • Function: Answers two core architectural questions—"at which stage of the LSS pipeline to intervene" and "which mechanism to employ."
    • Mechanism: Systematic experiments analyze three dimensions: (1) Spatial selection: BEV space (Splat) exhibits far greater temporal consistency than image space (Lift)—under persistent corruption, BEV features remain stable while image features fluctuate dramatically; (2) Depth selection: operating on high-dimensional semantic features (Splat) rather than low-dimensional task outputs (Shoot)—the latter compresses information such that post-recovery IoU is only 18.7 vs. 31.6 at Splat; (3) Mechanism selection: generative prediction (world model) substantially outperforms temporal attention aggregation (30.11 vs. 20.17), because adversarial perturbations are nearly imperceptible in feature space (low MSE) yet semantically catastrophic.
    • Design Motivation: These ablation experiments provide quantitative justification for each architectural decision, making the final design analytically grounded rather than heuristic.
  2. Semantic Prior Predictor (LDWM):

    • Function: Predicts the clean BEV features of the current frame from the previous frame's reconstructed features and ego-motion.
    • Mechanism: \(f_t^{pred} = D(\text{LDWM}(\text{Concat}(E_{vis}(f_{t-1}^{rec}), E_{act}(a_{t-1}))))\). A visual encoder \(E_{vis}\) projects the previous frame's reconstructed features into a compact latent space; an action encoder \(E_{act}\) encodes ego-motion (translation + rotation); the concatenated representation is fed into a Transformer-based world model (LDWM) to model state transitions; a decoder \(D\) maps back to the dense BEV feature space.
    • Design Motivation: Modeling transitions in the compact latent space rather than the high-dimensional feature space is computationally efficient. Using reconstructed features (rather than raw corrupted features) as input avoids error propagation.
  3. Anomaly Reconstructor (Gated Cross-Attention Fusion):

    • Function: Adaptively fuses the predicted prior with the corrupted current observation, preserving novel information (e.g., suddenly appearing vehicles) while rejecting noise.
    • Mechanism: \(f_t^{rec} = f_t^{pred} + \alpha \cdot \text{CrossAttn}(f_t^{pred}, \text{Concat}(f_{t-1}^{rec}, f_t^{corrupt}))\). The predicted prior \(f_t^{pred}\) serves as the Query; the concatenation of the previous reconstructed features and the current corrupted features serves as Key/Value. A learnable gating factor \(\alpha \in [0,1]\) controls information flow—when corruption is severe, \(\alpha\) decreases automatically, relying more on the historical prior; when the current observation is reliable, \(\alpha\) increases to incorporate new information.
    • Design Motivation: The predicted prior cannot handle sudden events (e.g., a vehicle appearing unexpectedly) and therefore cannot fully replace the current observation. The gated residual connection enables the model to adaptively balance "trusting the prior" and "leveraging the current observation."

Loss & Training

Training follows an ELBO objective derived from a probabilistic graphical model, comprising three terms: (1) observation reconstruction likelihood of the predicted prior; (2) task-label likelihood of the reconstructed features; (3) a KL regularization term. The Predictor and Reconstructor are trained jointly. Few-shot fine-tuning suffices to adapt to different LSS baseline models. Training is conducted on a single A100-80GB GPU with batch size 16.

Key Experimental Results

Main Results (Average over Three Severity Levels, Seen Corruptions)

Corruption LSS Vanilla LSS+RESBev Gain FIERY Vanilla FIERY+RESBev Gain
FGSM 10.28 28.42 +18.14 11.89 32.46 +20.57
PGD 9.17 31.47 +22.30 8.03 32.44 +24.41
Fog 9.93 28.39 +18.46 12.98 31.79 +18.81
Frame Lost 10.65 28.33 +17.68 15.62 31.62 +16.00
Overall Avg. 9.96 29.02 +19.06 12.08 31.98 +19.90

Generalization to Unseen Corruptions

Corruption LSS Vanilla LSS+RESBev GaussianLSS Vanilla GaussianLSS+RESBev
C&W (unseen) 8.78 30.80 (+22.02) 5.97 31.24 (+25.27)
Snow (unseen) 10.26 28.35 (+18.09) 16.08 32.10 (+16.02)
Dark (unseen) 8.11 28.36 (+20.25) 17.68 31.96 (+14.28)
Noise (unseen) 8.64 28.27 (+19.63) 16.67 31.43 (+14.76)
Overall Avg. 9.17 28.82 (+19.65) 13.96 31.66 (+17.70)

Ablation Study

Configuration LSS SimpleBEV GaussianLSS FIERY
Predictor only 26.67 30.11 29.16 29.79
Predictor + Reconstructor 29.00 32.80 31.59 31.98
Gain +8.7% +8.9% +8.3% +7.4%

Key Findings

  • Strongest recovery under adversarial attacks: Under PGD, FIERY recovers from 8.03 to 32.44 (+24.41), nearly reaching clean IoU.
  • Strong generalization: Gains of 17–20 IoU points are obtained on 5 unseen corruption types, indicating that the model has learned a general notion of "what a normal state should look like."
  • Stability under continuous corruption: IoU remains nearly unchanged over 10 consecutive corrupted frames (FGSM: 28.42 → 28.58), with no error accumulation.
  • Consistent Reconstructor gains: The Reconstructor yields consistent 7–9% additional improvement across all 4 baselines, demonstrating the value of selectively extracting information from current observations.
  • GraphBEV achieves the highest clean performance (61.47) but averages only 24 IoU under corruption, far below models equipped with RESBev (29–32).

Highlights & Insights

  • Reformulating robustness as temporal prediction: This perspective shift is particularly elegant—rather than repairing the corrupted current features, the framework "predicts" what the current state should be from history and selectively supplements it with new information from the current observation. This paradigm is broadly applicable to any perception system with temporal continuity.
  • Three-level analysis-driven design: Ablations along the spatial, depth, and mechanism dimensions provide quantitative justification for each architectural decision. In particular, the observation that "adversarial perturbations have extremely low MSE in feature space yet are semantically catastrophic" explains why simple aggregation fails.
  • Plug-and-play design: Effectiveness across 4 different LSS models demonstrates architectural generality. Few-shot fine-tuning enables low-cost deployment.

Limitations & Future Work

  • Assumes clean historical frames: The approach assumes the previous frame's reconstructed features are clean. If multiple consecutive frames are attacked, errors may accumulate over time (although experiments show stability within 10 steps, longer sequences are not tested).
  • Evaluation limited to nuScenes: Generalization to other autonomous driving datasets (Waymo, KITTI) is not validated.
  • BEV semantic segmentation as the sole task: The robustness enhancement effect on other BEV downstream tasks (3D object detection, motion prediction, etc.) is not verified.
  • Computational overhead not analyzed in detail: Inference latency of the world model and cross-attention is not reported, which may be a bottleneck for latency-critical autonomous driving deployment.
  • vs. GraphBEV: GraphBEV enhances robustness via graph-based reasoning and achieves the strongest clean performance (61.47), but averages only 24 IoU under corruption, far below models equipped with RESBev (29–32).
  • vs. BEVFormer temporal aggregation: BEVFormer aggregates historical frames via temporal self-attention, but this aggregation cannot distinguish clean from corrupted features; RESBev bypasses current corruption through generative prediction.
  • vs. Adversarial training: Adversarial training only defends against attack types seen during training, whereas RESBev generalizes to unseen corruption types.

Rating

  • Novelty: ⭐⭐⭐⭐ — The perspective of reformulating robustness as temporal prediction is highly insightful.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 baseline models × 10 corruption types × 3 severity levels, with generalization to unseen corruptions and continuous corruption testing.
  • Writing Quality: ⭐⭐⭐⭐ — The three-level analytical logic is clear and the ablation design is well-structured.
  • Value: ⭐⭐⭐⭐ — Practically significant for the safe deployment of autonomous driving systems.