ECCV2024 Autonomous Driving AI paper notes paper summaries 3D Object Detection Segmentation Adversarial Robustness Diffusion Models Object Detection

🚗 Autonomous Driving¶

🎞️ ECCV2024 · 53 paper notes

📌 Same area in other venues: 📷 CVPR2026 (157) · 🔬 ICLR2026 (50) · 🧪 ICML2026 (8) · 🤖 AAAI2026 (56) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (91)

🔥 Top topics: Autonomous Driving ×10 · 3D Object Detection ×8 · Segmentation ×7 · Adversarial Robustness ×6 · Diffusion Models ×4

4D Contrastive Superflows are Dense 3D Representation Learners: The SuperFlow framework is proposed, which establishes 4D pre-training objectives using continuous LiDAR-camera pairs through three modules: view consistency alignment, dense-sparse consistency regularization, and flow-based spatiotemporal contrastive learning. It comprehensively outperforms prior image-to-LiDAR pre-training methods across 11 heterogeneous LiDAR datasets.
Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention: This paper proposes to directly expose internal BEV features from online map estimation models to downstream trajectory prediction models (instead of just passing decoded vectorized maps). Through three BEV feature injection strategies, the proposed method achieves up to a 73% acceleration in inference and up to a 29% improvement in prediction accuracy.
Adaptive Human Trajectory Prediction via Latent Corridors: This paper introduces prompt tuning to pedestrian trajectory prediction. By adding learnable low-rank visual prompts (termed latent corridors) to the input of a pre-trained trajectory predictor, it achieves highly parameter-efficient adaptation to scene-specific behavioral patterns with less than 0.1% extra parameters, improving ADE by up to 23.9% and 26.8% on synthetic and real-world data, respectively.
Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene: This paper proposes the LiSe method, which incorporates 2D image information into unsupervised 3D object detection. Through adaptive sampling and weak model aggregation strategies in self-paced learning, it significantly improves the detection capability for long-range and small targets.
CarFormer: Self-Driving with Learned Object-Centric Representations: CarFormer is proposed to introduce self-supervised slot attention-learned object-centric representations into autonomous driving for the first time. On the CARLA Longest6 benchmark, it outperforms PlanT, which utilizes precise object attributes, while demonstrating the capability of a world model to predict future states.
CSOT: Cross-Scan Object Transfer for Semi-Supervised LiDAR Object Detection: The CSOT (Cross-Scan Object Transfer) paradigm is proposed, which predicts semantically consistent object placement locations and compatibility scores using a Transformer network. This achieves the first successful object copy-paste augmentation in semi-supervised LiDAR object detection. Combined with a spatial-aware classification loss, it matches the performance of the fully supervised baseline using only 1% of the annotated data.
Detecting As Labeling: Rethinking LiDAR-camera Fusion in 3D Object Detection: This paper summarizes the fundamental rule from the data labeling process that "image features should not be used for regression tasks" and proposes the DAL paradigm. DAL analogizes the detection process to the labeling process, using LiDAR features independently to complete regression predictions and fused features for classification predictions. Combined with a simplified training pipeline, DAL substantially refreshes the SOTA on nuScenes with 74.0 NDS (val) and 74.8 NDS (test).
DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-directional Structure Alignment: A clustering-based Local-to-Global fusion network, DVLO, is proposed to address the data structure inconsistency between vision and LiDAR through bi-directional structure alignment (image-to-pseudo-point-cloud + point-cloud-to-pseudo-image), achieving state-of-the-art (SOTA) performance on both the KITTI odometry and FlyingThings3D scene flow tasks.
DySeT: A Dynamic Masked Self-distillation Approach for Robust Trajectory Prediction: DySeT proposes a dynamic masked self-distillation approach. By leveraging reinforcement learning-driven priority sampling of informative tokens and knowledge distillation from a complete representation to a masked representation, it significantly enhances the generalization ability and robustness of trajectory prediction models in autonomous driving scenarios.
Enhancing Vectorized Map Perception with Historical Rasterized Maps: This paper proposes HRMapNet, which maintains a low-cost global historical rasterized map to provide complementary prior information for online vectorized map perception. It enhances existing methods at two levels—BEV feature aggregation and query initialization—achieving significant improvements on nuScenes and Argoverse 2.
Equivariant Spatio-Temporal Self-Supervision for LiDAR Object Detection: E-SSL3D proposes a joint spatio-temporal equivariant self-supervised pre-training framework. By jointly training the 3D feature encoder with spatial equivariance (using a classification objective for rotation, and contrastive objectives for translation/scaling/flipping) and temporal equivariance (using 3D scene flow to constrain the consistency of feature transformations between adjacent frames), the detector achieves 3D object detection performance close to training from scratch with 100% data while using only 20% labeled data in low-data scenarios.
FSD-BEV: Foreground Self-Distillation for Multi-View 3D Object Detection: This paper proposes a Foreground Self-Distillation (FSD) framework which constructs teacher-student branches sharing image features within the same model, effectively avoiding the distribution discrepancy challenge in cross-modal distillation. Combined with point cloud intensification and multi-scale foreground enhancement modules, it achieves SOTA performance on nuScenes.
Fully Sparse 3D Occupancy Prediction: This work proposes SparseOcc, the first fully sparse 3D occupancy prediction network, which achieves efficient occupancy prediction via a sparse voxel decoder and a mask-guided Mask Transformer, and designs a RayIoU evaluation metric to address the depth-direction inconsistent penalty of traditional mIoU.
GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction: This paper proposes an object-centric 3D semantic Gaussian representation to replace traditional dense voxels. It describes driving scenes using a set of sparse 3D semantic Gaussians and generates occupancy predictions via Gaussian-to-voxel splatting, reducing memory consumption by 75%–82% while yielding comparable performance.
GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection: To address the feature misalignment caused by calibration errors between LiDAR and cameras in multi-modal BEV fusion, this paper proposes the GraphBEV framework. It introduces two modules: LocalAlign (KD-Tree-based neighborhood depth graph matching) and GlobalAlign (global alignment via learnable offsets). GraphBEV achieves 70.1% mAP on nuScenes (outperforming BEVFusion by 1.6%) and outperforms BEVFusion by 8.3% in noisy misalignment scenarios.
H-V2X: A Large Scale Highway Dataset for BEV Perception: Introduces H-V2X, the first large-scale real-world highway V2X BEV perception dataset covering over 100 km of highway segments with over 1.9 million fine-grained annotated samples. It establishes three benchmark tasks (BEV detection, tracking, and trajectory prediction) and proposes an innovative baseline method integrating vector maps.
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion: Addressing the coarse temporal information utilization in camera-based semantic scene completion (SSC), this paper proposes a Hierarchical Temporal Context Learning (HTCL) paradigm: it first measures fine-grained correspondence between present and historical frames using Cross-frame Pattern Affinity (CPA), and then adaptively samples to compensate for incomplete observations through Affinity-guided Dynamic Refinement (ADR). HTCL ranks 1st on SemanticKITTI, and even surpasses LiDAR-based methods in mIoU on OpenOccupancy.
Improving Agent Behaviors with RL Fine-tuning for Autonomous Driving: Improves supervised-learning-trained traffic agent behavior models via closed-loop reinforcement learning fine-tuning, addressing the distribution shift issue inherent in open-loop training, and achieving state-of-the-art performance on the Waymo simulation benchmark.
ItTakesTwo: Leveraging Peer Representations for Semi-supervised LiDAR Semantic Segmentation: The proposed IT2 framework significantly improves semi-supervised LiDAR semantic segmentation by leveraging consistency learning between peer representations (range image + voxel grid) of LiDAR data as a novel form of perturbation, and introducing cross-distribution contrastive learning based on Gaussian Mixture Models (GMMs).
LiveHPS++: Robust and Coherent Motion Capture in Dynamic Free Environment: This work proposes LiveHPS++, a robust single-LiDAR-based human motion capture method. By utilizing three components—a trajectory-guided body tracker, a noise-insensitive velocity predictor, and a kinematic-aware pose optimizer—it implicitly and explicitly models the dynamics and kinematics of human motion to achieve accurate and coherent global motion capture in complex noisy environments.
MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation: This paper introduces knowledge distillation into the HD map construction task for the first time, proposing the MapDistill framework. By leveraging a dual BEV transform module, cross-modal relation distillation, dual-level feature distillation, and Map Head distillation, it transfers knowledge from a camera-LiDAR fusion teacher model to a lightweight, camera-only student model. This achieves a +7.7 mAP improvement or a 4.5x speedup on nuScenes.
MapTracker: Tracking with Strided Memory Fusion for Consistent Vector HD Mapping: Redefines online vector HD mapping as a tracking task. It achieves temporally consistent HD map reconstruction through a strided memory buffer fusion mechanism with dual representations (BEV grid + road element vector), significantly outperforming existing methods on nuScenes and Argoverse2 with 76.1 and 76.9 mAP, respectively.
Monocular Occupancy Prediction for Scalable Indoor Scenes: Proposes the ISO (Indoor Scene Occupancy) method, which achieves monocular 3D occupancy prediction for indoor scenes using pre-trained depth models and a D-FLoSP (Dual-feature Line-of-Sight Projection) module, and constructs the Occ-ScanNet benchmark dataset that is 40 times larger than NYUv2.
MonoWAD: Weather-Adaptive Diffusion Model for Robust Monocular 3D Object Detection: MonoWAD is proposed to achieve robust monocular 3D object detection under various weather conditions. It learns clear-weather knowledge as a reference using a weather codebook and performs feature enhancement by modeling foggy effects as noise through a weather-adaptive diffusion model.
Navigation Instruction Generation with BEV Perception and Large Language Models: This paper proposes BEVInstructor, which integrates Bird's-Eye-View (BEV) features into multimodal large language models. Through a Perspective-BEV fusion encoder, parameter-efficient prompt tuning, and an instance-guided iterative refinement strategy, it achieves state-of-the-art performance on both indoor and outdoor navigation instruction generation tasks.
Neural Volumetric World Models for Autonomous Driving: This paper proposes NeMo (Neural Volumetric World Model), an end-to-end autonomous driving framework based on volumetric representation. It represents scenes via 3D voxels, models dynamics through a motion flow module, and integrates future predictions via temporal attention. Trained in a self-supervised manner, NeMo outperforms prior methods by over 18% in driving performance on both nuScenes and CARLA.
NeuroNCAP: Photorealistic Closed-Loop Safety Testing for Autonomous Driving: This paper proposes NeuroNCAP, a photorealistic closed-loop safety testing framework for autonomous driving based on NeRF rendering. Inspired by the Euro NCAP collision avoidance protocols, three types of safety-critical scenarios (stationary, frontal, and side collisions) are designed. It reveals that current state-of-the-art (SOTA) end-to-end planners (UniAD, VAD) fail catastrophically in closed-loop safety scenarios—with collision rates as high as 88-92%—despite their perception modules functioning accurately.
OccGen: Generative Multi-modal 3D Occupancy Prediction for Autonomous Driving: OccGen reformulates 3D semantic occupancy prediction into a generative "noise-to-occupancy" paradigm. It extracts multi-modal features via a conditional encoder and performs diffusion denoising using a progressive refinement decoder to step-by-step generate occupancy maps in a coarse-to-fine manner. It relatively improves mIoU by 9.5%, 6.3%, and 13.3% under multi-modal, LiDAR-only, and camera-only settings on nuScenes-Occupancy, respectively.
OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving: OccWorld proposes learning a world model in 3D occupancy space. It tokenizes 3D occupancy via VQ-VAE and predicts future scene evolution and ego-vehicle trajectories autoregressively using a GPT-style spatial-temporal generative Transformer, achieving competitive planning performance on nuScenes without requiring instance or HD map annotations.
OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection: OPEN is proposed to predict object center depth from pixel-specific depth priors using an Object-wise Depth Encoder (ODE), and design an Object-wise Position Embedding (OPE) to inject this information into the Transformer decoder to generate 3D object-aware features, achieving state-of-the-art performance of 64.4% NDS on nuScenes.
Optimizing Diffusion Models for Joint Trajectory Prediction and Controllable Generation: This paper proposes two techniques, Optimal Gaussian Diffusion (OGD) and Estimated Clean Manifold (ECM) Guidance. By optimizing the diffusion prior distribution and directly injecting guidance gradients onto the clean manifold respectively, they reduce the diffusion steps for joint trajectory prediction to 1/12 and the guided sampling steps to 1/5 of the baseline, while achieving superior performance on Argoverse 2.
PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation: Proposes the first panoramic video object segmentation dataset PanoVOS (150 videos, 19K instance annotations), revealing that existing VOS models fail to handle pixel discontinuity and severe distortion in panoramic videos, and designs PSCFormer to address left-right boundary continuity using panoramic spatial consistency attention.
Progressive Pretext Task Learning for Human Trajectory Prediction: Proposes a progressive pretext task learning framework, PPT, which progressively enhances the model's ability to capture short-term dynamics and long-term dependencies through three-stage training (step-by-step next-position prediction → destination prediction → complete trajectory prediction). Together with an efficient two-step non-autoregressive Transformer predictor, it achieves SOTA on multiple pedestrian trajectory prediction benchmarks.
Random Walk on Pixel Manifolds for Anomaly Segmentation of Complex Driving Scenes: Proposes Random Walk on Pixel Manifolds (RWPM), which utilizes random walks to capture the manifold structure of pixel embeddings to correct manifold distortions caused by the diversity of driving scenes. This improves the accuracy of anomaly segmentation scoring functions and allows for plug-and-play integration into existing anomaly segmentation frameworks without requiring additional training.
RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation: This paper proposes RAPiD (Range-Aware Pointwise Distance Distribution) features, a local geometric representation for LiDAR point clouds that is invariant to rigid transformations and adaptive to changes in point density. Combined with a dual-stage nested autoencoder and channel attention-based fusion, it achieves state-of-the-art segmentation performance on SemanticKITTI (76.1 mIoU) and nuScenes (83.6 mIoU).
Reason2Drive: Towards Interpretable and Chain-Based Reasoning for Autonomous Driving: This paper constructs the Reason2Drive benchmark dataset (600K+ video-text pairs, covering perception-prediction-reasoning chain tasks), proposes ADRScore as a new metric to evaluate the correctness of chain-based reasoning, and designs a Prior Tokenizer + Instructed Vision Decoder framework to enhance the object-level perception and reasoning capabilities of VLMs, significantly outperforming all baselines on autonomous driving reasoning tasks.
Reliability in Semantic Segmentation: Can We Use Synthetic Data?: This work presents the first systematic utilization of Stable Diffusion to generate synthetic OOD data for a comprehensive reliability assessment of semantic segmentation models, encompassing robustness evaluation under covariate shift, OOD object detection, and model calibration. It demonstrates that evaluation results on synthetic data correlate highly with those on real OOD data.
Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather: Identifying two core interference patterns of adverse weather on LiDAR (geometric perturbation and point loss) through a data-centric analysis, this paper proposes two targeted data augmentation methods: Selective Jittering and Learnable Point Drop, achieving SOTA by improving the baseline by 8.1 mIoU on the SemanticKITTI→SemanticSTF benchmark.
Rethinking LiDAR Domain Generalization: Single Source as Multiple Density Domains: This work proposes a Density-Discriminative Feature Embedding (DDFE) module that leverages the inherent density diversity in a single LiDAR source domain (dense nearby and sparse far away) to learn density-aware feature representations, achieving generalization to unseen domains under different sensor configurations without requiring target domain data.
Risk-Aware Self-Consistent Imitation Learning for Trajectory Planning in Autonomous Driving: RaSc proposes a risk-aware self-consistent imitation learning framework. By introducing a Time-to-Collision (TTC) prediction branch to learn the risk-aversion motivations behind human driving behaviors and enforcing a self-consistency constraint to help the planner comprehend the physical consequences of its own actions, RaSc outperforms prior learning-based methods on both open-loop and closed-loop evaluations of the nuPlan dataset.
RoofDiffusion: Constructing Roofs from Severely Corrupted Point Data via Diffusion: RoofDiffusion proposes an end-to-end self-supervised method based on conditional diffusion probabilistic models to restore complete and clean elevation information from severely sparse (up to 99% missing), incomplete (80% area occluded), and noisy roof height maps. It significantly outperforms traditional interpolation methods and existing depth completion methods on the self-created PoznanRD dataset and BuildingNet.
Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries: Safe-Sim proposes a closed-loop safety-critical simulation framework based on diffusion models. By introducing an adversarial term and a Partial Diffusion mechanism into the diffusion denoising process, it achieves fine-grained control over adversarial vehicle behavior types (collision angles, relative velocities, and collision categories), validating its effective assessment capabilities over multiple planners on nuScenes and nuPlan.
SeFlow: A Self-Supervised Scene Flow Method in Autonomous Driving: SeFlow proposes integrating traditional ray-casting-based dynamic point classification into a self-supervised scene flow learning pipeline. By utilizing tailored dynamic/static loss functions and a cluster-based object-level motion consistency constraint, it achieves state-of-the-art (SOTA) self-supervised scene flow performance on Argoverse 2 and Waymo at real-time speeds (48ms/frame), even outperforming some supervised methods.
SFPNet: Sparse Focal Point Network for Semantic Segmentation on General LiDAR Point Clouds: SFPNet proposes Sparse Focal Modulation (SFPM) to replace window-attention. By avoiding inductive bias designs targeted at specific LiDAR types through multi-level context extraction and gated adaptive aggregation, it achieves leading or competitive performance on mechanical spinning, solid-state, and hybrid solid-state LiDAR datasets. It also releases S.MID, the first hybrid solid-state LiDAR semantic segmentation dataset.
SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras: The authors propose SimPB, a unified model that concurrently performs multi-camera 2D detection and BEV-space 3D detection using a hybrid decoder (multi-view 2D decoder + 3D decoder) in a cyclic 3D→2D→3D manner, achieving excellent results on both tasks on the nuScenes dataset.
SLEDGE: Synthesizing Driving Environments with Generative Models and Rule-Based Traffic: SLEDGE proposes the first generative-model-based driving simulator. By utilizing a Raster-to-Vector Autoencoder to encode driving scenes into Rasterized Latent Maps (RLMs), and subsequently using a Diffusion Transformer to generate high-quality lane graphs and agents, it creates a simulation environment with 500x less storage (<4GB) than nuPlan. Meanwhile, it supports 500m long route evaluations, exposing a failure rate of over 40% in the SOTA planner PDM-Closed.
Stream Query Denoising for Vectorized HD-Map Construction: This paper proposes the Stream Query Denoising (SQD) strategy. By adding noise to the ground truth (GT) of the previous frame and training the network to reconstruct the current frame's GT, temporal consistency modeling in streaming HD map construction is enhanced. This approach consistently outperforms StreamMapNet on nuScenes and Argoverse2.
TOD³Cap: Towards 3D Dense Captioning in Outdoor Scenes: This work pioneeringly proposes the task of outdoor 3D dense captioning, constructs the million-scale dataset TOD3Cap (2.3M captions across 850 scenes), and designs an end-to-end network based on BEV features + Relation Q-Former + LLaMA-Adapter, outperforming adapted indoor-based methods by +9.6 [email protected].
Train Till You Drop: Towards Stable and Robust Source-free Unsupervised 3D Domain Adaptation: To address the performance degradation issue in the late training stage of Source-Free Unsupervised 3D Domain Adaptation (SFUDA) for 3D semantic segmentation, this paper proposes regularization strategies and validation criteria based on reference model consistency to achieve stable and robust adaptation.
UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving: This paper proposes UniM2AE, a multi-modal self-supervised pre-training framework. By projecting image and LiDAR point cloud features into a unified 3D voxel space (which retains the height dimension unlike BEV) and designing a Multi-modal 3D Interactive Module (MMIM) for efficient cross-modal interaction, it achieves superior performance improvements in 3D detection (+1.2% NDS) and BEV segmentation (+6.5% mIoU) compared to independent pre-training and simple concatenation baselines.
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction: UniTraj establishes a unified framework for vehicle trajectory prediction by standardizing multiple datasets (nuScenes, Argoverse 2, WOMD), models (AutoBot, MTR, Wayformer), and evaluation strategies. The study reveals a significant drop in cross-dataset generalization for individual models, but demonstrates that scaling up training data volume and diversity substantially boosts performance, achieving 1st place on the nuScenes leaderboard via joint training.
VisionTrap: Vision-Augmented Trajectory Prediction Guided by Textual Descriptions: This work proposes VisionTrap, which introduces surround-view camera images and textual descriptions to the trajectory prediction task. Guided by a BEV visual semantic encoder and text-driven debiased contrastive learning, the model learns visual semantic cues (e.g., pedestrian poses, turn signals). It significantly improves prediction accuracy while maintaining a real-time inference speed of 53ms, and releases the nuScenes-Text dataset.
Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance: This paper proposes the VG-W3D framework, which trains a 3D object detector using only 2D annotations (without any 3D labels) through a three-level visual guidance mechanism across feature, output, and training layers. It achieves comparable performance on the KITTI dataset to methods utilizing 500 frames of 3D annotations.