🚗 Autonomous Driving¶

🧠 NeurIPS2025 · 50 paper notes

3EED: Ground Everything Everywhere in 3D: This paper introduces 3EED — the first large-scale multi-platform (vehicle, drone, quadruped robot), multimodal (LiDAR + RGB) outdoor 3D visual grounding benchmark, containing over 128K objects and 22K language descriptions, making it 10× larger than existing outdoor datasets. A baseline method incorporating cross-platform alignment, multi-scale sampling, and scale-adaptive fusion is also proposed, revealing substantial performance gaps in cross-platform 3D grounding.
Aha: Predicting What Matters Next — Online Highlight Detection Without Looking Ahead: Aha proposes the first autoregressive framework for Online Highlight Detection (OHD), featuring a decoupled multi-objective prediction head (relevance / informativeness / uncertainty) and a novel Dynamic SinkCache memory mechanism. Under strict causal constraints with no access to future frames, Aha surpasses prior offline methods on TVSum and Mr.Hisum benchmarks by +5.9% and +8.3% mAP, respectively.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning: AutoVLA integrates physical action tokens directly into a pretrained VLM (Qwen2.5-VL-3B), equips the model with fast/slow dual-thinking modes via SFT, and applies GRPO reinforcement fine-tuning to enable adaptive reasoning switching and optimize planning performance. The approach achieves competitive end-to-end driving performance across four major autonomous driving benchmarks: nuPlan, Waymo, nuScenes, and CARLA.
Availability-aware Sensor Fusion via Unified Canonical Space: This paper proposes ASF (Availability-aware Sensor Fusion), which maps Camera/LiDAR/4D Radar features into a shared space via Unified Canonical Projection (UCP), applies cross-sensor along-patch cross-attention (CASAP, complexity \(O(N_qN_s)\) vs. \(O(N_qN_sN_p)\)) to automatically adapt to available sensors, and employs a Sensor Combination Loss (SCL) covering all 7 sensor subsets. ASF achieves AP_3D of 73.6% on K-Radar (surpassing SOTA by 20.1%), with only a 1.7% performance drop under sensor failure.
BayesG: Bayesian Ego-Graph Inference for Networked Multi-Agent Reinforcement Learning: BayesG enables each agent in networked MARL to learn the dynamic structure of its local communication graph via Bayesian variational inference — sampling edge masks with Gumbel-Softmax and jointly optimizing policy and graph structure under an ELBO objective — achieving 50%+ reward improvement over the best baseline in a 167-agent New York traffic scenario.
Causality Meets Locality: Provably Generalizable and Scalable Policy Learning for Networked Systems: This paper proposes the GSAC framework, which integrates causal representation learning with meta Actor-Critic. By learning sparse causal masks from networked MARL to construct Approximate Compact Representations (ACR), GSAC achieves scalability; by conditioning policies on domain factors, it achieves cross-domain generalization. Finite-sample guarantees are provided for causal recovery, convergence, and adaptation gap.
ChronoGraph: A Real-World Graph-Based Multivariate Time Series Dataset: This paper presents ChronoGraph — the first real-world microservice dataset that simultaneously provides multivariate time series, explicit service dependency graphs, and event-level anomaly labels (6 months / ~700 services / 5-dimensional metrics / 8005 timesteps). Benchmark results reveal substantial room for improvement in long-horizon forecasting and topology-aware modeling among existing methods.
Continuous Simplicial Neural Networks: This paper proposes COSIMO, the first continuous simplicial neural network based on partial differential equations (PDEs), which realizes continuous information flow by defining heat diffusion dynamics on the Hodge Laplacian. COSIMO demonstrates superior stability and over-smoothing control compared to discrete SNNs.
CuMoLoS-MAE: A Masked Autoencoder for Remote Sensing Data Reconstruction: This paper proposes CuMoLoS-MAE, a Masked Autoencoder combining a curriculum masking strategy with Monte Carlo stochastic ensemble inference for high-fidelity reconstruction and pixel-wise uncertainty quantification of remote sensing atmospheric profile data.
CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation: This work introduces the first sketch-to-3D outdoor semantic scene generation task along with a benchmark dataset, SketchSem3D, and proposes CymbaDiff (Cylinder Mamba Diffusion), a denoising network that achieves structured spatial modeling via dual-path Mamba blocks combining cylindrical and Cartesian scanning. CymbaDiff reduces FID by 75% over 3D Latent Diffusion and 71% over 3D DiT.
DBLoss: Decomposition-based Loss Function for Time Series Forecasting: This paper proposes DBLoss—a general-purpose loss function based on exponential moving average (EMA) decomposition. During loss computation, both predictions and ground-truth values are decomposed into seasonal and trend components within the forecasting horizon, and losses are computed separately for each component. DBLoss serves as a plug-and-play replacement for MSE and consistently improves any deep learning forecasting model, with effectiveness validated across 8 benchmark datasets × 8 SOTA models.
DINO-Foresight: Looking into the Future with DINO: This paper proposes DINO-Foresight, which forecasts future-frame feature evolution within the semantic feature space of a Vision Foundation Model (VFM). A self-supervised Masked Feature Transformer predicts PCA-compressed representations of multi-layer DINOv2 features. Paired with plug-and-play task-specific heads, a single model simultaneously handles semantic segmentation, instance segmentation, depth estimation, and surface normal prediction, substantially outperforming the VISTA world model while achieving 100× faster inference.
DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving: DriveDPO is a two-stage framework that first fuses human-imitation similarity and rule-based safety scores into a single supervised distribution via unified policy distillation, then applies Safety DPO to construct trajectory preference pairs of the form "human-like but unsafe vs. human-like and safe" for policy fine-tuning — achieving a new state-of-the-art PDMS of 90.0 on NAVSIM.
Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation: This paper proposes Feature Mixing — an extremely simple multimodal outlier synthesis method that generates OOD samples by randomly swapping \(N\) dimensions across features from two modalities for training regularization. It provides theoretical guarantees that synthesized outliers reside in low-likelihood regions of the ID distribution with bounded deviation, achieves state-of-the-art performance across 8 datasets and 4 modality combinations, and runs 10×–370× faster than NP-Mix.
Flow Matching-Based Autonomous Driving Planning with Advanced Interactive Behavior Modeling: This paper proposes Flow Planner—a system combining three synergistic innovations: fine-grained trajectory tokenization, an interaction-enhanced spatiotemporal fusion architecture, and flow matching with classifier-free guidance. It is the first purely learning-based method to surpass 90 points on nuPlan Val14 (90.43), and outperforms Diffusion Planner by 8.92 points on the interaction-intensive interPlan benchmark.
Future-Aware End-to-End Driving: Bidirectional Modeling of Trajectory Planning and Scene Evolution: Proposes SeerDrive, which achieves SOTA on NAVSIM and nuScenes through bidirectional modeling of scene evolution and trajectory planning (future-aware planning + iterative interaction).
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving: FSDrive enables VLAs to "think visually" — first acting as a world model to generate a unified visual CoT frame that integrates future lane lines, 3D detection boxes, and scene predictions, then acting as an inverse dynamics model to perform trajectory planning based on current observations and the visual CoT. This approach activates the visual generation capability of MLLMs using only a minimal amount of data (~0.3%).
GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification: This paper proposes GSAlign, a framework that addresses geometric distortion and semantic misalignment in aerial-ground person re-identification (AG-ReID) via a Learnable Thin Plate Spline (LTPS) module and a Dynamic Alignment Module (DAM), achieving +18.8% mAP and +16.8% Rank-1 improvements on the CARGO dataset under the aerial-ground protocol.
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning: This paper proposes HoloLLM, the first framework to integrate rare sensing modalities — including LiDAR, infrared, mmWave radar, and WiFi — into a multimodal large language model (MLLM). Through a Universal Modality-Injection Projector (UMIP), HoloLLM achieves efficient alignment between sensing modalities and text under data-scarce conditions, improving human action QA and captioning by approximately 30% over existing MLLMs.
How Different from the Past? Spatio-Temporal Time Series Forecasting with Self-Supervised Deviation Learning: This paper proposes ST-SSDL, a framework that captures dynamic deviations between current inputs and historical patterns via self-supervised deviation learning (SSDL). It discretizes the latent space using learnable prototypes and enforces relative distance consistency through a contrastive loss and a deviation loss, achieving state-of-the-art performance on six spatio-temporal benchmarks.
L2RSI: Cross-View LiDAR-Based Place Recognition for Large-Scale Urban Scenes via Remote Sensing Imagery: This paper proposes L2RSI, the first framework for LiDAR-based place recognition in ultra-large-scale urban scenes (100 km²) leveraging high-resolution remote sensing imagery. It aligns LiDAR BEV representations with remote sensing semantic spaces via semantic contrastive learning, and introduces Spatio-Temporal Particle Estimation (STPE) to aggregate spatio-temporal information from consecutive queries, achieving 83.27% Top-1 accuracy within a 100 km² retrieval range.
LabelAny3D: Label Any Object 3D in the Wild: This paper proposes LabelAny3D, an analysis-by-synthesis automatic 3D annotation pipeline that reconstructs complete 3D scenes from monocular images to obtain high-quality 3D bounding box annotations. Based on this pipeline, the authors construct the COCO3D benchmark covering 80 categories of everyday objects, achieving significant improvements in open-vocabulary monocular 3D detection.
Layer-wise Modality Decomposition for Interpretable Multimodal Sensor Fusion: This paper proposes LMD (Layer-Wise Modality Decomposition), a post-hoc, model-agnostic interpretability method that linearizes neural network operations layer by layer to exactly decompose the predictions of multimodal fusion models into per-sensor modality contributions. LMD is the first method to achieve prediction attribution to individual input modalities in autonomous driving perception models, and its effectiveness is validated across camera-radar, camera-LiDAR, and camera-radar-LiDAR fusion settings.
FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance: This paper proposes FlowScene, which leverages optical flow to guide temporal feature aggregation and employs occlusion masks for voxel refinement. Using only 2 historical frames as input, FlowScene achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks (mIoU 17.70 / 20.81).
Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation: This paper proposes Vireo, the first single-stage framework that unifies open-vocabulary semantic segmentation (OVSS) and domain-generalized semantic segmentation (DGSS). By introducing GeoText Query to fuse depth-geometric features with linguistic cues, Vireo achieves state-of-the-art performance under both extreme environmental conditions and on unseen categories.
Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving: This paper proposes MPA, a framework that generates counterfactual trajectory data via 3DGS simulation, trains a diffusion policy adapter and a multi-principle Q-value model, and uses them at inference time to guide a pretrained E2E driving model toward improved safety and generalization in closed-loop scenarios.
Neurosymbolic Diffusion Models: This paper proposes Neurosymbolic Diffusion Models (NeSyDM), which integrates discrete masked diffusion models with symbolic programs to overcome the conditional independence assumption in traditional neurosymbolic predictors. NeSyDM models inter-concept dependencies and uncertainty while maintaining scalability, achieving state-of-the-art accuracy and calibration on visual reasoning and autonomous driving benchmarks.
OpenBox: Annotate Any Bounding Boxes in 3D: This paper proposes OpenBox, a two-stage automatic 3D bounding box annotation pipeline that first maps instance-level information from 2D visual foundation models to 3D point clouds via cross-modal instance alignment, then adaptively generates high-quality 3D bounding boxes based on the physical state of each object (static rigid / dynamic rigid / deformable), without requiring any self-training iterations.
Predictive Preference Learning from Human Interventions: PPL leverages a trajectory prediction model to anticipate the agent's future states and "bootstraps" each human intervention signal across the predicted future horizon to construct contrastive preference data. Combined with a dual-loss training strategy of behavior cloning and preference optimization, PPL substantially reduces the number of required human interventions and demonstration data.
Prioritizing Perception-Guided Self-Supervision: A New Paradigm for Causal Modeling in End-to-End Autonomous Driving: This work addresses causal confusion in end-to-end autonomous driving by leveraging perception outputs (lane centerlines, agent trajectories) and self-supervised learning to establish causal relationships, achieving state-of-the-art performance on the Bench2Drive closed-loop benchmark (Driving Score 78.08).
RAW2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving: This paper proposes RAW2Drive, the first model-based reinforcement learning (MBRL) end-to-end autonomous driving framework operating directly from raw sensor inputs to planning. Through a dual-stream world model design — first training a privileged world model, then guiding a raw-sensor world model via an alignment mechanism — RAW2Drive achieves state-of-the-art performance on CARLA v2 and Bench2Drive, substantially outperforming imitation learning (IL) methods.
Regret Lower Bounds for Decentralized Multi-Agent Stochastic Shortest Path Problems: This paper establishes the first \(\Omega(\sqrt{K})\) regret lower bound for the Decentralized Multi-Agent Stochastic Shortest Path (Dec-MASSP) problem under linear function approximation. By constructing a family of hard-to-learn instances and employing a symmetry argument to identify the structure of optimal policies, the paper demonstrates that this lower bound matches existing upper bounds in terms of the number of episodes \(K\).
SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction: This paper proposes SDTagNet, the first method to encode OpenStreetMap text annotations (road names, lane counts, one-way indicators, etc.) via BERT and to unify all SD map elements (points, polylines, and relations) through a point-level graph Transformer. On long-range HD map construction, SDTagNet achieves +5.9 mAP (+45%) over prior-free baselines and +3.2 mAP (+20%) over existing SD map prior methods.
Self-Supervised Learning of Graph Representations for Network Intrusion Detection: This paper proposes GraphIDS, a self-supervised intrusion detection model that unifies graph representation learning and anomaly detection via a masked autoencoder, achieving a PR-AUC of 99.98% and macro F1 of 99.61% on multiple NetFlow benchmarks, surpassing baselines by 5–25 percentage points.
Semantic Glitch: Agency and Artistry in an Autonomous Pixel Cloud: This paper presents "Pixel Cloud," a low-fidelity autonomous aerial robotic art installation that deliberately forgoes conventional LiDAR/SLAM sensors and relies solely on the semantic understanding of a multimodal large language model (MLLM) for navigation. Through natural language prompting, the robot is endowed with a biologically inspired narrative persona, yielding imprecise yet characterful emergent behaviors.
SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration: This paper presents SimWorld-Robotics (SWR), a large-scale urban simulation platform built on Unreal Engine 5 that supports procedural generation of unlimited photorealistic city environments. Built upon this platform, two new benchmarks are introduced — SimWorld-MMNav for multimodal navigation and SimWorld-MRS for multi-robot search — which collectively reveal critical capability gaps in current VLMs on outdoor urban tasks.
Spatio-Temporal Graphs Beyond Grids: Benchmark for Maritime Anomaly Detection: This paper proposes the first graph anomaly detection benchmark for non-grid spatio-temporal systems in the maritime domain. It extends the OMTAD dataset to support node/edge/graph-level anomaly detection, and plans to employ LLM agents for trajectory synthesis and anomaly injection.
SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding: SPIRAL proposes a semantic-aware range-view LiDAR diffusion model that jointly generates depth maps, reflectance images, and semantic segmentation maps. By introducing progressive semantic prediction and a closed-loop inference mechanism to enhance cross-modal consistency, the model achieves state-of-the-art performance with a minimal parameter count of 61M.
SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving: SQS presents the first query-based 3D Gaussian splatting pre-training framework for sparse perception models (SPMs). By self-supervisedly reconstructing RGB images and depth maps, the method learns fine-grained 3D representations, and introduces a query interaction module to fuse pre-trained Gaussian queries with task-specific queries. SQS achieves significant improvements over existing pre-training methods on occupancy prediction (+1.3 mIoU) and 3D object detection (+1.0 NDS).
StreamForest: Efficient Online Video Understanding with Persistent Event Memory: This paper proposes StreamForest, an architecture that adaptively organizes streaming video frames into multiple event-level tree structures via a "Persistent Event Memory Forest," combined with a "Fine-grained Spatiotemporal Window" to capture short-term visual cues. The method achieves 77.3% accuracy on StreamingBench and retains 96.8% of performance under extreme compression (only 1024 visual tokens).
Towards Foundational LiDAR World Models with Efficient Latent Flow Matching: This paper proposes the first transferable LiDAR world model, achieving a 192× compression ratio via a Swin Transformer VAE (state-of-the-art reconstruction accuracy), replacing diffusion models with Conditional Flow Matching (CFM) for state-of-the-art semantic occupancy prediction (using only 4.38% of prior work's FLOPs), and surpassing OccWorld trained on full annotations across three domain transfer tasks using only 5% labeled data.
Towards Physics-Informed Spatial Intelligence with Human Priors: An Autonomous Driving Perspective: This paper proposes the Spatial Intelligence Grid (SIG)—a structured representation inspired by the perspective grids used by Renaissance painters—that explicitly encodes object layout, directional relationships, and distance relationships in driving scenes as a grid structure. The authors further construct the SIGBench benchmark, demonstrating that SIG enables more stable and comprehensive improvements in the spatial reasoning capabilities of MLLMs under few-shot in-context learning compared to conventional VQA-based approaches.
Towards Predicting Any Human Trajectory in Context: This paper proposes TrajICL, an in-context learning (ICL) framework for pedestrian trajectory prediction that achieves cross-scene adaptive prediction without fine-tuning through spatiotemporal similarity-based example selection and prediction-guided example selection, surpassing even fine-tuned baselines.

TL;DR

Unifying Appearance Codes and Bilateral Grids for Driving Scene Gaussian Splatting: A multi-scale bilateral grid pyramid is proposed to unify global appearance codes and pixel-level bilateral grids. A 3-level hierarchy (coarse→medium→fine) captures global/regional/pixel-level photometric variation respectively. By employing a luminance-guided slice-and-blend pipeline and adaptive regularization, the method addresses photometric inconsistency in driving scene 3DGS, achieving a 28.2% improvement in Chamfer Distance over OmniRe on Waymo.
UniMotion: A Unified Motion Framework for Simulation, Prediction and Planning: UniMotion proposes a unified motion framework built on a decoder-only Transformer, supporting motion simulation, trajectory prediction, and ego-vehicle planning simultaneously through task-aware interaction patterns and training strategies. Joint training facilitates cross-task knowledge sharing, and after task-specific fine-tuning, the model achieves state-of-the-art performance across multiple tasks on the Waymo dataset.
URB -- Urban Routing Benchmark for RL-Equipped Connected Autonomous Vehicles: This paper presents URB — the first large-scale MARL benchmark environment for urban mixed-traffic (human + CAV) routing, integrating 29 real-world traffic networks, the microscopic traffic simulator SUMO, and empirical travel demand patterns. Experiments reveal that current state-of-the-art MARL algorithms rarely outperform human drivers, highlighting the urgent need for algorithmic advances in this domain.
UrbanIng-V2X: A Large-Scale Multi-Vehicle Multi-Infrastructure Dataset Across Multiple Intersections for Cooperative Perception: UrbanIng-V2X is the first real-world cooperative perception dataset spanning multiple vehicles, multiple infrastructure sensors, and multiple urban intersections. It provides 712K annotated instances across 13 categories in 34 scenes, and through a cross-intersection evaluation strategy (SIS) quantitatively reveals a substantial generalization gap of 14 mAP exhibited by existing cooperative perception methods on unseen intersections.
V2X-Radar: A Multi-Modal Dataset with 4D Radar for Cooperative Perception: This paper presents V2X-Radar, the first large-scale real-world multi-modal vehicle-to-everything (V2X) cooperative perception dataset incorporating 4D radar, LiDAR, and multi-view camera data. The dataset covers diverse weather and lighting conditions, providing 20K LiDAR frames, 40K camera images, 20K 4D radar scans, and 350K annotated bounding boxes, along with comprehensive benchmarks across three sub-datasets.
X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability: This paper presents X-Scene, a unified large-scale driving scene generation framework that supports multi-granularity control ranging from high-level text prompts to low-level BEV layouts. By jointly generating 3D semantic occupancy, multi-view images, and videos, and leveraging consistency-aware extrapolation for large-scale scene expansion, X-Scene comprehensively outperforms existing methods in generation quality (FID 11.29) and downstream tasks.