🚗 Autonomous Driving¶
🔬 ICLR2026 · 50 paper notes
📌 Same area in other venues: 📷 CVPR2026 (140) · 🧪 ICML2026 (8) · 🤖 AAAI2026 (56) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (91)
🔥 Top topics: Autonomous Driving ×14 · Agents ×7 · Adversarial Robustness ×4 · Multimodal/VLM ×4 · Alignment/RLHF ×3
- Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation
-
The A3Point (Adaptive Augmentation-Aware Latent Learning) framework is proposed to decouple intrinsic model semantic confusion from semantic shift introduced by data augmentation through two core components: implicit learning of Semantic Confusion Prior (SCP) and localization of Semantic Shift Regions (SSR). It adaptively optimizes across varying interference levels and achieves SOTA results on multiple LiDAR segmentation benchmarks under adverse weather.
- SMART-R1: Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning
-
SMART-R1 introduces R1-style Reinforcement Fine-Tuning (RFT) to multi-agent traffic simulation for the first time, proposing the Metric-oriented Policy Optimization (MPO) algorithm and an "SFT-RFT-SFT" iterative training strategy. It achieved first place on the WOSAC 2025 leaderboard with a Realism Meta score of 0.7858.
- ARINBEV: Bird's-Eye View Layout Estimation with Conditional Autoregressive Model
-
ARINBEV treats the BEV semantic map in autonomous driving as a discretized sequence of structured tokens, replaces VQ-VAE tokenization with class encoding, and utilizes entropy-guided masked autoregressive decoding to achieve higher mIoU, fewer parameters, and faster training on nuScenes and Argoverse2.
- Astra: General Interactive World Model with Autoregressive Denoising
-
Ours proposes Astra, a general interactive world model that enables action-conditioned long-range video prediction on pre-trained video diffusion models through an autoregressive denoising framework. It introduces ACT-Adapter (action injection), noise-enhanced historical memory (alleviating visual inertia), and Mixture of Action Experts (unifying heterogeneous action modalities), achieving SOTA fidelity and action-following capabilities across autonomous driving, robotic manipulation, and scene exploration.
- AsyncBEV: Cross-modal Flow Alignment in Asynchronous 3D Object Detection
-
Addressing the real-world issue of imperfect sensor synchronization, AsyncBEV proposes a lightweight, plug-and-play module. By defining a new task, \(\Delta\)-BEVFlow, it predicts dense 2D flow fields directly from asynchronous multimodal BEV features to warp and align delayed features to the reference timestamp. Under extreme 0.5s asynchrony, it improves the dynamic object NDS of CMT by 16.6% compared to the EMC baseline.
- AutoDrive-R²: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving
-
AutoDrive-R² employs a four-step CoT + self-reflection data for cold-starting an autonomous driving VLA, followed by post-training using GRPO with spatial, kinetic, and temporal smoothness constraints. This enables the model to explain its driving decisions while outputting trajectories that adhere to vehicle physical constraints.
- \(AutoDrive\text{-}P^3\): Unified Chain of Perception-Prediction-Planning Thought via Reinforcement Fine-Tuning
-
AutoDrive-P3 organizes perception, prediction, and planning of autonomous driving VLMs into a unified \(P^3\) chain-of-thought reasoning, utilizing GRPO rewards spanning all three stages for reinforcement fine-tuning. It simultaneously improves trajectory accuracy, collision rates, and closed-loop planning scores on nuScenes and NAVSIM.
- Beyond Visual Reconstruction Quality: Object Perception-aware 3D Gaussian Splatting for Autonomous Driving
-
This paper points out that the assumption "higher reconstruction fidelity leads to better reproduction of autonomous driving system (ADS) behavior" is a strong, unverified hypothesis. It proposes replacing pure visual similarity with perception stability (consistency of perception model outputs between reconstructed and ground truth images) as the optimization objective. Two plug-and-play losses—Perception Alignment Loss and Object Region Quality Loss—are introduced to significantly improve perception consistency in reconstructed scenes without sacrificing visual quality.
- Bird's-eye-view Informed Reasoning Driver (BIRDriver)
-
BIRDriver compresses the entire driving scene into a single-frame Bird's-Eye-View (BEV) top-down image fed into a VLM. The VLM outputs no more than three relative coordinate key points to express driving intentions, which are then refined into a trajectory by a motion planner. This low-cost approach grafts the VLM's commonsense reasoning onto long-tail driving scenarios.
- BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving
-
BridgeDrive proposes replacing truncated diffusion with a diffusion bridge to achieve anchor-guided trajectory planning in autonomous driving. This ensures theoretical symmetry between forward and backward processes, achieving success rates of 74.99% (PDM-Lite) and 89.25% (LEAD) in Bench2Drive closed-loop evaluations, surpassing previous SOTA by 7.72% and 2.45%, respectively.
- DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning
-
Addressing the training instability of multi-agent GAIL in traffic simulation, this paper identifies "irrelevant interaction misguidance" (where the discriminator is misled by neighbor-neighbor interactions weakly related to ego actions) as the root cause. It proposes DecompGAIL, which explicitly decomposes realism into "ego-map" and "ego-neighbor" components alongside distance-weighted social rewards, achieving SOTA realism on the WOMD Sim Agents 2025 leaderboard.
- Detecting Temporal Misalignment Attacks in Multimodal Fusion for Autonomous Driving
-
Addressing the vulnerability of camera-LiDAR fusion’s dependence on precise time synchronization, this paper proposes AION, a lightweight plug-and-play defense. AION utilizes "Continuity-Aware Contrastive Learning" to train a shared multimodal encoder and employs Dynamic Time Warping (DTW) to track the alignment path of dual-sensor representations. Deviations from the diagonal are converted into anomaly scores, achieving an average AUROC of 0.92–0.95 against seven types of temporal misalignment attacks on KITTI/nuScenes, with an inference overhead of only ~3.26 ms.
- Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving
-
ReflectDrive discretizes 2D driving space into an action codebook, uses a pre-trained Diffusion Language Model (DLM) for VLA trajectory planning, and layers a gradient-free "reflection mechanism"—performing local searches on unsafe tokens to find safety anchors, followed by diffusion inpainting to regenerate surrounding trajectories. It achieves a PDMS of 91.1 (approaching the human score of 94.8) on the NAVSIM closed-loop benchmark.
- DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking
-
DriveAgent-R1 enables a 3B VLM to learn to "proactively invoke tools to see clearly when details are obscure" during driving planning. By implementing active perception through a visual toolkit and a hybrid thinking framework that adaptively switches between "fast text-only inference" and "slow tool-augmented inference" based on scene complexity, the agent achieves performance comparable to GPT-5 and human drivers via three-stage progressive training with cascaded RL.
- DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving
-
DriveMamba abandons the traditional serial Transformer paradigm of "perception → prediction → planning" and costly dense BEV features. It sparsifies image features and all task queries into tokens, sorts them by 3D spatial position, and feeds them into a unified Mamba decoder. This allows for simultaneous view correspondence, task relationship modeling, and long-term temporal fusion with linear complexity. The smallest Tiny version reduces the average L2 error to 0.44m and collision rate to 0.15% on nuScenes, while achieving 17.9 FPS (approximately 10x faster than UniAD).
- DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
-
DriveVLA-W0 adds a "predict future images" world model task to autonomous driving VLA, using dense visual self-supervision signals to fill the "supervision deficit" left by sparse action supervision. This effectively "amplifies" the data scaling law across 70M frames, allowing the model to consistently improve rather than reaching early saturation.
- EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
-
Apple utilized Vision Pro to collect 829 hours of egocentric video paired with 3D hand joint tracking (EgoDex), covering 194 tabletop manipulation tasks. They systematically evaluated imitation learning strategies (BC/DDPM/FM + Transformer) on this dataset, providing the largest data foundation to date for the scaling and training of dexterous manipulation.
- EnvSocial-Diff: A Diffusion-Based Crowd Simulation Model with Environmental Conditioning and Individual-Group Interaction
-
Building upon the "social force + diffusion" framework of SPDiff, this work explicitly decomposes the environment into three categories of structured conditions: obstacles, Objects of Interest (OOI), and lighting. It supplements this with a graph-based "Individual-Group Interaction" (IGI) module for two-level social modeling, resulting in more realistic crowd trajectory simulations in complex outdoor scenarios.
- FlowAD: Ego-Scene Interactive Modeling for Autonomous Driving
-
FlowAD models the "feedback of ego-motion on future observations" as relative scene flow. By utilizing ego-guided scene partitioning and spatio-temporal flow prediction to learn these interaction dynamics in latent space, it achieves consistent performance gains in perception, end-to-end planning, and VLM analysis. It also introduces the FCP metric to specifically measure the speed of scene understanding.
- GaussianFusion: Unified 3D Gaussian Representation for Multi-Modal Fusion Perception
-
Ours replaces discrete BEV grids with continuous 3D Gaussian representations as a unified space for camera-LiDAR multi-modal fusion. By completing cross-modal alignment and interaction before quantization, it achieves new state-of-the-art accuracy in both 3D detection and occupancy prediction while significantly reducing memory overhead and latency.
- GT-Space: Enhancing Heterogeneous Collaborative Perception with Ground Truth Feature Space
-
GT-Space utilizes ground truth annotations (object boxes) to construct a unified BEV "common feature space" as an alignment anchor. This allows each heterogeneous agent to map its features into this space for fusion via a lightweight projector. Combined with cross-modal combinatorial contrastive loss, the 3D detection accuracy for heterogeneous collaboration on OPV2V / V2XSet / RCooper significantly outperforms existing methods that require encoder retraining or pairwise adaptation.
- Loc²: Interpretable Cross-View Localization via Depth-Lifted Local Feature Matching
-
Loc² directly learns local feature correspondences on the pixel planes of ground and aerial images, lifts matched points to BEV using monocular depth, and analytically solves for 3-DoF pose and depth scale via scale-aware Procrustes alignment. Supervised only by weak camera poses without pixel-level labels, it achieves SOTA results in challenging scenarios such as cross-area and unknown orientation, while matched points themselves serve as visual explanations for localization quality.
- Low-Latency Neural LiDAR Compression with 2D Context Models
-
RangeCM transitions LiDAR point cloud compression from expensive 3D contexts (voxel/octree) entirely to the 2D range image domain. It uses CNNs in 2D to aggregate spatial, temporal, and camera contexts simultaneously, utilizing a unified hybrid context to predict both geometry and intensity. While achieving better BD-Rate than SOTA, it reduces codec latency to approximately 0.1 seconds and accelerates intensity compression by over 100x compared to baselines.
- Map as a Prompt: Learning Multi-Modal Spatial-Signal Foundation Models for Cross-scenario Wireless Localization
-
The authors propose SigMap, a method that feeds 3D maps as "soft prompts" into a wireless channel foundation model. Using cycle-adaptive masking for self-supervised pre-training and map-conditioned Graph Neural Network (GNN) prompts for parameter-efficient fine-tuning, the model achieves strong zero-shot/few-shot generalization in cross-scenario wireless localization.
- MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding
-
The MARC framework is proposed, utilizing a "retrieve-then-compress" strategy. It employs a Visual Memory Retriever (VMR) to select video segments most relevant to the query, and then utilizes Compression GRPO (C-GRPO) to distill the inference capabilities of a 64-frame teacher model into a student model using only 1 frame's worth of tokens. This achieves 95% visual token compression, a 72% reduction in GPU memory, and a 23.9% reduction in inference latency with virtually no performance loss (42.20 vs. 42.21).
- Micro-Macro Coupled Koopman Modeling on Graph for Traffic Flow Prediction
-
The authors unify "microscopic vehicle trajectories" and "macroscopic traffic density" by lifting them into a linear Koopman observation space. By discretizing the Lighthill-Whitham-Richards (LWR) equations on a Lagrangian dynamic graph with vehicles as nodes, the model achieves trajectory prediction performance comparable to or better than history-dependent SOTA methods using only the current snapshot (no history required).
- Multi-Head Low-Rank Attention (MLRA)
-
Multi-Head Low-Rank Attention (MLRA) is proposed, which decomposes the single latent head of MLA into multiple independently shardable latent heads and sums the attention outputs of each branch. This achieves native 4-way tensor parallelism support and a 2.8× decoding speedup while maintaining SOTA performance.
- NeMo-map: Neural Implicit Flow Fields for Spatio-Temporal Motion Mapping
-
The paper proposes NeMo-map, a continuous spatio-temporal dynamic map based on neural implicit functions. By directly mapping spatio-temporal coordinates to Semi-Wrapped Gaussian Mixture Model (SWGMM) parameters, it eliminates the constraints of spatial discretization and temporal segmentation in traditional methods, achieving lower NLL and smoother velocity distributions on real human tracking datasets.
- OccDriver: Future Occupancy Guided Dual-branch Trajectory Planner in Autonomous Driving
-
OccDriver adopts a dual-branch coarse-to-fine framework: a vectorized branch generates coarse trajectories, a rasterized branch acts as an occupancy flow world model to predict future scene evolution conditioned on each trajectory, and the vectorized branch سپس refines the trajectories accordingly. Combined with cross-branch losses and a contingency planning strategy, it achieves SOTA performance on the nuPlan closed-loop benchmark.
- Online Navigation Refinement: Achieving Lane-Level Guidance by Associating Standard-Definition and Online Perception Maps
-
This paper proposes "Online Navigation Refinement" (ONR), a new task to refine road-level routes from SD maps into lane-level guidance. A lightweight Map Association Transformer (MAT) with path-aware and spatial attention is designed to perform "map-to-map" association between heterogeneous SD maps and on-vehicle online perception maps. MAT outperforms all map-matching baselines on the self-built OMA dataset with a latency of 34ms.
- Plan-R1: Safe and Feasible Trajectory Planning as Language Modeling
-
The paper treats autonomous driving trajectory planning as "language modeling"—first pre-training a motion token predictor autoregressively on expert data to learn "driving like a human," and then applying reinforcement learning fine-tuning with rule-based rewards and an improved GRPO (VD-GRPO). This explicitly aligns the model with driving principles such as safety, comfort, and compliance, achieving SOTA on nuPlan, especially under interactive reactive settings.
- PTN: Proposal-centric Transformer Network for 3D Object Detection
-
PTN attributes the bottleneck of two-stage LiDAR detectors to "poor proposal quality"—geometric details are lost during pooling, and proposal refinements are isolated. The authors propose Hierarchical Attention Feature Alignment (HAFA) to recover fine-grained geometry and a Collaborative Proposal Refinement Module (CPRM) that enables context exchange between proposals via deformable attention. PTN achieves SOTA on Waymo and KITTI, particularly significantly improving pedestrian and cyclist detection in sparse point cloud and occluded scenarios.
- RAP: 3D Rasterization Augmented End-to-End Planning
-
RAP utilizes lightweight 3D rasterization to generate controllable counterfactual views and recovery scenes from real driving logs. It then stabilizes the transfer of these synthetic samples to real-image planners through feature-space Raster-to-Real alignment, significantly enhancing end-to-end planning robustness on closed-loop/long-tail benchmarks such as NAVSIM, WOD-E2E, and Bench2Drive.
- Rate-Distortion Optimized Pragmatic Communication for Collaborative Perception
-
This paper extends the classic Shannon rate-distortion theory into a "pragmatic rate-distortion theory" oriented towards multi-agent collaborative perception. It derives two necessary conditions for optimal communication strategies: transmitting only task-relevant information and avoiding information redundant with the receiver's observations. Based on these, the RDcomm framework (task-entropy discrete encoding + mutual information-driven message filtering) is designed. It achieves SOTA accuracy in 3D detection and BEV segmentation across 4 datasets while compressing communication volume by up to 108x.
- ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
-
ReCogDrive replaces the "trajectory as text generation" paradigm with a "Cognitive VLM + Diffusion Planner" framework. It first injects human driving cognition into a VLM through a hierarchical data pipeline, then treats VLM hidden states as conditions for a diffusion planner to output continuous trajectories. Finally, a DiffGRPO reinforcement learning stage, tailored for diffusion policies, optimizes safety and comfort within the NAVSIM simulator. This achieves SOTA performance on both NAVSIM (PDMS 90.8) and Bench2Drive, while being 3.5× faster than pure text output.
- ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving
-
ResWorld proposes a Temporal Residual World Model (TR-World) that extracts dynamic object information by calculating temporal residuals of BEV scene representations (without detection/tracking), avoiding redundant modeling of static areas. Combined with a Future-Guided Trajectory Refinement (FGTR) module, it utilizes predicted future BEV features to correct planned trajectories, achieving SOTA planning performance on nuScenes and NAVSIM.
- Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
-
This paper points out that previous experiments using driving world models for synthetic data were based on "unfair training epochs." It proposes Dream4Drive—which decomposes real videos into dense 3D-aware guidance maps and renders 3D assets into them to fine-tune a world model for multi-view edited video generation. Under fair comparison with aligned epochs, adding less than 2% synthetic samples consistently improves 3D detection and tracking.
- S2GO: Streaming Sparse Gaussian Occupancy
-
S2GO uses a set of approximately 1k sparse 3D queries to summarize driving scenes in an online streaming fashion. In each frame, queries are decoded into dense semantic Gaussians and then "splatted" into voxel occupancy. Combined with a geometric denoising and rendering pre-training task, sparse queries learn to move toward occupied regions. It achieves a 2.7 IoU improvement over GaussianWorld on nuScenes/KITTI with 4.5× faster inference (real-time 26 FPS on a single 4090).
- SceneStreamer: Continuous Scenario Generation as Next Token Group Prediction
-
SceneStreamer encodes entire driving scenarios (maps, traffic lights, agent states, motion) into a discrete token sequence, generating them by "predicting the next group of tokens" via a single autoregressive Transformer. This enables continuous traffic generation in open systems over infinite horizons with dynamic agent entry/exit, significantly enhancing the robustness and generalization of downstream RL planners as a high-fidelity simulator.
- SEAL: Segment Any Events with Language
-
This work proposes the first Open-Vocabulary Event Instance Segmentation (OV-EIS) task and introduces the SEAL framework. By utilizing Multi-modal Hierarchical Semantic Guidance (MHSG) and a lightweight multi-modal fusion network, SEAL achieves multi-granularity (instance-level + part-level) semantic segmentation of event streams using only event-image pairs (without dense annotations), significantly outperforming all baseline methods with the fastest inference speed.
- SiMO: Single-Modality-Operable Multimodal Collaborative Perception
-
Ours proposes the SiMO framework, which utilizes the LAMMA fusion module and the PAFR training strategy to achieve a multi-agent collaborative perception system capable of operating under arbitrary modality loss (specifically when LiDAR fails and only cameras are available) for the first time. It functions like a parallel circuit—as long as one path exists, the system works.
- SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms
-
SimULi utilizes factorized 3D Gaussian representations to separately carry camera and LiDAR information, extending 3DGUT to the irregular sampling of spinning LiDAR, thereby achieving real-time autonomous driving sensor simulation that supports complex camera models and LiDAR scanning simultaneously.
- SPACeR: Self-Play Anchoring with Centralized Reference Models
-
SPACeR proposes a "human-like self-play" framework that utilizes a pre-trained tokenized autoregressive motion model as a centralized reference policy. Through log-likelihood rewards and KL divergence constraints, it guides decentralized self-play RL policies to align with human driving distributions. It outperforms pure self-play methods on WOSAC while achieving 10x faster inference and 50x fewer parameters than imitation learning models.
- Stability under Scrutiny: Benchmarking Representation Paradigms for Online HD Map Construction
-
This paper points out that the field of online high-definition (HD) mapping has exclusively focused on single-frame accuracy (mAP) while neglecting the issue of temporal stability (jittering/flickering) between consecutive frames. It proposes the first multi-dimensional stability evaluation framework (merging Presence, Localization, and Shape metrics into a mean Average Stability, mAS). Through large-scale evaluation of 42 models and variants, the study finds that mAP and mAS are largely independent. It systematically analyzes how design choices—such as sensors, backbones, BEV encoders, temporal fusion, and training duration—affect both accuracy and stability.
- Steerable Adversarial Scenario Generation through Test-Time Preference Alignment (SAGE)
-
SAGE reformulates adversarial scenario generation for autonomous driving as a multi-objective preference alignment problem. By training two preference expert models and performing weight interpolation at inference time, it achieves a continuous and controllable trade-off between adversariality and realism. This allows for the generation of a full spectrum of scenarios from mild to aggressive without retraining, significantly enhancing closed-loop training performance.
- To View Transform or Not to View Transform: NeRF-based Pre-training Perspective
-
NeRP3D argues that hard-linking NeRF pre-training to discrete BEV/voxel view transformation backbones compromises the advantages of continuous radiance fields. Thus, it directly utilizes NeRF-like continuous point queries to unify reconstruction pre-training and autonomous driving 3D perception. It outperforms existing NeRF pre-training methods in reconstruction, detection, occupancy prediction, and HD mapping tasks on nuScenes.
- TrajTok: What makes for a good trajectory tokenizer in behavior generation?
-
TrajTok systematically analyzes coverage, utilization, symmetry, and robustness of trajectory tokenizers in autonomous driving behavior generation. By using "rule-based candidates + data-driven selection/expansion + spatial-aware label smoothing," it constructs a trajectory vocabulary better suited for next-token prediction, achieving first place in the Waymo Open Sim Agents Challenge 2025.
- UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction
-
UniSplat performs multi-view spatial fusion and multi-frame temporal fusion simultaneously on a unified "3D implicit scaffold" (sparse voxel grid). It utilizes a point-voxel dual-branch decoder to generate Gaussians with dynamic attributes while maintaining a static Gaussian memory bank, achieving feed-forward SOTA new-view synthesis in sparse surround-view, highly dynamic driving scenarios like Waymo and nuScenes, and even completing blind spots outside the camera's field of view.
- VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning
-
VADv2 reformulates end-to-end driving planning from "regressing a single trajectory" to "learning a probability distribution over the action space." It first discretizes the continuous action space into a 4096-word planning vocabulary via furthest trajectory sampling, then uses a NeRF-inspired probabilistic field and cascaded Transformer to predict probabilities for each candidate action, and finally samples a trajectory for vehicle control. Using only camera inputs, it achieved a Driving Score of 85.1 on CARLA Town05 and led several benchmarks including Bench2Drive, NAVSIM, and 3DGS.
- WorldSplat: Gaussian-Centric Feed-Forward 4D Scene Generation for Autonomous Driving
-
WorldSplat unifies "driving video generation" and "3D/4D scene reconstruction": it first utilizes a 4D-aware latent diffusion model to generate multimodal latents containing RGB, depth, and semantics from conditions such as layout, text, and trajectories. A feed-forward decoder then produces pixel-aligned 4D Gaussian fields in a single pass, enabling the rendering of geometrically consistent multi-track novel-view videos along arbitrary custom trajectories. Finally, an enhancement diffusion model completes imperfections, achieving new SOTA performance in both driving video generation and novel view synthesis on nuScenes.