🚗 Autonomous Driving¶

🔬 ICLR2026 · 18 paper notes

Adaptive Augmentation-Aware Latent Learning for Robust LiDAR Semantic Segmentation: This paper proposes A3Point (Adaptive Augmentation-Aware Latent Learning), a training framework that addresses the augmentation dilemma in robust LiDAR segmentation via two core components: Semantic Confusion Prior (SCP) implicit learning and Semantic Shift Region (SSR) localization. By decoupling model-inherent semantic confusion from augmentation-induced semantic shift and adaptively optimizing across varying perturbation intensities, A3Point achieves state-of-the-art performance on multiple adverse-weather LiDAR segmentation generalization benchmarks.
SMART-R1: Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning: SMART-R1 is the first work to introduce R1-style reinforcement fine-tuning (RFT) into multi-agent traffic simulation. It proposes the Metric-oriented Policy Optimization (MPO) algorithm and an iterative "SFT-RFT-SFT" training strategy, achieving first place on the WOSAC 2025 leaderboard with a Realism Meta score of 0.7858.
Astra: General Interactive World Model with Autoregressive Denoising: This paper proposes Astra, a general interactive world model that enables action-conditioned long-horizon video prediction on top of a pretrained video diffusion model via an autoregressive denoising framework. Three key contributions are introduced: ACT-Adapter (action injection), noise-augmented history memory (to mitigate visual inertia), and Mixture of Action Experts (to unify heterogeneous action modalities). Astra achieves state-of-the-art fidelity and action-following capability across autonomous driving, robotic manipulation, and scene exploration scenarios.
BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving: BridgeDrive proposes replacing truncated diffusion with a diffusion bridge to achieve anchor-guided trajectory planning for autonomous driving, ensuring theoretical symmetry between the forward and reverse processes. On the Bench2Drive closed-loop benchmark, it achieves success rates of 74.99% (PDM-Lite) and 89.25% (LEAD), surpassing the previous SOTA by 7.72% and 2.45%, respectively.
DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving: DrivingGen introduces the first comprehensive benchmark for autonomous driving video world models, comprising a diverse evaluation dataset spanning weather/geography/time/complex scenarios and a four-dimensional metric framework (distribution, quality, temporal consistency, trajectory alignment). Evaluation of 14 SOTA models reveals a fundamental trade-off between general-purpose and driving-specific models.
EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video: Apple collected 829 hours of egocentric video paired with 3D hand joint tracking data (EgoDex) using Vision Pro, covering 194 tabletop manipulation tasks and 338K trajectories. The dataset is used to systematically benchmark imitation learning policies (BC/DDPM/FM + Transformer), providing the largest-scale data foundation to date for scaling dexterous manipulation training.
MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding: MARC is a framework that adopts a "retrieve-then-compress" strategy: a Visual Memory Retriever (VMR) selects the most query-relevant video segments, and Compression GRPO (C-GRPO) distills the reasoning capability of a 64-frame teacher model into a student model that operates on only 1-frame tokens. This achieves 95% visual token compression, 72% GPU memory reduction, 23.9% inference latency reduction, with virtually no performance loss (42.20 vs. 42.21).
Multi-Head Low-Rank Attention (MLRA): This paper proposes Multi-Head Low-Rank Attention (MLRA), which decomposes the single latent head in MLA into multiple independently shardable latent heads and sums the attention outputs across branches, enabling native 4-way tensor parallelism. The method achieves 2.8× decoding speedup while maintaining state-of-the-art performance.
NeMo-map: Neural Implicit Flow Fields for Spatio-Temporal Motion Mapping: NeMo-map is proposed as a continuous spatio-temporal dynamic map based on neural implicit functions, directly mapping spatial-temporal coordinates to Semi-Wrapped Gaussian Mixture Model (SWGMM) parameters. It eliminates the spatial discretization and temporal segmentation constraints of conventional methods, achieving lower NLL and smoother velocity distributions on real pedestrian tracking data.
ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving: ResWorld proposes a Temporal Residual World Model (TR-World) that extracts dynamic object information by computing temporal residuals of BEV scene representations—without relying on detection or tracking—thereby avoiding redundant modeling of static regions. Combined with a Future-Guided Trajectory Refinement (FGTR) module that leverages predicted future BEV features to refine planned trajectories, ResWorld achieves state-of-the-art planning performance on nuScenes and NAVSIM.
SEAL: Segment Any Events with Language: This paper introduces the open-vocabulary event instance segmentation (OV-EIS) task for the first time, and proposes the SEAL framework. Through multimodal hierarchical semantic guidance (MHSG) and a lightweight multimodal fusion network, SEAL achieves multi-granularity (instance-level + part-level) semantic segmentation of event streams using only event–image pairs (without dense annotations), substantially outperforming all baselines while achieving the fastest inference speed.
SiMO: Single-Modality-Operable Multimodal Collaborative Perception: This paper proposes SiMO, a framework that introduces the LAMMA fusion module and PAFR training strategy to achieve, for the first time in multi-agent collaborative perception, a multimodal perception system that remains operational under arbitrary modality absence—particularly when LiDAR fails and only cameras are available. The design is analogous to a parallel circuit: the system functions as long as at least one pathway is active.
Single Pixel Image Classification using an Ultrafast Digital Light Projector: This paper presents an experimental single-pixel imaging (SPI) system based on a microLED-on-CMOS ultrafast digital light projector, combined with low-complexity machine learning models (ELM and DNN) to achieve sub-millisecond image encoding and kHz-rate image classification. The system attains >90% accuracy on the MNIST dataset and >99% AUC in binary classification scenarios.
SPACeR: Self-Play Anchoring with Centralized Reference Models: SPACeR proposes a "human-like self-play" framework that uses a pretrained tokenized autoregressive motion model as a centralized reference policy. By incorporating log-likelihood rewards and KL divergence constraints, it guides a decentralized self-play RL policy to align with the human driving distribution. SPACeR outperforms pure self-play methods on WOSAC while achieving 10× faster inference and 50× fewer parameters than imitation learning approaches.
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis: SG-NLF proposes a pose-free LiDAR NeRF framework that integrates spectral information with geometric consistency. It leverages a hybrid spectral-geometric representation for continuous smooth geometry reconstruction, a confidence-aware pose graph for global pose optimization, and an adversarial learning strategy to enforce cross-frame consistency, achieving improvements of 35.8% in reconstruction quality and 68.8% in pose accuracy over the previous state of the art.
ST4VLA: Spatially Guided Training for Vision-Language-Action Models: This paper proposes ST4VLA, a two-stage spatially guided training framework (spatial grounding pre-training + spatially guided action post-training) that explicitly injects VLM spatial priors into VLA policy learning. On SimplerEnv, it improves the Google Robot success rate from 66.1% to 84.6% and WidowX from 54.7% to 73.2%, achieving state-of-the-art performance.
Steerable Adversarial Scenario Generation through Test-Time Preference Alignment (SAGE): SAGE reformulates adversarial scenario generation for autonomous driving as a multi-objective preference alignment problem. By training two preference expert models and performing weight interpolation at inference time, it enables a continuous and steerable trade-off between adversariality and realism—without retraining—generating a full spectrum of scenarios from mild to aggressive, substantially improving closed-loop training performance.
x²-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space: x²-Fusion introduces Event Edge Space — the first edge-based isomorphic latent space — that unifies image, LiDAR, and event camera features into a shared edge-centric representation. Combined with reliability-aware adaptive fusion and cross-dimension contrastive learning, it achieves state-of-the-art joint 2D optical flow and 3D scene flow estimation under both standard and degraded conditions.