🚗 Autonomous Driving¶
📷 CVPR2026 · 140 paper notes
📌 Same area in other venues: 🔬 ICLR2026 (50) · 🧪 ICML2026 (8) · 🤖 AAAI2026 (56) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (91)
🔥 Top topics: Autonomous Driving ×31 · Multimodal/VLM ×13 · Agents ×11 · Segmentation ×10 · 3D Object Detection ×9
- ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving
-
ActiveAD designs a "planning-oriented" active learning strategy for end-to-end autonomous driving: it uses nearly free meta-information (weather/lighting/driving commands/speed) for diversity initialization to solve the cold-start problem, and selects the most critical scenarios using three label-free criteria: displacement error, soft collision, and agent uncertainty. Training on only 30% of the data matches the performance of SOTA models trained on 100% data in both nuScenes open-loop and CARLA closed-loop evaluations.
- AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
-
The authors propose AdaRadar, an online adaptive radar data compression framework based on DCT spectral pruning and zeroth-order proxy gradients. It achieves over 100× compression with only ~1%p loss in detection/segmentation performance, effectively alleviating the bandwidth bottleneck between radar sensors and compute units.
- AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction
-
AMap identifies a safety hazard in existing temporal HD mapping methods: they "only enhance the rear area already passed and provide almost no improvement for the critical road ahead." It proposes a "distill-from-future" paradigm—using a teacher capable of seeing future frames to implicitly instill forward priors into a lightweight student observing only the current frame, significantly improving ahead-mapping accuracy (A-mAP) with zero inference overhead.
- An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving
-
Ours proposes ADMesh (a high-quality 3D model library with 15K+ assets) and CarlaOcc (a panoptic occupancy dataset with 100k frames and 0.05m precision). It provides the first instance-level annotations and physically consistent ground truth for 3D panoptic occupancy prediction in autonomous driving, along with occupancy quality evaluation metrics and a systematic benchmark.
- BEV-CAR: Enhancing Monocular Bird's Eye View Segmentation with Context-Aware Rasterization
-
BEV-CAR introduces a "training-only, inference-removed" context-aware rasterization mechanism that rearranges decoder outputs into rays along the lines of sight. Using discrete sampling via the Bresenham algorithm and ray-wise supervision, combined with a dual-branch (depth + global) BEV feature fusion, it achieves SOTA results on nuScenes (31.5% mIoU) and Argoverse (29.9% mIoU) with zero additional inference overhead at 43.1 FPS.
- BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images
-
This paper proposes BEV-SLD, a LiDAR global localization method based on self-supervised Scene Landmark Detection (SLD). By decoupling detection from correspondence prediction, it achieves high-precision \((x, y, \text{azimuth})\) pose estimation across various scenarios with a compact storage footprint of only 20MB.
- Beyond Rule-Based Agents: Active Markov Games for Realistic Multi-Agent Interaction in Autonomous Driving
-
This paper models the driving environment as an "Active Markov Game" (AMG) where both state transitions and rewards depend on the current policies of all agents. By employing multi-agent co-evolutionary training, the ego policy plays against and evolves with a pool of diverse opponent strategies. This approach learns robust interactive decision-making in CARLA unsignaled intersections and long-tail scenarios, reducing the collision rate to 0.02 and achieving a success rate of 98%.
- BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds
-
BuildAnyPoint is proposed to achieve unified reconstruction from diverse point cloud distributions (airborne LiDAR, SfM, sparse noisy points) to structured 3D building meshes using a Loosely-coupled Cascaded Diffusion Transformer (Loca-DiT). The framework first restores the underlying point cloud distribution through hierarchical latent diffusion and subsequently generates compact polygonal meshes via an autoregressive Transformer.
- C-LaV: Conditional Latent Velocity Field Denoising for Weather-Robust LiDAR Place Recognition
-
C-LaV compensates for LiDAR degradation caused by rain, snow, and fog within the BEV latent space of a frozen DINOv2. By learning a velocity field via conditional Flow Matching and solving a probability flow ODE, it deterministically transports "weather-noisy latent representations" back to "clear-day latent representations." Using a SALAD clustering head for global descriptor retrieval, it achieves Recall@1 improvements of 17.5% on NCLT Snowy and 21.5% on real-world Boreas datasets.
- CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography
-
CARD is a multi-modal autonomous driving dataset targeting "non-flat road surfaces" (speed bumps, potholes, irregularities, and off-road sections). Through a novel multi-LiDAR fusion ground truth generation pipeline, it providing approximately 500,000 measured LiDAR depth points per frame (about 6.5 times that of KITTI Depth Completion). It is equipped with 2D bounding boxes for road topography, wheel-ground contact excitation trajectories, and standardized evaluation protocols, specifically designed to evaluate depth estimation/completion capabilities for fine-grained road geometry.
- CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception
-
CATNet targets the two major realistic challenges in V2X cooperative perception: "communication delay + multi-source noise." By cascading Spatio-Temporal Synchronous (STSync), Dual-branch Wavelet Denoising (WTDen), and Adaptive Feature Selection (AdpSel) modules, it achieves SOTA AP on OPV2V/V2XSet/DAIR-V2X datasets under noisy and delayed scenarios with only 9.95M parameters.
- CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention
-
This paper proposes CausalVAD, which parameterizes Pearl’s backdoor adjustment theory into a plug-and-play module (SCIS). By performing multi-level causal interventions across the perception, prediction, and planning stages of the VAD architecture, it eliminates spurious correlations and achieves safer and more robust end-to-end autonomous driving.
- CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection
-
To address the modal imbalance issue in dual-branch multi-modal 3D detectors under domain shift, the CCF framework is proposed. It systematically improves camera query utilization and cross-domain robustness through three components: decoupled loss, LiDAR-guided depth priors, and complementary cross-modal masking.
- ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data
-
The authors propose the ClimaDrive data generation framework and the ClimaOoD benchmark dataset. By combining semantic-guided multi-weather scene generation with perspective-aware anomaly object placement, they construct a 10K+ training set covering 6 weather conditions and 93 anomaly categories. After training, four SOTA methods achieved an average AP improvement of 3.25%.
- CogDriver: Integrating Cognitive Inertia for Temporally Coherent Planning in Autonomous Driving
-
CogDriver explicitly injects "cognitive inertia"—the natural persistence of human intent—into end-to-end driving systems. It utilizes a multi-view spatiotemporal MLLM to automatically label VLA datasets with continuous narratives while integrating a Sparse Temporal Consistency Module (TCM) within the agent to maintain stable internal states. This prevents decision jitter; it achieves a 22% increase in Driving Score on Bench2Drive and a 21% reduction in L2 error on nuScenes, sets a new SOTA.
- CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
-
The CoIn3D framework is proposed, which explicitly models the spatial prior differences in camera intrinsics, extrinsics, and array layouts through two modules: Spatial-aware Feature Modulation (SFM) and Camera-aware Data Augmentation (CDA). It achieves strong generalization of multi-camera 3D detection models from source configurations to unseen target configurations and is applicable to the three major paradigms: BEVDepth, BEVFormer, and PETR.
- ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving
-
ColaVLA proposes a unified Vision-Language-Action (VLA) framework that shifts VLM reasoning from textual Chain-of-Thought (CoT) to the latent space. Through a Cognitive Latent Reasoner and a Hierarchical Parallel Planner, it efficiently completes scene understanding and trajectory decoding in just two VLM forward passes, achieving SOTA performance on both nuScenes open-loop and closed-loop benchmarks.
- CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion
-
CoLC proposes a communication-efficient early collaborative perception framework. It reduces transmission volume through Foreground-Aware Point Sampling (FAPS), restores dense pillar representations on the ego side using VQ-based LiDAR Completion (CEEF), and ensures semantic and geometric consistency via Dense-Guided Double Alignment (DGDA). This maintains or even exceeds early fusion detection performance while significantly lowering communication bandwidth.
- CoopDiff: A Diffusion-Guided Approach for Cooperation under Corruptions
-
CoopDiff reformulates the "corruption robustness" problem in multi-agent cooperative perception as a feature-space diffusion denoising task. A quality-aware teacher generates clean supervisory features, which a dual-branch diffusion student reconstructs from noisy inputs. This approach consistently outperforms existing SOTA across six types of corruption, including fog, motion blur, and EMI.
- Counterfactual VLA: Self-Reflective Vision-Language-Action Model with Adaptive Reasoning
-
CF-VLA enables an autonomous driving VLA to first generate "time-segmented meta-actions" and then perform counterfactual reasoning on its own proposed actions ("What would happen if I follow this plan, and should I modify it?") to self-correct before outputting a trajectory. Coupled with a rollout–filter–label data pipeline that labels counterfactual traces only for difficult scenarios, the model learns "adaptive reasoning"—thinking only when necessary—improving trajectory accuracy by approximately 17.6% and safety metrics by roughly 20%.
- CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation
-
Ours proposes the CycleBEV regularization framework: an Inverse View Transformation (IVT) network is introduced during training to map BEV segmentation maps back to Perspective View (PV) segmentation maps. The existing BEV semantic segmentation models are enhanced through cycle consistency loss, height-aware geometric regularization, and cross-view latent space alignment, with zero additional overhead during inference.
- Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation
-
DeGO introduces a "soft-rigid mask" for every 3D Gaussian in weakly-supervised camera-based occupancy prediction, allowing adaptive selection between "rigid displacement" and "non-rigid deformation." By distilling factorized cross-camera and cross-frame features from the VGGT 4D foundation model, it achieves a 10.9% improvement in overall mIoU and a 13.5% increase in human-centric metrics on Occ3D-NuScenes.
- Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction
-
From a data-centric perspective, the Den-TP framework is proposed to address the long-tail density imbalance in trajectory prediction datasets through density-aware data curation and evaluation protocols. It maintains overall performance and significantly improves robustness in high-density scenarios using only 50% of the data.
- Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving
-
To address the dilemma of "frame-by-frame jitter" and "copying historical trajectories" in learned planners, DFP segments the entire trajectory into historical/current/future chunks, independently adds noise to each for joint denoising, and employs "History-Annealed CFG" during inference to controllably adjust the intensity of historical influence. It achieves SOTA among learned baselines on nuPlan closed-loop benchmarks by being both stable and scene-adaptive.
- DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving
-
DLWM is proposed as a holistic Gaussian-centric pre-training paradigm for autonomous driving via dual latent world models. Stage 1 involves self-supervised learning of 3D Gaussian scene representations (rendering multi-view semantics and depth maps). Stage 2 trains dual latent world models: a Gaussian-flow-guided model for downstream occupancy perception/prediction (+1.02/+2.68 mIoU), and an ego-trajectory-guided model for motion planning (-16% L2 error). This architecture addresses the permutation invariance challenge of Gaussian queries across frames, which previously prevented direct supervision.
- Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving
-
DMW (Drive My Way) is proposed as a personalized VLA driving framework that learns long-term driving habits via user embeddings and adapts to short-term preferences through natural language instructions, utilizing GRPO reinforcement fine-tuning and style-aware rewards to generate personalized driving behaviors.
- DriveCombo: Benchmarking Compositional Traffic Rule Reasoning in Autonomous Driving
-
DriveCombo is the first multimodal benchmark for "compositional traffic rule reasoning." It organizes 70,000 multiple-choice questions (MCQs) using a five-level cognitive ladder—ranging from single-rule understanding to rule conflict arbitration. It utilizes a Rule2Scene Agent to automatically convert textual regulations into executable 3D driving scenarios in CARLA. Evaluations of 14 mainstream MLLMs reveal a sharp drop in accuracy to 41%–44% on the highest-level conflict arbitration tasks, significantly lower than the human performance of >98%.
- DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
-
DriveMoE integrates Mixture-of-Experts (MoE) into both the perception and decision-making components of a VLA autonomous driving model. The perception side utilizes a Vision MoE to dynamically select critical camera views to save tokens, while the decision-making side employs an Action MoE to allocate dedicated experts for different driving skills. On the Bench2Drive closed-loop benchmark, it improves the Driving Score (DS) from 55.85 to 74.22 and the Success Rate (SR) from 30% to 48.64%.
- DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
-
DrivePI utilizes a Qwen2.5 backbone of only 0.5B parameters to integrate LiDAR point clouds, multi-view images, and language instructions into a single MLLM. Through four specialized heads, it simultaneously outputs scene descriptions, 3D occupancy, occupancy flow, and planned trajectories. This allows the VLA model to possess language interaction capabilities while recovering the fine-grained spatial perception typical of VA models. End-to-end joint training enables it to surpass 7B-scale VLA and specialized VA methods.
- DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
-
DrivePTS addresses three major pain points in controllable autonomous driving scene generation: the coupling of maps and 3D boxes, coarse textual descriptions, and blurred foreground structures. It proposes a progressive training strategy that learns roads before objects (with mutual information constraints for decoupling), VLM-generated 6D multi-view descriptions, and frequency-guided structural loss. On nuScenes, it reduces FID to 11.45, increases road mIoU to 63.95, and successfully generates rare road conditions where previous methods failed.
- DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
-
This work proposes the first 360° omnidirectional driver attention dataset (approx. 1M frames with 19 drivers) and introduces DriverGaze360-Net. By leveraging an auxiliary semantic segmentation head to jointly learn attention maps and attended objects, the method achieves SOTA attention prediction performance on panoramic driving images.
- DriveVLN: Towards Mapless Vision-and-Language Navigation in Autonomous Driving
-
DriveVLN migrates "Vision-and-Language Navigation" to autonomous driving: in scenarios without high-definition maps and given only destination-level instructions (e.g., "go to the exit/charging pile"), it enables vehicles to find their way using visual cues and historical decisions. The authors reconstructed 200 real-world scenes in CARLA to create a closed-loop benchmark and established a baseline using a "Planning Module for candidates + VLM for trajectory selection + Two-stage training (SFT→GRPO RL)" approach. The resulting Driving Score of 0.67 outperforms Seed-1.6 and GPT-5.
- Driving on Registers (DrivoR)
-
DrivoR utilizes a pure transformer end-to-end driving architecture. It adds a set of learnable register tokens to each camera to compress thousands of ViT visual tokens into dozens of "scene tokens." Two decoupled decoders are then used for generating and scoring candidate trajectories. With only approximately 40M parameters, DrivoR matches or exceeds heavier baselines on NAVSIM-v1/v2 and closed-loop HUGSIM.
- Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
-
Dr.Occ is proposed as a unified vision-only 3D occupancy prediction framework. It leverages high-quality depth priors from MoGe-2 for precise geometric alignment via a Depth-guided Dual-projected View Transformer (D2-VFormer). Furthermore, it introduces region-guided MoE/MoR expert Transformers (R-EFormer / R2-EFormer) to adaptively assign experts to specific spatial regions, addressing spatial-semantic imbalance. It improves the BEVDet4D baseline by 7.43% mIoU on Occ3D-nuScenes.
- DSERT-RoLL: Robust Multi-Modal Perception for Diverse Driving Conditions with Stereo Event-RGB-Thermal Cameras, 4D Radar, and Dual-LiDAR
-
This paper introduces DSERT-RoLL, a driving dataset that simultaneously collects stereo Event-RGB-Thermal cameras, 4D Radar, and dual LiDAR, covering extreme conditions such as rain, snow, fog, nighttime, and HDR. It proposes a multi-modal 3D detection framework that first generates initial boxes from ranging sensors, supplements semantics using voxel-centric deformable sampling with three-way camera features, and finally fuses them via camera-confidence gating, achieving the highest AP across all weather and lighting conditions.
- Efficient Equivariant Transformer for Self-Driving Agent Modeling
-
DriveGATr is proposed, an equivariant Transformer architecture based on 2D Projective Geometric Algebra (PGA). It achieves SE(2)-equivariance without explicit pairwise relative position encoding, reaching SOTA performance in traffic simulation tasks while significantly reducing computational costs.
- ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders
-
Based on the lightweight real-time LiDAR geometry compressor RENO, ELiC introduces a "triad" of Cross-Bit-depth Feature Propagation, Bag-of-Encoders selection, and Morton order-preserving hierarchy. By allowing sparse high bit-depth layers to reuse contextual features from dense low bit-depth layers, ELiC achieves state-of-the-art compression rates with a real-time throughput of 10 FPS on Ford and SemanticKITTI datasets.
- EventDrive: Event Cameras for Vision-Language Driving Intelligence
-
EventDrive establishes the first benchmark that integrates event streams, RGB frames, and language supervision across the entire driving chain (Perception → Understanding → Prediction → Planning, featuring 4 levels, 17 sub-tasks, and 470,000 samples). It introduces EventDrive-VLM—utilizing "multi-scale voxelization + MoE-gated dynamic temporal encoders" and an "Event Q-Former" to align asynchronous events into the LLM semantic space. Event-frame fusion consistently outperforms frame-only and event-only models across all task families, reducing L2 planning error from 4.54m to 3.66m.
- EMDUL: Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets
-
The EMDUL pipeline is proposed to significantly expand the scale and diversity of mmWave HPE datasets. It achieves this by annotating unlabeled mmWave data with pseudo-labels (using a novel Unsupervised Temporal Consistency Loss, UTCL) and employing a closed-form LiDAR→mmWave point cloud converter (featuring Flow-based Point Filtering, FPF). This approach reduces in-domain error by 15.1% and cross-domain error by 18.9%.
- Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them
-
This paper systematically defines and quantifies two failure modes of deep learning-based online mapping models—localization overfitting and geometric overfitting. It proposes a performance metric based on Fréchet distance and a training set sparsification strategy based on Minimum Spanning Tree (MST). Validation on nuScenes and Argoverse 2 demonstrates that geometrically diverse and balanced training sets improve model generalization.
- FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts
-
The FedBPrompt framework is proposed, which splits prompts into Body Part Alignment Prompts and Holistic Full Body Prompts via the Body Distribution Aware Visual Prompts Mechanism (BAPM). Coupled with the Prompt-based Fine-Tuning Strategy (PFTS), the ViT backbone is frozen while only lightweight prompts are trained (reducing communication overhead to ~1%). It achieves average gains of 3.3% mAP and 4.9% Rank-1 in FedDG-ReID tasks.
- FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision
-
FlashCap is proposed as the first motion capture system based on flashing LEDs and event cameras. By assigning different flashing frequencies to each LED for identity recognition, the authors constructed FlashMotion, the first human motion dataset with 1000Hz annotation accuracy (7.15 million frames). Furthermore, the ResPose baseline method was introduced, reducing motion timing error from ~50ms to ~5ms and improving MPJPE in pose estimation by approximately 40%.
- FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration
-
FoSS introduces a frequency-time dual-branch framework that organizes Fourier spectra via Progressive Helix Reordering (HelixSort) for processing by a Selective State Space Model (SSM). Combined with a time-domain dynamic SSM and cross-attention fusion, it achieves SOTA trajectory prediction accuracy on Argoverse 1/2 while reducing parameters by over 40% and inference latency by 22%.
- GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
-
GaussianDWM utilizes "Language-enhanced 3D Gaussians" as a unified scene representation. By embedding CLIP language features into each Gaussian ellipsoid, it achieves explicit alignment between text and 3D geometry. Through task-aware sampling, compact 3D tokens are fed into an LLM for scene understanding (description/2D-3D grounding/planning), while dual-condition diffusion performs RGB-D spatiotemporal generation. On the NuInteract understanding task, the average score improved from 52.12 to 59.23; on nuScenes spatial generation, the FID for \(\pm 2m\) offset was reduced to 11.27.
- GEM: Generating LiDAR World Model via Deformable Mamba
-
GEM aligns LiDAR scan sequences with Mamba's step-by-step scanning mechanism. It utilizes a Mamba scene tokenizer to compress unordered point clouds into ordered latents, followed by unsupervised decoupling of dynamic objects and static environments modeled by a triple-path deformable Mamba. Ultimately, it establishes a new SOTA for 1s/3s future prediction on nuScenes/KITTI (reducing Chamfer Distance by 81% compared to the runner-up at 1s), while supporting autonomous rollout and BEV-controllable "what-if" generation.
- Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
-
GPOcc proposes utilizing generalizable visual geometry priors (e.g., VGGT, DepthAnything) for monocular 3D occupancy prediction. By extending surface points inward along camera rays to generate volumetric samples, the method performs probabilistic occupancy inference using sparse Gaussian primitives. It introduces a training-free incremental update strategy for streaming inputs, achieving a +9.99 mIoU gain in monocular settings and +11.79 in streaming settings over the previous SOTA on Occ-ScanNet, while running 2.65x faster under the same depth prior.
- Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
-
Ghost-FWL introduces the first large-scale mobile full-waveform LiDAR dataset (24K frames, 7.5 billion peak-level annotations) and designs the FWL-MAE self-supervised pre-training framework to achieve ghost detection and removal, reducing SLAM trajectory errors by over 66% and 3D detection false positive rates by 50 times.
- GuideFlow: Constraint-Guided Flow Matching for Planning in End-to-End Autonomous Driving
-
GuideFlow employs "Flow Matching + Energy-Based Models" for end-to-end driving planning, directly embedding safety and physical hard constraints into the generation process via three mechanisms: Constrained Velocity Field (CVF), Constrained Flow states (CF), and Refining the Flow by EBM (RFE). This approach mitigates multi-modal mode collapse in imitation learning and eliminates the need for post-optimization in generative methods, achieving a SOTA 43.0 EPDMS on NavSim Navhard.
- HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation
-
Addressing the severe shortage of extreme weather samples in lane detection datasets (CULane/TuSimple), this paper proposes HG-Lane—a two-stage diffusion generation framework without re-annotation. Stage-I preserves lane geometric structure through Control Information Fusion + Structure-aware Reverse Diffusion, while Stage-II adjusts lighting styles via Appearance-aware Refinement to generate 30K images across snow/rain/fog/night/dusk. The overall mF1 of CLRNet improves by +20.87%, with a +38.8% gain in snow scenarios.
- HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps
-
HOLO reformulates "fine-grained localization of surround-view images on standard-definition (SD) maps" as a homography estimation problem between BEV features and map tiles: first, semantic alignment is used to pull the two modalities into feature pairs satisfying homography constraints; then, the homography relationship guides feature fusion and constrains the pose output within a feasible solution space. This approach achieves faster convergence and higher localization accuracy than traditional "attention fusion + direct 3-DoF pose regression" methods, improving Recall@1m/2m on nuScenes by approximately 16%.
- HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
-
HorizonForge proposes a unified framework that reconstructs driving scenes into editable Gaussian Splats + Mesh representations. It achieves precise 3D manipulation through trajectory control and language-driven vehicle insertion. High-quality driving videos with spatio-temporal consistency are then generated via a video diffusion model, outperforming all baseline methods with a 91.02% user preference rate.
- Hybrid Robust Collaborative Perception with LiDAR-4D Radar Fusion under Adverse Weather Conditions
-
Targeting "multi-agent collaborative perception under adverse weather," HRCP proposes a hybrid collaboration strategy based on the physical characteristics of sensors (early collaboration for sparse 4D radar via raw point clouds; intermediate collaboration for dense LiDAR via features). It reformulates LiDAR-4D radar fusion as "jointly reconstructing a dense and reliable representation," using Bi-directional Cross-Modal Gating (BCMG) for mutual reliability verification and Adaptive Feature Enhancement (AFE) to recover information loss, outperforming SOTA on V2X-R simulation and V2X-Radar-C real-world datasets.
- HybridDriveVLA: Vision-Language-Action Model with Visual CoT reasoning and ToT Evaluation for Autonomous Driving
-
HybridDriveVLA replaces the traditional "image-to-text then CoT reasoning" in driving VLAs with direct prediction of future scenes in the visual domain (V-CoT). It employs a Tree-of-Thought multi-trajectory evaluation (ToT-Evaluation) to score candidates point-by-point across safety, progress, and comfort dimensions to select the optimal waypoint sequence, reducing the average collision rate of autoregressive VLAs to 0.17% on nuScenes.
- IntrinsicWeather: Controllable Weather Editing in Intrinsic Space
-
Diffusion models are utilized to decompose images into intrinsic maps consisting of "weather-invariant material/geometry + weather-dependent illumination." Target weather is then re-rendered in the intrinsic space using text prompts. This approach achieves fine-grained controllable weather editing while preserving scene material and geometry, outperforming SOTA in inverse rendering PSNR by over 10 dB and significantly enhancing the robustness of downstream detection/segmentation in adverse weather.
- KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System
-
The KnowVal end-to-end autonomous driving system is proposed, addressing the lack of knowledge reasoning and value alignment through three core components: (1) Retrieval-guided Open-world Perception, which integrates standard 3D detection, VL-SAMv2 for long-tail objects, and VLM for scene understanding; (2) Perception-guided Knowledge Retrieval, which fetches relevant knowledge from driving knowledge graphs (traffic laws, defensive driving, ethics); and (3) a World Model for predicting future states combined with a Value Model (trained on human preferences) to evaluate trajectory value, achieving interpretable decision-making. It achieves the lowest collision rate on nuScenes and SOTA performance on Bench2Drive and NVISIM.
- L3DR: 3D-aware LiDAR Diffusion and Rectification
-
L3DR appends a 3D residual regression network after range-view (RV) LiDAR diffusion to compute per-point offsets that correct "depth bleeding" and "wavy surface" artifacts in back-projected 3D point clouds. By using a Welsch loss to bypass high-bias hallucination regions in training pairs, it achieves SOTA geometric realism on KITTI, KITTI360, nuScenes, and Waymo with minimal computational overhead.
- LA-Pose: Latent Action Pretraining Meets Pose Estimation
-
LA-Pose repurposes the Genie-style "inverse dynamics latent action"—originally used to drive world models or robotic policies—as input features for camera pose estimation. By performing self-supervised pretraining on 10 million unlabeled driving videos to learn latent actions, followed by post-training a lightweight pose head on a minimal amount of 3D-annotated data, the method achieves over 10% higher pose accuracy than feed-forward SOTAs like VGGT on Waymo/PandaSet while using orders of magnitude less labeled data.
- LEAD: Minimizing Learner-Expert Asymmetry in End-to-End Driving
-
This paper identifies that the root cause of the "student failing to learn the privileged expert" in CARLA is not insufficient model capacity, but rather the expert's use of privileged information that is invisible or unmeasurable for the student, combined with sparse navigation intent. By constraining the expert's perception and decision-making to the student's observable range (LEAD expert + dataset) and restructuring the target point injection in the student policy (TFv6), this work achieves 95 DS on Bench2Drive and more than doubles previous SOTA performance on Longest6 v2 / Town13.
- LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization
-
LEADER achieves 24.1% and 73.9% relative reductions in positioning error on LiDAR relocalization tasks by utilizing a robust projective geometric encoder (yaw-invariant) and a truncated relative reliability loss (to suppress unreliable points).
- Learnability-Driven Submodular Optimization for Active Roadside 3D Detection
-
The LH3D framework is proposed, utilizing a three-stage submodular optimization active learning strategy—"Depth Confidence → Semantic Balance → Geometric Diversity"—to suppress the selection of inherently ambiguous samples in roadside monocular 3D detection. It significantly outperforms traditional uncertainty/diversity-based AL methods using only a 20% annotation budget.
- Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception
-
Ours proposes the MVIG attack framework, which uniformly models the vulnerabilities of various defensive collaborative perception systems as a Mutual View Information Graph. By combining temporal graph learning with entropy-aware vulnerability searching, it achieves adaptive fabrication attacks that reduce defense success rates by up to 62%.
- Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
-
Ours proposes LFG (Learning to drive is a Free Gift), a completely label-free, teacher-guided autonomous driving pretraining framework. It learns a unified geometry-, semantic-, and motion-aware pseudo-4D representation from large-scale unposed YouTube driving videos. On the NAVSIM benchmark, it outperforms multi-camera + LiDAR BEV methods (PDMS 85.2) using only a monocular front-facing camera and demonstrates exceptional data efficiency (achieving 81.4 PDMS with only 10% labels).
- Learning to Identify Out-of-Distribution Objects for 3D LiDAR Anomaly Segmentation
-
LIDO directly models the distribution of inlier classes in the feature space using a semantic head to maintain "confidence-based prototypes" and a contrastive head to push inlier features away from the hypersphere center. During inference, it fuses cosine distance, entropy, and feature norm signals to assign anomaly scores to each point, achieving SOTA in 3D LiDAR anomaly segmentation without any anomaly samples. The authors also contribute a mixed real-synthetic OoD dataset to address the lack of evaluation benchmarks in this field.
- LiDAR-to-4DRadar Diffusion Bridge via Cross-Modal Alignment and Translation in Latent Space
-
L2RLDB is the first to translate sparse 3D LiDAR into complete 4D radar tensors including the Doppler dimension. It employs a "Key Voxel-Aware VAE" to compress high-dimensional noisy radar into a low-dimensional latent space, aligns LiDAR latent codes via patch-level contrastive learning, and completes cross-modal translation using a Brownian diffusion bridge in the aligned latent space. The synthesized radar significantly improves downstream 3D detection accuracy.
- LiDAS: Lighting-driven Dynamic Active Sensing for Nighttime Perception
-
LiDAS treats High-Definition (HD) headlights as "visual actuators," utilizing a learned lighting policy network to dynamically determine where to project light in a closed loop. This enables day-trained detection/segmentation models to achieve zero-shot nighttime availability—improving performance by +10.4% mAP50 / +6.8% mIoU in synthetic scenes and +18.7% mAP50 / +5.0% mIoU in real-world closed-loop tests—while saving up to 40% power without retraining downstream models.
- Lipschitz Optimization for Formal Verification of Homographies
-
By formulating "camera 6-DOF pose perturbation → pixel values" as a closed-form homography and extending the piecewise linear + Lipschitz optimization bounds of Batten et al. from affine to non-affine projective transformations, this work performs the first formal verification of "camera motion robustness" for neural networks. Compared to prior work, it achieves up to 89% speedup and 7% tighter bounds, while revealing systematic vulnerabilities to 3D perspective perturbations in VNN-COMP networks and runway visibility classifiers.
- LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
-
Ours proposes LiREC-Net, the first unified framework to simultaneously perform target-free extrinsic calibration for LiDAR-RGB and LiDAR-Event cameras. By utilizing a shared LiDAR representation (fusing 3D point features and projected depth features) and paired cost volumes for cross-modal alignment, it achieves calibration accuracies of 1.80cm/0.11° on KITTI, and 2.51cm/0.14° (LiDAR-RGB) and 1.18cm/0.07° (LiDAR-Event) on DSEC.
- Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
-
This paper reveals that feature misalignment in LiDAR-Camera fusion is primarily concentrated at foreground-background depth discontinuity boundaries. It proposes three collaborative modules—PGDC (Prior-Guided Depth Calibration), DAGF (Discontinuity-Aware Geometric Fusion), and SGDM (Structure-Guided Depth Modulator)—to actively correct misalignment before fusion. The method achieves SOTA performance on the nuScenes validation set with 71.5% mAP and 73.6% NDS.
- MAD: Motion Appearance Decoupling for Efficient Driving World Models
-
MAD minimizes the cost of transforming general video diffusion models (VGMs) into driving world models to the extreme: using a single backbone with two lightweight LoRAs, it first generates "pose videos" consisting only of skeletons to predict motion, then "dresses" the skeletons with textures to render RGB. By decoupling motion and appearance, it matches previous SOTA performance using only 6% of the compute required by competitors.
- MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving
-
The MeanFuser end-to-end autonomous driving framework is proposed. It uses Gaussian Mixture Noise (GMN) to replace discrete trajectory vocabularies for continuous multi-modal trajectory modeling. By leveraging MeanFlow Identity, it achieves error-free one-step sampling, and an Adaptive Reconstruction Module (ARM) is designed to implicitly decide between selecting existing proposals or reconstructing new trajectories. Using only RGB input and a ResNet-34 backbone, it achieves 89.0 PDMS at 59 FPS on NAVSIM.
- MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
-
MindDriver introduces a progressive multimodal reasoning framework that mimics the human "Perception → Imagination → Action" mechanism. It executes textual semantic understanding first, followed by imagining future scene images (bridging semantic and physical spaces), and finally predicting trajectories. Combined with a feedback-guided data annotation pipeline and progressive reinforcement fine-tuning, it achieves state-of-the-art performance in both nuScenes open-loop and Bench2Drive closed-loop evaluations.
- Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes (LegoOcc)
-
Ours proposes LegoOcc, which utilizes Language-Embedded Gaussians (LE-Gaussians) as a unified geometric-semantic intermediate representation. Combined with a Poisson-process-based Gaussian-to-Occupancy (G2O) operator and a progressive temperature decay strategy, it achieves monocular open-vocabulary occupancy prediction for indoor scenes using only binary occupancy labels (no semantic annotations), reaching 59.50 IoU / 21.05 mIoU on Occ-ScanNet.
- MTA: Multimodal Task Alignment for BEV Perception and Captioning
-
MTA builds two alignment bridges for the "BEV 3D detection + 3D dense captioning" task pair, which were previously optimized independently. BLA supervises the BEV object queries of the Q-Former using text representations of GT captions, while DCA maps detection and captioning outputs into a shared space via learnable prompts for contrastive alignment. Both modules are active only during training, incurring zero inference overhead while improving detection mAP by 4.9% and captioning CIDEr by 9.2%.
- Neural Distribution Prior for LiDAR Out-of-Distribution Detection
-
NDP proposes a learnable neural distribution prior module to model the distribution structure of network predictions. Combined with pseudo-OOD samples generated via Perlin noise and a soft outlier exposure strategy, it achieves 61.31% AP on the STU benchmark, exceeding previous state-of-the-art results by over 10x.
- NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
-
NoRD demonstrates that autonomous driving VLAs do not require large-scale reasoning annotations or massive datasets. By identifying that the root cause of GRPO's failure on weak SFT policies is difficulty bias (where learning signals from high-variance rollout groups are suppressed), the authors adopt Dr. GRPO instead of standard GRPO for RL post-training. With <60% of data, no reasoning annotations, and 3× fewer tokens, it achieves performance competitive with reasoning-based VLAs on NAVSIM (85.6 PDMS) and WaymoE2E (7.709 RFS).
- OccAny: Generalized Unconstrained Urban 3D Occupancy
-
OccAny proposes the first generalized unconstrained urban 3D occupancy prediction framework, capable of predicting metric-scale occupancy voxels from monocular, sequential, or multi-view images in uncalibrated, out-of-distribution scenes. Through two key designs—Segmentation Forcing and Novel View Rendering—it outperforms all visual-geometric baselines on KITTI and nuScenes.
- OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
-
OccuFly introduces the first real-world camera-based Semantic Scene Completion (SSC) benchmark from an aerial perspective. It contains over 20,000 samples and 21 semantic categories, covering urban, industrial, and rural scenes across various seasons and altitudes, while revealing the fundamental limitations of current vision foundation models in aerial environments.
- OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera
-
OneOcc is a vision-only panoramic semantic occupancy prediction framework designed for legged/humanoid robots. By integrating dual-projection fusion, dual-grid voxelization, gait displacement compensation, and a hierarchical mixture-of-experts decoder, it achieves 360° semantic scene completion using only a single panoramic camera, outperforming LiDAR baselines on real-world quadruped and simulated humanoid datasets.
- Open-Ended Instruction Realization with LLM-Enabled Multi-Planner Scheduling in Autonomous Vehicles
-
Addressing the neglected requirement for "passengers issuing maneuver-level instructions in natural language" in L4-L5 autonomous driving (AD), this paper proposes a "scheduling-centric" framework. It utilizes an LLM to parse open-ended instructions into a sequence of driving behaviors and generate a scheduling script in a single pass. Multiple MPC motion planners then execute these behaviors sequentially under real-time feedback. While maintaining full-link traceability from language to control, this approach improves the instruction realization success rate by 64%–200% over baselines with only one LLM query.
- Open-Vocabulary Domain Generalization in Urban-Scene Segmentation
-
A new OVDG-SS setting is proposed to unify the handling of unseen domains and unseen classes in semantic segmentation. A S2-Corr module based on State Space Models (SSM) is designed to repair the degradation of text-image correlation caused by domain shifts, achieving efficient and robust cross-domain open-vocabulary segmentation in autonomous driving scenarios.
- OptiMVMap: Offline Vectorized Map Construction via Optimal Multi-vehicle Perspectives
-
OptiMVMap extends offline vectorized HD map construction from "single-vehicle trajectories" to "multi-vehicle collaboration." It proposes a plug-and-play "select-then-fuse" framework: an uncertainty-guided OVS module selects the 2–5 most complementary helper vehicles, which are then fused at the BEV level after pose-tolerant alignment (CVA) and semantic noise filtering (SNF). It improves MapTRv2 by +10.5 and +9.3 mAP on nuScenes and Argoverse2, respectively.
- PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
-
This paper presents the first study on "Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation (mm-3DPS)." It proposes PanDA: an asymmetric multimodal dropout (AMD) strategy within a mean-teacher framework to simulate single-modal degradation on the source domain for learning domain-invariant features, and a dual-expert pseudo-label refinement (DualRefine) mechanism utilizing 3D geometric superpoints and 2D vision foundation models (VFMs) to repair incomplete or misclassified target domain pseudo-labels. PanDA significantly outperforms 3D semantic segmentation UDA baselines across domain shifts in time, weather, location, and sensors.
- ParkGaussian: Surround-view 3D Gaussian Splatting for Autonomous Parking
-
Targeting "crowded, GPS-denied, and low-light" underground garage scenarios, this work first establishes ParkRecon3D, the first 3D reconstruction benchmark for parking (four-way surround-view fisheye + 60,000 parking slot annotations). It then proposes ParkGaussian, which adapts 3DGS to fisheye cameras via UT projection, converts rendering results to Birds-Eye-View (BEV) using differentiable IPM, and employs a frozen parking slot detector for teacher-student guidance to perform "parking-aware reconstruction." This ensures the reconstruction is not only visually high-fidelity but also maintains perceptual consistency for downstream parking slot detection.
- Perceiving the Near, Reasoning the Distant: Coherent Long-Horizon Trajectory Prediction for Autonomous Driving
-
NDPNet decouples long-horizon trajectory prediction into two specialized decoding paths: "inertia-based near" and "semantic-based distant." These paths are smoothly connected via a temporal bridge module, further enhanced by a Motion-Aware Consistency (MAC) loss that incorporates kinematic priors into training targets. It achieves SOTA on Argoverse 2 and WOMD, marking the first time minFDE6 has been reduced below 1.75 for 8-second predictions.
- Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule
-
Ours proposes Perception Characteristics Distance (PCD), a new metric to quantify the reliable detection capability of perception systems at different distances. By statistically modeling the changes in mean and variance of detection confidence relative to distance, PCD defines the maximum reliable detection distance of a perception system, addressing the limitations of traditional static metrics like AP/IoU that fail to reflect distance dependency and stochasticity.
- Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species
-
This paper constructs TPC-268, the first large-scale counting dataset integrating plant taxonomy. It contains 10,000 images, 678,050 point annotations, and 268 countable classes (covering 242 species). Full hierarchical information is annotated according to the Linnaean system, and a comprehensive benchmark is conducted under the Class-Agnostic Counting (CAC) paradigm.
- Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors
-
Points-to-3D is proposed to encode partial point clouds from visible regions into the sparse structure (SS) latent space of TRELLIS, completing invisible regions via a mask-aware inpainting network. By integrating a two-stage sampling strategy (structure completion followed by boundary refinement), the method achieves high-fidelity 3D asset and scene generation with explicit geometric controllability, reaching an F-Score of 0.964 on Toys4K (0.998 for visible areas).
- Probabilistic Discrepancy Learning for Roadside LiDAR Scene Completion
-
PDL reformulates the challenge of "severe occlusion due to fixed viewpoints in roadside LiDAR" as a probabilistic inference problem. It first aligns noisy poses from visual detectors into high-precision pseudo-ground truth (pseudo-GT) via Probabilistic Pose Discrepancy Minimization (PPDM), then performs full-scene completion using a Scenario Discrepancy Learning (SDL) diffusion model conditioned on these pseudo-GTs. With dual-path regional/global discrepancy losses and confidence-adaptive CFG inference, PDL reduces Chamfer Distance (CD) by an average of 14.5% and 3D JSD by 6% on V2X-Seq and TUMTraf-V2X.
- ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction
-
This paper proposes the ProOOD framework, which for the first time treats long-tail recognition and Out-of-Distribution (OOD) detection in 3D occupancy prediction from a unified perspective of voxel prototype guidance. Through Prototype-Guided Semantic Infilling (PGSI), Prototype-Guided Tail Mining (PGTM), and the training-free EchoOOD scoring mechanism, it achieves a +3.57% mIoU improvement on SemanticKITTI (+24.80% for tail classes) and a +19.34 increase in OOD detection AuPRCr on VAA-KITTI.
- PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
-
This paper proposes PTC-Depth, a monocular depth estimation framework combining optical flow triangulation and wheel odometry. By tracking the metric scale of depth foundation models through recursive Bayesian updates, it achieves temporally consistent metric depth prediction and demonstrates strong generalization across multiple datasets including KITTI, TartanAir, and thermal infrared.
- R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
-
R4Det is proposed to systematically address three major challenges in 4D radar-camera fusion—inaccurate depth estimation, pose-less temporal fusion, and small object detection—via three plug-and-play BEV modules: Panoramic Depth Fusion (PDF), Deformable Gated Temporal Fusion (DGTF), and Instance-Guided Dynamic Refinement (IGDR). It achieves 47.29% 3D mAP (+5.47%) on TJ4DRadSet and 66.69% mAP on VoD.
- RAG-TP: A General Framework for Vehicle Trajectory Prediction via Retrieval-Augmented Generation
-
This work reformulates vehicle trajectory prediction from "dependency on online perception priors" to a Retrieval-Augmented Generation (RAG) problem that retrieves historical experiences from large-scale offline knowledge bases. By dynamically fusing retrieved priors into the decoder using a retrieval-driven MoE, RAG-TP matches map-based SOTA, outperforms map-free methods on Argoverse/WOMD, and demonstrates significant advantages in zero-shot cross-domain transfer.
- Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals
-
Rascene is proposed as an Integrated Sensing and Communication (ISAC) framework that utilizes mmWave OFDM communication signals (e.g., 5G/Wi-Fi) for high-fidelity 3D scene imaging. It achieves geometrically consistent recovery from sparse, multipath-interfered RF observations through confidence-weighted multi-frame fusion.
- Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction
-
Ours proposes PRF, a progressive retrospective framework that gradually aligns features of incomplete observations to complete ones through cascaded retrospective units. It significantly improves variable-length trajectory prediction performance and is plug-and-play compatible with existing methods.
- Reliable Policy Transfer for Safety-Aware End-to-End Driving with Deep Reinforcement Learning
-
This paper proposes an end-to-end (E2E) driving deep reinforcement learning (DRL) framework organized around a "reliability interface at the control layer." A single normalized uncertainty signal \(\bar{\sigma}\) simultaneously drives ego-centric relational attention, gated policy entropy, and regularizes cross-domain transfer alignment. Experiments in CARLA under adverse weather and cross-city closed-loop tests demonstrate significant improvements in success rate, reduced violation rates, and better lane keeping compared to strong baselines.
- ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection
-
To address the issue where "2D-to-3D lifting collapses (forming dips, bumps, or twists) due to a lack of geometric invariants" in monocular 3D lane detection, this paper proposes the road manifold hypothesis: "roads are smooth 2D manifolds in \(\mathbb{R}^3\), and lanes are 1D submanifolds embedded upon them." Lane geometry is encoded as Riemannian Gaussian descriptors on Symmetric Positive Definite (SPD) manifolds and integrated into visual features via gated fusion. A sliced 3D tunnel lane IoU loss is introduced, achieving an +8.2% F1 improvement over the baseline and +1.8% over the previous SOTA on OpenLane.
- ReMoT: Reinforcement Learning with Motion Contrast Triplets
-
Ours proposes ReMoT—a unified training paradigm that constructs a 16.5K motion contrast triplet dataset (ReMoT-16K) through a rule-driven multi-expert collaboration. Combined with GRPO reinforcement learning optimization featuring logical consistency rewards and length regularization, it systematically addresses fine-grained spatio-temporal reasoning deficiencies of VLMs in scenarios such as navigation, robotic manipulation, and autonomous driving.
- ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving
-
ResAD reformulates trajectory prediction in end-to-end driving from "direct future trajectory prediction" to "predicting normalized residuals relative to an inertial reference trajectory." By using perturbed inertial references for multi-modal generation, diffusion decoding, and trajectory ranking, it achieves SOTA results on NAVSIM v1/v2 with 88.8 PDMS / 85.5 EPDMS using only 2 denoising steps.
- ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
-
Defines and formalizes the task of temporally sparse 4D indoor semantic instance segmentation (4DSIS). The proposed ReScene4D method extends 3D instance segmentation architectures to the 4D dimension through three temporal information sharing strategies: spatio-temporal contrastive loss, spatio-temporal mask pooling, and spatio-temporal serialization. It achieves SOTA on the 3RScan dataset and introduces the new t-mAP metric to jointly evaluate segmentation quality and temporal identity consistency.
- RLFTSim: Realistic and Controllable Multi-Agent Traffic Simulation via Reinforcement Learning Fine-Tuning
-
A pre-trained imitation learning traffic simulation model (SMART) is fine-tuned in a closed-loop setting using Reinforcement Learning. By utilizing Waymo's Realism Meta-metric (RMM) as the reward and transforming it into a low-variance, dense per-rollout reward via a Leave-One-Out modification (MLOO), the method achieves SOTA realism on WOMD. Furthermore, the ability to "controllably generate specific scenarios" is distilled using goal-conditioning and Hindsight Experience Replay (HER).
- RPGFusion: 4D Radar Prior-Guided Multi-Modal Fusion for 3D Detection
-
RPGFusion injects physical priors from 4D radar (confidence and depth maps) into the image-to-BEV transformation process. Simultaneously, it performs robust encoding and densification of sparse, noisy radar point clouds, followed by spatial alignment and semantic fusion to obtain a consistent Bird's Eye View (BEV) representation. This approach achieves SOTA results for radar-camera 3D detection on VoD (69.31% mAP in Entire Annotated Area) and TJ4DRadSet.
- SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors
-
This paper proposes SABER, the first non-intrusive, 3D-consistent universal adversarial object generation framework for BEV 3D detectors. By placing optimized 3D meshes in the scene to interfere with multi-view and multi-frame detection, it reveals the over-reliance of BEV models on environmental context priors.
- Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
-
The MOSAIC framework is proposed, which achieves efficient data selection for end-to-end autonomous driving models by clustering data, fitting scaling laws for each domain relative to evaluation metrics, and iteratively selecting data cluster samples with the maximum marginal gain. This method reaches or exceeds baseline performance with 80% less data.
- SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving
-
SearchAD constructs the first large-scale rare image retrieval dataset for autonomous driving, containing 420k+ frames, 510k+ bounding boxes, and 90 rare categories. It supports text-to-image and image-to-image retrieval and reveals the deficiencies of current multimodal retrieval models in rare object retrieval through comprehensive evaluation.
- Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
-
Addressing the lack of long-tail data in autonomous driving, this paper utilizes 4DGS to inverse-render real AV logs into "dashcam-style" videos to self-generate paired data. A conditional diffusion model is then trained to convert monocular dashcam videos into the complete multi-view camera + LiDAR sensor suite of a target vehicle. It achieves an FID of 6.47 and reduces the Chamfer distance by 13.4% compared to X-Drive, enabling the "translation" of long-tail accident/night videos from the internet into usable multi-modal AV logs.
- SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving
-
SGDrive explicitly injects a hierarchical world knowledge set of "scene geometry-key agents-short-term goals" into a Vision-Language Model (VLM). It uses a set of trainable
<world>queries to predict current and future world states, then translates this knowledge into trajectories via a DiT diffusion planner, achieving SOTA on the NAVSIM camera-only track (PDMS 87.4, 91.1 after RL). - SG-NLF: Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis
-
SG-NLF proposes a LiDAR NeRF framework that does not require precise poses. It solves geometric holes caused by sparse LiDAR data through a hybrid spectral-geometric representation, achieves global pose optimization via a confidence-aware graph, and strengthens cross-frame consistency using adversarial learning. Reconstruction quality and pose accuracy improve by 35.8% and 68.8% respectively compared to SOTA on nuScenes.
- SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting
-
Proposes SHARP, a motion forecasting framework based on short-window streaming inference. It explicitly maintains and updates agent latent representations across time steps via an instance-aware context stream module. Combined with a dual-objective training strategy, it achieves SOTA on the Argoverse 2 multi-agent benchmark for streaming inference while maintaining extremely low latency.
- ShelfOcc: Native 3D Supervision beyond LiDAR for Vision-Based Occupancy Estimation
-
ShelfOcc departs from using 2D rendering losses to supervise occupancy networks. Instead, it utilizes geometric foundation models (MapAnything) and semantic segmentation foundation models (GroundedSAM) to generate metric-consistent 3D semantic voxel pseudo-labels from pure multi-view video as "native 3D supervision." This achieves up to a 34% relative improvement in weakly/shelf-supervised occupancy estimation on Occ3D-nuScenes without any reliance on LiDAR.
- SimScale: Learning to Drive via Real-World Simulation at Scale
-
The authors propose SimScale, a framework that generates large-scale, high-fidelity simulation data by applying trajectory perturbation to existing driving logs, followed by reactive environment simulation and neural rendering. Combined with pseudo-expert trajectory supervision and a sim-real co-training strategy, the end-to-end planner achieves significant improvements on NAVSIM v2 (+8.6 EPDMS on navhard), with performance scaling smoothly with the volume of simulated data.
- SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
-
SpaceDrive replaces the conventional practice in VLM-based end-to-end driving—generating coordinates digit-by-digit as text—with a unified 3D Position Encoding (PE). The same sine-cosine PE is superimposed on visual tokens, used to replace coordinate tokens in text, and used to encode ego-states. Finally, a regressive PE decoder outputs trajectory coordinates directly. It achieves SOTA among VLM methods in nuScenes open-loop benchmarks and a second-best score of 78.02 in Bench2Drive closed-loop testing.
- SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model
-
Ours proposes SparseWorld-TC, a pure-attention sparse occupancy world model that bypasses VAE discretization and BEV intermediate representations. It end-to-end predicts trajectory-conditioned multi-frame future occupancy directly from raw image features, significantly outperforming existing methods on nuScenes.
- Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion
-
VoxSAMNet is proposed, a monocular semantic scene completion framework that explicitly models voxel sparsity and semantic imbalance. It utilizes a Dummy Shortcut to skip empty voxels and combines Foreground Dropout with a Text-Guided Image Filter to mitigate long-tail overfitting, achieving a SOTA mIoU of 18.19% on SemanticKITTI (outperforming existing monocular and stereo methods).
- Spe-BEVHead: Rethinking the Detection Head Design for Bird's-Eye-View Object Detection
-
To address the issues of "geometric mismatch in Gaussian kernels / performance collapse after removing NMS / sparse supervision signals" caused by the long-standing use of 2D center-based detection heads in autonomous driving BEV 3D detection, this paper proposes Spe-BEVHead. This plug-and-play detection head employs a Rotated Box Kernel (RBK), a Local Response Refinement Module (LRRM), and a dual-branch structure. It achieves performance gains on nuScenes by simply replacing the head and maintains competitiveness in end-to-end (NMS-free) settings.
- SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
-
SToRe3D introduces a "planning-aligned" joint sparsity framework for ViT-based multi-view 3D detectors. It utilizes a lightweight relevance head to simultaneously score 2D image tokens and 3D object queries. Low-relevance items are stored in a buffer rather than discarded and are reactivated in the final layer. This achieve up to 3× inference speedup with negligible precision loss, particularly maintaining near-zero loss for "planning-critical agents."
- StreamVLO: Streaming Visual-LiDAR Odometry with Cumulative Drift Compensation
-
StreamVLO unifies the spatial fusion of vision and LiDAR with multi-frame temporal modeling into a Mamba-based MMG module. It utilizes a differentiable "Cumulative Drift Compensation" (CDC) to backtrack historical frames and learn residual corrections online. Without relying on mapping or loop closure, it significantly reduces long-range drift, achieving a 19%/22% reduction in \(t_{rel}/r_{rel}\) on KITTI and an 18%/16% reduction in ATE/RPE on Argoverse, with a single-frame inference latency of only 74 ms.
- STRNet: Visual Navigation with Spatio-Temporal Representation through Dynamic Graph Aggregation
-
STRNet proposes a unified spatio-temporal representation framework for visual navigation. By utilizing a graph reasoning module to model intra-frame spatial topology and combining hybrid temporal shifts with multi-resolution differential convolutions for temporal dynamics, it significantly improves the success rate of goal-conditioned navigation (a 70% increase over NoMaD).
- Structure-to-Intensity Diffusion for Adverse-Weather LiDAR Generation
-
SiD explicitly decomposes the denoising process of adverse-weather LiDAR generation into two branches at each step: "reconstruct geometric structure first, then denoise reflectance intensity conditioned on the structure." Combined with the RPWS module that synthesizes degraded data using real sensor statistics, SiD significantly reduces multiple distribution metrics for fog, rain, and snow point cloud generation compared to previous SOTA models of similar scale.
- TACO: Task-Aware Contrastive Learning for Joint LiDAR Localization and 3D Object Detection
-
TACO utilizes a single shared backbone to simultaneously perform LiDAR localization and 3D object detection. Through three contrastive learning modules, it explicitly decouples and mutually complements "static geographic features" and "dynamic object features." On the self-constructed OxfoLD dataset, it reduces localization error from a 0.95m baseline to 0.72m, while achieving detection accuracy superior to single-task models.
- TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR
-
This paper proposes TerraSeg, the first self-supervised domain-agnostic LiDAR ground segmentation model. By constructing a unified large-scale OmniLiDAR dataset (12 public benchmarks, 15 sensors, nearly 22 million scans) and an innovative PseudoLabeler self-supervised pseudo-label generation module, it achieves SOTA results on nuScenes, SemanticKITTI, and Waymo without using any manual annotations.
- TT-Occ: Test-Time 3D Occupancy Prediction
-
Proposes TT-Occ, a pre-training-free test-time 3D occupancy prediction framework that incrementally constructs, optimizes, and voxelizes time-aware 3D Gaussians by integrating Vision Foundation Models (VFMs) at inference time, outperforming all self-supervised methods requiring extensive training on Occ3D-nuScenes and nuCraft.
- Test-Time Training for LiDAR Semantic Segmentation under Corruption via Geometric Inlier Discrimination
-
This paper proposes GeoID, a test-time training framework for robust LiDAR semantic segmentation under corruption. By injecting "off-manifold" synthetic noise points into point clouds, the model is tasked with a self-supervised objective of distinguishing between "geometrically consistent real inliers" and "manually displaced synthetic outliers" to adapt to the target domain. Combined with Bidirectional Unreliable Point Filtering (BiUPF) to remove ambiguous regions, GeoID improves mIoU from 42.33/51.25 to 46.96/56.73 on SemanticKITTI-C / nuScenes-C, consistently outperforming existing TTA baselines.
- TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations
-
TopoHR transforms "centerline detection" and "topology reasoning" from a serial cascade into a cyclic mutual enhancement structure. By introducing a "point query + instance query" hierarchical centerline representation, it enables topology reasoning to utilize both fine-grained Point-to-Instance (P2I) and global Instance-to-Instance (I2I) relations. This achieves significant improvements on OpenLane-V2 metrics (subset_A +5.4 TOP\(_{ll}\), subset_B +7.9 TOP\(_{ll}\)).
- Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation
-
Proposes Shapley value-based modality contribution assessment and Fisher Information Matrix (FIM) weighted Adaptive Weight Constraint (AWC) regularization to address modality imbalance in multi-modal (RGB/LiDAR/mmWave/WiFi) 3D human pose estimation, achieving balanced optimization without additional learnable parameters.
- TrafficAlign: Aligning Large Language Models for Traffic Scenario Generation
-
TrafficAlign automatically synthesizes traffic scenario descriptions from real-world driving videos, performs semantic verification and self-refinement using a Domain-Specific Language (DSL), and fine-tunes (aligns) an LLM with this data. This enables the LLM to generate scenarios reflecting the actual traffic distribution of specific geographical regions. It induces 10.8% more collisions than the Prev. SOTA across three autonomous driving models, and fine-tuning these models with the generated scenarios reduces collision rates by 36.1%.
- U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences
-
Ours proposes U4D, the first uncertainty-aware 4D LiDAR world modeling framework. It adopts a "hard-to-easy" two-stage diffusion generation strategy, first reconstructing high-uncertainty regions and then conditionally completing the entire scene. A MoST module is designed to adaptively fuse spatio-temporal features to ensure temporal consistency.
- Unifying Language-Action Understanding and Generation for Autonomous Driving
-
LinkVLA integrates language instructions and driving trajectories into a unified discrete vocabulary and enforces language-action alignment through an "action understanding" task (inferring instructions from trajectories). It replaces point-by-point autoregression with a two-step coarse-to-fine decoding, achieving a driving score of 91.01 on the CARLA closed-loop benchmark while reducing inference latency from 361ms to 48ms (saving 86%).
- Unleashing VLA Potentials in Autonomous Driving via Explicit Learning from Failures
-
ELF-VLA enables autonomous driving VLA models to overcome performance plateaus during Reinforcement Learning (RL). When sparse rewards in long-tail scenarios fail to provide guidance (where all rollouts receive zero scores), a teacher VLM generates a three-layer structured failure diagnosis ("planning/reasoning/execution"). This guides the student to resample high-score corrected trajectories, which are reinjected into the GRPO training batch. This approach breaks the performance bottleneck, achieving a new SOTA PDMS of 91.0 on NAVSIM.
- Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
-
Unposed-to-3D reconstructs "simulation-ready" 3D vehicles from real-world driving images using pure image supervision (without any 3D ground truth or camera pose annotations). By utilizing a camera prediction head to estimate poses and backpropagating image reconstruction losses through differentiable rendering to the geometry—combined with scale prediction and lighting harmonization modules—the reconstructed vehicles can be directly inserted into driving scenes with correct orientation, physical scale, and consistent lighting. This improves downstream 3D detection AP by approximately 1 point.
- Unsupervised Multi-agent and Single-agent Perception from Cooperative Views
-
UMS utilizes the cooperative perspective brought by Vehicle-to-Vehicle (V2V) communication to develop a pseudo-label refinement framework (PPF Filtering + PPS Stabilizing + CCL Cross-view Consistency) without any manual annotation. Based on observations that "dense multi-vehicle point clouds make classification easier" and "cooperative views can supervise single-vehicle detection," it is the first to train both multi-agent and single-agent 3D detection to significantly surpass existing unsupervised methods.
- V2U4Real: A Real-world Large-scale Dataset for Vehicle-to-UAV Cooperative Perception
-
V2U4Real is the first real-world, large-scale, multi-modal dataset for Vehicle-to-UAV (V2U) cooperative perception. Collected by a ground vehicle and a UAV equipped with multi-beam LiDAR and RGB cameras, it provides 56k LiDAR frames, 56k images, and 700k manual 3D bounding box annotations. Benchmarks for single-agent/cooperative 3D detection and tracking demonstrate that the bird's-eye view (BEV) from UAVs significantly enhances perception robustness in long-range and occluded scenarios.
- VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
-
The authors propose the VGGDrive framework, which empowers VLMs with cross-view geometric awareness via a frozen 3D vision foundation model (VGGT). By designing a plug-and-play CVGE module, 3D features are hierarchically and adaptively injected into the 2D visual embeddings of each VLM layer, achieving significant performance gains across five autonomous driving benchmarks.
- VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation
-
VIRD is proposed to construct view-invariant representations through dual-axis transformation (polar transformation + context-enhanced positional attention). It achieves SOTA cross-view pose estimation without orientation priors, reducing position and orientation errors on KITTI by 50.7% and 76.5%, respectively.
- W2W: Language-Model-Based Trajectory Prediction with Reinforcement Learning
-
Pedestrian trajectory prediction is reformulated as a "parsable language generation" task. Multi-pedestrian coordinates and interaction relationships (companion/following/obstacle) are translated into fixed-format text prompts. T5-Small undergoes full-parameter SFT to learn the output format, followed by reinforcement learning alignment using PPO+LoRA with a "ADE error + boundary penalty" reward. This achieves ADE/FDE comparable to recent LM-based and deep learning baselines on ETH/UCY and SDD while maintaining the interpretability of language models.
- WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation
-
Proposes WalkGPT—the first pixel-grounded large vision-language model for pedestrian accessibility navigation, unifying conversational reasoning, segmentation masks, and depth estimation into a single architecture, accompanied by the 41k-scale PAVE dataset.
- WAM-Flow: Parallel Coarse-to-Fine Motion Planning via Discrete Flow Matching for Autonomous Driving
-
WAM-Flow reformulates trajectory planning for end-to-end autonomous driving as "Discrete Flow Matching (DFM) in a discrete token space." By replacing autoregressive token-by-token decoding with fully parallel, bidirectional denoising, it achieves "coarse-to-fine" planning with an adjustable number of steps—obtaining 89.1 PDMS with a single denoising step (approx. 4.67× faster than autoregressive baselines) and refining to 90.3 PDMS with 5 steps, outperforming autoregressive and diffusion-based VLA baselines on NAVSIM-v1.
- WhisperNet: A Scalable Solution for Bandwidth-Efficient Collaboration
-
WhisperNet flips the collaborative perception communication strategy from "senders choosing spatial regions" to "receiver-centric global scheduling." Based on lightweight metadata reported by all parties, the receiver simultaneously determines "where (spatial)" and "what (channels)" to transmit, improving [email protected] by 2.4% while using only 0.5% bandwidth on OPV2V.
- WOD-E2E: Waymo Open Dataset for End-to-End Driving in Challenging Long-tail Scenarios
-
Waymo extracted 4,021 long-tail driving segments (approx. 12 hours) with an occurrence frequency below 0.03% from 6.4 million miles of real-world road tests to create the WOD-E2E dataset. It proposes the RFS (Rater Feedback Score), an open-loop metric based on human expert preference scores, to replace ADE (which only measures distance error against a single future trajectory). This allows for a fair evaluation of vision-based end-to-end models in safety-critical scenarios where "multiple reasonable trajectories coexist."
- WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World
-
WorldLens proposes a full-spectrum evaluation benchmark for driving world models covering five dimensions—"generation, reconstruction, action-following, downstream tasks, and human preference"—with a total of 24 fine-grained metrics. Along with the WorldLens-26K human-annotated dataset and the distilled interpretable auto-evaluator WorldLens-Agent, it systematically reveals that current world models "look real but behave unreal"; no single model leads across all dimensions simultaneously.
- x2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space
-
Ours proposes x2-Fusion, which constructs a unified Event Edge Space anchored by the spatio-temporal edge signals of event cameras. By aligning Image/LiDAR/Event features into a homogeneous edge space, the method performs reliability-aware adaptive fusion and cross-dimensional contrastive learning to simultaneously estimate 2D optical flow and 3D scene flow, achieving SOTA results on both synthetic and real-world datasets.