🚗 Autonomous Driving¶

📷 CVPR2026 · 105 paper notes

A Prediction-as-Perception Framework for 3D Object Detection: Inspired by the brain's predictive perception mechanism, this paper proposes the PAP framework, which injects trajectory prediction outputs from previous frames as queries into the current frame's perception module, achieving a 10% improvement in tracking accuracy and a 15% speedup in inference on UniAD.
A Prediction-as-Perception Framework for 3D Object Detection: Inspired by the human cognitive pattern of "anticipating target locations before focusing attention," this work converts trajectory predictions from the previous frame into detection queries for the current frame, forming an iterative prediction-perception closed loop. Applied to UniAD, the framework achieves simultaneous improvements of +10% in tracking accuracy and +15% in inference speed.
AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception: AdaRadar is proposed — an online adaptive radar data compression framework based on DCT spectral pruning and zeroth-order surrogate gradients — achieving over 100× compression with only ~1 percentage point degradation in detection/segmentation performance, effectively alleviating the bandwidth bottleneck between radar sensors and computing units.
An Instance-Centric Panoptic Occupancy Prediction Benchmark for Autonomous Driving: This paper proposes ADMesh (a library of 15K+ high-quality 3D models) and CarlaOcc (a panoptic occupancy dataset with 100K frames at 0.05m resolution), providing for the first time instance-level annotations and physically consistent ground truth for 3D panoptic occupancy prediction in autonomous driving, along with occupancy quality evaluation metrics and a systematic benchmark.
BEV-SLD: Self-Supervised Scene Landmark Detection for Global Localization with LiDAR Bird's-Eye View Images: This paper proposes BEV-SLD, a self-supervised scene landmark detection (SLD)-based method for LiDAR global localization. By decoupling detection from correspondence prediction, the approach achieves high-accuracy \((x, y, \text{azimuth})\) pose estimation across diverse environments using only 20 MB of storage.
BuildAnyPoint: 3D Building Structured Abstraction from Diverse Point Clouds: This paper proposes BuildAnyPoint, which employs a loosely-coupled cascaded diffusion Transformer (Loca-DiT) to achieve unified reconstruction from diverse point cloud distributions (airborne LiDAR, SfM, sparse noisy point clouds) to structured 3D building meshes — first recovering the underlying point cloud distribution via hierarchical latent diffusion, then generating compact polygonal meshes via an autoregressive Transformer.
C2T: LLM-Aligned Common-Sense Reward Learning for Traffic-Vehicle Coordination: This paper proposes the C2T framework, which converts traffic states into structured captions, leverages LLMs for offline preference judgments, and distills these judgments into an intrinsic reward function. This approach replaces hand-crafted rewards for traffic signal control (TSC) and achieves improvements in efficiency, safety, and energy consumption across multiple real-world urban networks on the CityFlow benchmark.
CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention: CausalVAD is proposed to parameterize Pearl's backdoor adjustment theory as a plug-and-play module (SCIS), performing multi-level causal intervention across the perception–prediction–planning pipeline of the VAD architecture to eliminate spurious correlations and achieve safer, more robust end-to-end autonomous driving.
CCF: Complementary Collaborative Fusion for Domain Generalized Multi-Modal 3D Object Detection: To address the modality imbalance problem in dual-branch multi-modal 3D detectors under domain shift, this paper proposes the CCF framework, which systematically improves camera query utilization and cross-domain robustness through three components: Query Decoupled Loss, LiDAR-Guided Depth Prior, and Complementary Cross-Modal Masking.
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data: This paper proposes ClimaDrive, a data generation framework, and ClimaOoD, a benchmark dataset. By combining semantically guided multi-weather scene generation with perspective-aware anomaly object placement, the framework constructs a 10K+ training set covering 6 weather conditions × 93 anomaly categories. Training on this dataset yields an average AP improvement of 3.25% across four state-of-the-art methods.
CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection: CoIn3D is proposed as a framework that explicitly models spatial prior discrepancies arising from camera intrinsics, extrinsics, and array layouts via two modules — Spatial-aware Feature Modulation (SFM) and Camera-aware Data Augmentation (CDA) — enabling strong generalization transfer of multi-camera 3D detection models from source configurations to unseen target configurations. The framework is plug-and-play compatible with three mainstream paradigms: BEVDepth, BEVFormer, and PETR.
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving: ColaVLA proposes a unified vision-language-action (VLA) framework that transfers VLM reasoning from textual chain-of-thought to latent space. Through a Cognitive Latent Reasoner and a Hierarchical Parallel Planner, the framework completes scene understanding and trajectory decoding with only two VLM forward passes, achieving state-of-the-art performance on both nuScenes open-loop and closed-loop benchmarks.
CoLC: Communication-Efficient Collaborative Perception with LiDAR Completion: CoLC proposes a communication-efficient early fusion framework for collaborative perception. It reduces transmission volume via Foreground-Aware Point Sampling (FAPS), reconstructs dense pillar representations on the ego side through VQ-based LiDAR completion (CEEF), and ensures semantic and geometric consistency via Dense-Guided Dual Alignment (DGDA). The framework achieves detection performance on par with or superior to full early fusion while significantly reducing communication bandwidth.
Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation: This paper proposes CompoSIA, a compositional driving video simulator that injects three control factors — scene structure, object identity, and ego-vehicle action — through independent pathways into a Flow Matching DiT. It supports both individual and compositional editing, enabling systematic adversarial scenario synthesis. CompoSIA achieves a 17% FVD improvement on identity editing, 30%/47% reduction in rotation/translation error for action control, and an average 173% increase in collision rate for downstream planners.
CompoSIA: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation: This paper proposes CompoSIA, a framework that achieves composable adversarial driving scene generation via disentangled control over three factors — Structure, Identity, and Action — built upon a video diffusion model. The approach reduces FVD for identity editing by 17% and increases the collision rate of downstream planners by 173%, effectively exposing hidden failure modes in autonomous driving systems.
CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation: This paper proposes CycleBEV, a training-time regularization framework that introduces an Inverse View Transformation (IVT) network to map BEV segmentation maps back to perspective-view (PV) segmentation maps. The framework enhances existing BEV semantic segmentation models via a cycle consistency loss, a height-aware geometric regularization objective, and a cross-view latent space alignment objective, with zero additional inference overhead.
Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction: Den-TP is a data-centric framework that addresses the long-tail density imbalance in trajectory prediction datasets through density-aware data curation and evaluation protocols. Using only 50% of the training data, it maintains overall performance while significantly improving robustness in high-density scenarios.
DLWM: Dual Latent World Models enable Holistic Gaussian-centric Pre-training in Autonomous Driving: This paper proposes DLWM, a two-stage Gaussian-centric self-supervised pre-training paradigm. Stage 1 learns 3D Gaussian representations by reconstructing depth and semantic maps. Stage 2 trains dual latent world models — a Gaussian-flow-guided temporal prediction model (for occupancy perception/prediction) and an ego-planning-guided temporal prediction model (for motion planning) — achieving significant performance gains across all three core tasks.
Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving: This paper proposes DMW (Drive My Way), a personalized VLA driving framework that learns long-term driving habits via user embeddings and adapts to short-term preferences through natural language instructions. Personalized driving behavior is generated using GRPO-based reinforcement fine-tuning and style-aware rewards.
DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance: This paper introduces the first 360° panoramic driver attention dataset (~1M frames / 19 drivers) and proposes DriverGaze360-Net, which jointly learns attention maps and attended objects via an auxiliary semantic segmentation head, achieving state-of-the-art attention prediction performance on panoramic driving images.
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving: Dr.Occ proposes a unified 3D occupancy prediction framework with depth guidance and region guidance. It employs D2-VFormer to leverage high-quality depth priors from MoGe-2 for accurate 2D→3D geometric mapping, and R/R2-EFormer to adaptively assign region-specific experts inspired by MoE/MoR for handling spatial semantic anisotropy, achieving a +7.43% mIoU improvement over the BEVDet4D baseline.
Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving: This paper proposes Dr.Occ, a unified camera-only 3D occupancy prediction framework. It introduces a Depth-guided Dual-projection View Former (D2-VFormer) that leverages high-quality depth priors from MoGe-2 for accurate geometric alignment, and a Region-guided Expert Transformer (R-EFormer / R2-EFormer) that adaptively assigns spatial region experts to address semantic imbalance. Dr.Occ improves the BEVDet4D baseline by 7.43% mIoU on Occ3D-nuScenes.
Efficient Equivariant Transformer for Self-Driving Agent Modeling: This paper proposes DriveGATr, an equivariant Transformer architecture based on 2D Projective Geometric Algebra (PGA) that achieves SE(2)-equivariance without explicit pairwise relative position encoding (RPE), attaining state-of-the-art performance on traffic simulation tasks while substantially reducing computational cost.
EMDUL: Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets: This paper proposes EMDUL, a pipeline that expands mmWave HPE datasets in scale and diversity by (1) annotating unlabeled mmWave data via pseudo-labels with a novel unsupervised temporal consistency loss (UTCL), and (2) converting LiDAR datasets to mmWave point clouds through a closed-form converter with flow-based point filtering (FPF). The approach reduces in-domain error by 15.1% and cross-domain error by 18.9%.
F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling: This paper proposes F3DGS, the first method to apply a federated learning framework to 3DGS, enabling decentralized multi-agent 3D reconstruction through frozen geometry and visibility-aware aggregation without sharing raw data.
Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them: This paper systematically defines and quantifies two failure modes of deep learning-based online mapping models — localization overfitting and map geometry overfitting — proposes a Fréchet distance-based performance metric and a minimum spanning tree (MST)-based training set sparsification strategy, and validates on nuScenes and Argoverse 2 that geometrically diverse and balanced training sets improve model generalization.
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts: FedBPrompt introduces learnable visual prompts partitioned into body part alignment prompts (with constrained local attention to handle viewpoint misalignment) and holistic full-body prompts (to suppress background interference), coupled with a prompt-only federated fine-tuning strategy that transmits only prompt parameters (~0.46M vs. ~86M for the full model), achieving consistent improvements on FedDG-ReID benchmarks.
FedBPrompt: Federated Domain Generalization Person Re-Identification via Body Distribution Aware Visual Prompts: This paper proposes FedBPrompt, a framework that introduces a Body Distribution Aware Visual Prompts Mechanism (BAPM) dividing prompts into Body Part Alignment Prompts and Holistic Full Body Prompts, paired with a Prompt-based Fine-Tuning Strategy (PFTS) that freezes the ViT backbone and trains only lightweight prompt parameters (reducing communication to ~1%), achieving average mAP gains of 3.3% and Rank-1 gains of 4.9% on FedDG-ReID benchmarks.
FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision: FlashCap is proposed as the first motion capture system combining flashing LEDs with event cameras, where each LED is assigned a unique flashing frequency for identity recognition. The system enables the construction of FlashMotion, the first human motion dataset with 1000Hz annotation precision (7.15 million frames), and introduces the ResPose baseline, reducing motion timing error from ~50ms to ~5ms and lowering pose estimation MPJPE by approximately 40%.
FoSS: Modeling Long-Range Dependencies and Multimodal Uncertainty in Trajectory Prediction via Fourier–State Space Integration: FoSS proposes a frequency-domain–time-domain dual-branch framework that organizes Fourier spectra via progressive spiral reordering (HelixSort) before feeding them into a selective state space model (SSM), and combines a temporal dynamic SSM with cross-attention fusion to achieve state-of-the-art trajectory prediction on Argoverse 1/2 while reducing parameter count by over 40% and inference latency by 22%.
Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction: GPOcc proposes leveraging generalizable visual geometry priors (e.g., VGGT, DepthAnything) for monocular 3D occupancy prediction. Surface points predicted by these priors are extended inward along camera rays to generate volumetric samples, which serve as centers of sparse Gaussian primitives for probabilistic occupancy inference. A training-free incremental update strategy handles streaming input. On Occ-ScanNet, GPOcc surpasses the previous SOTA by +9.99 mIoU (monocular) and +11.79 mIoU (streaming), while running 2.65× faster under the same depth prior.
Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal: Ghost-FWL introduces the first large-scale mobile full-waveform LiDAR dataset (24K frames, 7.5 billion peak-level annotations) and proposes FWL-MAE, a self-supervised pretraining framework for ghost detection and removal, reducing SLAM trajectory error by over 66% and cutting 3D detection false positive rates by 50×.
HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation: To address the severe scarcity of adverse-weather samples in lane detection datasets (CULane/TuSimple), this paper proposes HG-Lane — a two-stage diffusion-based generation framework requiring no re-annotation. Stage-I employs Control Information Fusion and Structure-aware Reverse Diffusion to preserve lane geometry, while Stage-II applies Appearance-aware Refinement to adjust illumination style. The framework generates 30K images across snow/rain/fog/night/dusk conditions. CLRNet achieves an overall mF1 improvement of +20.87%, with +38.8% in snow scenarios.
HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles: HorizonForge proposes a unified framework that reconstructs driving scenes as editable Gaussian Splats combined with Mesh representations, enabling fine-grained 3D manipulation via trajectory control and language-driven vehicle insertion. A video diffusion model then renders spatiotemporally consistent, high-quality driving videos. The method achieves a user preference rate of 91.02%, decisively outperforming all baselines.
IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration: This paper proposes the IGASA framework, which employs a three-stage pipeline consisting of a Hierarchical Pyramid Architecture (HPA), Hierarchical Cross-Layer Attention (HCLA), and Iterative Geometry-Aware Refinement (IGAR) to bridge the semantic gap across multi-scale features and dynamically suppress outliers, achieving state-of-the-art performance on four benchmarks: 3DMatch, 3DLoMatch, KITTI, and nuScenes.
IGASA: Integrated Geometry-Aware and Skip-Attention Modules for Enhanced Point Cloud Registration: IGASA is a point cloud registration framework that combines a Hierarchical Pyramid Architecture (HPA), Hierarchical Cross-Layer Attention (HCLA) with skip-attention fusion, and Iterative Geometry-Aware Refinement (IGAR) with dynamic consistency weighting. It achieves 94.6% Registration Recall on 3DMatch (SOTA), 100% RR on KITTI, with a total inference time of only 2.763s.
InCaRPose: In-Cabin Relative Camera Pose Estimation Model and Dataset: This paper presents InCaRPose, an in-cabin relative camera pose estimation model built upon a frozen ViT backbone and a Transformer decoder. Trained exclusively on synthetic data, it generalizes to real in-cabin environments, achieving metric-scale translation prediction and real-time inference (>45 FPS). The authors also release an accompanying real-world, high-distortion in-cabin test dataset, In-Cabin-Pose.
KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System: KnowVal is an end-to-end autonomous driving system that addresses two fundamental deficiencies—knowledge reasoning and value alignment—through three core components: (1) Retrieval-guided Open-world Perception, which integrates standard 3D detection, VL-SAMv2-based long-tail object recognition, and VLM-based scene understanding; (2) Perception-guided Knowledge Retrieval, which queries a driving knowledge graph covering traffic regulations, defensive driving, and ethical norms; and (3) a World Model for future state prediction combined with a human-preference-trained Value Model for trajectory evaluation. The system achieves the lowest collision rate on nuScenes and state-of-the-art performance on Bench2Drive and NVISIM.
Le MuMo JEPA: Multi-Modal Self-Supervised Representation Learning with Learnable Fusion Tokens: This work extends the LeJEPA self-supervised framework to a multi-modal setting by introducing learnable fusion tokens as a Perceiver-style latent bottleneck within a shared Transformer, enabling efficient fusion of RGB with companion modalities (LiDAR depth / thermal infrared). A pruning strategy reduces attention overhead by approximately 9×. On Waymo, CenterNet 3D detection mAP XY reaches 23.6 (+4.3 over RGB-only LeJEPA) and Depth MAE improves from 4.704 to 2.860.
LEADER: Learning Reliable Local-to-Global Correspondences for LiDAR Relocalization: LEADER achieves 24.1% and 73.9% relative reductions in position error on LiDAR relocalization benchmarks via a robust projection-based geometric encoder (yaw-invariant) and a truncated relative reliability loss (suppressing unreliable points).
Learnability-Driven Submodular Optimization for Active Roadside 3D Detection: This paper proposes LH3D, an active learning framework that employs a three-stage hierarchical submodular optimization strategy—depth confidence → semantic balancing → geometric diversity—to suppress the selection of inherently ambiguous samples in roadside monocular 3D detection. With only 20% of the annotation budget, LH3D significantly outperforms conventional uncertainty- and diversity-based AL methods.
Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization: This paper uses panoramic depth and reflectance images derived from 3D LiDAR point clouds as CNN inputs, constructs a large-scale outdoor scene categorization dataset (MPO), and proposes two architectural improvements—Horizontal Circular Convolution (HCC) and Row-Wise Max Pooling (RWMP)—to achieve high-accuracy classification (up to 97.87%) across six outdoor scene categories, substantially outperforming traditional handcrafted feature methods.
Learning Geometric and Photometric Features from Panoramic LiDAR Scans for Outdoor Place Categorization: This paper proposes a method for outdoor scene categorization using LiDAR panoramic depth maps and reflectance maps as CNN inputs. The authors construct the large-scale MPO outdoor 3D dataset (6 scene categories, 34,200 frames), and address the ring topology of panoramic images via Horizontal Circular Convolution (HCC) and Row-Wise Max Pooling (RWMP). The proposed multimodal fusion approach achieves 97.47% classification accuracy.
Learning Mutual View Information Graph for Adaptive Adversarial Collaborative Perception: This paper proposes MVIG, an adversarial attack framework that unifies the vulnerability modeling of diverse defense-equipped collaborative perception systems into a Mutual View Information Graph (MVIG). By combining temporal graph learning with entropy-aware vulnerability search, MVIG enables adaptive fabrication attacks that reduce the defense success rate by up to 62%.
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos: This paper proposes LFG (Learning to drive is a Free Gift), a fully label-free, teacher-guided pretraining framework for autonomous driving. LFG learns a unified pseudo-4D representation of geometry, semantics, and motion from large-scale unposed YouTube driving videos. On the NAVSIM benchmark, using only a monocular front-facing camera, LFG surpasses multi-camera + LiDAR BEV methods (PDMS 85.2), and demonstrates strong data efficiency (81.4 PDMS with only 10% labels).
LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration: This paper proposes LiREC-Net, the first unified framework for simultaneously performing target-free extrinsic calibration between LiDAR–RGB and LiDAR–Event camera pairs. Through a shared LiDAR representation that fuses 3D point features with projected depth features, and pairwise cost volumes for cross-modal alignment, LiREC-Net achieves calibration accuracies of 1.80 cm/0.11° on KITTI, and 2.51 cm/0.14° (LiDAR–RGB) and 1.18 cm/0.07° (LiDAR–Event) on DSEC.
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection: This work identifies that feature misalignment in LiDAR-Camera fusion is concentrated at foreground-background depth discontinuity boundaries, and proposes three synergistic modules — PGDC (2D Prior-Guided Depth Calibration), DAGF (Discontinuity-Aware Geometric Fusion), and SGDM (Structural Guidance Depth Modulator) — to proactively correct misalignment prior to fusion, achieving state-of-the-art mAP of 71.5% and NDS of 73.6% on the nuScenes validation set.
LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction: LR-SGS proposes a structure-aware Salient Gaussian representation guided by LiDAR reflectance. By calibrating LiDAR intensity into an illumination-invariant reflectance channel appended to each Gaussian, initializing structured Salient Gaussians from geometric and reflectance feature points, and enforcing RGB–reflectance cross-modal gradient consistency, the method surpasses OmniRe by 1.18 dB PSNR on complex-lighting scenes of the Waymo dataset while using fewer Gaussians and shorter training time.
LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction: This paper proposes LR-SGS, which calibrates LiDAR intensity into an illumination-invariant reflectance channel attached to 3D Gaussians, and introduces a structure-aware Salient Gaussian representation (initialized from LiDAR geometry and reflectance feature points) with improved densification control and a salient transform strategy. LR-SGS achieves higher-fidelity reconstruction than OmniRe on complex Waymo autonomous driving scenes while using fewer Gaussians and requiring less training time.
M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs: M²-Occ addresses real-world scenarios where camera failures cause missing views by proposing MMR (reconstructing missing view representations in feature space using adjacent camera FoV overlaps) and FMM (refining ambiguous voxel features via a learnable semantic prototype memory bank). On the SurroundOcc baseline, it achieves +4.93% IoU when the rear camera is missing, maintains 18.36% IoU under five missing cameras (versus a baseline collapse to 13.35%), and does not compromise performance under complete-view inputs.
M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs: To address incomplete inputs caused by camera failures in autonomous driving, M²-Occ introduces a Multi-view Masked Reconstruction (MMR) module that exploits the overlapping fields of view between adjacent cameras to recover missing view features, and a Feature Memory Module (FMM) that refines voxel representations using class-level semantic prototypes. The framework achieves a 4.93% IoU gain when the rear camera is missing, without degrading full-view performance.
MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction: This paper proposes MapGCLR, a geospatial contrastive learning strategy that enforces consistent BEV feature representations for overlapping regions across different traversals. Operating within a semi-supervised framework, the method achieves 13%–42% relative performance gains on online vectorized HD map construction using only 5%–20% of labeled data.
MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction: MapGCLR proposes a semi-supervised training scheme based on geospatial contrastive learning: it exploits the geospatial overlap between BEV feature grids produced from multiple traversals of the same location, constructing an InfoNCE contrastive loss to enforce geographic consistency in the BEV feature space. On Argoverse 2, using only 5% labeled data, it achieves 18.9 mAP (vs. 13.3 for the fully supervised baseline), a relative improvement of 42%—roughly equivalent to doubling the amount of labeled data.
MeanFuser: Fast One-Step Multi-Modal Trajectory Generation and Adaptive Reconstruction via MeanFlow for End-to-End Autonomous Driving: MeanFuser is an end-to-end autonomous driving framework that replaces discrete trajectory vocabulary with Gaussian mixture noise for continuous multi-modal trajectory modeling, leverages the MeanFlow Identity for error-free one-step sampling, and introduces an Adaptive Reconstruction Module (ARM) that implicitly decides between selecting an existing proposal and reconstructing a new trajectory. On NAVSIM, using only RGB input with a ResNet-34 backbone, it achieves 89.0 PDMS at 59 FPS.
MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating: This paper proposes MetaDAT, a framework that obtains an initialization amenable to online adaptation via meta pre-training, and further employs dynamic learning rate optimization (DLO) and hard-sample-driven updates (HSD) at test time to achieve trajectory prediction adaptation under cross-dataset distribution shifts. MetaDAT consistently outperforms existing test-time training (TTT) methods across diverse cross-domain configurations on nuScenes, Lyft, and Waymo.
MetaDAT: Generalizable Trajectory Prediction via Meta Pre-training and Data-Adaptive Test-Time Updating: This paper proposes the MetaDAT framework, which obtains a model initialization amenable to online adaptation via meta pre-training, and achieves data-adaptive model adjustment at test time through dynamic learning rate optimization (DLO) and hard-sample-driven updates (HSD). MetaDAT surpasses all existing TTT methods under cross-dataset distribution shift settings across nuScenes, Lyft, and Waymo.
Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks: This paper proposes the dCAP framework, which achieves real-time 6-DoF relative pose estimation between the tractor and trailer in articulated autonomous trucks via Transformer-based cross-view and temporal attention mechanisms. The framework is integrated into BEVFormer to improve 3D object detection performance under articulated motion, achieving a translation error of 0.452 m and a rotation error of 0.042 rad.
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving: This paper proposes MindDriver, a progressive multimodal reasoning framework that emulates the human "perception→imagination→action" cognitive process. The model first performs textual semantic understanding, then imagines future scene images (bridging semantic and physical spaces), and finally predicts trajectories. Combined with feedback-guided automatic data annotation and progressive reinforcement fine-tuning, MindDriver achieves state-of-the-art performance on both the nuScenes open-loop and Bench2Drive closed-loop benchmarks.
Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes (LegoOcc): This paper proposes LegoOcc, which leverages Language-Embedded Gaussians (LE-Gaussians) as a unified geometric-semantic intermediate representation. Combined with a Poisson-process-based Gaussian-to-Occupancy (G2O) operator and a progressive temperature decay strategy, LegoOcc achieves monocular open-vocabulary occupancy prediction for indoor scenes using only binary occupancy labels (without semantic annotations), attaining 59.50 IoU / 21.05 mIoU on Occ-ScanNet.
Neural Distribution Prior for LiDAR Out-of-Distribution Detection: NDP introduces a learnable neural distribution prior module to model the distributional structure of network predictions. Combined with Perlin-noise-based pseudo-OOD sample generation and a soft anomaly exposure strategy, NDP achieves 61.31% AP on the STU benchmark, surpassing the previous best result by more than 10×.
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning: NoRD demonstrates that autonomous driving VLAs require neither large-scale reasoning annotations nor massive datasets. By identifying the root cause of GRPO failure on weak SFT policies as difficulty bias — wherein learning signals from high-variance rollout groups are suppressed — it replaces standard GRPO with Dr. GRPO for RL post-training. Using less than 60% of the data, no reasoning annotations, and 3× fewer tokens, NoRD achieves competitive performance against reasoning-based VLAs on NAVSIM (85.6 PDMS) and WaymoE2E (7.709 RFS).
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction: O3N is the first work to introduce the omnidirectional open-vocabulary occupancy prediction task and proposes a purely vision-based end-to-end framework. Polar-spiral Mamba (PsM) models panoramic geometric continuity via spiral scanning in polar coordinate space; Occupancy Cost Aggregation (OCA) constructs a voxel-text matching cost volume to avoid overfitting from direct feature alignment; Natural Modality Alignment (NMA) aligns pixel-voxel-text tri-modal embeddings through gradient-free random walk. The method achieves 16.54 mIoU / 21.16 Novel mIoU on QuadOcc (SOTA), substantially outperforming the OVO baseline.
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction: O3N is the first purely vision-based, end-to-end omnidirectional open-vocabulary occupancy prediction framework. Through three core modules—Polar Spiral Mamba (PsM), Occupancy Cost Aggregation (OCA), and Natural Modality Alignment (NMA)—it achieves open-vocabulary 3D occupancy prediction under 360° panoramic image input that surpasses closed-set supervised methods.
OccAny: Generalized Unconstrained Urban 3D Occupancy: OccAny proposes the first generalized unconstrained urban 3D occupancy prediction framework, capable of predicting metric-scale occupancy voxels from monocular, sequential, or surround-view images in calibration-free, out-of-domain scenes. Through two key designs—Segmentation Forcing and Novel View Rendering—it surpasses all visual geometry baselines on KITTI and nuScenes.
OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective: OccuFly introduces the first real-world camera-based Semantic Scene Completion (SSC) benchmark from the aerial perspective, comprising 20,000+ samples across 21 semantic categories, spanning multi-season and multi-altitude urban, industrial, and rural scenes. It further reveals fundamental limitations of current visual foundation models in aerial settings.
On the Feasibility and Opportunity of Autoregressive 3D Object Detection: This paper proposes AutoReg3D, the first framework that formulates LiDAR 3D object detection as autoregressive sequence generation. By adopting a near-to-far ordering and parameter-specific vocabularies to discretize bounding boxes into token sequences, AutoReg3D achieves competitive performance against mainstream methods without anchors or NMS, while unlocking new capabilities such as RL fine-tuning and cascading refinement.
OneOcc: Semantic Occupancy Prediction for Legged Robots with a Single Panoramic Camera: This paper proposes OneOcc, a vision-only panoramic semantic occupancy prediction framework for legged/humanoid robots. Through dual-projection fusion, dual-grid voxelization, gait displacement compensation, and a hierarchical mixture-of-experts decoder, OneOcc achieves 360° semantic scene completion using only a single panoramic camera, surpassing LiDAR baselines on both real quadruped and simulated humanoid datasets.
Open-Vocabulary Domain Generalization in Urban-Scene Segmentation: This paper proposes OVDG-SS, a new problem setting that unifies unseen-domain and unseen-category challenges in semantic segmentation, and introduces S2-Corr, a state space model-based module that repairs text-image correlation degradation caused by domain shift, enabling efficient and robust cross-domain open-vocabulary segmentation in autonomous driving scenarios.
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots: This paper introduces PanoMMOcc, the first panoramic multimodal (RGB + thermal + polarization + LiDAR) semantic occupancy dataset for quadruped robots, and proposes VoxelHound, a framework achieving robust 3D occupancy prediction via Vertical Jitter Compensation (VJC) and Multimodal Information Prompt Fusion (MIPF) modules, attaining 23.34% mIoU (+4.16%).
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots: This paper introduces PanoMMOcc, the first panoramic multimodal semantic occupancy prediction dataset for quadruped robots, along with the VoxelHound framework. By incorporating a Vertical Jitter Compensation (VJC) module and a Multimodal Information Prompt Fusion (MIPF) module, VoxelHound achieves 23.34% mIoU under a four-modality setup (panoramic RGB + thermal + polarization + LiDAR), surpassing existing methods by +4.16%.
Perception Characteristics Distance: Measuring Stability and Robustness of Perception System in Dynamic Conditions under a Certain Decision Rule: This paper proposes the Perception Characteristics Distance (PCD), a novel metric that quantifies the maximum reliable detection range of a perception system by statistically modeling how the mean and variance of detection confidence evolve with distance. Given a detection quality threshold \(y^{thres}\) and a probability threshold \(p^{thres}\), PCD identifies the furthest distance at which reliability requirements are satisfied, addressing the inability of conventional static metrics such as AP and IoU to capture distance-dependent behavior and stochastic variation.
Plant Taxonomy Meets Plant Counting: A Fine-Grained, Taxonomic Dataset for Counting Hundreds of Plant Species: This paper introduces TPC-268, the first large-scale plant counting dataset integrating plant taxonomy, comprising 10,000 images, 678,050 point annotations, and 268 countable categories (covering 242 species), with complete Linnaean taxonomic hierarchy annotations, and provides comprehensive benchmarking under the class-agnostic counting (CAC) paradigm.
Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors: This paper proposes Points-to-3D, which encodes visible-region point clouds into TRELLIS's sparse structure latent (SS latent) and completes unobserved regions via a mask-aware inpainting network. Combined with a two-stage sampling strategy of structure completion followed by boundary refinement, the method achieves geometry-controllable, high-fidelity 3D asset/scene generation, attaining an F-Score of 0.964 on Toys4K (0.998 for visible regions).
ProOOD: Prototype-Guided Out-of-Distribution 3D Occupancy Prediction: This paper proposes ProOOD, a framework that for the first time unifies long-tail recognition and out-of-distribution (OOD) detection in 3D occupancy prediction from a voxel prototype-guided perspective. Through prototype-guided semantic inpainting (PGSI), tail-class enhancement (PGTM), and the training-free EchoOOD scoring mechanism, it achieves +3.57% mIoU (tail classes +24.80%) on SemanticKITTI and +19.34 AuPRCr on VAA-KITTI for OOD detection.
PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency: This paper proposes PTC-Depth, a monocular depth estimation framework that combines optical flow triangulation with wheel odometry. It tracks the metric scale of a depth foundation model via recursive Bayesian updates, achieving temporally consistent metric depth prediction with strong generalization across KITTI, TartanAir, and thermal infrared datasets.
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection: This paper proposes R4Det, which systematically addresses three core challenges in 4D radar-camera fusion—inaccurate depth estimation, pose-free temporal fusion, and small object detection—through three plug-and-play BEV modules: Panoramic Depth Fusion (PDF), Deformable Gated Temporal Fusion (DGTF), and Instance-Guided Dynamic Refinement (IGDR). R4Det achieves 47.29% 3D mAP (+5.47%) on TJ4DRadSet and 66.69% mAP on VoD.
Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals: This paper proposes Rascene, an Integrated Sensing and Communication (ISAC) framework for high-fidelity 3D scene imaging using mmWave OFDM communication signals (5G/Wi-Fi). It achieves geometrically consistent recovery from sparse, multipath-corrupted RF observations via confidence-weighted multi-frame fusion.
Recover to Predict: Progressive Retrospective Learning for Variable-Length Trajectory Prediction: This paper proposes the Progressive Retrospective Framework (PRF), which employs cascaded retrospective units to progressively align features from incomplete observations to those of complete observations, substantially improving variable-length trajectory prediction performance in a plug-and-play manner compatible with existing methods.
ReMoT: Reinforcement Learning with Motion Contrast Triplets: This paper proposes ReMoT, a unified training paradigm that automatically constructs a 16.5K motion contrast triplet dataset (ReMoT-16K) via a rule-driven multi-expert collaborative pipeline, and combines GRPO reinforcement learning with a composite reward (logical consistency + length regularization) to systematically address the fundamental deficiencies of VLMs in spatiotemporal consistency reasoning, achieving a 25.1% performance improvement.
RESBev: Making BEV Perception More Robust: This paper proposes RESBev, a plug-and-play robustness enhancement framework for BEV perception. It employs a latent-space world model to predict clean BEV semantic priors from historical frames, and an anomaly reconstructor that fuses these priors with corrupted current observations via cross-attention. On nuScenes, RESBev achieves an average improvement of 15–20 IoU points across four LSS-based models under 10 types of perturbations (including natural corruptions and adversarial attacks), and generalizes to corruption types unseen during training.
ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes: This paper formally defines the temporally sparse 4D indoor semantic instance segmentation (4DSIS) task and proposes ReScene4D, which extends a 3D instance segmentation architecture to the 4D domain via three temporal information sharing strategies—spatio-temporal contrastive loss, spatio-temporal mask pooling, and spatio-temporal decoder serialization. The method achieves state-of-the-art performance on the 3RScan dataset and introduces a new t-mAP metric that jointly evaluates segmentation quality and temporal identity consistency.
SABER: Spatially Consistent 3D Universal Adversarial Objects for BEV Detectors: This paper proposes SABER, the first non-invasive, spatially consistent universal adversarial object generation framework targeting BEV 3D detectors. By placing optimized 3D meshes in the scene, SABER disrupts multi-view multi-frame detection and reveals BEV models' over-reliance on learned environmental context priors.
Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems: This paper proposes MOSAIC, a framework that clusters training data into domains, fits per-domain scaling laws over evaluation metrics, and greedily selects samples with the highest marginal gain, enabling efficient data selection for end-to-end autonomous driving models that matches or surpasses baseline performance with 80% less data.
SearchAD: Large-Scale Rare Image Retrieval Dataset for Autonomous Driving: SearchAD introduces the first large-scale rare image retrieval dataset for autonomous driving, comprising 420K+ frames, 510K+ annotated bounding boxes, and 90 rare categories. It supports both text-to-image and image-to-image retrieval, and through comprehensive evaluation reveals the deficiencies of current multimodal retrieval models in retrieving rare objects.
SG-NLF: Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis: SG-NLF proposes a pose-free LiDAR NeRF framework that addresses the geometric hole problem arising from LiDAR sparsity via a spectral-geometric hybrid representation, achieves global pose optimization through a confidence-aware pose graph, and enforces cross-frame consistency via adversarial learning. On nuScenes, it outperforms the previous state of the art by 35.8% in reconstruction quality and 68.8% in pose accuracy.
SHARP: Short-Window Streaming for Accurate and Robust Prediction in Motion Forecasting: This paper proposes SHARP, a motion prediction framework based on short-window streaming inference. It explicitly maintains and updates agent latent representations across time steps via an instance-aware context streamer module, and employs a dual-objective training strategy to achieve state-of-the-art streaming performance on the Argoverse 2 multi-agent benchmark while maintaining minimal latency.
SimScale: Learning to Drive via Real-World Simulation at Scale: This paper proposes the SimScale framework, which generates large-scale, high-fidelity simulation data by applying trajectory perturbations to existing driving logs, simulating reactive environment responses, and synthesizing sensor observations via neural rendering. Combined with pseudo-expert trajectory supervision and a sim-real co-training strategy, SimScale achieves substantial gains on NAVSIM v2 (navhard +8.6 EPDMS), with performance scaling smoothly with the volume of simulation data.
Single Pixel Image Classification using an Ultrafast Digital Light Projector: An ultrafast microLED-on-CMOS digital light projector (330 kfps global shutter) is employed for single-pixel imaging. Twelve-by-twelve Hadamard patterns are projected onto MNIST digits, and a single-pixel photodetector acquires a time series of aggregated light intensities. Image reconstruction is entirely bypassed; an ELM or DNN directly classifies the time series. The system achieves greater than 90% multi-class accuracy and greater than 99% AUC binary classification (anomaly detection) at 1.2 kfps.
Single Pixel Image Classification using an Ultrafast Digital Light Projector: This paper employs a microLED-on-CMOS digital light projector to realize ultrafast single-pixel imaging (SPI), and combines low-complexity machine learning models (ELM and DNN) to achieve >90% classification accuracy on MNIST handwritten digits at a frame rate of 1.2 kHz, entirely bypassing image reconstruction.
SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model: This paper proposes SparseWorld-TC, a pure attention-based sparse occupancy world model that bypasses VAE discretization and BEV intermediate representations, directly predicting trajectory-conditioned multi-frame future occupancy end-to-end from raw image features, achieving substantial improvements over existing methods on nuScenes.
Sparsity-Aware Voxel Attention and Foreground Modulation for 3D Semantic Scene Completion: This paper proposes VoxSAMNet, a monocular semantic scene completion (SSC) framework that explicitly models voxel sparsity and semantic imbalance. It employs a Dummy Shortcut to bypass empty voxels, and Foreground Dropout combined with a Text-Guided Image Filter (TGIF) to mitigate long-tail overfitting. VoxSAMNet achieves a state-of-the-art 18.19% mIoU on SemanticKITTI, surpassing existing monocular and stereo methods.
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis: This paper proposes SG-NLF, a framework that achieves pose-free LiDAR novel view synthesis via a hybrid spectral-geometric representation, combined with a confidence-aware pose graph and adversarial learning strategy. It significantly outperforms state-of-the-art methods on KITTI-360 and nuScenes (Chamfer Distance reduced by 35.8%, ATE reduced by 68.8%).
TerraSeg: Self-Supervised Ground Segmentation for Any LiDAR: This paper proposes TerraSeg, the first self-supervised, domain-agnostic LiDAR ground segmentation model. By constructing the large-scale unified OmniLiDAR dataset (12 public benchmarks, 15 sensor types, ~22 million scans) and a novel PseudoLabeler self-supervised pseudo-label generation module, TerraSeg achieves state-of-the-art performance on nuScenes, SemanticKITTI, and Waymo without any human annotation.
TT-Occ: Test-Time 3D Occupancy Prediction: This paper proposes TT-Occ, a training-free test-time 3D occupancy prediction framework that integrates vision foundation models (VFMs) at inference time to incrementally construct, refine, and voxelize temporally-aware 3D Gaussians. TT-Occ surpasses all self-supervised methods requiring extensive training on both Occ3D-nuScenes and nuCraft benchmarks.
TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding: This paper proposes TopoMaskV3, which upgrades the mask-based road topology understanding paradigm from a 2D auxiliary module to a standalone 3D centerline predictor by introducing dense offset fields and dense height maps as additional prediction heads. The work also introduces, for the first time in road topology evaluation, a geographically non-overlapping split and a long-range benchmark, exposing performance inflation caused by geographic overlap in existing benchmarks. TopoMaskV3 achieves state-of-the-art 28.5 OLS on the geographically non-overlapping benchmark.
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation: This paper proposes Adaptive Weight Constraint (AWC) regularization, combining Shapley-value-based modality contribution assessment and Fisher Information Matrix (FIM) weighted parameter penalties, to address modality imbalance in multi-modal (RGB/LiDAR/mmWave/WiFi) 3D human pose estimation. Balanced optimization is achieved without introducing any additional learnable parameters.
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation: To address the modality imbalance problem in multi-modal 3D human pose estimation (3D HPE), this paper proposes a Shapley-value-based modality contribution evaluation algorithm and an Adaptive Weight Constraint (AWC) regularization method based on the Fisher information matrix. The approach achieves balanced optimization across modalities without introducing additional parameters, and comprehensively outperforms existing balancing methods on the MM-Fi dataset.
Towards Balanced Multi-Modal Learning in 3D Human Pose Estimation: This paper proposes a modality contribution assessment algorithm based on Shapley values and Pearson correlation coefficients, along with a Fisher Information Matrix (FIM)-guided Adaptive Weight Constraint (AWC) regularization method. The approach addresses modality imbalance in end-to-end fusion of four modalities (RGB/LiDAR/mmWave/WiFi), achieving a 2.71 mm reduction in MPJPE on the MM-Fi dataset without introducing additional learnable parameters.
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model: This paper proposes TTSG, a training-free modular framework that generates realistic traffic scenes directly from free-form natural language descriptions. It employs LLM-driven prompt analysis, road retrieval, agent planning, and a plan-aware road ranking algorithm, requiring no predefined routes or spawn points, and achieves a minimum average collision rate of 3.5% on SafeBench.
Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model: This paper proposes TTSG, a modular framework that leverages LLMs to convert free-form text descriptions into executable traffic scenarios. Through prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm, TTSG generates diverse scenes and achieves a minimum average collision rate of 3.5% on SafeBench.
U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences: This paper proposes U4D, the first uncertainty-aware 4D LiDAR world modeling framework. It adopts a "hard-first, easy-second" two-stage diffusion generation strategy that first reconstructs high-uncertainty regions and then conditionally completes the entire scene. A MoST module is designed to adaptively fuse spatio-temporal features for temporal consistency.
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation: This paper proposes VIRD, which constructs view-invariant representations via dual-axis transformation (polar transformation + context-enhanced positional attention) to achieve state-of-the-art cross-view pose estimation without orientation priors, reducing position and orientation errors on KITTI by 50.7% and 76.5%, respectively.
Learning Vision-Language-Action World Models for Autonomous Driving: VLA-World unifies the predictive imagination of world models with the reflective reasoning of VLA models in a single framework. By generating future frames and reasoning over them, the method improves trajectory planning, achieving state-of-the-art collision rates and FID scores.
WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation: WalkGPT is proposed as the first pixel-grounded large vision-language model for pedestrian accessibility navigation, unifying conversational reasoning, segmentation masks, and depth estimation within a single architecture, accompanied by the 41k-scale PAVE dataset.
x2-Fusion: Cross-Modality and Cross-Dimension Flow Estimation in Event Edge Space: This paper proposes x2-Fusion, which constructs a unified Event Edge Space anchored on spatiotemporal edge signals from event cameras. Image, LiDAR, and event features are aligned into this homogeneous edge space, followed by reliability-aware adaptive fusion and cross-dimension contrastive learning to jointly estimate 2D optical flow and 3D scene flow, achieving state-of-the-art performance on both synthetic and real-world datasets.