CVPR2025 Autonomous Driving AI paper notes paper summaries 3D Object Detection Diffusion Models 3D Gaussian Splatting Point Cloud Segmentation

🚗 Autonomous Driving¶

📷 CVPR2025 · 89 paper notes

📌 Same area in other venues: 📷 CVPR2026 (157) · 🔬 ICLR2026 (50) · 🧪 ICML2026 (8) · 🤖 AAAI2026 (56) · 🧠 NeurIPS2025 (47) · 📹 ICCV2025 (91)

🔥 Top topics: Autonomous Driving ×8 · 3D Object Detection ×6 · Diffusion Models ×5 · 3D Gaussian Splatting ×5 · Point Cloud ×5

3D-AVS: LiDAR-based 3D Auto-Vocabulary Segmentation: Ours proposes 3D-AVS, the first auto-vocabulary segmentation method specifically tailored for LiDAR point clouds. Without requiring users to specify target categories, the system automatically identifies semantic entities in the scene from both images and point clouds to generate a vocabulary, and then finishes point-wise semantic segmentation with an open-vocabulary segmenter. It demonstrates the capability to generate fine-grained semantic categories on nuScenes and ScanNet200.
ProtoOcc: 3D Occupancy Prediction with Low-Resolution Queries via Prototype-aware View Transformation: This paper proposes ProtoOcc, which enhances the contextual information of low-resolution voxels by mapping 2D image clustering prototypes into the 3D voxel query space via prototype-aware view transformation. Together with a multi-perspective occupancy decoding strategy, it reconstructs high-resolution 3D occupancy scenes from the enhanced voxels. It achieves competitive performance compared to high-resolution methods (Occ3D mIoU 37.80 vs. PanoOcc 38.11) while using a 75% smaller voxel resolution.
A Dataset for Semantic Segmentation in the Presence of Unknowns: This paper proposes the ISSU anomaly segmentation dataset, which represents the first benchmark to simultaneously support the joint evaluation of known classes (closed-set) and unknown anomalies (open-set). It is twice the size of existing anomaly segmentation datasets, covers multiple domains, sensors, and lighting conditions, and its benchmarks reveal significant deficiencies in state-of-the-art (SOTA) methods regarding domain generalization and the segmentation of large/small objects.
A Neuro-Symbolic Framework Combining Inductive and Deductive Reasoning for Autonomous Driving Planning: This paper proposes the first neuro-symbolic framework that directly embeds ASP symbolic reasoning decisions as learnable embeddings into the trajectory decoding of an end-to-end planner. It dynamically extracts scene rules using LLMs, performs logical arbitration via the Clingo solver, generates physically feasible trajectories via a differentiable KBM, and refines them with neural residuals. On nuScenes, it comprehensively outperforms MomAD with an L₂ error of 0.57m, a collision rate of 0.075%, and a TPC of 0.47m.
PAP: A Prediction-as-Perception Framework for 3D Object Detection: Inspired by the brain's "predictive perception," PAP uses the trajectory prediction results of the previous frame as query inputs for the perception module of the current frame to replace some random queries. This achieves a 10% improvement in AMOTA (0.359 to 0.395), a 15% increase in inference speed (14 to 16 FPS), and a 14% reduction in training time on UniAD.
CAWM-Mamba: A Unified Model for Infrared-Visible Image Fusion and Compound Adverse Weather Restoration: CAWM-Mamba proposes the first end-to-end unified framework that simultaneously addresses infrared-visible image fusion and compound adverse weather restoration (e.g., fog + rain, rain + snow). By featuring weather-aware preprocessing, cross-modal feature interaction, and wavelet-domain frequency-SSM decoupling multi-frequency degradations, it comprehensively outperforms SOTA models on the AWMM-100K and standard fusion datasets.
Certified Human Trajectory Prediction: This work introduces randomized smoothing certification to human trajectory prediction for the first time. By leveraging mean/median aggregation functions and a diffusion denoiser, it provides certified robustness for trajectory prediction models—ensuring that the output remains within a certified boundary regardless of how the input noise is perturbed (within a radius \(R\)).
ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate: This work constructs the first large-scale multi-modal rock climbing motion dataset, AscendMotion (412K frames, RGB+LiDAR+IMU), and proposes ClimbingCap, a method that accurately recovers the 3D motions of climbers in the world coordinate system through separate coordinate decoding, post-processing optimization, and semi-supervised training.

Closed-Loop Supervised Fine-Tuning of Tokenized Traffic Models

CompoSIA: Composing Driving Worlds through Disentangled Control for Adversarial Scenario Generation: CompoSIA proposes a compositional driving video generation framework based on Flow Matching DiT. By disentangling the injection of three types of control signals—structure (3D bboxes), identity (a single reference image), and ego-motion (camera trajectories)—it achieves fine-grained independent control and compositional editing for systematically synthesizing adversarial driving scenarios, resulting in a 17% improvement in FVD and a 173% increase in collision rate.
Cubify Anything: Scaling Indoor 3D Object Detection: This paper proposes the Cubify Anything 1M (CA-1M) dataset—the first large-scale indoor 3D detection dataset with exhaustive annotations of all objects on laser scans (440K objects / 1K scenes / 3.5K captures / 13M frames / pixel-perfect projection), and introduces CuTR, a pure Transformer detector, demonstrating that without 3D inductive biases (point clouds/voxels), 3D detection can outperform point-cloud-based methods when data is abundant.
DecoupledGaussian: Object-Scene Decoupling for Physics-Based Interaction: Decouples objects from the background in 3DGS scenes to support physical simulations (such as collisions and grasping) while maintaining high-quality rendering.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving: This paper proposes DiffusionDrive, which successfully applies diffusion models to real-time multimodal trajectory planning in end-to-end autonomous driving for the first time. By introducing a truncated diffusion policy (reducing denoising steps from 20 to 2) and a cascade diffusion decoder, it achieves a record-breaking 88.1 PDMS on the NAVSIM dataset while maintaining a real-time speed of 45 FPS.
Distilling Monocular Foundation Model for Fine-grained Depth Completion: This paper proposes DMD3C, a two-stage knowledge distillation framework that transfers geometric knowledge from monocular depth foundation models (such as Depth Anything V2) to depth completion networks. The first stage performs pre-training using synthesized training data, and the second stage fine-tunes on real-world data utilizing a scale-shift invariant loss (SSI Loss), achieving first place on the KITTI depth completion leaderboard.
Distilling Multi-modal Large Language Models for Autonomous Driving: This paper proposes the DiMA framework, which performs knowledge distillation between a Multi-modal Large Language Model (MLLM) and a visual end-to-end planner through joint training. It designs three surrogate tasks—masked reconstruction, future prediction, and scene editing—to enrich scene representations. During inference, the LLM can be discarded, utilizing only the visual planner. This achieves a 37% reduction in L2 trajectory error and an 80% reduction in collision rate on nuScenes.
Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map: This paper defines for the first time the task of integrating traffic sign regulations into online vectorized high-definition (HD) maps. It constructs the MapDR dataset, which contains over 10,000 video clips and more than 18,000 lane-level regulations, and proposes two baseline solutions: a modular approach (VLE-MEE) and an end-to-end approach (RuleVLM), with RuleVLM achieving a 64.2% overall F1 score.
DrivingSphere: Building a High-fidelity 4D World for Closed-loop Simulation: Proposes a high-fidelity closed-loop driving simulation framework based on 4D occupancy grids. It generates static scene occupancy from BEVs using OccDreamer, composes dynamic objects using Actor Bank, and generates multi-view videos conditioned on occupancy using VideoDreamer, reducing FVD by 44% and improving object detection mAP by 33%.
EV-3DOD: Pushing the Temporal Boundaries of 3D Object Detection with Event Cameras: This work introduces event cameras to 3D object detection for the first time, proposing Virtual 3D Event Fusion (V3D-EF) to project asynchronous events into a 3D voxel space for fusion with LiDAR features. It enables continuous object detection at 100 FPS during the inter-frame "blind time," filling the ~100 ms sensing gap between sensor frames.
EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis: This paper proposes EVolSplat, a feed-forward 3D Gaussian Splatting method for urban scenes based on sparse 3D convolutions. Instead of pixel-aligned predictions, it predicts Gaussian parameters from a globally unified voxel grid. Combined with occlusion-aware image-based rendering (IBR) coloring, it achieves 23.26dB PSNR and 83.81 FPS on KITTI-360.
Exploring Scene Affinity for Semi-Supervised LiDAR Semantic Segmentation: This paper proposes the AIScene framework, which leverages intra-scene consistency (point erasure strategy) and inter-scene affinity (MixPatch + InsFill cross-scene augmentation) to improve semi-supervised LiDAR segmentation by 1.9 mIoU on SemanticKITTI using only 1% labels.
ForestLPR: LiDAR Place Recognition in Forests Attentioning Multiple BEV Density Images: This paper proposes ForestLPR, which slices point clouds at different heights to generate multiple BEV density maps. It leverages ViT to extract local features, followed by a multi-BEV interaction module to adaptively attend to discriminative features at different heights, achieving robust LiDAR place recognition in forest environments and significantly outperforming prior SOTA methods on multiple datasets.
FreeSim: Toward Free-Viewpoint Camera Simulation in Driving Scenes: This paper proposes FreeSim, which reformulates the challenging off-trajectory novel view generation problem as a generative image enhancement task. Combined with training data construction via piece-wise Gaussian reconstruction and a progressive view expansion strategy, it achieves high-quality free-viewpoint rendering with more than 3 meters of lateral offset in driving scenes for the first time.
GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D Occupancy Prediction: This paper proposes GaussianFormer-2, reinterpreting 3D semantic Gaussians from a probabilistic perspective: each Gaussian represents the occupancy probability distribution of its neighborhood. By aggregating geometric predictions via probability multiplication and normalizing semantic predictions using a Gaussian Mixture Model (GMM), it completely eliminates the issues of Gaussians describing empty regions and redundant overlaps, achieving SOTA performance with only 8.9% of the Gaussians.
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction: This paper proposes GaussianWorld, which reformulates 3D occupancy prediction as a 4D occupancy prediction problem conditioned on current sensor inputs. By decomposing scene evolution into three factors — ego-motion alignment, dynamic object motion, and new region completion — the proposed method explicitly models scene changes in the 3D Gaussian space via a world model. Without introducing extra computational overhead, it improves the mIoU of single-frame methods by over 2% on nuScenes.
GDFusion: Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction: GDFusion is proposed, which reinterprets RNN as gradient descent on the feature space to uniformly fuse four types of heterogeneous temporal information (voxel-level, scene-level, motion, geometry) in VisionOcc, achieving a 1.4%-4.8% mIoU improvement on Occ3D while reducing GPU memory by 27%-72%.
Generating Multimodal Driving Scenes via Next-Scene Prediction: This paper proposes UMGen, a unified multimodal driving scene generation framework. It tokenizes four modalities—ego-vehicle action, map, traffic participants (agents), and images—and generates scenes step-by-step using a two-stage strategy: temporal autoregression (TAR) across frames and ordered autoregression (OAR) within each frame. Additionally, it introduces an Action-aware Map Alignment (AMA) module to maintain consistency between ego-motion and the map, enabling the autonomous generation of coherent driving sequences up to 60 seconds long.
Generative Gaussian Splatting for Unbounded 3D City Generation: Proposes GaussianCity, the first framework to apply 3D Gaussian Splatting to unbounded 3D city generation. By introducing a compact intermediate representation called BEV-Point, GPU memory consumption is decoupled from the scene scale (remaining constant). Additionally, a Point Serializer is designed to convert unordered BEV points into ordered sequences to capture structural and contextual features. This achieves state-of-the-art (SOTA) performance in both drone and street-view city generation, with rendering speeds 60 times faster than CityDreamer (which is based on NeRF).
GLane3D: Detecting Lanes with Graph of 3D Keypoints: This paper proposes GLane3D, a keypoint-based 3D lane detection method. It constructs a graph structure by detecting lane keypoints and predicting directed connections between them. After removing redundant keypoint proposals using PointNMS, Dijkstra's shortest path algorithm is employed to extract lane instances. It achieves state-of-the-art (SOTA) F1 scores on OpenLane and Apollo datasets with superior generalization capability.
InteractionMap: Improving Online Vectorized HDMap Construction with Interaction: This paper proposes InteractionMap, which comprehensively enhances information interaction in online vectorized HD map construction through three modules: point-level and instance-level relation embedding, keyframe-based hierarchical temporal fusion, and geometry-aware classification-localization alignment. It achieves state-of-the-art (SOTA) performance on both nuScenes (71.8 mAP) and Argoverse2 (74.7 mAP).
Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels: This paper proposes DOtA (Detect Objects from Multi-Agent), a multi-agent LiDAR 3D object detection method that requires no manual annotations. By leveraging the shared ego-pose and ego-shape within cooperative agents to initialize the detector, it encodes complementary observations across agents at multiple scales. It then decodes high- and low-quality pseudo-labels to guide feature learning, achieving high-quality 3D object detection in a completely unsupervised manner.
LiDAR-RT: Gaussian-based Ray Tracing for Dynamic LiDAR Re-Simulation: This paper proposes LiDAR-RT, which integrates 3D Gaussian primitives with NVIDIA OptiX hardware-accelerated ray tracing to achieve real-time and physically accurate LiDAR re-simulation in dynamic driving scenes for the first time. It achieves a rendering speed of 30 FPS and requires only 2 hours of training, significantly surpassing NeRF-based approaches (0.2 FPS and 15 hours).
LightLoc: Learning Outdoor LiDAR Localization at Light Speed: This paper proposes LightLoc, which achieves a 50x acceleration in large-scale outdoor LiDAR localization training (1 hour vs. 2 days) while attaining a state-of-the-art (SOTA) position accuracy of 0.83m. This is achieved via Sample Classification Guidance (SCG) to reduce regression ambiguity in visually similar areas, and Redundant Sample Downsampling (RSD) to discard already well-learned frames.
LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes: This paper proposes LiMoE, which fuses three complementary LiDAR representations (range images, sparse voxels, and raw point clouds) using a Mixture of Experts (MoE) mechanism. Through three-stage training (image-to-LiDAR pre-training -> contrastive mixture learning -> semantic mixture supervision), it achieves 51.4% mIoU on nuScenes segmentation and generalizes across 7 datasets.
LiSu: A Dataset and Method for LiDAR Surface Normal Estimation: This paper proposes LiSu, the first large-scale synthetic LiDAR point cloud surface normal dataset, and designs a spatiotemporal regularization method to enhance normal estimation accuracy, effectively suppressing pseudo-label noise during self-training and achieving robust synthetic-to-real domain adaptation.
LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction: LR-SGS proposes a robust LiDAR-reflectance-guided salient Gaussian splatting method, introducing structure-aware salient Gaussian representations (initialized by LiDAR geometry and reflectance feature points) and an illumination-invariant reflectance channel as extra constraints. On the challenging scenes (complex lighting) of the Waymo dataset, its PSNR outperforms OmniRe by 1.18 dB.
M²-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs: M²-Occ addresses the problem of semantic occupancy prediction under incomplete multi-camera inputs. It proposes a Multi-view Masked Reconstruction (MMR) module to recover missing view features using overlapping regions of adjacent cameras, and a Feature Memory Module (FMM) to refine uncertain voxel features through class-level semantic prototypes, improving IoU by 4.93% under the missing rear-view setup. The significance of this work lies not only in the method itself but also in providing the first systematic study of occupancy prediction behavior under sensor failures.
MapGCLR: Geospatial Contrastive Learning of Representations for Online Vectorized HD Map Construction: MapGCLR proposes a geospatial contrastive learning method that improves the BEV encoder for online vectorized HD map construction by enforcing BEV feature consistency in geospatially overlapping regions across multiple drives, achieving a 42% relative mAP improvement with only 5% labeled data.
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction: This paper combines the MAE-style mask reconstruction task with the diffusion generation process to propose the MaskGWM driving world model. Through three innovative designs—diffusion-related mask tokens, row-wise mask attention, and a row-wise cross-view module—it significantly outperforms current SOTA methods in both long-term prediction and multi-view generation scenarios.
MITracker: Multi-View Integration for Visual Object Tracking: This paper proposes a multi-view object tracking dataset MVTrack (234K frames, 27 target classes) and a method named MITracker. By projecting 2D features into 3D feature volumes and compressing them into a BEV (Bird's-Eye View) plane for cross-view fusion, combined with spatially-augmented attention to refine individual view tracking results, the method achieves rapid tracking recovery from occlusions.
ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling: Proposes ModeSeq—a novel paradigm that models trajectory modes as a sequence. By decoding multimodal trajectories progressively (instead of one-step parallel decoding), it explicitly captures inter-mode correlations. Combined with the Early-Match-Take-All (EMTA) training strategy, it significantly improves trajectory diversity and confidence calibration in sparse multimodal motion prediction, without relying on dense mode predictions or heuristic post-processing.
Neural Inverse Rendering from Propagating Light: The first method for physics-based inverse rendering from multi-view time-resolved LiDAR measurements (time-of-flight photon detection). It replaces recursive path tracing with a time-resolved radiance cache to model direct and indirect light transport, reducing normal MAE on synthetic scenes from 22.80° (FWP++) to 8.45°, while supporting novel view synthesis and relighting.
O3N: Omnidirectional Open-Vocabulary Occupancy Prediction: O3N is the first to propose a purely vision-based end-to-end omnidirectional open-vocabulary occupancy prediction framework. It models omnidirectional spatial continuity via Polar-spiral Mamba (PsM), unifies geometric and semantic supervision via Occupancy Cost Aggregation (OCA), and bridges the pixel-voxel-text modality gap via gradient-free Natural Modality Alignment (NMA), achieving SOTA performance on QuadOcc and Human360Occ.
OccMamba: Semantic Occupancy Prediction with State Space Models: OccMamba introduces SSM/Mamba into outdoor semantic occupancy prediction. It serializes 3D voxels into 1D sequences via a height-prioritized 2D Hilbert flattening strategy, and uses a hierarchical Mamba structure coupled with a local context processor to model both global and local contexts. It achieves state-of-the-art (SOTA) results on OpenOccupancy, SemanticKITTI, and SemanticPOSS, with GPU memory consumption far lower than Transformer-based approaches.
Online Video Understanding: OVBench and VideoChat-Online: This work advances online video understanding from three perspectives: evaluation benchmarks, model architectures, and training strategies. It proposes OVBench (an online video QA benchmark containing 16 subtasks across 6 task types), designs a Pyramid Memory Bank (PMB) to efficiently compress streaming video information, and builds the 4B-parameter VideoChat-Online model through progressive offline-to-online training, outperforming a 7B offline model by 4.2% on OVBench.
Open-Canopy: Towards Very High Resolution Forest Monitoring: Open-Canopy proposes the first open-access, nation-wide, very high resolution (\(1.5\text{m}\)) canopy height estimation benchmark dataset covering over \(87,000\text{ km}^2\) in France, combining SPOT satellite imagery and airborne LiDAR data. It also introduces a benchmark task for canopy height change detection, Open-Canopy-\(\Delta\), establishing a comprehensive experimental baseline across a series of SOTA models.
Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots: The first panoramic multimodal semantic occupancy prediction framework VoxelHound designed for quadruped robots. It introduces the PanoMMOcc dataset (panoramic RGB + thermal + polarization + LiDAR) and achieves 23.34% mIoU through the Vertical Jitter Compensation (VJC) and Multimodal Information Prompt Fusion (MIPF) modules.
PanSplat: 4K Panorama Synthesis with Feed-Forward Gaussian Splatting: PanSplat proposes a feed-forward panorama view synthesis method. By designing a spherical 3D Gaussian pyramid, Fibonacci lattice arrangement, and hierarchical spherical cost volume, it achieves high-efficiency 4K resolution (2048×4096) panoramic generation for the first time, trainable on a single A100 GPU.
Physical Plausibility-aware Trajectory Prediction via Locomotion Embodiment: This paper proposes the Locomotion Embodiment framework, which utilizes humanoid locomotion generation in a physical simulator to evaluate the physical plausibility of trajectories. It constructs a differentiable LocoVal function to replace the non-differentiable physical simulator during trajectory prediction network training, and filters implausible trajectories at inference time.
PIDLoc: Cross-View Pose Optimization Network Inspired by PID Controllers: Inspired by PID controllers, this paper proposes PIDLoc, a cross-view pose optimization network. By integrating three branches—P (proportional, local feature discrepancy), I (integral, global multi-pose candidate aggregation), and D (derivative, gradient of feature discrepancy)—with a spatial-aware pose estimator, PIDLoc achieves robust and precise localization even under large initial pose errors.
Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting: This paper identifies that point-to-point (P2P) matching in semi-supervised crowd counting leads to model over-activation on unlabeled data (visualized via PSAM gradient diagnosis). To address this, the authors propose point-to-region (P2R) matching, which expands each GT/pseudo-labeled point into a local region and propagates confidence. On ShanghaiTech-A with 5% labeled data, it achieves an MAE of 69.9 (vs. prev. SOTA of 83.7) while running 68 times faster than P2P.
PAR: Poly-Autoregressive Prediction for Modeling Interactions: PAR (Poly-Autoregressive) proposes a simple and unified multi-agent behavior prediction framework. By conditioning on the state sequences of other agents during interactions, paired with the next-timestep prediction of the same agent and learned agent ID embeddings, it outperforms single-agent autoregressive baselines across three distinct tasks: social behavior prediction, autonomous driving trajectory prediction, and hand-object interaction.
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation: Prompt Depth Anything introduces the "prompting" paradigm to depth foundation models for the first time. Using low-cost LiDAR (such as iPhone LiDAR) as metric prompts, it guides the Depth Anything model to output accurate metric depth through a concise multi-scale prompt fusion architecture, achieving high-quality depth estimation at up to 4K resolution.
PSA-SSL: Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds: PSA-SSL is proposed to preserve object pose and size information by incorporating a self-supervised bounding box regression pre-training task into contrastive learning, integrated with LiDAR Pattern Augmentation to achieve cross-sensor generalization, significantly outperforming SOTA self-supervised methods on 3D semantic segmentation and object detection.
RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion: This work proposes RaCFormer, a query-based radar-camera fusion framework. By simultaneously sampling features from both the image perspective and the BEV perspective, and incorporating modules such as circular query initialization, radar-aware depth prediction, and an implicit dynamic catcher, it achieves 64.9% mAP and 70.2% NDS on nuScenes.
RC-AutoCalib: An End-to-End Radar-Camera Automatic Calibration Network: RC-AutoCalib is proposed as the first end-to-end online automatic geometric calibration method for 3D Radar and Camera. By utilizing a dual-perspective (front view + bird's-eye view) feature representation, a selective fusion mechanism, and a noise-resistant matcher, it effectively addresses the sparsity and high uncertainty of Radar data, significantly outperforming existing LiDAR-Camera calibration methods on the nuScenes dataset.
ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration: This paper proposes ReconDreamer, which enhances driving scene reconstruction by incrementally integrating world model knowledge. The core components are DriveRestorer (a fine-tuned world model that restores rendering artifacts online) and a Progressive Data Updating Strategy (PDUS). It achieves high-quality novel trajectory rendering under large maneuvers (e.g., a multi-lane change of over 6 meters) for the first time, achieving a 24.87% improvement in NTA-IoU over the baseline.
RENO: Real-Time Neural Compression for 3D LiDAR Point Clouds: RENO proposes Sparse Occupancy Codes and a one-time inference strategy, achieving the first real-time neural compression of 3D LiDAR point clouds (10fps@14-bit). With a model size of only 1MB, it outperforms the G-PCC standard by 12.25% in bitrate savings.
Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection: Reveals the inherent truncation defect at endpoints in existing sparse lane representations (losing up to 20m). Proposes an Endpoint Patching strategy (EP-head) and a geometry-prior-infused PL-attention, improving the F1-score of Persformer, Anchor3DLane, and LATR by 4.4, 3.2, and 2.8 points, respectively.
GDFusion: Rethinking Temporal Fusion with a Unified Gradient Descent View for 3D Semantic Occupancy Prediction: This paper proposes GDFusion, which re-interprets RNNs as gradient descent steps to unify the fusion of three temporal cues (scene-level, motion, and geometry). It improves performance by 1.4-4.8% mIoU over non-temporal baselines on Occ3D while reducing inference memory by 27-72%, demonstrating superior efficiency compared to multi-frame methods like SOLOFusion.
Scenario Dreamer: Vectorized Latent Diffusion for Generating Driving Simulation Environments: Scenario Dreamer is proposed to decompose the generation of autonomous driving simulation environments into three parts: a vectorized latent diffusion model to generate initial scenarios (lanes and agents), reward-conditioned CtRL-Sim for closed-loop behavior generation, and scene inpainting for unbounded environment expansion. It achieves a Frechet Distance of 0.67 on nuPlan (compared to the SLEDGE baseline of 1.44) with a generation time of only 0.16 seconds.
SceneCrafter: Controllable Multi-View Driving Scene Editing: SceneCrafter proposes a driving scene editing framework based on multi-view diffusion models. Through a teacher-student two-stage training paradigm, it generates high-quality synthetic paired data, supporting global editing of weather/time and local editing of foreground object addition/deletion while maintaining 3D geometric consistency across cameras.
SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model: SceneDiffuser++ is proposed, an end-to-end city-scale traffic simulation diffusion model that handles agent spawning and despawning in sparse tensors via soft clipping, achieving over 60 seconds of trip-level traffic simulation with a combined JS divergence of 0.2423 on WOMD-XLMap.
SDGOcc: Semantic and Depth-Guided BEV Transformation for 3D Multimodal Occupancy Prediction: This paper proposes SDG-OCC, a multimodal 3D semantic occupancy prediction framework. By replacing the traditional LSS pipeline with a semantic and depth-guided view transformation (which utilizes LiDAR depth and image semantic segmentation masks to construct virtual points), and combining it with a fusion-to-occupancy-driven active distillation module, the method achieves SOTA performance on Occ3D-nuScenes while maintaining real-time inference speed.
Segment Anything, Even Occluded: Proposes SAMEO, which adapts EfficientSAM as an amodal segmentation decoder for occluded objects. Combined with a newly constructed 300K-image Amodal-LVIS dataset, it achieves zero-shot amodal segmentation performance on COCOA-cls and D2SA that outperforms supervised methods.
Single Pixel Image Classification using an Ultrafast Digital Light Projector: Achievement of single-pixel imaging (SPI)-based MNIST image classification utilizing an ultrafast microLED-on-CMOS digital light projector, reaching \(>90\%\) classification accuracy at a frame rate of 1.2 kfps. This completely bypasses image reconstruction to classify directly from temporal optical signals.
SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction: SocialMOIF proposes a multi-order intention fusion model that comprehensively captures social intentions through a first-order direct interaction layer and a high-order neighbor indirect interaction layer. Combined with a trajectory distribution approximator based on the squeeze theorem and a global trajectory optimizer introducing KANs for the first time, it achieves SOTA performance on multiple datasets including ETH/UCY, SDD, NBA, and NuScenes.
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving: The paper proposes SOLVE, which achieves feature-level synergy between VLM and end-to-end (E2E) driving models via a shared SQ-Former vision encoder. By employing Trajectory Chain-of-Thought (T-CoT), it utilizes the long-range trajectories from the VLM as prior initialization for the E2E model, achieving a state-of-the-art average L2 error of 0.28m on nuScenes.
SparseAlign: A Fully Sparse Framework for Cooperative Object Detection: SparseAlign proposes the first fully sparse framework for cooperative object detection. By resolving the problems of center feature missing and isolated convolution domains via coordinate-expandable sparse convolution, it outperforms dense BEV-based state-of-the-art methods while reducing communication bandwidth by 98%.
Spatiotemporal Decoupling for Efficient Vision-Based Occupancy Forecasting: EfficientOCF is proposed to solve the spatial and temporal biases in occupancy forecasting through spatial decoupling (decomposing 3D occupancy into 2D BEV occupancy + height values) and temporal decoupling (achieving step-by-step OCF instead of end-to-end prediction by associating instances via optical flow), achieving SOTA 3D occupancy forecasting performance and a fast inference time of 82.33ms.
Spectral-Geometric Neural Fields for Pose-Free LiDAR View Synthesis: SG-NLF proposes a pose-free LiDAR NeRF framework. By reconstructing smooth geometry with a hybrid spectral-geometric representation, achieving global alignment via a confidence-aware pose graph, and enhancing cross-frame consistency with adversarial learning, it outperforms the state-of-the-art by 35.8% in reconstruction quality and 68.8% in pose accuracy under low-frequency LiDAR scenarios.
SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization: SuperPC is proposed as the first framework to unify point cloud completion, upsampling, denoising, and colorization within a single conditional diffusion model. It effectively fuses image and point cloud modalities using a three-level conditioning (TLC) mechanism (raw/local/global) and a spatial mixed fusion (SMF) strategy.
T²SG: Traffic Topology Scene Graph for Topology Reasoning in Autonomous Driving: Defines a unified Traffic Topology Scene Graph (T²SG) to explicitly model lanes, traffic signal control relationships, and topology connections among lanes. It also proposes TopoFormer, which achieves precise topology reasoning using a Lane Aggregation Layer (LAL) and a Counterfactual Intervention Layer (CIL), reaching a SOTA of 46.3 OLS on OpenLane-V2.
TacoDepth: Towards Efficient Radar-Camera Depth Estimation with One-Stage Fusion: TacoDepth proposes the first one-stage radar-camera fusion depth estimation framework. By utilizing a graph-based radar structure extractor and a pyramid-based radar fusion module, it bypasses the need for intermediate quasi-dense depth maps, improving accuracy by 12.8% and speed by 91.8%, achieving real-time performance at 37+ FPS.
Temporal Action Detection Model Compression by Progressive Block Drop: A Progressive Block Drop method is proposed to compress Temporal Action Detection (TAD) models from the depth dimension. By progressively removing redundant blocks and utilizing a parameter-efficient cross-depth alignment strategy to recover performance, this method achieves a 25% computation reduction with no performance degradation, and even exhibits performance gains.
Toward Real-World BEV Perception: Depth Uncertainty Estimation via Gaussian Splatting: GaussianLSS introduces depth uncertainty modeling into the classic Lift-Splat-Shoot (LSS) framework. By calculating the variance of the depth distribution and converting it into a 3D Gaussian representation, which is then efficiently rasterized using Gaussian Splatting to generate uncertainty-aware BEV features, the method achieves state-of-the-art (SOTA) performance among unprojection methods on nuScenes, while being 2.5\(\times\) faster and saving 70% of GPU memory compared to projection methods.
Towards Autonomous Micromobility through Scalable Urban Simulation: This paper proposes URBAN-SIM (a high-performance urban robot learning simulation platform) and URBAN-BENCH (a benchmark with 8 micromobility tasks). By incorporating three core modules—hierarchical urban scene generation, interactive dynamics generation, and asynchronous scene sampling—the framework enables training and evaluation of embodied agents in large-scale, diverse urban environments, serving as a systematic simulation solution for driving autonomous micromobility forward.
Towards Satellite Image Road Graph Extraction: A Global-Scale Dataset and A Novel Method: This paper constructs Global-Scale, a large-scale global satellite road graph extraction dataset (approximately 20 times larger than the largest existing public dataset), and proposes the SAM-Road++ method. Under the proposed method, a node-guided resampling strategy is designed to resolve the training-inference mismatch, while an "extended-line" strategy mitigates road fragmentation caused by occlusions, achieving SOTA performance across multiple datasets.
Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning: This paper proposes Tra-MoE, which uses a sparse gated Mixture-of-Experts (MoE) architecture to train a trajectory prediction model. It effectively fuses large-scale out-of-domain action-free video data with small-scale in-domain robot demonstrations. It also designs an adaptive policy conditioning technique to explicitly align 2D trajectories with visual observations, significantly improving the success rate of robot manipulation in both simulation and real-world scenarios.
Trajectory Mamba: Efficient Attention-Mamba Forecasting Model Based on Selective SSM: Proposes Trajectory Mamba (Tamba), which redesigns the self-attention mechanism based on selective State Space Models (SSMs) to achieve linear-time-complexity trajectory forecasting. By utilizing a joint polyline encoding strategy and a cross-state space decoder, it maintains prediction accuracy while reducing parameters by over 40% and decreasing FLOPs by 4x.
Uncertainty-Instructed Structure Injection for Generalizable HD Map Construction: Ours proposes UIGenMap, which obtains explicit structural features through an uncertainty-aware perspective view (PV) detection branch, constructs PV prompts based on uncertainty weights to inject into the BEV map decoder, and incorporates Mimic Query distillation for real-time inference, achieving a +5.7 mAP generalization performance improvement on geographically disjoint data splits.
UniScene: Unified Occupancy-centric Driving Scene Generation: UniScene is proposed, a two-stage driving scene generation framework with occupancy grids as the unified intermediate representation. An Occupancy Diffusion Transformer generates semantic occupancy from BEV layouts, which is then rendered into semantic and depth maps via Gaussian Splatting to condition dual diffusion models for generating video and LiDAR. UniScene achieves an FVD of 71.94 (compared to the previous SOTA Drive-WM of 122.70) and improves downstream 3D detection by 3.62% mAP via data augmentation.
Unlocking Generalization Power in LiDAR Point Cloud Registration: This work proposes the UGP framework, which significantly improves the generalization capability of LiDAR point cloud registration in cross-range and cross-dataset scenarios by eliminating cross-attention and introducing progressive self-attention and BEV feature fusion.
V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection: This paper constructs V2X-R, the first V2X simulation dataset containing three modalities (LiDAR, camera, and 4D radar). It proposes a cooperative LiDAR-4D radar fusion pipeline and a Multi-modal Denoising Diffusion (MDD) module. By leveraging weather-robust 4D radar features to guide a diffusion model in denoising noisy LiDAR features, the approach improves detection performance by up to 5.73%/6.70% in foggy/snowy conditions with almost no impact on normal weather performance.
VIRD: View-Invariant Representation through Dual-Axis Transformation for Cross-View Pose Estimation: VIRD constructs view-invariant representations through a dual-axis transformation (polar transformation + context-enhanced positional attention) to achieve omnidirectional cross-view pose estimation without orientation priors, reducing position and orientation errors on KITTI by 50.7% and 76.5%, respectively.
VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving: This paper proposes VisionPAD, a vision-centric self-supervised pre-training framework. It replaces volume rendering with anchor-based 3D Gaussian Splatting to reconstruct multi-view images, and introduces self-supervised voxel velocity estimation combined with multi-frame photometric consistency constraints to learn motion cues and 3D geometry. Completely independent of LiDAR depth supervision, it significantly outperforms existing pre-training methods on three downstream tasks: 3D detection, occupancy prediction, and map segmentation.
VoteFlow: Enforcing Local Rigidity in Self-Supervised Scene Flow: VoteFlow incorporates local rigid motion constraints as an inductive bias into self-supervised scene flow estimation models by introducing a lightweight module based on differentiable voting within the network architecture. It outperforms previous state-of-the-art self-supervised methods on the Argoverse 2 and Waymo datasets with extremely low computational overhead.
WeatherGen: A Unified Diverse Weather Generator for LiDAR Point Clouds via Spider Mamba Diffusion: This paper proposes WeatherGen, the first unified diffusion generation framework for diverse adverse weather LiDAR data. By preserving the physical structure of LiDAR through the Spider Mamba generator and achieving controllable weather generation via a contrastive learning-based controller, it significantly outperforms physics-based simulation methods in both data fidelity and downstream detection performance.
Zero-Shot 4D Lidar Panoptic Segmentation: This paper proposes SAL-4D (Segment Anything in Lidar-4D), which utilizes a multimodal sensor setup as a bridge to distill Video Object Segmentation (VOS) models and CLIP vision-language features into the LiDAR space. This achieves zero-shot 4D LiDAR panoptic segmentation, outperforming prior methods on 3D zero-shot LPS by over \(5+\) PQ.
ZeroVO: Visual Odometry with Minimal Assumptions: This paper proposes ZeroVO, a Transformer-based monocular visual odometry method. Through a calibration-free geometry-aware network architecture, language prior integration, and a semi-supervised training paradigm, it achieves over 30% improvement in zero-shot generalization performance across KITTI, nuScenes, Argoverse 2, and a self-built GTA dataset.