Skip to content

🚗 Autonomous Driving

📹 ICCV2025 · 98 paper notes

3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

This paper proposes PGA, the first physical adversarial attack framework based on 3DGS, which generates cross-view robust physical adversarial camouflage through fast and accurate target reconstruction, resolution of Gaussian mutual/self-occlusion issues, and a min-max background adversarial optimization strategy. PGA surpasses state-of-the-art methods in both digital and physical domains.

3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

This paper proposes PGA, the first physical adversarial attack framework based on 3D Gaussian Splatting (3DGS). By addressing mutual occlusion and self-occlusion among Gaussians to ensure cross-viewpoint consistency, and by designing a min-max optimization strategy to filter non-robust adversarial features, PGA substantially outperforms state-of-the-art methods in both the digital and physical domains.

3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

This paper presents 3DRealCar, the first large-scale real-world 3D vehicle dataset comprising 2,500 vehicles from 100+ brands, each with approximately 200 high-resolution 360-degree RGB-D views captured under three lighting conditions (standard, reflective, and low-light), along with 13-category vehicle parsing annotations, supporting tasks including 3D reconstruction, detection, and generation.

3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

This paper introduces 3DRealCar, the first large-scale real-world 3D car dataset comprising high-resolution (1920×1440) 360-degree RGB-D scans of 2,500 real vehicles (averaging 200 views per car), covering 100+ brands and three lighting conditions (standard / high-reflectance / low-light). The dataset provides rich annotations including point clouds and parsing maps, and benchmarks multiple 3D reconstruction methods, revealing significant reconstruction challenges under reflective and low-light conditions.

4DSegStreamer: Streaming 4D Panoptic Segmentation via Dual Threads

This paper proposes 4DSegStreamer, a streaming 4D panoptic segmentation framework built upon a dual-thread system (predictive thread + inference thread). It achieves real-time, high-quality 4D panoptic segmentation through geometric and motion memory maintenance, ego-pose prediction, and inverse forward flow iteration.

6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting

This paper proposes 6DOPE-GS, a model-free online tracking method that jointly optimizes 6D object pose and 3D reconstruction using 2D Gaussian Splatting (2DGS). Through dynamic keyframe selection and opacity-percentile-based density control, it achieves a 5× speedup while maintaining state-of-the-art accuracy.

6DOPE-GS: Online 6D Object Pose Estimation using Gaussian Splatting

Leveraging the efficient differentiable rendering capability of 2D Gaussian Splatting, this paper proposes a CAD-model-free online 6D object pose estimation and tracking method. By jointly optimizing a Gaussian object field and keyframe poses, it achieves approximately 5× speedup over BundleSDF while maintaining comparable accuracy.

A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

This paper proposes an SfM-free constrained optimization framework that jointly optimizes camera parameters and 3DGS scene reconstruction from coarse poses and noisy point clouds produced by multi-camera SLAM systems, via camera pose decomposition, sensitivity-based pre-conditioning, log-barrier constraints, and geometric constraints.

ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

This paper proposes ACAM-KD, an adaptive student-teacher cooperative attention masking framework for knowledge distillation. By employing Student-Teacher Cross-Attention Feature Fusion (STCA-FF) and Adaptive Spatial-Channel Masking (ASCM) to dynamically adjust distillation focus, ACAM-KD surpasses the state of the art by up to 1.4 mAP on COCO detection and improves mIoU by 3.09 on Cityscapes segmentation.

ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

This paper proposes ACAM-KD, which introduces two modules — Student-Teacher Cross-Attention Feature Fusion (STCA-FF) and Adaptive Spatial-Channel Masking (ASCM) — to enable dynamically evolving feature selection in knowledge distillation that adapts to the student's learning state. On COCO detection, RetinaNet R50 distilled from R101 achieves 41.2 mAP (+1.4 over prior SOTA); on Cityscapes segmentation, DeepLabV3-MBV2 improves mIoU by 3.09.

AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

This paper proposes AD-GS, a self-supervised autonomous driving scene rendering framework based on 3D Gaussian Splatting. The core innovation is combining learnable B-spline curves with trigonometric functions for local-global motion modeling, coupled with a simplified binary pseudo-segmentation for robust scene decomposition. Without relying on manual 3D annotations, AD-GS substantially outperforms existing self-supervised methods.

AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

This paper proposes AD-GS, a self-supervised autonomous driving scene rendering framework that models dynamic object motion by combining locally-aware learnable B-spline curves with globally-aware trigonometric functions. It employs simplified pseudo 2D segmentation for scene decomposition, significantly outperforming existing self-supervised methods and approaching the performance of annotation-dependent approaches without relying on manual 3D annotations.

AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving

AdaDrive presents the first LLM-augmented autonomous driving framework with an adaptive slow-fast architecture. Two adaptive connectors dynamically determine when to activate the LLM (Connector-W) and how much the LLM contributes (Connector-H), achieving SOTA performance on language-grounded driving benchmarks (driving score 80.9%) while reducing inference latency to 189ms and GPU memory to 6.79GB.

Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts

This paper proposes DUO (Dual Uncertainty Optimization), the first test-time adaptation framework that jointly minimizes semantic uncertainty and geometric uncertainty, achieving robust monocular 3D object detection via conjugate focal loss and normal field constraints.

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

This paper proposes the AGO framework, which handles known categories via noise-augmented grounding training and unknown categories via a modality adapter for adaptive alignment. An information entropy-based open-world recognizer dynamically selects the optimal features at inference time. AGO surpasses VEON by 4.09 mIoU on the Occ3D-nuScenes self-supervised benchmark while exhibiting open-world zero-shot/few-shot transfer capability.

ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions

ALOcc is proposed as a framework that achieves state-of-the-art performance on multiple 3D semantic occupancy and occupancy flow prediction benchmarks through three innovations: an occlusion-aware adaptive lifting mechanism, a semantic prototype-based occupancy head, and a BEV cost volume-based flow prediction module, while offering multiple model variants ranging from real-time to high-accuracy configurations.

ALOcc: Adaptive Lifting-Based 3D Semantic Occupancy and Cost Volume-Based Flow Predictions

This paper proposes the ALOcc framework, which achieves state-of-the-art performance on multiple occupancy prediction benchmarks while maintaining high inference speed through three improvements: an occlusion-aware adaptive lifting mechanism, semantic prototype alignment, and BEV cost volume-based flow prediction.

Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations

LiMA proposes a long-horizon image-to-LiDAR memory aggregation framework that explicitly leverages spatiotemporal cues in LiDAR sequences via three modules—cross-view aggregation, long-term feature propagation, and cross-sequence memory alignment—to enhance LiDAR representation learning, achieving substantial improvements over existing pre-training methods on semantic segmentation and 3D object detection.

CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting

This paper proposes the CCL-LGS framework, which employs a zero-shot tracker for cross-view mask association and a Contrastive Codebook Learning (CCL) module to distill semantically compact intra-class and discriminative inter-class features. The framework addresses cross-view semantic inconsistency in 2D-prior-based 3D semantic field reconstruction caused by occlusion, blur, and viewpoint variation.

CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving

CoDa-4DGS augments the 4D Gaussian Splatting (4DGS) framework with context awareness (self-supervised 4D semantic features from 2D foundation models) and temporal deformation awareness (tracking per-Gaussian deformation between adjacent frames). By jointly encoding semantic and deformation features as dynamic compensation cues for each Gaussian, the method captures finer-grained details in autonomous driving dynamic scenes and surpasses existing self-supervised approaches.

CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving

The first end-to-end LLM-driven cooperative driving system. Through an Actor-Critic language negotiation module and an intention-guided trajectory generator, it achieves an 11% higher success rate than existing methods across diverse V2V interaction scenarios.

Controllable 3D Outdoor Scene Generation via Scene Graphs

This work proposes the first method to use scene graphs as control signals for large-scale 3D outdoor scene generation. A GNN encodes sparse scene graphs into BEV embedding maps, which are then fed into a cascaded 2D→3D discrete diffusion model to generate semantic 3D scenes. An accompanying interactive system allows users to directly edit scene graphs to control the generation.

CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

This paper proposes CoopTrack, the first fully instance-level end-to-end cooperative 3D multi-object tracking framework. It achieves cross-agent instance matching and fusion via a learnable graph attention association module and multi-dimensional feature extraction, reaching state-of-the-art performance on V2X-Seq.

Counting Stacked Objects

The paper decomposes the stacked object counting problem into two sub-problems — volume estimation and occupancy ratio estimation — solving the former via multi-view 3D reconstruction and the latter via a depth-map-driven neural network that infers interior occupancy from visible surfaces. This is the first method to accurately count largely invisible stacked objects, significantly outperforming humans.

CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection

This paper proposes CVFusion — the first two-stage 4D radar-camera fusion network for 3D object detection. Stage 1 generates high-recall proposals via a Radar-Guided Iterative (RGIter) BEV fusion module, while Stage 2 refines each proposal by aggregating heterogeneous multi-view features through Point-Guided Fusion (PGF) and Grid-Guided Fusion (GGF). CVFusion achieves mAP improvements of 9.10% and 3.68% on VoD and TJ4DRadSet, respectively.

DAMap: Distance-aware MapNet for High Quality HD Map Construction

This paper identifies two inherent deficiencies in current HD map construction methods regarding high-quality prediction — inappropriate classification labels and suboptimal task-specific features — and proposes DAMap (comprising three components: DAFL, HLS, and TMDA) to systematically address task misalignment, achieving consistent gains of 2–3 mAP across multiple baselines on NuScenes and Argoverse2.

DCHM: Depth-Consistent Human Modeling for Multiview Detection

This paper proposes DCHM, a depth-consistent human modeling framework that requires no 3D annotations. It generates pseudo depth labels via superpixel-level Gaussian splatting to fine-tune a monocular depth estimation network, and combines multiview label matching to achieve high-accuracy pedestrian detection under sparse-view and heavily occluded scenarios. DCHM achieves 84.2% MODA on Wildtrack and improves MODP by 31.2% over UMPD.

Decoupled Diffusion Sparks Adaptive Scene Generation

This paper proposes Nexus, an adaptive driving scene generation framework based on decoupled diffusion. By assigning independent noise states to individual tokens, Nexus unifies goal-oriented and reactive generation, reduces displacement error by 40%, and introduces Nexus-Data, a dataset comprising 540 hours of safety-critical driving scenarios.

Detect Anything 3D in the Wild

DetAny3D is a promptable 3D detection foundation model that transfers prior knowledge from two 2D foundation models—SAM and depth-pretrained DINO—via a proposed 2D Aggregator and Zero-Embedding Mapping (ZEM) mechanism, enabling stable 2D-to-3D knowledge transfer. Using only monocular images, it achieves zero-shot 3D object detection across arbitrary scenes and camera configurations, surpassing baselines by up to 21% AP3D on novel categories.

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

This paper proposes DiST-4D, the first feed-forward 4D driving scene generation framework. By disentangling temporal prediction (DiST-T) and spatial novel view synthesis (DiST-S) into two separate diffusion processes, with metric depth serving as a geometric bridge, the method achieves state-of-the-art temporal video generation (FVD 22.67) and spatial NVS (FID 10.12) on nuScenes simultaneously, without any per-scene optimization.

Distilling Diffusion Models to Efficient 3D LiDAR Scene Completion

This paper proposes ScoreLiDAR, a diffusion model distillation framework for 3D LiDAR scene completion. By incorporating scene-wise and point-wise structural losses to guide distillation, it reduces completion time from 30.55 seconds to 5.37 seconds (>5× speedup) while surpassing all state-of-the-art methods on SemanticKITTI.

DONUT: A Decoder-Only Model for Trajectory Prediction

DONUT draws inspiration from the decoder-only architecture of LLMs and proposes a unified autoregressive model for processing both historical and future trajectories, coupled with an overprediction strategy to improve anticipation of the distant future. It achieves state-of-the-art performance on the Argoverse 2 benchmark.

DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

This paper proposes DriveX, a self-supervised world model framework that learns transferable general scene representations in a BEV latent space via Omni Scene Modeling (OSM)—jointly supervising 3D point cloud prediction, 2D semantic representation, and image generation. A Future Spatial Attention (FSA) paradigm is designed to seamlessly integrate predicted future states into downstream tasks such as occupancy prediction, flow estimation, and end-to-end driving, achieving state-of-the-art performance across multiple tasks.

DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic

This paper proposes DuET, a framework that, for the first time, addresses both class-incremental and domain-incremental object detection simultaneously (Dual Incremental Object Detection, DuIOD) via exemplar-free Task Arithmetic model merging. It introduces a Directional Consistency Loss to mitigate sign conflicts, achieving substantial improvements over existing methods on the Pascal Series and Diverse Weather Series benchmarks.

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

This paper proposes EmbodiedOcc, a framework that leverages 3D semantic Gaussians as a global memory to enable online indoor 3D occupancy prediction from monocular visual input through progressive exploration and local updating.

EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting

This paper proposes the Explicit Motion Decomposition (EMD) module, which models the motion characteristics of each Gaussian primitive via learnable motion embeddings and a dual-scale deformation framework. As a plug-and-play module, EMD integrates seamlessly into both self-supervised and supervised street-view Gaussian splatting methods, achieving state-of-the-art performance under the self-supervised setting on the Waymo and KITTI datasets.

Epona: Autoregressive Diffusion World Model for Autonomous Driving

This paper proposes Epona, an autoregressive diffusion world model that achieves a unified framework for high-resolution long-horizon driving video generation and real-time trajectory planning through decoupled spatiotemporal modeling and asynchronous multimodal generation.

ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models

This paper proposes ETA, a dual-system framework that shifts large-model computation from the current frame to preceding time steps and applies batch inference, enabling large-model features to be available at every frame. ETA achieves a driving score of 69.53 on Bench2Drive with a 50 ms latency, improving the state of the art by 8%.

EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

This paper proposes EVT, a framework that achieves efficient LiDAR-guided view transformation via Adaptive Sampling and Adaptive Projection (ASAP), combined with group-wise mixed query selection and geometry-aware cross-attention, attaining state-of-the-art performance of 75.3% NDS on the nuScenes test set at real-time inference speed.

Extrapolated Urban View Synthesis Benchmark

This paper presents the first Extrapolated Urban View Synthesis (EUVS) benchmark, which leverages publicly available multi-traversal/multi-vehicle/multi-camera datasets to systematically evaluate the generalization ability of 3DGS and NeRF methods under extrapolation settings, revealing that current methods severely overfit to training viewpoints.

Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics

This paper proposes a "First Reasoning, Then Forecasting" strategy that infers reward distributions over driving intentions via Query-centric Inverse Reinforcement Learning (QIRL), and couples this with a Bi-Mamba-enhanced DETR-style trajectory decoder to significantly improve prediction confidence and accuracy.

Free-running vs. Synchronous: Single-Photon Lidar for High-flux 3D Imaging

This paper systematically compares free-running and synchronous single-photon lidar (SPL) operating modes for depth imaging under high-flux conditions, proposes an efficient joint maximum likelihood estimator and a score-based depth regularization algorithm SSDR, and demonstrates that the free-running mode consistently outperforms the synchronous mode across diverse flux levels and signal-to-background ratios.

SDKD: Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting

This paper proposes SDKD (Spectral Decoupled Knowledge Distillation), a framework that leverages a frequency-aware teacher model and a frequency-aligned distillation strategy to transfer multi-scale spectral knowledge from complex spatiotemporal forecasting models to lightweight student networks, achieving up to 81.3% MSE reduction on the Navier-Stokes dataset.

Future-Aware Interaction Network For Motion Forecasting

This paper proposes FINet, which incorporates latent future trajectories into the scene encoding stage for joint optimization, while introducing the Mamba architecture as a replacement for Transformers in spatiotemporal modeling, achieving efficient and accurate motion forecasting.

GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

This paper proposes GaussianFlowOcc, which replaces dense voxel grids with sparse 3D Gaussian distributions for occupancy estimation. A Gaussian Transformer is introduced for efficient scene modeling, and a Temporal Module estimates per-Gaussian 3D temporal flow to handle dynamic objects. The method substantially outperforms existing approaches on nuScenes under weak supervision (51%+ mIoU improvement) while achieving 50× faster inference.

GaussRender: Learning 3D Occupancy with Gaussian Rendering

This paper proposes GaussRender, a plug-and-play differentiable Gaussian rendering module that projects predicted and ground-truth 3D occupancy onto 2D views and enforces semantic and depth consistency constraints, thereby eliminating visual artifacts such as floating voxels. The approach achieves significant improvements in geometric fidelity across multiple benchmarks, with particularly pronounced gains on surface-sensitive metrics such as RayIoU.

Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model

This paper proposes GALTraj, the first method to apply generative active learning to trajectory prediction. During training, it dynamically identifies tail samples on which the model fails, and employs a controllable diffusion model to synthesize new samples that preserve tail-behavior characteristics while complying with traffic rules. This effectively alleviates long-tail data imbalance, improving both tail-case performance and overall prediction accuracy.

GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts

This paper is the first to introduce Mixture-of-Experts (MoE) networks into low-light image enhancement (LLIE), employing three specialized sub-expert networks to handle color restoration, detail enhancement, and high-level feature enhancement respectively. A dynamic gating mechanism adaptively adjusts the contribution of each expert, achieving state-of-the-art PSNR performance on five benchmark datasets.

GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting

This paper proposes GS-LIVM, the first real-time photo-realistic LiDAR-inertial-visual mapping framework designed for large-scale unbounded outdoor scenes. It addresses the problem of sparse and non-uniform LiDAR point clouds via voxel-level Gaussian Process Regression (Voxel-GPR), and leverages a covariance-centric design to rapidly initialize 3D Gaussian parameters. The method achieves state-of-the-art mapping efficiency and rendering quality across multiple outdoor datasets.

GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting

This paper proposes GS-Occ3D, a scalable vision-only occupancy reconstruction framework that achieves full-dataset auto-labeling on Waymo through Octree-based Gaussian Surfel representation and a three-layer decomposed modeling of ground, static background, and dynamic objects. The resulting labels enable downstream occupancy prediction models to achieve zero-shot generalization comparable to or better than LiDAR-based annotations.

Hermes: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

This paper proposes Hermes, the first unified driving world model that simultaneously performs 3D scene understanding (VQA/captioning) and future scene generation (point cloud prediction). By leveraging BEV representations and world queries to inject LLM world knowledge into future scene generation, Hermes reduces 3s point cloud generation error by 32.4% and improves scene understanding CIDEr by 8.0%.

IGL-Nav: Incremental 3D Gaussian Localization for Image-goal Navigation

This paper proposes IGL-Nav, a system that builds a renderable scene memory via incremental 3D Gaussian representations and efficiently solves the image-goal navigation problem through a coarse-to-fine localization strategy, while supporting a free-view setting with arbitrary camera viewpoints.

INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

This paper proposes INSTINCT, a LiDAR-based instance-level collaborative perception framework that achieves state-of-the-art performance across multiple datasets through three core modules — quality-aware filtering, dual-branch detection routing, and cross-agent local instance fusion — while reducing communication bandwidth to approximately 1/264–1/281 of that required by existing methods.

LangTraj: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

LangTraj is proposed as the first diffusion-based trajectory simulator that incorporates natural language as a training-time condition. It is accompanied by the InterDrive dataset, containing 150K human-annotated interaction behaviors, enabling language-controllable multi-agent interaction simulation and safety-critical scenario generation.

Language Driven Occupancy Prediction (LOcc)

This paper proposes LOcc, an effective and generalizable open-vocabulary occupancy (OVO) prediction framework. Its core contribution is a semantic transitive labeling pipeline (LVLM + OV-Seg → LiDAR → voxel) that generates dense, fine-grained 3D language occupancy pseudo-GT, replacing the noisy and sparse intermediate feature distillation used in prior work. LOcc comprehensively surpasses state-of-the-art methods on Occ3D-nuScenes.

Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering

This paper proposes UGSDF, a method that jointly learns an SDF network and 3D Gaussian Splatting to model dynamic objects in urban scenes. Using only 2D priors (a depth network and a point tracker), UGSDF achieves state-of-the-art rendering quality without requiring LiDAR data, 3D motion annotations, or human body templates.

SkyDiffusion: Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

This paper proposes SkyDiffusion, which combines a Curved-BEV transformation with a BEV-guided diffusion model to achieve high-quality cross-view synthesis from ground-level street view images to aerial/satellite imagery, and introduces the Ground2Aerial-3 multi-scene dataset.

LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal

This paper proposes LightsOut, a diffusion-based image outpainting framework that enhances existing single-image flare removal (SIFR) methods by predicting and reconstructing out-of-frame light sources. It serves as a plug-and-play preprocessing module that improves arbitrary SIFR models without requiring additional training.

Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

This paper proposes InfGen, a unified autoregressive next-token prediction model that interleaves closed-loop motion simulation with scenario generation (dynamic agent insertion and removal), achieving for the first time stable long-term (30-second) traffic simulation. InfGen reaches state-of-the-art performance on short-term benchmarks and significantly outperforms all existing methods on long-term tasks.

LookOut: Real-World Humanoid Egocentric Navigation

LookOut proposes predicting future 6D head pose sequences (translation + rotation) over a 4.5-second horizon from first-person video with known poses. It backprojects DINOv2 features into 3D space and compresses them into a BEV representation to capture scene geometry and semantics. Trained on a self-collected 4-hour real-world dynamic scene dataset, the model learns human-like navigation behaviors including waiting, detour, and looking left and right before crossing the street.

MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception

This paper proposes the MAESTRO framework, which generates task-specific features and suppresses inter-task interference in multi-task 3D perception through three modules: Class-wise Prototype Generator (CPG), Task-Specific Feature Generator (TSFG), and Scene Prototype Aggregator (SPA). MAESTRO simultaneously surpasses single-task models on 3D object detection, BEV map segmentation, and 3D occupancy prediction.

MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding

This paper proposes MCAM, which constructs a causal structure between visual and language modalities via a Driving State Directed Acyclic Graph (DSDAG), combined with a multi-level feature extractor and a causal analysis module, for behavior description and causal reasoning in ego-vehicle-level driving video understanding.

MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion

This paper proposes MGSfM, a global Structure-from-Motion (SfM) framework for multi-camera systems. By exploiting multi-camera rigid constraints through two core modules — Decoupled Multi-camera Rotation Averaging (DMRA) and Multi-camera Geometry driven Position estimation (MGP) — MGSfM achieves accuracy comparable to or better than incremental SfM on large-scale scenes while being approximately 10× faster.

Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration

Mixed Signals is the first real-world V2X dataset featuring heterogeneous LiDAR configurations (varying mounting heights and tilt angles), collected by 3 autonomous vehicles and one roadside unit. It provides 45,100 point cloud frames and 240,600 annotated bounding boxes, and is also the first V2X dataset collected in a left-hand traffic country (Australia).

MonoSOWA: Scalable Monocular 3D Object Detector Without Human Annotations

This paper proposes the first monocular 3D object detection method that requires no human annotations of any kind (neither 2D nor 3D). A novel Local Object Motion Model (LOMM) is introduced to disentangle inter-frame motion sources, enabling auto-labeling at a speed ~700× faster than prior work. A Canonical Object Space (COS) is further proposed to enable multi-dataset training across heterogeneous camera configurations.

Occupancy Learning with Spatiotemporal Memory

This paper proposes ST-Occ, a scene-level spatiotemporal occupancy representation learning framework. Through a Unified Temporal Modeling paradigm, it employs a spatiotemporal memory bank defined in scene coordinates along with an uncertainty- and dynamics-aware memory attention mechanism. ST-Occ outperforms the prior state of the art by 3 mIoU on the Occ3D benchmark while reducing temporal inconsistency by 29%.

OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

This paper proposes OD-RASE, a framework that constructs a road traffic expert knowledge ontology to filter infrastructure improvement proposals generated by LVLMs, enabling proactive identification of accident-prone road structures and generation of improvement recommendations.

Passing the Driving Knowledge Test

This paper introduces DriveQA — the first large-scale text-and-visual dual-modality driving knowledge test benchmark (26K text QA + 448K image QA) — to systematically evaluate LLMs/MLLMs on driving knowledge including traffic regulations, sign recognition, and right-of-way judgment. The benchmark reveals significant deficiencies in numerical reasoning and complex right-of-way scenarios, and demonstrates that DriveQA pretraining yields generalization gains on downstream driving tasks.

PBCAT: Patch-Based Composite Adversarial Training against Physically Realizable Attacks on Object Detection

This paper proposes PBCAT (Patch-Based Composite Adversarial Training), which combines small-area gradient-guided adversarial patches with global imperceptible perturbations for adversarial training, providing unified defense against multiple physically realizable attacks (adversarial patches and adversarial textures). PBCAT achieves a 29.7% AP improvement over the previous SOTA defense on pedestrian detection tasks.

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Building upon ReconDreamer, ReconDreamer++ introduces a Novel Trajectory Deformation Network (NTDNet) to bridge the domain gap between generated data and real observations, and independently models the ground plane to preserve geometric priors. On Waymo, the method achieves performance on par with Street Gaussians on original trajectories while improving NTA-IoU by 6.1%, FID by 23.0% on novel trajectories.

Referring Expression Comprehension for Small Objects

This work proposes the SOREC dataset (100K referring expression–bounding box pairs for small objects) and the PIZA adapter module (Progressive-Iterative Zooming Adapter), enabling pretrained models such as GroundingDINO to autoregressively zoom in on extremely small targets, achieving substantial accuracy gains for small-object REC in autonomous driving scenarios.

RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters

This paper proposes RESCUE, the first online SDM (Sensing–Decision–Motion) unified 3D evacuation simulation framework, integrating a 3D adaptive social force model and a personalized gait controller to achieve real-time personalized evacuation simulation for hundreds of agents.

Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations

This paper proposes the Resonance (Re) model, which decomposes pedestrian trajectory prediction into a superposition of multiple "vibrations"—a linear base, a self-bias, and a resonance-bias. By leveraging spectral similarity between trajectories to simulate "resonance" phenomena in social interactions, the method is validated on ETH-UCY, SDD, NBA, and nuScenes benchmarks.

Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations

This paper proposes Resonance, a physics-inspired model that decomposes pedestrian trajectories into multiple independent "vibration" components, each representing an agent's response to a single cause. The final trajectory is predicted via superposition of these components, while social interactions are learned by simulating resonance phenomena, enhancing interpretability.

RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case

This paper proposes RoboTron-Sim, a framework that constructs a hard-case simulation dataset (HASS), introduces Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego encoder (I2E), enabling MLLMs to effectively leverage simulated hard cases to improve real-world driving performance. On nuScenes hard scenarios, it achieves ~48% reduction in L2 distance and ~46% reduction in collision rate, establishing state-of-the-art open-loop planning performance.

Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs

This paper proposes the Probabilistic Point Cloud (PPC) representation, which attaches measurement confidence derived from raw single-photon LiDAR timing histograms as a probability attribute to each 3D point. Combined with a lightweight NPD filter and FPPS sampling strategy, PPC enables robust 3D object detection under low signal-to-background ratio (SBR) conditions, substantially outperforming point cloud denoising baselines on SUN RGB-D and KITTI with negligible computational overhead.

RTMap: Real-Time Recursive Mapping with Change Detection and Localization

RTMap is proposed as the first end-to-end framework that simultaneously addresses three core challenges in multi-traversal online HD map construction: prior-map-based localization, road structure change detection, and probabilistic crowdsourced map fusion. It achieves improvements in both map quality and localization accuracy on TbV and nuScenes.

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

SA-Occ is proposed as the first method to leverage satellite imagery to assist onboard cameras in 3D occupancy prediction. Three modules—Dynamic-Decoupling Fusion, 3D Projection Guidance, and Uniform Sampling Alignment—address cross-view perception challenges, achieving 39.05% mIoU (+6.97%) on Occ3D-nuScenes with only 6.93 ms additional latency.

Saliency-Aware Quantized Imitation Learning for Efficient Robotic Control

This paper proposes SQIL (Saliency-Aware Quantized Imitation Learning), which identifies task-critical states via saliency scoring and applies weighted distillation during quantization-aware training. SQIL recovers full-precision performance for 4-bit quantized VLA policy models in robotic manipulation and autonomous driving, while achieving 2.5–3.7× inference speedup.

SAM4D: Segment Anything in Camera and LiDAR Streams

This paper presents SAM4D, the first promptable multimodal segmentation foundation model for camera and LiDAR streams. It introduces Unified Multimodal Positional Encoding (UMPE) to enable cross-modal prompting and interaction, Motion-aware Cross-Modal Attention (MCMA) for temporal consistency, and constructs the Waymo-4DSeg dataset containing 300K+ masklets, demonstrating strong capabilities in cross-modal segmentation and data annotation.

Self-Supervised Sparse Sensor Fusion for Long Range Perception

LRS4Fusion proposes a long-range LiDAR-camera fusion framework based on sparse voxel representations, combined with a self-supervised pretraining strategy via sparse occupancy and velocity field reconstruction, achieving state-of-the-art performance within a 250-meter perception range: a 26.6% improvement in object detection mAP and a 30.5% reduction in LiDAR prediction Chamfer Distance.

Semantic Causality-Aware Vision-Based 3D Occupancy Prediction

This paper analyzes semantic ambiguity in 2D-to-3D transformation for vision-based 3D occupancy prediction from a causal perspective, proposes a Causal Loss for end-to-end semantic consistency supervision, and designs the SCAT module (channel-grouped lifting, learnable camera offsets, normalized convolution) to significantly improve occupancy prediction accuracy and robustness to camera perturbations.

SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions

Inspired by the human sketching process, this work models lane topology as a chain of sequential graph expansions, incrementally constructing directed lane graphs via an autoregressive transformer, thereby overcoming the inability of DAG-based methods to represent cycles and bidirectional lanes.

SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection

This paper proposes SparseLaneSTP, which integrates lane geometry priors (parallelism, continuity) and temporal information into a sparse Transformer architecture. Through Catmull-Rom spline representation, spatio-temporal attention mechanisms, and temporal regularization, the method achieves state-of-the-art performance on multiple 3D lane detection benchmarks.

Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping

The first LiDAR odometry and mapping pipeline built entirely on 2D Gaussian primitives, simultaneously achieving high-accuracy pose estimation and lightweight scene reconstruction via a spherical-projection-driven differentiable rasterizer.

SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement

This paper proposes Soft-Braid Attention, which explicitly models spatiotemporal topological relationships between trajectories and between trajectories and lanes via "soft crossing points" to guide multi-agent trajectory refinement. The method achieves significant improvements over four baseline methods on both Argoverse v2 and INTERACTION datasets, establishing a new state of the art for trajectory refinement.

TARS: Traffic-Aware Radar Scene Flow Estimation

This paper proposes TARS, a traffic-aware radar scene flow estimation method that constructs a Traffic Vector Field (TVF) via joint object detection, capturing rigid-body motion at the traffic level rather than the instance level. TARS surpasses the state of the art by 15% and 23% on the VOD and a proprietary dataset, respectively.

Towards Open-World Generation of Stereo Images and Unsupervised Matching

This paper proposes GenStereo, a diffusion-based stereo image generation framework that achieves both high visual quality and geometric accuracy through disparity-aware coordinate embedding, cross-view attention, and an adaptive fusion mechanism, while advancing unsupervised stereo matching to a new state of the art.

TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking

TrackAny3D is the first work to transfer large-scale pretrained 3D models to category-agnostic 3D single object tracking. By introducing a dual-path adapter, a Mixture of Geometry Experts (MoGE) module, and a temporal context optimization strategy, it achieves state-of-the-art performance on cross-category unified tracking within a single model.

TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

This paper proposes TrafficLoc, a coarse-to-fine image-to-point cloud registration method that achieves high-accuracy localization of traffic surveillance cameras in 3D reference maps via Geometry-guided Attention Loss (GAL), Inter- and Intra-modal Contrastive Learning (ICL), and Dense Training Alignment (DTA). On the self-constructed Carla Intersection dataset, it outperforms the previous state of the art by up to 86%.

UAVScenes: A Multi-Modal Dataset for UAVs

UAVScenes is the first large-scale multi-modal UAV dataset that simultaneously provides per-frame semantic annotations for both images and LiDAR point clouds along with accurate 6-DoF poses. It contains over 120,000 annotated frames and supports six perception tasks including semantic segmentation, depth estimation, localization, scene recognition, and novel view synthesis.

UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

This paper proposes UniOcc, the first unified benchmark for 2D/3D occupancy prediction and forecasting, integrating four data sources — nuScenes, Waymo, CARLA, and OpenCOOD — while introducing per-voxel flow annotations and ground-truth-free evaluation metrics. Large-scale experiments reveal the significant value of voxel-level flow information and cross-domain training for occupancy tasks.

Unleashing the Temporal Potential of Stereo Event Cameras for Continuous-Time 3D Perception

This paper proposes the first 3D object detection framework relying solely on stereo event cameras. Through a semantic-geometric dual filtering module and object-centric ROI alignment, it enables continuous-time 3D detection during blind time periods, significantly outperforming methods that depend on synchronized sensors (Ev-3DOD) in dynamic large-motion scenarios. Its pedestrian AP3D even surpasses methods that use LiDAR+RGB+Event.

Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving

This paper proposes SceneCrafter, a unified 3DGS-based simulation framework that simultaneously supports synthetic data generation and closed-loop evaluation via an adaptive kinematic model and bidirectional interactive agent control. Experiments demonstrate that synthetic data significantly improves the generalization of end-to-end autonomous driving models, yielding up to 18% improvement in Route Completion.

Wavelet Policy: Lifting Scheme for Policy Learning in Long-Horizon Tasks

Wavelet Policy is the first work to introduce wavelet analysis into embodied intelligence policy learning. It proposes a multi-scale policy network based on a learnable lifting scheme, decomposing observation sequences into different frequency components and synthesizing action sequences layer by layer. The method achieves superior or comparable performance to baselines across five long-horizon tasks, including autonomous driving (CARLA), robotic manipulation, and multi-robot collaboration.

Where am I? Cross-View Geo-localization with Natural Language Descriptions

This paper introduces a novel task of cross-view geo-localization via natural language descriptions, constructs the CVG-Text multimodal dataset covering 30,000+ coordinates across 3 cities (street-view + satellite + OSM + text), and proposes CrossText2Loc — a method employing Extended Positional Embedding for long-text handling and an Explainable Retrieval Module for localization rationale, achieving over 10% improvement in Top-1 Recall.

Where, What, Why: Towards Explainable Driver Attention Prediction

This paper proposes a new paradigm of "explainable driver attention prediction," introducing the first large-scale W³DA dataset and the LLada framework, which unifies spatial attention prediction (Where), semantic parsing (What), and cognitive reasoning (Why) within a single end-to-end large language model-driven architecture.

World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

World4Drive constructs an intention-aware latent world model that leverages spatial-semantic priors from visual foundation models to achieve annotation-free end-to-end planning, reducing L2 error by 18.1% and collision rate by 46.7%.