🚗 Autonomous Driving¶

🤖 AAAI2026 · 58 paper notes

A Data-Driven Model Predictive Control Framework for Multi-Aircraft TMA Routing Under Travel Time Uncertainty: A closed-loop MPC framework is proposed for conflict-free multi-aircraft routing and scheduling within the 50 NM Terminal Maneuvering Area (TMA) of Changi Airport. The framework integrates XGBoost-based TMA boundary arrival time prediction, MILP optimization (incorporating route selection, speed adjustment, holding control, and separation constraints), and a receding-horizon simulator. Under peak congestion scenarios of 36 aircraft/hour, it achieves a 7× computational speedup while significantly outperforming the Dijkstra baseline in feasibility under Monte Carlo robustness validation.
AI-based Traffic Modeling for Network Security and Privacy: Challenges Ahead: A survey and position paper on AI-based traffic modeling for Network Security & Privacy (NetS&P) tasks. It systematically reviews AI approaches for anomaly detection, attack classification, IoT device identification, and website fingerprinting attacks, and provides an in-depth discussion of four frontier challenges: data quality, practical deployment, explainability, and foundation models.
Backdoor Attacks on Open Vocabulary Object Detectors via Multi-Modal Prompt Tuning: This paper presents the first study on backdoor attacks against open-vocabulary object detectors (OVODs), proposing TrAP (Trigger-Aware Prompt tuning), which jointly optimizes learnable prompts in both visual and textual branches alongside a learnable trigger to inject high-success-rate backdoors without modifying any model weights.
Beta Distribution Learning for Reliable Roadway Crash Risk Assessment: A geospatial deep learning framework based on Beta distribution learning is proposed, which leverages multi-scale satellite imagery to predict the full probability distribution of fatal crash risk (rather than point estimates), achieving 17–23% improvement in Recall while naturally expressing uncertainty through distribution shape.
Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation: This paper is the first to systematically address the "target-class hallucination" problem in unpaired day-to-night image translation. By combining a dual-head discriminator (style head + SAM2 pseudo-label segmentation head) for hallucination detection and class-prototype contrastive learning for suppression, the method improves mAP from 15.08 to 17.40 (+15.5%) on BDD100K day-to-night domain adaptation detection, with traffic light AP improving by 31.7%.
CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction: CaTFormer is proposed to explicitly model causal interactions between driver behavior and environmental context via a causal temporal Transformer, achieving state-of-the-art performance of 98.6% F1 on the Brain4Cars dataset.
CompTrack: Information Bottleneck-Guided Low-Rank Dynamic Token Compression for Point Cloud Tracking: CompTrack is proposed as the first framework to simultaneously address dual redundancy in LiDAR point clouds: SFP filters background noise via information entropy analysis to resolve spatial redundancy; IB-DTC estimates effective rank via online SVD and adaptively determines compression ratio to compress foreground into low-rank proxy tokens, resolving information redundancy. Achieves state-of-the-art on nuScenes (61.04% Success) at 90 FPS.
Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification: This work systematically identifies two unique challenges in adversarial defense for person ReID — model bias and composite generalization requirements — and proposes a Debiased Dual-Invariant Defense framework. The data balancing stage employs a diffusion model for resampling to mitigate bias, while the dual adversarial self-meta defense stage achieves dual generalization to unseen IDs and unseen attacks via Farthest Negative Example Softening (FNES)-based metric adversarial training and adversarially-enhanced self-meta learning.
AdaptiveAD: Decoupling Scene Perception and Ego Status for End-to-End Autonomous Driving: This paper identifies the architectural root cause of ego-status over-reliance in end-to-end autonomous driving—namely, the premature fusion of ego status within the BEV encoder—and proposes AdaptiveAD, a dual-branch architecture consisting of a scene-driven branch (with ego status removed) and a self-driven branch that independently generate planning decisions. A scene-aware fusion module then adaptively integrates the two branches. Complemented by path attention, BEV unidirectional distillation, and an autoregressive online mapping auxiliary task, AdaptiveAD achieves state-of-the-art planning performance on nuScenes.
SAML: A Differentiable Semantic Meta-Learning Framework for Long-Tail Motion Prediction: SAML is proposed as the first framework to provide a differentiable semantic definition of "long-tailedness" in motion prediction — quantifying rarity via five intrinsic/interactive attributes, fusing them into a continuous Tail Index through a Bayesian Tail Perceiver, and driving MAML-based meta-learning adaptation. On the nuScenes worst-case top 1% subset, SAML achieves a minADE 17.2% lower than the second-best method.
Difficulty-Aware Label-Guided Denoising for Monocular 3D Object Detection: This paper proposes MonoDLGD, which provides explicit geometric supervision for monocular 3D detection by adaptively perturbing and reconstructing ground-truth labels according to instance-level detection difficulty, achieving state-of-the-art performance on KITTI.
DiffRefiner: Coarse to Fine Trajectory Planning via Diffusion Refinement with Semantic Interaction for End to End Autonomous Driving: This paper proposes DiffRefiner, a coarse-to-fine two-stage framework that first employs a discriminative Proposal Decoder to generate coarse trajectory proposals, then iteratively refines them via a diffusion model, combined with a fine-grained semantic interaction module. The method achieves state-of-the-art performance on both NAVSIM v2 and Bench2Drive benchmarks.
Drive As You Like: Strategy-Level Motion Planning Based on A Multi-Head Diffusion Model: This paper proposes M-Diffusion Planner, a strategy-level motion planning framework built upon a multi-head diffusion model and GRPO post-training, enabling users to switch among driving styles such as aggressive, conservative, and comfortable via natural language, while maintaining state-of-the-art planning performance.
DriveFlow: Rectified Flow Adaptation for Robust 3D Object Detection in Autonomous Driving: DriveFlow is a rectified flow adaptation method built upon pretrained T2I Flow models. Through frequency decomposition, it preserves high-frequency foreground content while applying dual-frequency optimization to the background, enabling training-free driving scene image editing for data augmentation and significantly improving the OOD robustness of vision-based 3D detectors.
DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning: This paper proposes DriveSuprim, which addresses three key bottlenecks in selection-based end-to-end planning — difficulty distinguishing similar trajectories, directional bias, and hard-label instability — through a coarse-to-fine trajectory selection paradigm, rotation-based data augmentation, and a self-distillation soft-label framework, achieving state-of-the-art performance on NAVSIM v1/v2 and Bench2Drive.
Dual-branch Spatial-Temporal Self-supervised Representation for Enhanced Road Network Learning: This paper proposes DST (Dual-branch Spatial-Temporal), a road network representation learning framework that jointly models spatial heterogeneity and temporal dynamics via a spatial branch (mix-hop transition matrix + graph–hypergraph contrastive learning) and a temporal branch (Transformer encoder + next-token prediction + weekday/weekend classification). DST achieves state-of-the-art performance on three downstream tasks across three cities.
ExpertAD: Enhancing Autonomous Driving Systems with Mixture of Experts: ExpertAD introduces a Mixture-of-Experts (MoE) architecture into the perception and prediction modules of end-to-end autonomous driving systems. A Perception Adapter dynamically re-weights BEV features to amplify task-critical semantics, while a Mixture of Sparse Experts employs a router to selectively activate relevant driving task experts and uses sparse attention to reduce computation. The framework reduces inference latency by approximately 25% while maintaining or improving planning performance.
Exploring Surround-View Fisheye Camera 3D Object Detection: This paper systematically investigates 3D object detection with surround-view fisheye cameras. It constructs the Fisheye3DOD benchmark dataset containing both pinhole and fisheye camera data, and proposes two frameworks—FisheyeBEVDet and FisheyePETR—that embed fisheye geometric modeling into mainstream detection paradigms via spherical feature representations, achieving up to 6.2 FDS improvement over rectification-based baselines.
FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning: FastDriveVLA is proposed to train a lightweight plug-and-play ReconPruner module (only 0.07B parameters) via MAE-style foreground pixel reconstruction. By employing an adversarial foreground-background reconstruction strategy, the method prioritizes the retention of foreground tokens critical for driving decisions. It achieves state-of-the-art performance across all pruning ratios on the nuScenes open-loop planning benchmark, and can be transferred to different VLA models sharing the same visual encoder without retraining.
Fine-Grained Representation for Lane Topology Reasoning: This paper proposes TopoFG, a framework that replaces conventional single-query lane modeling with fine-grained queries (each lane represented by multiple spatially-aware queries), combined with hierarchical prior extraction, region-focused decoding, and boundary-point-based robust topology reasoning, achieving new state-of-the-art results of 48.0% OLS (subset_A) and 45.4% OLS (subset_B) on OpenLane-V2.
FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection: This work presents the first fully INT8-quantized deployment of PETR-series 3D detectors. It introduces three key components: a quantization-friendly LiDAR-ray position encoding (QFPE) to resolve multi-modal feature magnitude mismatch, a dual-lookup table (DULUT) for efficient approximation of nonlinear operators, and quantization after numerical stabilization (QANS) to prevent softmax attention distortion. Across PETR/StreamPETR/PETRv2/MV2D, W8A8 quantization incurs less than 1% mAP loss while reducing latency by 75% (3.9× speedup).
Generalising Traffic Forecasting to Regions without Traffic Observations: This paper proposes GenCast, which achieves generalization of traffic forecasting from sensor-covered regions to unobserved continuous regions via three key innovations: a physics-informed neural network (incorporating the LWR traffic equation as a soft constraint), dynamic external weather signal fusion, and a spatial grouping module. GenCast consistently outperforms existing state-of-the-art methods across five real-world datasets.
Global-Lens Transformers: Adaptive Token Mixing for Dynamic Link Prediction: This paper proposes GLFormer, a lightweight attention-free Transformer framework for dynamic graph link prediction. It replaces self-attention with an adaptive token mixer conditioned on interaction order and temporal intervals, and employs a hierarchical aggregation mechanism to enlarge the temporal receptive field. GLFormer achieves performance on par with or superior to Transformer baselines across 6 benchmarks while substantially reducing computational complexity.
HD2-SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving: This paper proposes the HD2-SSC framework, which addresses the 2D→3D input–output dimension gap via a High-dimensional Semantic Decoupling (HSD) module (expanding pixel features along a pseudo-dimension and orthogonally decoupling them), and addresses the annotation–reality density gap via a High-density Occupancy Refinement (HOR) module (a "detection–refinement" paradigm that aligns geometrically and semantically critical voxels). The method achieves state-of-the-art performance on SemanticKITTI and SSCBench-KITTI-360.
Hierarchical Prompt Learning for Image- and Text-Based Person Re-Identification: This paper proposes HPL, a unified framework that decouples I2I and T2I tasks via a Task-Routed Transformer (dual classification tokens), and employs hierarchical prompt learning (identity-level + instance-level pseudo-text tokens) combined with cross-modal prompt regularization, achieving simultaneous state-of-the-art performance on both image-to-image and text-to-image person re-identification within a single model for the first time.
I-INR: Iterative Implicit Neural Representations: This paper proposes I-INR (Iterative Implicit Neural Representations), a plug-and-play iterative refinement framework that introduces lightweight FeedbackNet and FuseNet modules (adding only 0.5–2% parameters) to perform progressive multi-step signal reconstruction, effectively alleviating the spectral bias of INRs. I-INR consistently outperforms baselines across image fitting, super-resolution, denoising, and 3D occupancy prediction tasks.
Invisible Triggers, Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving: This paper proposes AdvRoad, a two-stage framework (Road-Style Adversary Generation + Scenario-Associated Adaptation) that generates diverse adversarial posters with road-surface texture styles. These posters induce "ghost objects" (false positives) in visual 3D detectors for autonomous driving while remaining inconspicuous to human drivers due to their natural appearance, significantly improving the stealthiness and defensive resistance of FP attacks.
LiDAR-GS++: Improving LiDAR Gaussian Reconstruction via Diffusion Priors: This paper proposes LiDAR-GS++, which introduces a controllable LiDAR diffusion generative model as a prior to perform extended reconstruction of a neural 2DGS field. The method addresses the severe degradation in reconstruction quality under extrapolated viewpoints (e.g., lane-change scenarios) encountered in single-pass LiDAR scanning, achieving state-of-the-art performance on both interpolated and extrapolated views across multiple public benchmarks.
LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences: This paper proposes LiDARCrafter, the first 4D generative world model targeting LiDAR, which achieves controllable 4D LiDAR sequence generation and editing through a pipeline of text → scene graph → three-branch diffusion layout → range-image diffusion generation → autoregressive temporal extension, comprehensively surpassing existing methods on nuScenes.
LiNeXt: Revisiting LiDAR Completion with Efficient Non-Diffusion Architectures: This paper proposes LiNeXt, a lightweight non-diffusion network for LiDAR 3D scene completion. Through a Distance-aware Selective Repetition (DSR) strategy, a Noise-to-Coarse (N2C) module, and a Refine module, LiNeXt directly reconstructs complete point clouds. On SemanticKITTI, it achieves 199.8× faster inference than LiDiff, reduces Chamfer Distance by 50.7%, and uses only 6.1% of LiDiff's parameters.
LUCID: Learning-Enabled Uncertainty-Aware Certification of Stochastic Dynamical Systems: This paper proposes LUCID, the first verification engine capable of providing quantified safety guarantees for black-box stochastic dynamical systems. By combining data-driven control barrier certificates, conditional mean embeddings, and finite Fourier kernel expansions, LUCID reformulates a semi-infinite non-convex optimization problem into a tractable linear program.
MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation: MambaSeg is proposed, employing a dual-branch parallel Mamba encoder to process RGB images and event streams respectively, with a Dual-Dimension Interaction Module (DDIM) for fine-grained cross-modal fusion along both spatial and temporal dimensions. It achieves state-of-the-art performance of 77.56%/75.10% mIoU on DDD17 and DSEC with only 25.44M parameters, offering substantially better efficiency than Transformer-based approaches.
Meta Dynamic Graph for Traffic Flow Prediction: This paper proposes MetaDG, a framework that generates dynamic node representations at each time step and enhances them via spatio-temporal correlation, extending dynamism modeling beyond merely updating the adjacency matrix to simultaneously generating meta-parameters, adjacency matrices, and edge-weight adjustment matrices. This enables unified spatio-temporal heterogeneity modeling (ST-unification) and achieves state-of-the-art performance on four benchmark datasets: PEMS03/04/07/08.
Minimum-Cost Network Flow with Dual Predictions: This paper presents the first learning-augmented algorithm for minimum-cost network flow (MCF) based on dual predictions. By warm-starting the classical ε-relaxation algorithm with machine-learned dual solutions, the proposed method ties its complexity bound to the \(\ell_\infty\)-norm of the prediction error (achieving both consistency and robustness), and demonstrates average speedups of 12.74× on traffic networks and 1.64× on chip escape routing.
MOBA: A Material-Oriented Backdoor Attack against LiDAR-based 3D Object Detection: This paper proposes MOBA (Material-Oriented Backdoor Attack), the first physically realizable backdoor attack framework grounded in material reflectance modeling. It systematically selects titanium dioxide (TiO₂) as the trigger material and employs an angle-independent approximation of the Oren-Nayar BRDF model for LiDAR intensity simulation, achieving an attack success rate (ASR) of 93.50% on real physical data—more than 41% above existing methods.
Multimodal Data Fusion to Capture Dynamic Interactions between Built Environment and Vulnerable Older Adults: This paper proposes a multimodal data fusion framework that integrates eye-tracking, motion sensors (IMU), physiological monitoring (EDA/HRV), GPS, and video recording to dynamically characterize interactions between vulnerable older adults (with knee osteoarthritis or fall history) and the urban built environment. Through AI-driven data fusion, the framework identifies urban street segments that significantly influence walking behavior and perception at a microscopic scale, providing evidence-based support for age-friendly urban planning.
SPARC: OOD Generalization for Controlling 100 Unseen Vehicles with a Single Policy: This paper proposes SPARC (Single-Phase Adaptation for Robust Control), which unifies the two-phase context encoding and history-based adaptation of RMA into a single-phase training procedure. Using a single policy in the high-fidelity Gran Turismo 7 racing simulator, SPARC achieves state-of-the-art OOD generalization performance across 100+ unseen vehicles.
PriorDrive: Enhancing Online HD Map Construction with Unified Vector Priors: This paper proposes PriorDrive, a framework that encodes multiple types of vectorized prior maps (SD maps, outdated HD maps, historical prediction maps) into a unified representation via a Unified Vector Encoder (UVE) and Hybrid Prior Representation (HPQuery), and integrates them into various online mapping models. It achieves a +14.3 mAP improvement on nuScenes and is compatible with both query-based and non-query-based mapping architectures.
RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis: This paper presents RacketVision—the first large-scale benchmark covering three racket sports (table tennis, tennis, and badminton)—which introduces racket pose annotations for the first time and defines three interconnected tasks: ball tracking, racket pose estimation, and ball trajectory prediction. The work reveals the critical role of cross-attention fusion in multimodal trajectory prediction.
RadarMP: Motion Perception for 4D mmWave Radar in Autonomous Driving: This paper proposes RadarMP — the first unified architecture that jointly addresses mmWave radar object detection and scene flow estimation. It leverages energy flow consistency across adjacent-frame radar echo signals (tesseracts) for self-supervised training, achieving a detection probability of 69.5% (far exceeding the prior best of 44.1%) while enabling accurate 3D scene motion perception.
RAST: A Retrieval Augmented Spatio-Temporal Framework for Traffic Prediction: This work introduces the RAG paradigm into spatio-temporal forecasting by maintaining a dual-dimensional memory bank to store historical spatio-temporal patterns and retrieving them at inference time for fusion. The resulting general-purpose retrieval-augmented spatio-temporal prediction framework, RAST, achieves state-of-the-art performance on six traffic datasets while requiring only 1/12 the GPU memory of competing methods.
ReflexDiffusion: Reflexion-Enhanced Trajectory Planning for High Lateral Acceleration in Autonomous Driving: This paper proposes ReflexDiffusion, which introduces a physics-aware reflection mechanism during the inference stage of diffusion models. By injecting gradients to enforce curvature-velocity-acceleration coupling constraints (\(a_y = \kappa v^2\)), the method achieves a 14.1% improvement in driving score on nuPlan high-lateral-acceleration long-tail scenarios. The architecture-agnostic design allows direct deployment on existing diffusion planners.
Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception: This paper proposes HAT (multiple Hypotheses spAtio-Temporal alignment), a plug-and-play spatio-temporal alignment module that generates alignment hypotheses via multiple explicit motion models and adaptively decodes the optimal alignment using motion cues latent in queries. HAT consistently improves multiple 3D temporal detectors and trackers on nuScenes, and reduces collision rates by 32–48% in end-to-end autonomous driving.
RoadSceneVQA: Benchmarking Visual Question Answering in Roadside Perception Systems for Intelligent Transportation System: This paper introduces RoadSceneVQA—the first large-scale visual question answering dataset for roadside perception scenarios (34,736 QA pairs)—and proposes the RoadMind model, which significantly improves lightweight MLLM performance on traffic scene reasoning through CogniAnchor Fusion (CAF) and Assisted Decoupled Chain-of-Thought (AD-CoT), enabling a 0.9B-parameter model to surpass 8B-parameter counterparts.
SparseCoop: Cooperative Perception with Kinematic-Grounded Queries: This paper proposes SparseCoop—the first fully sparse cooperative perception framework—which abandons dense BEV representations entirely through kinematic-grounded queries (KGQ), a coarse-to-fine aggregation module, and a cooperative instance denoising strategy. SparseCoop achieves state-of-the-art performance on V2X-Seq and Griffin datasets with minimal communication overhead and maximum computational efficiency (AP 0.530, transmission cost only 3.17×10⁴ BPS).
STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes: This paper constructs STRIDE-QA, the largest spatiotemporal reasoning VQA dataset in autonomous driving (270K frames, 16M QA pairs), defines three categories of spatiotemporal reasoning tasks (object-centric spatial / ego-centric spatial / ego-centric spatiotemporal), and demonstrates that fine-tuning a VLM raises localization success rate from near zero to 55% and temporal localization consistency from 0 to 28.4%.
Task Prototype-Based Knowledge Retrieval for Multi-Task Learning from Partially Annotated Data: This paper proposes a task prototype-based knowledge retrieval framework that employs learnable Task Prototypes to encode task characteristics and quantify inter-task affinities, and a Knowledge Retrieval Transformer to adaptively refine feature representations based on task-affinity scores. The framework addresses multi-task learning from partially annotated data (MTPSL) without relying on predictions from unannotated tasks, achieving state-of-the-art performance on PASCAL-Context and NYUD-v2.
TawPipe: Topology-Aware Weight Pipeline Parallelism for Accelerating Long-Context Large Models Training: This paper proposes TawPipe—a topology-aware weight pipeline parallelism framework comprising three components: group-based weight scheduling, device-bound storage, and communication-computation overlap. By exploiting the hierarchical bandwidth characteristics of distributed clusters, TawPipe achieves throughput improvements of 11.8%/23.6%/44.1% over WeiPipe/1F1B/FSDP respectively when training LLaMA models on 24 GPUs, while reducing communication time by 82.1%.
TimeBill: Time-Budgeted Inference for Large Language Models: This paper proposes TimeBill, a framework that adaptively adjusts the KV cache eviction ratio under a given time budget via a fine-grained Response Length Predictor (RLP) and a workload-guided Execution Time Estimator (ETE), simultaneously maximizing LLM response quality while guaranteeing inference completion rate.
Towards 3D Object-Centric Feature Learning for Semantic Scene Completion: This paper proposes Ocean, a framework that leverages instance masks extracted by MobileSAM to guide 3D object-centric feature learning. Through Semantic Group Attention (SGA3D) and Global Similarity-Guided Attention (GSGA), Ocean achieves instance-level feature aggregation in 3D space, and refines scene representations via an Instance-aware Local Diffusion (ILD) module, attaining state-of-the-art performance on SemanticKITTI and SSCBench-KITTI360.
TSBOW: Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions: This paper presents TSBOW — a large-scale CCTV-based traffic surveillance dataset comprising 198 videos, over 32 hours of real-world traffic data, and 3.2 million frames, covering all four seasons (clear/haze/rain/snow including extreme disaster scenarios), spanning 8 categories of traffic participants, with a focus on addressing the challenge of occluded vehicle detection under adverse weather conditions.
Understanding Dynamic Scenes in Egocentric 4D Point Clouds: This work introduces EgoDynamic4D — the first egocentric QA benchmark targeting highly dynamic 4D scenes (927K QA pairs, 12 task types) — and proposes an end-to-end spatiotemporal reasoning framework that compresses large-scale 4D scenes into LLM-processable token sequences via instance-aware feature encoding, temporal encoding, camera encoding, and adaptive downsampling.
Unleashing Semantic and Geometric Priors for 3D Scene Completion: This paper proposes FoundationSSC, a framework that unleashes the semantic and geometric priors of Vision Foundation Models through a dual-level decoupling design at both the source level and pathway level. Combined with an Axis-Aware Fusion module for integrating complementary 3D features, the method achieves state-of-the-art performance of 19.32 mIoU / 48.12 IoU on SemanticKITTI.
Unlocking Efficient Vehicle Dynamics Modeling via Analytic World Models: This paper proposes Analytic World Models (AWMs), which exploit the differentiability of differentiable simulators to design three world modeling tasks—relative odometry, optimal planners, and inverse optimal state estimation—enabling end-to-end efficient training of state predictors without trial-and-error search. The approach is validated on the Waymax autonomous driving simulator.
Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction: This paper proposes the first vision-only semantic occupancy prediction framework that uses sparse 3D semantic Gaussian primitives as the communication medium for collaborative perception. Through ROI cropping, rigid transformation of Gaussians, and a neighborhood fusion module to suppress noise and redundancy, the method achieves +8.42 mIoU over the single-agent baseline and +3.28 mIoU over the baseline collaborative method.
Walking Further: Semantic-aware Multimodal Gait Recognition Under Long-Range Conditions: This paper introduces LRGait — the first LiDAR-Camera multimodal gait dataset targeting long-range (10–50m) cross-distance scenarios — and proposes EMGaitNet, an end-to-end framework that achieves 2D-3D cross-modal feature fusion via Semantic Mining (SeMi), Semantic-Guided Alignment (SGA), and Symmetric Cross-Attention Fusion (SCAF) modules, reaching state-of-the-art performance on multiple benchmarks.
When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework: This paper presents EvReID, the first large-scale RGB-Event person re-identification dataset (1,200 identities / 118,988 image pairs), and proposes TriPro-ReID, a three-stage contrastive learning framework guided by pedestrian attributes. The framework leverages positive-negative attribute prompts and cross-modal prompt fusion to integrate RGB and Event modality features, achieving 69.3% mAP.
WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving: WorldRFT is a planning-oriented latent world model framework that integrates VGGT-based spatial encoding, hierarchical planning decomposition with local-aware iterative refinement, and GRPO-based collision-aware reinforcement fine-tuning. It reduces collision rate by 83% on nuScenes (0.30% → 0.05%) and achieves near-LiDAR SOTA performance using camera only on NavSim (87.8 vs. 88.1 PDMS).