UAVScenes: A Multi-Modal Dataset for UAVs¶

Conference: ICCV 2025 arXiv: 2507.22412 Code: https://github.com/sijieaaa/UAVScenes Area: Autonomous Driving Keywords: UAV perception, multi-modal dataset, semantic segmentation, depth estimation, LiDAR point cloud

TL;DR¶

UAVScenes is the first large-scale multi-modal UAV dataset that simultaneously provides per-frame semantic annotations for both images and LiDAR point clouds along with accurate 6-DoF poses. It contains over 120,000 annotated frames and supports six perception tasks including semantic segmentation, depth estimation, localization, scene recognition, and novel view synthesis.

Background & Motivation¶

The Demand for UAV Perception¶

With the rapid growth of the low-altitude economy, UAVs have been widely deployed in aerial taxi, low-altitude logistics, agriculture, inspection, and emergency response. Unlike ground vehicles, UAVs operate above ground constraints and require high-quality training datasets to achieve reliable perception capabilities.

Systematic Deficiencies in Existing UAV Datasets¶

The authors conducted a systematic survey of existing UAV datasets and identified three levels of problems:

Level 1: Single-Modality Only - Many datasets (UAVDT, VisDrone, UAVid, FloodNet, etc.) contain only camera images without 3D LiDAR data. - 3D scene understanding and high-accuracy multi-modal fusion are therefore infeasible.

Level 2: Multi-Modal but Lacking Per-Frame Annotations - NTU VIRAL, GrAco, FIReStereo, and MUN-FRL provide camera+LiDAR data but are primarily designed for SLAM or 3D reconstruction. - UrbanScene3D and Hessigheim 3D annotate only reconstructed 3D maps rather than individual frames. - GauU-Scene uses encrypted DJI-L1 point clouds, making per-frame LiDAR data inaccessible.

Level 3: The Core Gap - No existing multi-modal UAV dataset provides both per-frame image annotations and per-frame LiDAR point cloud annotations simultaneously. - This directly impedes research on advanced perception tasks such as per-frame semantic segmentation, depth estimation, and precise localization.

Positioning and Contributions¶

UAVScenes extends the MARS-LVIG dataset (originally a multi-modal UAV dataset for SLAM only) through three major contributions: 1. Adding 19-class semantic annotations (16 static + 2 dynamic + 1 background) for per-frame images. 2. Adding semantic annotations for per-frame LiDAR point clouds. 3. Reconstructing accurate 6-DoF poses (the original dataset provides only 4-DoF RTK poses).

Method¶

Overall Architecture¶

The construction pipeline of UAVScenes consists of three stages: 3D reconstruction for 6-DoF pose estimation → image semantic annotation → LiDAR point cloud semantic annotation. Each stage includes rigorous quality control and manual review.

Key Designs¶

1. 6-DoF Pose Reconstruction¶

Function: Upgrades MARS-LVIG's 4-DoF RTK poses to complete 6-DoF poses.
Mechanism:
- LVI-SLAM methods (FAST-LIVO, R3LIVE) were first attempted but yielded poor reconstruction quality due to LiDAR degeneracy caused by the downward-facing flight orientation.
- Structure-from-Motion (SfM) was adopted instead; COLMAP, RealityCapture, Metashape, and DJI Terra were evaluated.
- DJI Terra was ultimately selected as it accepts GNSS coordinates for initialization and is specifically designed for UAV scenes, yielding the best reconstruction quality.
- The entire MARS-LVIG dataset was divided into 8 splits based on environment and lighting conditions; SfM reconstruction was performed independently for each split (3–10 hours per split).
Design Motivation: Accurate 6-DoF poses are the foundation for all downstream tasks, particularly novel view synthesis and precise localization. 4-DoF poses encode only 3D position and yaw angle, which is insufficient for fine-grained evaluation.

2. Image Semantic Annotation Pipeline¶

Function: Provides 19-class pixel-level semantic annotations for 120,000+ frames.
Mechanism (two-step process):
- Static class annotation (16 classes): Manual semantic labeling is performed on the reconstructed 3D point cloud map and then rendered back to the corresponding camera views to obtain 2D semantic masks. 3D consistency ensures cross-frame annotation coherence.
- Dynamic class annotation (2 classes: cars and trucks): Instance-level manual annotation is performed on individual frames. The tracking functionality of X-AnyLabeling partially accelerates this process, but tracking instability necessitates extensive manual verification and correction. Over 280,000 dynamic instances were annotated in total.
- Static and dynamic annotations are merged into complete per-frame labels.
Design Motivation: Annotating static classes on the 3D map guarantees cross-frame consistency (i.e., the same building receives consistent labels across frames), which is difficult to ensure with conventional frame-by-frame annotation.

3. LiDAR Point Cloud Semantic Annotation¶

Function: Provides semantic annotations for per-frame Livox-Avia LiDAR point clouds.
Mechanism:
- Camera–LiDAR hardware synchronization and calibration are leveraged to project image semantic annotations onto the corresponding LiDAR point clouds.
- Automatic projection is followed by manual consistency checking and correction.
- Only the open Livox-Avia point clouds are used, as the DJI-L1 output is encrypted and inaccessible.
Design Motivation: Image-to-point-cloud projection efficiently produces initial annotations, which are then refined through manual correction to ensure quality.

Loss & Training¶

As a dataset paper, no specific model training is involved. Benchmark experiments follow standard training protocols for each respective task.

Key Experimental Results¶

Main Results (Image Semantic Segmentation)¶

Params	Architecture	Model	mIoU ↑
22M	Transformer	DeiT3-s	67.6
38M	Transformer	DeiT3-m	68.3
22M	Transformer	ViT-s	63.9
5M	Transformer	ViT-t	62.8
25M	CNN	ResNet-50	61.3
44M	CNN	ResNet-101	60.7
21M	CNN	ResNet-34	59.9
28M	CNN	ConvNext-t	55.3
48M	CNN	MambaOut-s	51.8

All models use UperNet as the segmentation head. Transformer-based models consistently outperform CNN-based models, with DeiT3-m achieving the best mIoU of 68.3%.

Ablation Study (LiDAR Semantic Segmentation)¶

Params	Model	mIoU ↑	Note
38M	MinkUNet	32.7	Voxelized point cloud method
39M	SPUNet	34.4	Sparse convolution method
11M	PTv2	33.2	Point Transformer method

LiDAR segmentation mIoU is substantially lower than image segmentation (~33% vs. ~68%), indicating that semantic segmentation of aerial LiDAR point clouds is highly challenging.

Key Findings¶

Transformers outperform CNNs: Even with fewer parameters (e.g., ViT-t at only 5M), Transformer architectures surpass large-parameter CNNs on UAV semantic segmentation.
LiDAR segmentation is much harder than image segmentation: The mIoU gap is approximately 35 percentage points, likely attributable to low point cloud density and domain shift from ground-level patterns.
Challenges of dynamic object annotation: IoU scores for cars and trucks are notably lower than for static categories, reflecting the difficulty of dynamic object detection in UAV scenes.
Small objects such as Solar Panels and Umbrellas: IoU values are extremely low (<10%), representing a primary direction for future improvement.
Multi-traversal coverage: The dataset includes multiple passes over the same scenes, enabling temporal tasks such as scene change detection.

Highlights & Insights¶

Uniqueness: The first real-world UAV dataset globally to simultaneously provide 6-DoF poses, per-frame image annotations, and per-frame LiDAR annotations.
Annotation pipeline innovation: Annotating static classes on the 3D map and rendering back to 2D ensures cross-frame consistency while reducing annotation cost.
Comprehensive task coverage: A single dataset supports six distinct tasks, providing a unified evaluation platform for UAV perception research.
Scale advantage: 120,000+ annotated frames substantially exceed most existing UAV datasets.
Open-source and reproducible: Both data and code are publicly available.

Limitations & Future Work¶

Limited environmental diversity: Based on MARS-LVIG, the dataset covers only town, valley, airport, and island scenes, lacking extreme weather, nighttime, and other challenging conditions.
DJI-L1 encryption restriction: High-quality DJI-L1 LiDAR data cannot be utilized; only Livox-Avia data are available.
Limited dynamic object categories: Only cars and trucks are annotated; pedestrians, bicycles, and other dynamic objects are absent.
SfM reconstruction accuracy: Reliance on the commercial DJI Terra tool may yield lower reconstruction accuracy than SLAM under ideal conditions.
Annotation scalability: Despite the semi-automatic pipeline, labeling 280,000+ dynamic instances requires substantial human effort, posing a cost challenge for scaling to larger datasets.

SemanticKITTI: The benchmark dataset for LiDAR semantic segmentation in ground-level autonomous driving; its annotation methodology directly inspired this work.
nuScenes / Waymo: Representative multi-modal autonomous driving datasets, both limited to ground-level perspectives.
MARS-LVIG: The foundation of this dataset, originally designed solely for SLAM.
Key insight: The dataset construction methodology from ground-level autonomous driving is transferred to the UAV domain, leveraging 3D-to-2D annotation rendering to ensure cross-frame consistency.

Rating¶

Novelty: ⭐⭐⭐⭐ — Fills a clear gap in the dataset landscape.
Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-task benchmarks are comprehensive, though individual task experiments are relatively small in scale.
Writing Quality: ⭐⭐⭐⭐ — Comparison tables are clear and related work is thoroughly surveyed.
Value: ⭐⭐⭐⭐⭐ — As an infrastructure-level contribution, it holds high value for UAV perception research.