Skip to content

UAVScenes: A Multi-Modal Dataset for UAVs

Conference: ICCV 2025 arXiv: 2507.22412 Code: https://github.com/sijieaaa/UAVScenes Area: Autonomous Driving Keywords: UAV perception, multi-modal dataset, semantic segmentation, depth estimation, LiDAR point cloud

TL;DR

UAVScenes is the first large-scale multi-modal UAV dataset that simultaneously provides per-frame semantic annotations for both images and LiDAR point clouds along with accurate 6-DoF poses. It contains over 120,000 annotated frames and supports six perception tasks including semantic segmentation, depth estimation, localization, scene recognition, and novel view synthesis.

Background & Motivation

The Demand for UAV Perception

With the rapid growth of the low-altitude economy, UAVs have been widely deployed in aerial taxi, low-altitude logistics, agriculture, inspection, and emergency response. Unlike ground vehicles, UAVs operate above ground constraints and require high-quality training datasets to achieve reliable perception capabilities.

Systematic Deficiencies in Existing UAV Datasets

The authors conducted a systematic survey of existing UAV datasets and identified three levels of problems:

Level 1: Single-Modality Only - Many datasets (UAVDT, VisDrone, UAVid, FloodNet, etc.) contain only camera images without 3D LiDAR data. - 3D scene understanding and high-accuracy multi-modal fusion are therefore infeasible.

Level 2: Multi-Modal but Lacking Per-Frame Annotations - NTU VIRAL, GrAco, FIReStereo, and MUN-FRL provide camera+LiDAR data but are primarily designed for SLAM or 3D reconstruction. - UrbanScene3D and Hessigheim 3D annotate only reconstructed 3D maps rather than individual frames. - GauU-Scene uses encrypted DJI-L1 point clouds, making per-frame LiDAR data inaccessible.

Level 3: The Core Gap - No existing multi-modal UAV dataset provides both per-frame image annotations and per-frame LiDAR point cloud annotations simultaneously. - This directly impedes research on advanced perception tasks such as per-frame semantic segmentation, depth estimation, and precise localization.

Positioning and Contributions

UAVScenes extends the MARS-LVIG dataset (originally a multi-modal UAV dataset for SLAM only) through three major contributions: 1. Adding 19-class semantic annotations (16 static + 2 dynamic + 1 background) for per-frame images. 2. Adding semantic annotations for per-frame LiDAR point clouds. 3. Reconstructing accurate 6-DoF poses (the original dataset provides only 4-DoF RTK poses).

Method

Overall Architecture

The construction pipeline of UAVScenes consists of three stages: 3D reconstruction for 6-DoF pose estimation → image semantic annotation → LiDAR point cloud semantic annotation. Each stage includes rigorous quality control and manual review.

Key Designs

1. 6-DoF Pose Reconstruction

  • Function: Upgrades MARS-LVIG's 4-DoF RTK poses to complete 6-DoF poses.
  • Mechanism:
    • LVI-SLAM methods (FAST-LIVO, R3LIVE) were first attempted but yielded poor reconstruction quality due to LiDAR degeneracy caused by the downward-facing flight orientation.
    • Structure-from-Motion (SfM) was adopted instead; COLMAP, RealityCapture, Metashape, and DJI Terra were evaluated.
    • DJI Terra was ultimately selected as it accepts GNSS coordinates for initialization and is specifically designed for UAV scenes, yielding the best reconstruction quality.
    • The entire MARS-LVIG dataset was divided into 8 splits based on environment and lighting conditions; SfM reconstruction was performed independently for each split (3–10 hours per split).
  • Design Motivation: Accurate 6-DoF poses are the foundation for all downstream tasks, particularly novel view synthesis and precise localization. 4-DoF poses encode only 3D position and yaw angle, which is insufficient for fine-grained evaluation.

2. Image Semantic Annotation Pipeline

  • Function: Provides 19-class pixel-level semantic annotations for 120,000+ frames.
  • Mechanism (two-step process):
    • Static class annotation (16 classes): Manual semantic labeling is performed on the reconstructed 3D point cloud map and then rendered back to the corresponding camera views to obtain 2D semantic masks. 3D consistency ensures cross-frame annotation coherence.
    • Dynamic class annotation (2 classes: cars and trucks): Instance-level manual annotation is performed on individual frames. The tracking functionality of X-AnyLabeling partially accelerates this process, but tracking instability necessitates extensive manual verification and correction. Over 280,000 dynamic instances were annotated in total.
    • Static and dynamic annotations are merged into complete per-frame labels.
  • Design Motivation: Annotating static classes on the 3D map guarantees cross-frame consistency (i.e., the same building receives consistent labels across frames), which is difficult to ensure with conventional frame-by-frame annotation.

3. LiDAR Point Cloud Semantic Annotation

  • Function: Provides semantic annotations for per-frame Livox-Avia LiDAR point clouds.
  • Mechanism:
    • Camera–LiDAR hardware synchronization and calibration are leveraged to project image semantic annotations onto the corresponding LiDAR point clouds.
    • Automatic projection is followed by manual consistency checking and correction.
    • Only the open Livox-Avia point clouds are used, as the DJI-L1 output is encrypted and inaccessible.
  • Design Motivation: Image-to-point-cloud projection efficiently produces initial annotations, which are then refined through manual correction to ensure quality.

Loss & Training

As a dataset paper, no specific model training is involved. Benchmark experiments follow standard training protocols for each respective task.

Key Experimental Results

Main Results (Image Semantic Segmentation)

Params Architecture Model mIoU ↑
22M Transformer DeiT3-s 67.6
38M Transformer DeiT3-m 68.3
22M Transformer ViT-s 63.9
5M Transformer ViT-t 62.8
25M CNN ResNet-50 61.3
44M CNN ResNet-101 60.7
21M CNN ResNet-34 59.9
28M CNN ConvNext-t 55.3
48M CNN MambaOut-s 51.8

All models use UperNet as the segmentation head. Transformer-based models consistently outperform CNN-based models, with DeiT3-m achieving the best mIoU of 68.3%.

Ablation Study (LiDAR Semantic Segmentation)

Params Model mIoU ↑ Note
38M MinkUNet 32.7 Voxelized point cloud method
39M SPUNet 34.4 Sparse convolution method
11M PTv2 33.2 Point Transformer method

LiDAR segmentation mIoU is substantially lower than image segmentation (~33% vs. ~68%), indicating that semantic segmentation of aerial LiDAR point clouds is highly challenging.

Key Findings

  • Transformers outperform CNNs: Even with fewer parameters (e.g., ViT-t at only 5M), Transformer architectures surpass large-parameter CNNs on UAV semantic segmentation.
  • LiDAR segmentation is much harder than image segmentation: The mIoU gap is approximately 35 percentage points, likely attributable to low point cloud density and domain shift from ground-level patterns.
  • Challenges of dynamic object annotation: IoU scores for cars and trucks are notably lower than for static categories, reflecting the difficulty of dynamic object detection in UAV scenes.
  • Small objects such as Solar Panels and Umbrellas: IoU values are extremely low (<10%), representing a primary direction for future improvement.
  • Multi-traversal coverage: The dataset includes multiple passes over the same scenes, enabling temporal tasks such as scene change detection.

Highlights & Insights

  • Uniqueness: The first real-world UAV dataset globally to simultaneously provide 6-DoF poses, per-frame image annotations, and per-frame LiDAR annotations.
  • Annotation pipeline innovation: Annotating static classes on the 3D map and rendering back to 2D ensures cross-frame consistency while reducing annotation cost.
  • Comprehensive task coverage: A single dataset supports six distinct tasks, providing a unified evaluation platform for UAV perception research.
  • Scale advantage: 120,000+ annotated frames substantially exceed most existing UAV datasets.
  • Open-source and reproducible: Both data and code are publicly available.

Limitations & Future Work

  • Limited environmental diversity: Based on MARS-LVIG, the dataset covers only town, valley, airport, and island scenes, lacking extreme weather, nighttime, and other challenging conditions.
  • DJI-L1 encryption restriction: High-quality DJI-L1 LiDAR data cannot be utilized; only Livox-Avia data are available.
  • Limited dynamic object categories: Only cars and trucks are annotated; pedestrians, bicycles, and other dynamic objects are absent.
  • SfM reconstruction accuracy: Reliance on the commercial DJI Terra tool may yield lower reconstruction accuracy than SLAM under ideal conditions.
  • Annotation scalability: Despite the semi-automatic pipeline, labeling 280,000+ dynamic instances requires substantial human effort, posing a cost challenge for scaling to larger datasets.
  • SemanticKITTI: The benchmark dataset for LiDAR semantic segmentation in ground-level autonomous driving; its annotation methodology directly inspired this work.
  • nuScenes / Waymo: Representative multi-modal autonomous driving datasets, both limited to ground-level perspectives.
  • MARS-LVIG: The foundation of this dataset, originally designed solely for SLAM.
  • Key insight: The dataset construction methodology from ground-level autonomous driving is transferred to the UAV domain, leveraging 3D-to-2D annotation rendering to ensure cross-frame consistency.

Rating

  • Novelty: ⭐⭐⭐⭐ — Fills a clear gap in the dataset landscape.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Multi-task benchmarks are comprehensive, though individual task experiments are relatively small in scale.
  • Writing Quality: ⭐⭐⭐⭐ — Comparison tables are clear and related work is thoroughly surveyed.
  • Value: ⭐⭐⭐⭐⭐ — As an infrastructure-level contribution, it holds high value for UAV perception research.