AAAI 2026 3D Vision aerial-ground cooperative perception UAV-vehicle collaboration 3D object detection multi-object tracking collaborative perception dataset

Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark¶

Conference: AAAI 2026 arXiv: 2503.06983 Code: https://github.com/wang-jh18-SVM/Griffin Area: 3D Vision / Collaborative Perception Keywords: aerial-ground cooperative perception, UAV-vehicle collaboration, 3D object detection, multi-object tracking, collaborative perception dataset

TL;DR¶

This paper presents Griffin, the first aerial-ground cooperative (AGC) 3D perception dataset and benchmark framework, comprising 250+ dynamic scenes (37K+ frames) generated via CARLA-AirSim joint simulation. Griffin features realistic UAV dynamics, variable cruise altitudes (20–60 m), occlusion-aware annotations, and a systematic robustness evaluation protocol.

Background & Motivation¶

State of the Field¶

Collaborative perception has emerged as a critical direction for overcoming the limitations of single-agent systems (occlusion, limited field of view). Major paradigms include: - V2V (vehicle-to-vehicle): OPV2V, V2V4Real, etc. - V2I (vehicle-to-infrastructure): DAIR-V2X, V2X-Seq, etc. - UAV collaboration: CoPerception-UAVs, UAV3D, etc.

Root Cause¶

V2V/V2I systems require large-scale infrastructure investment and widespread vehicular networking, imposing high economic barriers. Aerial-ground cooperation (AGC)—pairing UAVs with ground vehicles—offers a more flexible and cost-effective alternative that can be deployed on demand and provides an unobstructed bird's-eye view. However, AGC perception research is hindered by the lack of high-quality public datasets and benchmarks.

Limitations of Prior Work¶

Issue	Affected Datasets
Idealized communication and localization (noise-free)	UAV3D, AeroCollab3D, Air-Co-Pred, AirV2X
Simplified UAV models (fixed orientation/altitude)	V2U-COO, UAV3D, Air-Co-Pred
No occlusion-aware annotations	CoPerception-UAVs, UAV3D, AeroCollab3D, AirV2X
No tracking IDs	AGC-Drive
2D annotations only	CoPeD

Critical gap: No existing AGC dataset simultaneously provides occlusion-aware annotations, realistic noise simulation, multi-altitude support, and tracking IDs.

Starting Point¶

This paper constructs Griffin—the first AGC perception dataset to simultaneously support occlusion-aware 3D annotations, realistic UAV dynamics, multi-altitude settings, and simulation of communication interference and localization errors—alongside a unified detection and tracking benchmark framework.

Method¶

Overall Architecture¶

Griffin consists of three components: 1. Dataset: CARLA-AirSim joint simulation → multi-sensor data collection → occlusion-aware annotation 2. Benchmark framework: Standardized implementations of four fusion paradigms (early / intermediate BEV-level / intermediate instance-level / late) 3. Evaluation protocol: Accuracy + communication efficiency + robustness (latency, packet loss, localization error)

Key Designs¶

1. Data Collection and Scene Diversity¶

Function: Generate synchronized multi-agent data via CARLA-AirSim joint simulation.

Sensor configuration: - Ground vehicle: 4 wide-FOV RGB cameras (108.8°, 1920×1080) + 80-line LiDAR (10 Hz, vertical FOV −25° to 15°) - Aerial UAV: 5 downward-facing cameras (SWaP-constrained; no LiDAR)

Scene diversity: - 4 CARLA maps (2 urban + 2 suburban) - Weather: sunny/rainy/foggy × noon/sunset/night × wind speed 0–9 m/s - Altitude: Griffin-Random (20–60 m), Griffin-25m/40m/55m (each ±2 m) - 255 scene clips, ~15 s each; 37.7K frames, 339.3K images, 914.8K 3D annotations in total

UAV dynamics realism: AirSim's physics engine is used to simulate UAV motion; pitch/roll angle distributions are centered near zero rather than sharply peaked—reflecting the continuous micro-adjustments and wind-resistance corrections of real UAVs.

Design Motivation: - CARLA provides rich environments and traffic flows; AirSim provides a realistic UAV physics model - The LiDAR-free UAV configuration mirrors practical deployments (e.g., BYD-DJI solutions, where small UAVs carry payloads <1 kg) - Variable altitudes and weather conditions test method generalization

2. Occlusion-Aware Annotation¶

Function: Quantify the visibility ratio of each target to each agent and filter out invisible targets.

Mechanism: 1. Collect RGB and instance segmentation images (perfectly aligned) 2. Sample points within each 3D bounding box and project them onto the segmentation image 3. Compare the semantic class and instance ID of projected pixels against those of the target 4. Compute the per-agent visibility ratio 5. Collaborative perception GT: retain targets visible to at least one agent

Design Motivation: - Many datasets filter annotations by distance only (yellow boxes), ignoring heavily occluded targets (red boxes) - Skipping occlusion filtering introduces annotation noise from invisible targets, degrading model training quality - Experimental validation: training without occlusion filtering reduces Early Fusion AP from 0.607 to 0.586

3. Benchmark Framework¶

Function: Implement four fusion paradigms on a unified backbone (BEVFormer + ResNet-50).

Four fusion strategies:

Fusion Type	Representative Methods	Communication (BPS)	Characteristics
Early Fusion	Raw image transmission	3.11×10⁸	Performance upper bound; extremely high bandwidth
BEV-level intermediate	V2X-ViT, Where2comm	3.3–8.0×10⁵	Scene-level BEV features transmitted after compression
Instance-level intermediate	UniV2X, CoopTrack	0.56–1.17×10⁵	Sparse object queries; lower bandwidth
Late Fusion	Detection result transmission	1.56×10³	Extremely low bandwidth; limited performance

Evaluation protocol: - Accuracy: NuScenes AP and AMOTA - Communication efficiency: bytes per second (BPS) - Robustness: communication latency (0–400 ms), packet loss rate (0–50%), localization error (translation 0–2.5 m, rotation 0–5°)

Loss & Training¶

AdamW optimizer, learning rate \(2\times10^{-4}\), batch size 8
Distributed training on 4× NVIDIA RTX 3090 GPUs
Input images downsampled from 1920×1080 to 960×540
Objects merged into 3 categories (car, pedestrian, two-wheeler)
Perception range: 102.4 m × 102.4 m region centered on the ego vehicle

Key Experimental Results¶

Main Results¶

Per-Method Performance Across Altitude Datasets¶

Method	Griffin-25m AP/AMOTA	Griffin-55m AP/AMOTA	Communication (BPS)
No Fusion	0.375/0.365	0.335/0.359	0
Early Fusion	0.607/0.670	0.483/0.522	3.11×10⁸
V2X-ViT	0.465/0.508	0.350/0.379	8.00×10⁵
Where2comm	0.396/0.406	0.317/0.353	3.30×10⁵
CoopTrack	0.479/0.488	0.364/0.402	1.17×10⁵
UniV2X	0.419/0.456	0.323/0.349	5.58×10⁴
Late Fusion	0.378/0.377	0.306/0.332	1.56×10³

Griffin-Random (Mixed Altitude 20–60 m)¶

Method	AP	vs. No Fusion
No Fusion	0.459	—
Early Fusion	0.583	+0.124
V2X-ViT	0.400	−0.059
Where2comm	0.406	−0.053
CoopTrack	0.468	+0.009
UniV2X	0.402	−0.057

Ablation Study¶

Impact of Occlusion-Aware Annotation (Griffin-25m)¶

Model	Annotation Type	AP	AMOTA
Early Fusion	Occlusion-aware (baseline)	0.607	0.670
Early Fusion	No filtering	0.586 (↓)	0.636 (↓)
Vehicle Side	Occlusion-aware	0.477	0.457
Vehicle Side	No filtering	0.412 (↓)	0.433 (↓)

Communication Robustness¶

Latency (ms)	Early Fusion AP Drop	Intermediate Fusion
100	~10%	Still outperforms No Fusion
200	~20%	Detection marginally better; tracking still improved
400	>30%	Tracking maintains advantage

Localization Robustness¶

Translation Error std (m)	V2X-ViT	UniV2X
0.5	Normal	Normal
1.5	Below No Fusion	Still outperforms No Fusion
2.5	Severe degradation	Still maintains advantage

Key Findings¶

Altitude variation has a profound impact on collaborative perception: cooperative gain is greatest at 25 m and degrades with increasing altitude; under mixed altitudes (20–60 m), most intermediate fusion methods perform worse than the single-agent baseline.
Instance-level fusion is more robust than BEV-level fusion: CoopTrack is the only intermediate fusion method to maintain positive gain on Griffin-Random, as instance-level methods decouple geometric transformations from semantic features, making them more robust to viewpoint inconsistency.
Where2comm and UniV2X perform poorly in AGC scenarios: UAV bird's-eye views produce sparse object distributions, leading to under-training of spatial confidence maps and sparse queries based on positive sample detection.
Packet loss has less impact than latency: packet loss only causes information absence (reducing gain), whereas latency introduces spatial misalignment.
UniV2X is most robust to localization errors: selective fusion and instance-level filtering down-weight unreliable signals.
Occlusion-aware annotation matters: omitting filtering degrades both collaborative and single-agent model performance.
Tracking is more robust to latency than detection: temporal information helps mitigate inter-frame alignment issues.

Highlights & Insights¶

"Altitude variation invalidates collaborative perception" is a profound AGC-specific finding: this issue does not arise in V2V/V2I settings but is critical for AGC.
The occlusion-aware annotation method is concise and effective: leveraging the simulator's instance segmentation ground truth to quantify visibility ratios avoids the cost of manual annotation.
The robustness evaluation covers aggressive ranges (2.5 m translation / 5° rotation / 400 ms latency / 50% packet loss): far exceeding standard evaluation ranges, revealing true failure boundaries of each method.
The CARLA-AirSim joint simulation framework cleverly exploits the complementary strengths of both simulators (CARLA's environments + AirSim's UAV physics).
In-depth comparison of BEV-level vs. instance-level fusion: provides clear guidance on fusion strategy selection for AGC scenarios.

Limitations & Future Work¶

Simulation-to-real domain gap: despite efforts to approximate reality (LiDAR-free UAVs, noise injection), the sim-to-real gap persists.
Only car category evaluated: results for pedestrians and two-wheelers are not presented in the main paper.
Fixed backbone (ResNet-50 BEVFormer): stronger single-agent detectors may alter the relative ranking of fusion methods.
Differential impact of weather on individual methods is not analyzed: although the dataset covers diverse weather conditions, the analysis does not group results by weather type.
Height-adaptive and scale-aware fusion mechanisms should be developed to address the core challenge.
More advanced late fusion strategies could be explored to achieve better cost-effectiveness under extremely low bandwidth constraints.

OPV2V (Xu et al., ICRA 2022): pioneering work in V2V collaborative perception; Griffin fills the AGC gap.
DAIR-V2X (Yu et al., CVPR 2022): real-world V2I dataset, but with fixed camera height of 20–25 m.
BEVFormer (Li et al., 2022): unified backbone used for all baselines.
V2X-ViT, Where2comm: representative BEV-level intermediate fusion methods; this paper reveals their limitations in AGC scenarios.
CoopTrack (Zhong et al., ICCV 2025): instance-level fusion proves more robust under altitude variation.
Insight: AGC scenarios require entirely new fusion design philosophies—V2V/V2I methods cannot be directly transferred; scale and viewpoint inconsistency induced by altitude variation is the core challenge.

Rating¶

Novelty: ⭐⭐⭐⭐ — AGC datasets are not a wholly new concept, but occlusion-aware annotation and systematic robustness evaluation are notable contributions.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 6 methods × 4 altitudes × 3 perturbation types; evaluation is exceptionally comprehensive.
Writing Quality: ⭐⭐⭐⭐⭐ — Dataset construction details and experimental analysis are both highly thorough.
Value: ⭐⭐⭐⭐⭐ — Fills a critical data gap in AGC perception research; the "altitude robustness" finding provides important guidance for future work.