ForestPersons: A Large-Scale Dataset for Under-Canopy Missing Person Detection¶

Conference: ICLR 2026 arXiv: 2603.02541 Code: https://huggingface.co/datasets/etri/ForestPersons Area: Object Detection Keywords: person detection, forest search and rescue, UAV, occlusion-awareness, dataset

TL;DR¶

ForestPersons is the first large-scale benchmark dataset specifically designed for under-canopy missing person detection in forest environments (96,482 images + 204,078 annotations). By simulating the low-altitude flight perspective of micro aerial vehicles (MAVs) at 1.5–2.0 meters, the dataset covers multi-season, multi-weather, multi-pose, and multi-occlusion-level conditions representative of real search-and-rescue (SAR) scenarios, providing a solid foundation for training and evaluating under-canopy person detection models.

Background & Motivation¶

Background: UAVs have been widely deployed in SAR operations, enabling rapid coverage of large open areas. With advances in hardware miniaturization and SLAM technology, MAVs are now capable of safe navigation and exploration in GPS-denied forest environments.

Limitations of Prior Work:

Viewpoint Limitation: Existing SAR datasets (HERIDAL, WiSARD, SARD, VTSaR) are collected from high-altitude nadir or oblique perspectives, where dense forest canopies cause persons to occupy only a few pixels in the image, making detection extremely difficult.
Scene Bias: Ground-level person detection datasets (COCO, CrowdHuman, CityPersons) primarily cover standing or walking individuals in urban environments, which differs substantially from forest SAR scenarios — cases such as lying down, sitting, and vegetation occlusion are rarely represented.
Missing Annotations: No existing dataset simultaneously provides occlusion-level and pose annotations, preventing systematic evaluation of detection capability under varying difficulty conditions.

Key Challenge: The missing persons most critical to detect in forest SAR operations are precisely those in scenarios least covered by existing datasets — beneath the forest canopy, occluded by vegetation, and in non-standing poses.

Goal: Construct ForestPersons, the first large-scale benchmark dataset focused on under-canopy person detection, simulating the low-altitude MAV perspective and supplemented with semantic annotations for pose and visibility, to support the development of detection models suited to real SAR scenarios.

Method¶

Overall Architecture¶

The construction pipeline of ForestPersons proceeds as follows: forest environment video capture → frame sampling → bounding box annotation → pose and visibility attribute annotation → face anonymization → difficulty-aware data splitting. The entire pipeline is designed around the core principle of faithfully reproducing real SAR conditions.

Key Design 1: Multi-Dimensional Data Collection Strategy¶

Data collection aims to replicate the complexity of real SAR scenarios as closely as possible:

Viewpoint Simulation: Handheld or tripod-mounted cameras capture footage at 1.5–2.0 meters height, simulating the low-altitude ground-level perspective of MAVs flying beneath the forest canopy.
Pose Diversity: Volunteers enact states of fatigue or disorientation, adopting three poses — standing, sitting, and lying on the ground — while subject to natural occlusion by vegetation, branches, and terrain.
Environmental Coverage: Covers four seasons (dense summer canopy vs. leafless winter with snow), multiple weather conditions (sunny/cloudy/light rain), and different times of day (afternoon/dusk).
Total Scale: 96,482 images and 204,078 annotated instances sampled from 377 video clips.

Key Design 2: Three-Dimensional Semantic Annotation Scheme¶

In addition to bounding boxes, each person instance is annotated with two SAR-relevant semantic attributes:

Pose Categories (3 classes):

Class	Description	SAR Significance
Standing	Upright	Mild condition / conscious
Sitting	Seated	Fatigued / waiting
Lying	Prone/supine	Injured / unconscious

Visibility Levels (4 levels):

Level	Description	Occlusion
100	Fully visible	No occlusion
70	Slightly occluded	Most of body clearly visible
40	Partially occluded	Person identifiable but notably occluded
20	Heavily occluded	Barely identifiable

When pose is difficult to determine due to occlusion, annotators refer to adjacent video frames for decision-making.

Key Design 3: Model-Driven Difficulty-Aware Data Splitting¶

Data is split at the video-sequence level to prevent temporal leakage between adjacent frames across splits. The splitting strategy is based on model-driven difficulty estimation:

A COCO-pretrained Faster R-CNN computes \(AP_{50}\) on each video sequence.
The difficulty score is defined as \(1 - AP_{50}\).
Sequences are grouped into three difficulty tiers: easy (\(< 0.45\)), medium (\(0.45 \le \text{score} < 0.75\)), and hard (\(\ge 0.75\)).
Sequences are proportionally distributed across train/validation/test splits.

Final split: training set — 67,686 images + 145,816 annotations; validation set — 18,243 images + 37,395 annotations; test set — 10,553 images + 20,867 annotations.

Key Experimental Results¶

Main Results: Transfer Performance of Existing Datasets to Under-Canopy Scenarios¶

Faster R-CNN models trained on different datasets are evaluated on the ForestPersons test set to quantify the inadequacy of existing data:

Training Data	Type	Own Test AP	ForestPersons AP	ForestPersons AP₅₀
SARD	SAR/nadir	58.6	3.0	7.8
HERIDAL	SAR/nadir	35.0	0.2	0.3
WiSARD	SAR/oblique	18.5	11.3	29.0
COCO-Person	ground/urban	54.0	40.8	66.9
CrowdHuman	ground/urban	39.4	31.9	58.8
CityPersons	ground/urban	38.7	5.9	15.1

All SAR datasets achieve AP below 12% on ForestPersons, confirming that high-altitude perspective data cannot generalize to under-canopy scenarios. Among ground-level datasets, COCO performs best (AP = 40.8), yet still exhibits significant performance degradation, underscoring the domain gap between urban and forest environments.

Baseline Results: Multi-Detector Performance on ForestPersons¶

Detection Model	Backbone	AP	AP₅₀	AP₇₅	AR
SSD	MobileNetV2	45.0	83.6	43.1	53.7
YOLOv3	YOLO	50.2	86.5	53.9	58.6
YOLOX	YOLO	51.0	89.0	54.4	58.2
DETR	Transformer	53.9	88.7	59.4	67.9
RetinaNet	ResNet-50	64.2	93.9	74.4	70.9
Faster R-CNN	ResNet-50	64.4	92.7	75.4	70.0
DINO	Transformer	65.3	94.0	76.2	77.7
YOLOv11	YOLO	65.6	93.4	75.6	71.7
CZ Det	Cascade-Zoom	65.6	96.1	77.9	71.6
Deformable R-CNN	ResNet-50	66.3	93.4	77.5	71.3

Deformable R-CNN achieves the highest overall AP (66.3), though the best-performing model varies by metric: DINO leads in AR (77.7, which is more critical in SAR contexts emphasizing recall), while CZ Det achieves the highest AP₅₀ and AP₇₅.

Ablation Study: Impact of Attributes on Detection Performance¶

Training → Test Attribute	Standing AP	Sitting AP	Lying AP
Standing only	45.3–60.1	30.0–44.5	31.7–46.0
All poses	49.3–65.5	50.6–65.7	47.5–65.1

Training exclusively on Standing data leads to severe performance degradation for Sitting and Lying detection (approximately −20 AP). Training with all pose categories yields substantial improvements across all three pose classes, confirming the necessity of multi-pose data coverage.

Correlation between Visibility Level and Detection Performance: Detection accuracy increases consistently with visibility level (from level 20 to level 100), validating that the difficulty gradient design in ForestPersons aligns with realistic SAR conditions.

Highlights & Insights¶

Strengths¶

Filling a Critical Gap: ForestPersons is the first large-scale person detection dataset focused on the under-canopy perspective; its 96K+ image scale exceeds the largest prior SAR dataset (WiSARD, 44K) by more than twofold.
Comprehensive Annotation: The three-dimensional annotation scheme — bounding boxes, pose, and visibility — provides a unique foundation for systematically studying occlusion robustness.
Thorough Experiments: Beyond multi-detector benchmarking, cross-dataset transfer experiments quantitatively demonstrate the necessity of the proposed dataset.

Limitations & Future Work¶

Data collection relies on volunteers simulating SAR scenarios, which may introduce distributional bias relative to the actual appearance and pose of real missing persons.
Only RGB data is provided (a thermal infrared variant, ForestPersonsIR, is briefly mentioned in the appendix), leaving the potential of multimodal fusion largely unexplored.
The best detector achieves only 66.3% AP, indicating substantial room for improvement, yet the paper proposes no targeted detection method to address this challenge.

Rating¶

⭐⭐⭐⭐ — As a dataset paper, ForestPersons excels in problem definition clarity, data scale, and annotation quality, opening an important research direction for computer vision in forest search-and-rescue operations.