NeurIPS 2025 Video Understanding Multimodal dataset sensor fusion drone imagery camera traps bioacoustics wildlife monitoring

SmartWilds: Multimodal Wildlife Monitoring Dataset¶

Conference: NeurIPS 2025 arXiv: 2509.18894 Code: N/A (Dataset available on HuggingFace under CC0-1.0 license) Area: Dataset, Multimodal Learning, Wildlife Monitoring, Conservation Ecology Keywords: Multimodal dataset, sensor fusion, drone imagery, camera traps, bioacoustics, wildlife monitoring

TL;DR¶

This work introduces SmartWilds, the first synchronously collected multimodal wildlife monitoring dataset, integrating three complementary modalities — drone imagery, camera traps, and bioacoustics — comprising 101 GB of data. Cross-modal alignment is achieved via GPS coordinates and timestamps. The dataset establishes a reproducible standard protocol for conservation monitoring, filling the gap in comprehensive multi-sensor fusion benchmarks for ecosystem-scale ecological research.

Background & Motivation¶

Background: Conservation biology is undergoing an automated monitoring revolution driven by multi-sensor systems, edge AI, and computer vision. Technologies such as drones, camera traps, and bioacoustic recorders are each advancing rapidly, and unimodal datasets such as MammAlps (multi-view video + audio), BuckTales (multi-drone tracking), and KABR (drone-based behavior recognition) have been established.

Limitations of Prior Work: Existing datasets predominantly focus on specific taxonomic groups or single behavioral tasks (e.g., species identification or behavior classification in isolation). Synchronized cross-modal data are largely absent — camera traps cover only fixed points, acoustic sensors detect only vocalizing species, and drones are limited by battery endurance. No single sensor can comprehensively characterize an ecosystem, nor can current resources support the development and evaluation of multimodal fusion algorithms.

Key Challenge: Conservation management requires comprehensive, multi-scale ecosystem understanding spanning spatial coverage, temporal continuity, species diversity, and behavioral detail. However, the field lacks a standard benchmark dataset that spatiotemporally synchronizes multiple sensing modalities, leaving multimodal fusion methods without a validation platform for conservation scenarios.

Goal: To construct a synchronously collected multimodal wildlife monitoring dataset comprising three complementary modalities — drones, camera traps, and bioacoustics — accompanied by complete GPS/timestamp metadata for cross-modal association, along with a reproducible and scalable deployment protocol.

Key Insight: The Wilds conservation center was selected as the pilot site, specifically a 220-acre enclosure housing a GPS-tagged endangered elk population. This site combines the ecological complexity of a real conservation scenario with the data quality assurance afforded by a controlled environment, and lays the groundwork for future integration of GPS tracking data.

Core Idea: By synchronously deploying three complementary sensing modalities (drone + camera trap + acoustics) within the same spatiotemporal context, and linking them into a unified multimodal dataset via GPS and timestamp metadata, the work fills the multi-sensor fusion benchmark gap in conservation AI through a standardized collection protocol.

Method¶

Overall Architecture¶

SmartWilds adopts a "three-modality synchronized deployment with metadata association" data collection architecture. Within the 220-acre enclosure at The Wilds conservation center, four camera traps (fixed visual monitoring), four bioacoustic monitors (continuous/scheduled audio recording), and a Parrot ANAFI drone (flexible aerial surveying) were simultaneously deployed over four consecutive days (June 30 – July 3, 2025). All sensors share GPS coordinates and precise timestamps; cross-modal temporal calibration is further achieved through dedicated synchronization flights in which the drone operates within camera trap fields of view. The resulting dataset contains 20K+ files totaling 101 GB, encompassing photographs, videos, audio recordings, and comprehensive environmental/deployment metadata.

Key Designs¶

Three-Modality Complementary Sensor Network
- Function: Achieves comprehensive ecosystem monitoring through complementarity across three dimensions — spatial coverage, temporal continuity, and species detectability.
- Mechanism: Camera traps excel at fixed-point species identification (~30 m field of view, event-triggered, operable for weeks); bioacoustic monitors excel at temporally continuous coverage and detection of cryptic species (~100 m range, 48 kHz high-quality recordings); drones excel at landscape-scale, high-resolution behavioral observation (sub-meter resolution, 30–60 fps video). The three modalities form a complementary matrix across spatial range, temporal granularity, and species detectability.
- Design Motivation: Each individual sensor has inherent blind spots — camera traps have fixed fields of view and limited video information, acoustics cannot capture visual behavior, and drones are constrained by battery life. Only by synchronously fusing all three modalities can a complete ecosystem portrait be constructed, which is a prerequisite for multimodal fusion research.
GPS/Timestamp-Based Cross-Modal Alignment Mechanism
- Function: Enables precise spatiotemporal association across sensors, allowing the same wildlife event to be cross-validated and jointly analyzed across modalities.
- Mechanism: All devices record GPS coordinates and deployment timestamps. The drone executes dedicated synchronization flights within camera trap fields of view, such that camera trap scenes are simultaneously visible in drone footage, enabling temporally aligned pixel-level correspondence. Comprehensive metadata (habitat type, environmental conditions, researcher field observations) provides semantic context for fusion.
- Design Motivation: Accurate alignment is the foundation of multimodal fusion — temporal offsets between sensors would prevent fusion algorithms from correctly associating the same event. The synchronization flights provide a dual visual-temporal verification channel, which is the critical technical mechanism for transforming three independent data streams into a genuinely multimodal dataset.
Standardized Deployment Protocol for Reproducibility and Scalability
- Function: Establishes systematic standards for site selection, sensor configuration, and metadata recording, enabling the protocol to be replicated at other conservation sites.
- Mechanism: Site selection is based on animal activity patterns and habitat diversity (camera traps prioritized at high-activity areas such as water sources and salt licks; acoustic monitors covering varied acoustic environments). Sensor configuration is documented with explicit specifications (GardePro T5NG, Song Meter Mini parameter settings). The metadata framework includes complete environmental descriptions and deployment rationale (each of the eight sites, TW01–TW08, is documented with detailed site selection justification and habitat characteristics).
- Design Motivation: The value of a pilot study lies not only in the data collected but in the transferability of the methodology. Only through standardized protocols can data collection be extended across multiple sites and seasons, and can other research groups replicate the conservation monitoring network.

Loss & Training¶

This paper presents a dataset contribution and does not involve model training. Key design decisions in the data collection strategy include: - Camera traps: Motion-triggered hybrid photo/video mode, preferentially deployed in high-activity areas. - Bioacoustics: Dual sampling strategy — half of the devices record 5 minutes per hour (to capture ungulate vocalizations throughout the day), while the other half record during dawn/dusk windows (to capture bird call diversity). - Drone: Mixed mission mode combining systematic area surveys and opportunistic behavioral tracking, supplemented by dedicated cross-modal synchronization calibration flights. - Future plans include a human-in-the-loop annotation pipeline supported by active learning and citizen science.

Key Experimental Results¶

Main Results¶

Dataset overview:

Modality	Data Type	Total Files	Size (GB)
Camera Traps	Photos and videos	20,014	49
Bioacoustics	Audio recordings	311	6
Drone	Aerial video + metadata	20 video files	46
Total	All modalities	~20K	101

Ablation Study¶

Sensor modality performance comparison (Table 2, qualitative evaluation):

Metric	Camera Traps	Bioacoustics	Drone	GPS Tags (Future)
Spatial Range	Fixed ~30 m radius	Fixed ~100 m radius	Mobile ~2 km	Whole domain
Spatial Resolution	High within FOV	Moderate directionality	Sub-meter	1–10 m
Temporal Range	Weeks–months	Weeks–months	Hours/mission	Months–years
Temporal Resolution	Event-triggered <1 s	Continuous/scheduled	30–60 fps	Hourly
Species Detectability	Large visible species	Cryptic/vocalizing species	Large mammals	Tagged individuals only
Behavioral Detail	Limited within-frame interactions	Vocal behavior	High: posture/interaction	Movement patterns only
Deployment Effort	Low–medium	Low–medium	High (active operation)	Low (post-deployment)
Data Volume	Moderate	Moderate–high	High	Low

Key Findings¶

Minimal animal disturbance: Elk initially exhibited curiosity toward the drone but overall behavioral disruption was negligible, validating the feasibility of non-invasive monitoring.
Modality complementarity confirmed: Camera traps excel at species identification, acoustics provide temporally continuous coverage, and drones offer landscape-scale perspectives — the three modalities collectively span distinct monitoring dimensions.
Limited utility of camera trap video: The pilot revealed that camera trap video provides substantially less information than drone footage, prompting protocol adjustments in subsequent deployments (including co-located acoustic and camera trap pairs for capability comparison).
Clear behavioral patterns during breeding season: Territorial males vocalized frequently and herds congregated at water sources during hot weather, demonstrating the capacity of multimodal data to capture complex ecological behaviors.
Technical challenges identified: Limited GPS signal in remote areas affected synchronization precision for some data; weather conditions impacted acoustic quality; the absence of ready-made mounting infrastructure required innovative attachment solutions.

Highlights & Insights¶

Filling the multimodal conservation dataset gap: SmartWilds is the first ecological dataset to synchronously collect three sensing modalities within the same spatiotemporal context and link them through metadata, providing the first standard benchmark for multimodal fusion in conservation AI.
Quantitative complementarity analysis framework: A systematic eight-dimension comparison of the capability matrix across four sensing technologies provides researchers with an evidence-based decision framework for selecting and combining sensors.
Practice-driven iterative design: Issues identified during the pilot (e.g., limited utility of camera trap video) were directly fed back into protocol refinements, demonstrating a scientific engineering cycle of deploy–evaluate–improve.
Scalable architecture oriented toward the future: The pilot site was selected to include a GPS-tagged elk population, preserving an interface for future releases integrating individual tracking data, reflecting forward-looking planning for long-term dataset development.

Limitations & Future Work¶

Limited data scale: Only four days of collection at a single enclosure and single site; spatiotemporal coverage is insufficient for training large-scale models or capturing seasonal variation.
No annotations: The current release contains no object detection, classification, or behavioral annotations; supervised learning applications require additional annotation effort.
No baseline models: The release is purely a dataset contribution with no model baseline performance reported (detection/classification/fusion), reducing its immediate usability as a benchmark.
Modality imbalance: With only 311 acoustic files and 20 drone video clips compared to 20K+ camera trap files, multimodal learning may face severe sample imbalance.
Limited synchronization precision: Coarse-grained alignment relying on GPS and timestamps is subject to GPS signal instability in remote areas; frame-accurate cross-modal alignment remains challenging.
Future directions: Integration of GPS tracking data, citizen science imagery, and satellite/weather data; expansion to multiple habitats, seasons, and sites; development of real-time edge AI systems with adaptive sampling.

MammAlps: Multi-view video with synchronized audio for behavioral analysis of Alpine wild mammals, demonstrating the value of multimodal approaches in behavioral research.
BuckTales: Multi-drone tracking for individual identification and re-identification of wild ungulates, advancing individual-level animal monitoring technology.
KABR: Drone-based behavior recognition of Kenyan wildlife, contributing to conservation AI applications in African ecosystems.
PanAf-FGBG: Investigation of how environmental background influences wildlife behavior recognition, providing insights into foreground–background interaction.
Unique positioning of this work: Rather than targeting a single task, SmartWilds designs a comprehensive multi-sensor synchronized collection protocol and data framework oriented toward a "conservation digital twin."

Rating¶

Novelty: ⭐⭐⭐⭐ — The first synchronously collected three-modality conservation ecology dataset, filling the gap in multimodal wildlife monitoring benchmarks with clear conceptual differentiation.
Technical Depth: ⭐⭐⭐ — The core contribution is the dataset and collection protocol; technical analysis remains primarily at the level of qualitative modality comparison, without fusion algorithm design or model baselines.
Experimental Thoroughness: ⭐⭐⭐ — The pilot-stage data are limited in scale (4 days, single site), with no annotations or baselines, representing a relatively modest contribution for the NeurIPS dataset track.
Writing Quality: ⭐⭐⭐⭐ — Dataset documentation is clear and complete, with thorough justification for site selection and sensor configuration details, ensuring strong reproducibility.