Web-Scale Collection of Video Data for 4D Animal Reconstruction¶

Conference: NeurIPS 2025 arXiv: 2511.01169 Code: https://github.com/briannlongzhao/Animal-in-Motion Area: Video Understanding / 3D Vision Keywords: 4D animal reconstruction, data pipeline, YouTube video mining, benchmark dataset, single-view reconstruction

TL;DR¶

This paper proposes a fully automated large-scale video data collection pipeline that mines and processes 30K animal videos (2M frames) from YouTube, establishes the first 4D quadruped animal reconstruction benchmark Animal-in-Motion (230 sequences / 11K frames), and introduces a baseline method 4D-Fauna that achieves model-free 4D reconstruction via sequence-level optimization.

Background & Motivation¶

Visual analysis of animal morphology and motion has important applications in wildlife conservation, biomechanics, and robotics. Traditional approaches rely on expensive multi-view controlled environments or marker-based systems. Recent single-view methods (pose estimation, tracking, 3D/4D reconstruction) have made progress, but are severely constrained by data scale.

Existing animal video datasets suffer from three critical issues: (1) extremely small scale—the largest, APT-36K, contains only 2.4K clips of 15 frames each; (2) lack of object-centric crops—raw videos may contain multiple overlapping animals with no segmentation masks; (3) lack of essential preprocessing—no auxiliary annotations (keypoints, optical flow, depth, etc.) prepared for 3D/4D reconstruction tasks. The only dataset genuinely suitable for 4D animal reconstruction, BADJA, contains merely 11 videos.

The root cause lies in the tension between the data demands of data-driven methods and the prohibitive cost of collecting and annotating animal videos. The paper resolves this by leveraging the vast video resources on YouTube to build a fully automated collection–processing–annotation pipeline.

Method¶

Overall Architecture¶

The pipeline consists of four stages: (1) searching and downloading raw videos from YouTube; (2) video preprocessing (shot segmentation, CLIP filtering); (3) animal detection and tracking (Grounded-SAM-2) to generate object-centric crops; and (4) feature extraction (keypoints, DINO features, optical flow, depth maps, occlusion boundaries). The entire pipeline is coordinated through a central database and supports multi-process parallelism.

Key Designs¶

Intelligent Search Query Generation:
- Given an animal category (e.g., "horse"), GPT is used to generate sub-breeds (Clydesdale, Mustang) and contextual phrases (racing competition, in a farm).
- These are randomly combined into diverse search queries to maximize video diversity.
- Selenium and pytube are used for search and download.
Multi-Level Filtering and Tracking:
- Shot segmentation: PySceneDetect is used to detect scene transitions via pixel-level changes, preventing cross-shot tracking confusion.
- CLIP filtering: CLIPScore between frames and category text is computed; low-scoring clips are discarded.
- Grounded-SAM-2 tracking: Iterative grounding-tracking for long-term tracking.
- Multi-level filtering:
  - Overlapping instance filtering (IoU threshold to remove frames with multiple overlapping animals)
  - Low-resolution filtering (bounding box area < 1/4 of crop size)
  - Truncated instance filtering (bounding box near frame boundary)
  - Inconsistent trajectory filtering (abrupt IoU changes between adjacent frames to detect identity switches)
- Object-centric crops: Square crops centered on the bounding box, smoothed via moving average.
- GPT-based final visual verification to remove false detections and severely occluded instances.
Comprehensive Feature Extraction Module:
- ViTPose++: animal keypoint estimation
- DINOv2: image features
- SEA-RAFT: optical flow estimation
- Depth Anything V2: depth estimation
- Occlusion boundaries: computed from depth discontinuities at mask boundaries
4D-Fauna Baseline Method:
- Built upon 3D-Fauna (a model-free approach) with sequence-level optimization.
- Keypoint supervision: 2D keypoints introduced as part-level constraints to resolve leg ordering ambiguity.
- Temporal smoothing loss: Regularization applied to changes in camera pose parameters and joint velocity of animal pose.
- Efficient per-sequence overfitting: Camera pose and joint parameters are directly optimized per frame, initialized from pretrained network outputs.

Loss & Training¶

4D-Fauna uses the original inverse rendering losses from 3D-Fauna (mask IoU + DINO feature matching), augmented with a keypoint reprojection loss and temporal smoothing regularization terms on camera pose and joint velocity. Optimization is performed per sequence on top of the pretrained model.

Key Experimental Results¶

Main Results¶

Method	IoU↑	PCK@0.1↑	PCK@0.05↑	KT-PCK@0.1↑	MPJVE↓	Type
SMALify	0.867	0.954	0.787	0.623	0.023	Model-based
AniMer	0.677	0.537	0.199	0.566	0.038	Model-based
3D-Fauna	0.670	0.470	0.177	0.329	0.058	Model-free
4D-Fauna	0.814	0.664	0.317	0.418	0.044	Model-free

Ablation Study¶

Configuration	IoU	PCK@0.1	Notes
3D-Fauna (direct inference)	0.670	0.470	Baseline model-free method
+ Sequence optimization + Keypoints	0.814	0.664	Keypoints resolve leg ordering
+ Temporal smoothing	↑	↑	Reduces inter-frame jitter

Key Findings¶

Misleading 2D metrics: SMALify achieves the best quantitative scores across all metrics, yet qualitative inspection reveals frequent unnatural 3D shapes (depth-elongated bodies, abnormal leg bending, distorted frontal-view geometry), exposing the limitations of 2D projection metrics.
Advantages of model-free methods: 3D-Fauna and 4D-Fauna produce more naturally plausible 3D shapes and poses, despite yielding lower 2D metric scores.
Necessity of sequence optimization: Feed-forward inference with 3D-Fauna causes abrupt leg switching between frames; 4D-Fauna effectively addresses this via keypoint constraints and temporal smoothing.
The data pipeline successfully collects 30K videos / 2M frames spanning 23 animal categories.

Highlights & Insights¶

End-to-end automation: The pipeline is fully automatic from search query generation to final feature extraction; only benchmark verification requires limited human effort.
Revealing metric deficiencies: The paper clearly demonstrates the inconsistency between 2D projection metrics and 3D reconstruction quality, underscoring the need for 3D-aware evaluation metrics.
Elegant model adaptation: Rather than retraining, 4D-Fauna uses feed-forward model outputs as optimization initialization, combining the generalization of model-free methods with the precision of model-based ones.

Limitations & Future Work¶

Annotations from the automated pipeline are not entirely clean; benchmark evaluation still requires manual verification.
The benchmark relies solely on 2D projection metrics, lacking true 3D ground truth.
4D-Fauna offers limited modeling of temporal consistency—future work could explore autoregressive models to capture inter-frame dynamics.
Original RGB video frames are not released due to copyright concerns; only derived data are provided.
The pipeline currently focuses on quadruped animals; extension to birds, fish, and other categories requires adaptation.

vs. APT-36K: Scale increased by 12.5× (30K vs. 2.4K), with comprehensive preprocessed features included.
vs. BADJA: Expanded from 11 videos to 230 benchmark sequences, establishing the first benchmark genuinely targeting 4D reconstruction.
vs. 3D-Fauna: 4D-Fauna adds keypoint and temporal constraints, yielding consistent improvements across all metrics.
vs. SMALify: Quantitatively superior yet questionable in 3D quality, highlighting fundamental issues with the current evaluation paradigm.

Rating¶

Novelty: ⭐⭐⭐⭐ — Significant engineering contribution in the pipeline; 4D-Fauna method is incrementally innovative but practically effective.
Experimental Thoroughness: ⭐⭐⭐⭐ — Impressive data scale and comprehensive benchmark, though lacking 3D quantitative evaluation.
Writing Quality: ⭐⭐⭐⭐ — Pipeline description is thorough and clear, with insightful analysis.
Value: ⭐⭐⭐⭐⭐ — The dataset and pipeline offer substantial value to the community and are poised to advance the field of 4D animal reconstruction.