Sapiens: Foundation for Human Vision Models¶
Conference: ECCV 2024
arXiv: 2408.12569
Code: https://github.com/facebookresearch/sapiens
Area: 3D Vision
Keywords: Human vision foundation models, self-supervised pre-training, Vision Transformer, high-resolution inference, human perception
TL;DR¶
Sapiens presents a family of human-centric vision foundation models (0.3B to 2B parameters) pre-trained on 300 million human images using MAE self-supervised methods. It natively supports \(1024 \times 1024\) high-resolution inference, systematically outperforming the state-of-the-art across four major human vision tasks: 2D pose estimation, body part segmentation, depth estimation, and surface normal prediction.
Background & Motivation¶
- Background: Methods for generating realistic humans (2D/3D) have made significant progress in recent years. However, these methods rely heavily on robust human perception assets (2D keypoints, body segmentation, depth, and normals), the precise estimation of which remains an active area of research.
- Limitations of Prior Work: (1) Existing methods for individual tasks are highly specialized and systemically complex, hindering generalization; (2) Labeled data of in-the-wild scenarios is extremely difficult to obtain at scale; (3) Current vision foundation models are primarily trained on general images and are not optimized for the human domain.
- Key Challenge: General-purpose vision foundation models are not necessarily optimal for human-centric tasks, whereas human-specific models lack large-scale pre-training.
- Key Insight: Domain-specific large-scale pre-training—collecting 300 million human images for MAE pre-training, followed by fine-tuning on high-quality (and even synthetic) annotations to achieve a synergy of generalization, broad applicability, and high fidelity.
- Core Idea: Under the same computational budget, self-supervised pre-training on human-centric datasets significantly outperforms pre-training on general-purpose datasets.
Method¶
Overall Architecture¶
Sapiens adopts a pretrain-then-finetune paradigm: 1. Pre-training Phase: Self-supervised pre-training of ViT models using the Masked Autoencoder (MAE) approach on the Humans-300M dataset, with a native input resolution of \(1024 \times 1024\) and a patch size of 16. 2. Fine-tuning Phase: Fine-tuning individually for four downstream tasks using a unified encoder-decoder architecture—where the encoder is initialized with pre-trained weights and the decoder is randomly initialized, followed by end-to-end fine-tuning.
Key Designs¶
-
Humans-300M Dataset: A collection of 300 million human images curated from approximately 1 billion in-the-wild images. Curation criteria include: removing watermarks, text, and artistic illustrations, and filtering using a human detector (detection score \(> 0.9\), bounding box size \(> 300\) pixels). Over 248 million images contain multi-person scenes. This lies at the core of their domain-specific data strategy, distinguishing it from general pre-training.
-
High-Resolution MAE Pre-training: Unlike existing ViTs pre-trained at \(224 \times 224\), Sapiens is pre-trained at \(1024 \times 1024\) resolution, consuming approximately four times the FLOPs of the largest existing ViT. Each patch token covers only 0.02% of the image area (compared to 0.4% in standard ViT), enabling much finer-grained reasoning across tokens. Even with a masking ratio as high as 95%, the model can still reasonably reconstruct human anatomy.
-
Model Scaling Strategy: Four model scales are provided (0.3B, 0.6B, 1B, and 2B parameters), with scaling prioritized by width rather than depth. The largest model has 2B parameters with approximately 8.7T FLOPs. All models are pre-trained on 1.2 trillion tokens.
-
High-Quality Annotations:
- Pose Estimation: Introduces 308 whole-body keypoints (including 243 facial keypoints), annotating 1 million 4K images captured in an indoor multi-view system.
- Body Segmentation: A 28-class vocabulary (including fine-grained classes such as upper/lower lips, teeth, and tongue), annotating 100,000 4K images.
- Depth/Normals: Uses synthetic data from 600 high-resolution scans of real human subjects.
-
Layer-wise Learning Rate Decay: The encoder uses a lower learning rate (layer-wise learning rate decay of 0.85) to preserve the generalization capability developed during pre-training.
Loss & Training¶
- Pose Estimation: MSE loss (heatmap regression)
- Body Segmentation: Weighted Cross-Entropy loss
- Depth Estimation: Scale-invariant log depth loss (see Equations 1-3 in the paper)
- Normal Estimation: L1 loss + cosine similarity loss (\(1 - n \cdot \hat{n}\))
- Training utilizes the AdamW optimizer with a linear warmup followed by cosine annealing or linear decay.
- The 2B model was pre-trained on 1,024 A100 GPUs for 18 days.
Key Experimental Results¶
Main Results¶
| Dataset | Metric | Ours (2B) | Prev. SOTA | Gain |
|---|---|---|---|---|
| Humans-5K (Pose) | Whole-body AP | 61.1 | DWPose-l: 53.1 | +7.6 |
| Humans-2K (Seg) | mIoU | 81.2 | DeepLabV3+: 64.1 | +17.1 |
| Hi4D (Depth) | RMSE | 0.114 | DepthAnything-L: 0.147 | -22.4% |
| THuman2.0 (Normal) | Mean Angular Error | Drastically reduced | PIFuHD: 30.51° | -53.5% |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| Sapiens-0.3B vs ViTPose+-L | +5.6 AP (Pose) | Same parameter scale, domain-specific pre-training wins by a large margin |
| Sapiens-0.6B vs ViTPose+-H | +7.9 AP (Pose) | Continues to widen the gap |
| 0.3B → 0.6B → 1B → 2B | Monotonic AP increase | Increasing model size consistently yields performance gains |
| Sapiens-0.3B vs Mask2Former (Seg) | +12.6 mIoU | High-resolution + human pre-training vs general segmentation |
| Fine-tuned using synthetic data only (Depth) | 0.008 RMSE (Face) | Surpasses DepthAnything trained on real data using synthetic data only |
Key Findings¶
- Under the same computational budget, domain-specific pre-training yields significantly larger improvements for human-centric tasks than general-purpose pre-training.
- High-quality/synthetic annotations + domain-specific pre-training = remarkable generalization capability in-the-wild.
- Model performance correlates positively with parameter size, showing no signs of saturation.
- Fine-tuning solely on indoor multi-view annotations still generalizes well to various in-the-wild scenes.
Highlights & Insights¶
- The Power of Data Strategy: Compared to DINOv2 (142M general images) and AIM (2B general images), Sapiens achieves superior performance on human-centric tasks using "fewer but more focused" human data. This validates the hypothesis that "domain-specific data > sheer data volume."
- Simple and Unified Architecture: All four tasks share the same encoder-decoder framework, merely swapping the decoder output head. This demonstrates that powerful pre-trained representations are sufficient to support diverse downstream tasks.
- Native High-Resolution Support: Pre-training at 1K resolution is an intuitive but rarely pursued choice due to the massive computational cost (FLOPs), yet it proves crucial for fine-grained human perception.
- The Surprise of Synthetic Data: For depth and normal estimation, fine-tuning solely on synthetic data achieves state-of-the-art in-the-wild performance.
Limitations & Future Work¶
- The pre-training dataset (Humans-300M) is proprietary and not publicly available, which limits reproducibility.
- The inference cost of the 2B model reaches 8.7T FLOPs, limiting practical deployment.
- The work only addresses four human vision tasks, leaving other tasks such as body shape estimation and hand gesture recognition unexplored.
- No direct pre-training data ablation experiments were conducted against general models like DINOv2 specifically on human-centric tasks.
- The top-down paradigm relies on an external human detector, meaning that performance in multi-person scenes can be bottlenecked by the quality of the detector.
Related Work & Insights¶
- Compared with general-purpose pre-training models like DINOv2, AIM, and MAWS, this work validates the value of domain-specific pre-training.
- Compared with highly specialized methods like ViTPose+ and DWPose, it proves that a "simple architecture + strong pre-training" combination can outperform complex, task-specific designs.
- Inspiration: Can other specialized domains (such as autonomous driving or medical image analysis) yield similar benefits through large-scale domain-specific pre-training?
Rating¶
- Novelty: ⭐⭐⭐⭐ The core idea (domain-specific pre-training) is not entirely brand new, but its systematic validation in human vision is exceptionally thorough.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Extremely solid evaluation across four tasks, multiple datasets, and comprehensive scaling experiments.
- Writing Quality: ⭐⭐⭐⭐⭐ Exceptionally clear structure, concise presentation, and highly informative tables and figures.
- Value: ⭐⭐⭐⭐⭐ Delivers a highly practical family of human vision foundation models, offering significant promotional value to the research community.