PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation¶
Conference: CVPR 2025
arXiv: 2505.17475
Code: https://github.com/uyoung-jeong/PoseBH
Area: Human Understanding / Pose Estimation
Keywords: Multi-Dataset Training, Keypoint Prototype, Cross-Skeleton Transfer, Sinkhorn Clustering, Self-Supervised Learning
TL;DR¶
Proposes PoseBH, which achieves unified training across datasets with different skeleton definitions (e.g., humans, animals, hands) via non-parametric keypoint prototypes (Sinkhorn-Knopp online clustering) and cross-type self-supervision (CSS). It improves upon ViTPose++ by 11.2 AP on the APT-36K animal video dataset, demonstrating the effectiveness of cross-type knowledge transfer.
Background & Motivation¶
Background¶
Background: Pose estimation datasets often have different skeleton definitions—COCO has 17 human keypoints, AP-10K has animal keypoints, and InterHand has hand keypoints. The standard approach is to train models independently for each skeleton, which discards shared knowledge across datasets.
Limitations of Prior Work: Multi-dataset joint training faces two critical challenges: (1) different skeletons have varying numbers and semantics of keypoints, making it impossible to share a prediction head; (2) keypoints labeled in one dataset are unlabeled in others, leading to missing labels.
Key Challenge: Skeletons of different entities appear completely different, yet joint types share substantial commonalities—for example, "bending joints" and "end joints" exist across humans, animals, and hands.
Key Insight: Prototype learning can be utilized to discover shared prototypes across skeletons in the embedding space, letting clustering automatically find correspondences without manual definition.
Core Idea: Non-parametric keypoint prototypes + cross-type self-supervision = unified pose estimation across all skeleton types.
Mechanism¶
Goal: ### Key Designs
-
Non-parametric keypoint prototype:
- Function: Learn keypoint representations shared across datasets in the embedding space
- Mechanism: Maintain a \(J \times M \times F\) prototype matrix (\(J\) keypoints \(\times\) \(M\) prototypes \(\times\) \(F\)-dimensional features) for each keypoint type, and update prototypes via online Sinkhorn-Knopp clustering.
Method¶
Key Designs¶
-
Non-parametric keypoint prototype:
- Function: Learn keypoint representations shared across datasets in the embedding space
- Mechanism: Maintain a \(J \times M \times F\) prototype matrix (\(J\) keypoints \(\times\) \(M\) prototypes \(\times\) \(F\)-dimensional features) for each keypoint type, and update prototypes via online Sinkhorn-Knopp clustering. Classification is performed during prediction using the distance between pixel features and prototypes.
- Design Motivation: Eliminates the need to manually define relationships like "human elbow = cat front knee." Prototypes cluster automatically during training.
-
Cross-type Self-Supervision (CSS):
- Function: Perform self-supervised learning using predictions of unlabeled keypoint types.
- Mechanism: For samples in each mixed batch, predictions are made using both the keypoint head and the embedding head. Highly confident predictions from one head serve as pseudo-labels for the other head via weighted averaging.
- Design Motivation: Animal data is unlabeled in human datasets, but the model can make predictions based on the "joint" concepts learned from human training, and vice versa.
Loss & Training¶
\(\mathcal{L}_{MDT} = \mathcal{L}_{KPL} + \mathcal{L}_{CSS}\). The keypoint loss includes the pixel-prototype contrastive loss (\(\mathcal{L}_{PPC}\)) and the pixel-prototype distance loss (\(\mathcal{L}_{PPD}\)). Three-stage progressive training is used.
Key Experimental Results¶
Main Results¶
| Dataset | PoseBH (ViT-B) | ViTPose++ | Gain |
|---|---|---|---|
| COCO | 77.3 AP | 76.5% | +0.8 |
| AP-10K Animal | 75.0 AP | 74.1% | +0.9 |
| APT-36K Video Animal | 87.2 AP | 76.0% | +11.2 |
| InterHand Hand | 87.1 AUC | 86.2% | +0.9 |
Key Findings¶
- APT-36K achieves the largest gain (+11.2)—being a video dataset, multi-dataset pretraining provides superior temporal understanding.
- Prototypical clustering successfully discovers shared structures across different types.
- CSS contributes +0.2 to the average score (totaling +2.4 compared to the baseline).
Highlights & Insights¶
- Philosophy of "Beyond Human Pose Estimation"—Not just a better human pose estimator, but a unified joint detector for all species/objects.
- Substantial gain of +11.2 on APT-36K—Demonstrates that cross-type knowledge transfer holds massive value for data-scarce domains like animal video.
Limitations & Future Work¶
- Few-shot prototype learning is still required; fully zero-shot cross-skeleton transfer is not yet feasible.
- CSS requires similar dataset distributions.
- The 3D domain remains unexplored.
Rating¶
- Novelty: ⭐⭐⭐⭐ Novel prototype-driven cross-skeleton unified learning.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across four categories of datasets: human, animal, hand, and video.
- Writing Quality: ⭐⭐⭐⭐ Clear.
- Value: ⭐⭐⭐⭐ Provides a scalable framework for unified pose estimation.