PoseBH: Prototypical Multi-Dataset Training Beyond Human Pose Estimation¶

Conference: CVPR 2025
arXiv: 2505.17475
Code: https://github.com/uyoung-jeong/PoseBH
Area: Human Understanding / Pose Estimation
Keywords: Multi-Dataset Training, Keypoint Prototype, Cross-Skeleton Transfer, Sinkhorn Clustering, Self-Supervised Learning

TL;DR¶

Proposes PoseBH, which achieves unified training across datasets with different skeleton definitions (e.g., humans, animals, hands) via non-parametric keypoint prototypes (Sinkhorn-Knopp online clustering) and cross-type self-supervision (CSS). It improves upon ViTPose++ by 11.2 AP on the APT-36K animal video dataset, demonstrating the effectiveness of cross-type knowledge transfer.

Background & Motivation¶

Background¶

Background: Pose estimation datasets often have different skeleton definitions—COCO has 17 human keypoints, AP-10K has animal keypoints, and InterHand has hand keypoints. The standard approach is to train models independently for each skeleton, which discards shared knowledge across datasets.

Limitations of Prior Work: Multi-dataset joint training faces two critical challenges: (1) different skeletons have varying numbers and semantics of keypoints, making it impossible to share a prediction head; (2) keypoints labeled in one dataset are unlabeled in others, leading to missing labels.

Key Challenge: Skeletons of different entities appear completely different, yet joint types share substantial commonalities—for example, "bending joints" and "end joints" exist across humans, animals, and hands.

Key Insight: Prototype learning can be utilized to discover shared prototypes across skeletons in the embedding space, letting clustering automatically find correspondences without manual definition.

Core Idea: Non-parametric keypoint prototypes + cross-type self-supervision = unified pose estimation across all skeleton types.

Mechanism¶

Goal: ### Key Designs

Non-parametric keypoint prototype:
- Function: Learn keypoint representations shared across datasets in the embedding space
- Mechanism: Maintain a \(J \times M \times F\) prototype matrix (\(J\) keypoints \(\times\) \(M\) prototypes \(\times\) \(F\)-dimensional features) for each keypoint type, and update prototypes via online Sinkhorn-Knopp clustering.

Method¶

Key Designs¶

Non-parametric keypoint prototype:
- Function: Learn keypoint representations shared across datasets in the embedding space
- Mechanism: Maintain a \(J \times M \times F\) prototype matrix (\(J\) keypoints \(\times\) \(M\) prototypes \(\times\) \(F\)-dimensional features) for each keypoint type, and update prototypes via online Sinkhorn-Knopp clustering. Classification is performed during prediction using the distance between pixel features and prototypes.
- Design Motivation: Eliminates the need to manually define relationships like "human elbow = cat front knee." Prototypes cluster automatically during training.
Cross-type Self-Supervision (CSS):
- Function: Perform self-supervised learning using predictions of unlabeled keypoint types.
- Mechanism: For samples in each mixed batch, predictions are made using both the keypoint head and the embedding head. Highly confident predictions from one head serve as pseudo-labels for the other head via weighted averaging.
- Design Motivation: Animal data is unlabeled in human datasets, but the model can make predictions based on the "joint" concepts learned from human training, and vice versa.

Loss & Training¶

\(\mathcal{L}_{MDT} = \mathcal{L}_{KPL} + \mathcal{L}_{CSS}\). The keypoint loss includes the pixel-prototype contrastive loss (\(\mathcal{L}_{PPC}\)) and the pixel-prototype distance loss (\(\mathcal{L}_{PPD}\)). Three-stage progressive training is used.

Key Experimental Results¶

Main Results¶

Dataset	PoseBH (ViT-B)	ViTPose++	Gain
COCO	77.3 AP	76.5%	+0.8
AP-10K Animal	75.0 AP	74.1%	+0.9
APT-36K Video Animal	87.2 AP	76.0%	+11.2
InterHand Hand	87.1 AUC	86.2%	+0.9

Key Findings¶

APT-36K achieves the largest gain (+11.2)—being a video dataset, multi-dataset pretraining provides superior temporal understanding.
Prototypical clustering successfully discovers shared structures across different types.
CSS contributes +0.2 to the average score (totaling +2.4 compared to the baseline).

Highlights & Insights¶

Philosophy of "Beyond Human Pose Estimation"—Not just a better human pose estimator, but a unified joint detector for all species/objects.
Substantial gain of +11.2 on APT-36K—Demonstrates that cross-type knowledge transfer holds massive value for data-scarce domains like animal video.

Limitations & Future Work¶

Few-shot prototype learning is still required; fully zero-shot cross-skeleton transfer is not yet feasible.
CSS requires similar dataset distributions.
The 3D domain remains unexplored.

Rating¶

Novelty: ⭐⭐⭐⭐ Novel prototype-driven cross-skeleton unified learning.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Evaluated across four categories of datasets: human, animal, hand, and video.
Writing Quality: ⭐⭐⭐⭐ Clear.
Value: ⭐⭐⭐⭐ Provides a scalable framework for unified pose estimation.