Skip to content

Probabilistic Prompt Distribution Learning for Animal Pose Estimation

Conference: CVPR 2025
arXiv: 2503.16120
Code: GitHub
Area: Human/Animal Understanding
Keywords: Animal Pose Estimation, Probabilistic Prompt Learning, Multimodal Fusion, Cross-species Generalization, CLIP

TL;DR

This paper proposes PPAP (Probabilistic Prompt for Animal Pose), a multi-species animal pose estimation method based on probabilistic prompt distribution learning. By constructing multiple learnable attribute prompts for each keypoint and modeling them as Gaussian distributions, combined with a diversity loss and cross-modal fusion strategies, it achieves state-of-the-art (SOTA) performance under both supervised and zero-shot settings.

Background & Motivation

  • Multi-species animal pose estimation (APE) faces significant challenges due to the massive visual diversity and uncertainty across species.
  • Directly applying human pose estimation methods to animals suffers from significant domain shifts.
  • Category-Agnostic Pose Estimation (CAPE) methods require extra support sets and category prior knowledge, which limits their practical utility.
  • Pure visual APE methods struggle with long-tail distributions in cross-species scenarios as they solely rely on visual cues.
  • Existing multimodal APE methods (such as CLAMP, X-Pose) use fixed text templates (deterministic prompts), which lack rich textual descriptions.
  • A single textual description cannot cover all the subtle features (such as color, position, and shape) of keypoints.
  • The complexity of wild scenarios and multi-species characteristics introduce uncertain statistical shifts.
  • Compared with deterministic prompts, probabilistic prompt learning is more adaptable to cross-species challenges, but existing methods have limited effectiveness in modeling distributions in the input space.

Method

Overall Architecture

PPAP is built upon the CLAMP framework, retaining CLIP's text encoder and image encoder. It creates \(N_p\) learnable attribute prompt templates for each keypoint. After being encoded by the text encoder, the mean is obtained via a text decoder and the variance is obtained via a visual-text decoder, modeling them as independent Gaussian distributions. The sampled probabilistic prompt representations are aligned with visual features at the spatial level using three cross-modal fusion strategies (heuristic, ensemble, or attention) to generate keypoint heatmaps.

Key Designs

Design 1: Diverse Prompt Construction + Diversity Loss - Function: Provides rich, multi-perspective, and multi-attribute textual descriptions for each keypoint. - Mechanism: Creates \(N_p\) attribute templates \(p_i^t = \{a_1^t, \ldots, a_L^t | k_i\}\) for the \(i\)-th keypoint, where \(\{a_l^t\}\) represent learnable attribute tokens. A Generalized Keypoint Placement (GKP) strategy is adopted to allow the keypoint name to be placed at random positions within the templates. A diversity loss \(\mathcal{L}_{div} = \frac{1}{K}\sum_{i=1}^{K}\|\tilde{P}_i\tilde{P}_i^T - \mathbb{I}\|_2^2\) is designed to maintain the orthogonality of attribute representations. - Design Motivation: A single prompt cannot capture the full semantic information of keypoints, whereas multiple different attribute prompts can provide complementary information from multiple perspectives such as color and spatial location. The diversity loss prevents the learned attributes from degenerating into identical representations.

Design 2: Probabilistic Prompt Distribution Modeling - Function: Models the uncertainty of prompts using Gaussian distributions to enhance generalization to unseen categories. - Mechanism: Each attribute prompt is modeled as an independent Gaussian distribution \(\mathcal{G}(z_i^t|p_i^t) \sim \mathcal{N}(\mu_i^t, \sigma_i^t\mathbf{I})\). The mean is computed by a text decoder (self-attention + MLP), and the variance is estimated by a visual-text decoder (cross-attention + MLP) using visual features. Sampling is performed via the reparameterization trick \(\hat{z}_i^t = \mu(p_i^t) + \epsilon \cdot \sigma(p_i^t)\), while a KL divergence regularization term prevents variance collapse. - Design Motivation: Deterministic prompt representations are fixed and cannot adapt to the large-variance distribution of animal data. Probabilistic modeling allows simulating statistical variations across different species and synthesizing new feature statistics to enhance robustness.

Design 3: Three Cross-Modal Fusion Strategies - Function: Aligns probabilistic prompt representations with visual features at the spatial level. - Mechanism: (1) Heuristic Selection: selects the score map most similar to the target from \(N_s\) sampled score maps; (2) Ensemble Selection: concatenates all score maps and performs convolutional fusion \(S = \text{Conv}(\text{Concat}(S'))\); (3) Attention Selection: introduces learnable queries to learn optimal fusion from sampled prompts through an attention module. - Design Motivation: Different fusion strategies suit different scenarios, and attention selection achieves the optimal balance between degree of freedom and information utilization.

Loss & Training

The total loss is defined as \(\mathcal{L}_{total} = \mathcal{L}_{pred} + \mathcal{L}_{spatial} + \gamma \cdot \mathcal{L}_{feature} + \beta \cdot \mathcal{L}_{prompt}\), where \(\mathcal{L}_{pred}\) is the MSE loss for heatmap prediction, \(\mathcal{L}_{spatial}\) is the spatial adaptation MSE loss, \(\mathcal{L}_{feature}\) is the contrastive feature alignment loss, and \(\mathcal{L}_{prompt} = \mathcal{L}_{div} + \text{KL}(\mathcal{G}\|\mathcal{N}(\mathbf{0},\mathbf{I}))\).

Key Experimental Results

Main Results: AP-10K Dataset (AP Metrics)

Method Backbone AP AP.50 AP.75 AR
HRNet HRNet-W48 74.4 95.9 80.7 -
ViTPose++ ViT-Base 74.5 94.9 82.2 70.0
X-Pose-V Swin-Large 79.0 95.7 86.8 -
CLAMP ViT-Base 74.7 95.3 81.2 77.4
PPAP(Ours) ViT-Base 77.2 96.0 84.0 79.7

Ablation Study: Contribution of Components (AP-10K, AP Metrics)

Setting AP Description
Baseline (CLAMP) 74.7 Single Prompt + Deterministic
+Multi-attribute Prompts 75.6 +0.9
+Probabilistic Modeling 76.4 +1.7
+Attention Fusion 77.0 +2.3
+Diversity Loss 77.2 +2.5 (Full)

Key Findings

  • PPAP achieves 77.2 AP on AP-10K using the ViT-Base backbone, outperforming CLAMP with the same backbone by 2.5 points.
  • Probabilistic modeling contributes the most compared to deterministic prompts (+1.7 AP).
  • It performs exceptionally well under the zero-shot setting (P3) of AnimalKingdom, demonstrating strong generalization to unseen species.
  • The attention fusion strategy consistently outperforms heuristic and ensemble strategies.
  • Estimating variance via the visual-text decoder (cross-attention) is superior to estimating it from text alone.

Highlights & Insights

  1. Generality of Probabilistic Prompts: Introducing probability distributions into prompt learning naturally models data variation across species.
  2. Simplicity of Diversity Loss: Maintaining attribute diversity through orthogonal constraints, resulting in a simple yet effective design.
  3. GKP Strategy: Allowing keypoint names to be placed randomly in templates, which is more flexible than the fixed-position strategy of ProDA.
  4. Vision-Guided Variance Estimation: Regulating the variance of text distributions using visual features, achieving deep vision-language interaction.

Limitations & Future Work

  • It still relies on the pretrained knowledge of CLIP, which may limit its effectiveness on animal species rarely seen by CLIP.
  • Probabilistic sampling introduces extra computational overhead, requiring multiple samples during inference.
  • Presently, only 2D keypoint estimation has been validated, while 3D animal pose estimation remains unexplored.
  • Future work can explore extending probabilistic prompt learning to other cross-domain vision tasks.
  • Different from ProDA which models a single distribution in the output embedding space, PPAP models an independent Gaussian distribution for each prompt.
  • Different from PPL which constructs Gaussian mixture distributions, PPAP keeps each attribute independent and orthogonal.
  • Probabilistic prompt methods offer inspiration for other vision tasks that need to handle large cross-domain discrepancies.

Rating

⭐⭐⭐⭐ — The probabilistic prompt learning framework is well-designed, with experiments thoroughly covering supervised and zero-shot scenarios. The novelty of the method makes a solid contribution to the field of prompt learning.