FSMC-Pose: Frequency and Spatial Fusion with Multiscale Self-calibration for Cattle Mounting Pose Estimation¶

Conference: CVPR 2026 arXiv: 2603.16596 Code: Github Area: Human/Animal Understanding Keywords: cattle pose estimation, mounting detection, estrus recognition, frequency-spatial fusion, lightweight model

TL;DR¶

This paper proposes FSMC-Pose, a lightweight top-down framework that achieves cattle mounting pose estimation in dense and cluttered farm environments via the frequency-spatial fusion backbone CattleMountNet and the multiscale self-calibration head SC2Head, attaining 89% AP with only 2.698M parameters.

Background & Motivation¶

Background: Cattle mounting behavior is a key visual indicator of estrus, and its accurate recognition is critical for livestock production efficiency.
Limitations of Prior Work: Existing animal pose estimation methods are largely transferred from human pose estimation and perform poorly in complex agricultural scenes—cluttered backgrounds, frequent inter-animal occlusion, and entangled limbs and joints lead to identity confusion.
Key Challenge: High accuracy demands complex models, yet agricultural deployments require real-time inference and low-cost hardware. Furthermore, no publicly available mounting dataset exists.
Goal: Achieve lightweight and efficient mounting pose estimation in dense, cluttered environments.
Key Insight: Frequency-domain analysis to separate cattle from background, combined with multiscale self-calibration to correct structural misalignment under occlusion.
Core Idea: SFEBlock employs wavelet decomposition and Gaussian smoothing to separate foreground from background; RABlock aggregates multiscale context; SC2Head corrects occlusion-induced misalignment via spatial-channel self-calibration.

Method¶

Overall Architecture¶

A top-down paradigm: detect individual cattle → crop → backbone feature extraction → prediction head outputs keypoint heatmaps. The backbone is CattleMountNet and the prediction head is SC2Head.

Key Designs¶

SFEBlock (Spatial-Frequency Enhancement Block): Applies wavelet decomposition to separate high- and low-frequency components, and Gaussian smoothing to suppress background noise, enhancing the separability between cattle and cluttered backgrounds.
RABlock (Receptive field Aggregation Block): Aggregates contextual information at different scales via multi-scale dilated convolutions, adapting to body parts of varying sizes.
SC2Head (Spatial-Channel Self-Calibration Head): Attends to spatial and channel dependencies; a self-calibration branch corrects structural misalignment caused by occlusion.

Loss & Training¶

Standard keypoint heatmap MSE loss.

Key Experimental Results¶

Main Results¶

Method	AP	AP75	AR	Params	GFLOPs
FSMC-Pose	89.0%	92.5%	89.9%	2.698M	4.41
RTMPose	87.6%	89.5%	89.0%	13.5M	-

Key Findings¶

Parameter count reduced by 80% relative to RTMPose, with a 1.4% AP improvement.
Frequency-spatial fusion yields the largest gain in cluttered-background scenarios (+2.8% AP).
The MOUNT-Cattle dataset contains 1,176 mounting instances, filling a critical gap in the field.

Ablation Study¶

Configuration	AP	Params
Full FSMC-Pose	89.0%	2.698M
w/o SFEBlock	86.2%	2.1M
w/o RABlock	87.1%	2.3M
w/o SC2Head	87.5%	2.4M
Spatial only (w/o frequency)	86.8%	2.5M

Parameters reduced by 80% vs. RTMPose, with 1.4% AP gain.
Frequency-spatial fusion produces the largest improvement in cluttered-background scenes.
The constructed MOUNT-Cattle dataset, containing 1,176 mounting instances, fills an important gap.

Highlights & Insights¶

This is the first dataset and method dedicated to cattle mounting pose estimation, directly serving smart livestock farming.
The strategy of using frequency-domain analysis to separate foreground from background is generalizable to other dense animal pose scenarios (e.g., poultry flocks, pig herds).

Limitations & Future Work¶

The dataset is relatively small (1,176 instances), posing a risk of overfitting.
Validation is limited to mounting behavior; other behaviors (e.g., feeding, resting, rumination) are not addressed.
Night/low-light conditions are insufficiently validated, despite significant illumination variation in outdoor farms.
The computational overhead of wavelet decomposition may partially offset the lightweight advantage.
The top-down paradigm depends on detector quality; pose estimation fails when detection fails.
Video-level temporal modeling is not explored; the current approach operates on single frames only.
Body shape variation across cattle breeds may affect generalization.
No comparison is made against recent VLM-based animal pose estimation methods.

vs. DeepLabCut: DeepLabCut performs poorly under occlusion; FSMC-Pose's self-calibration mechanism alleviates this issue.
vs. RTMPose: RTMPose is general-purpose but parameter-heavy; FSMC-Pose is optimized for cattle scenarios and is substantially more lightweight.

Supplementary Discussion¶

The core innovation lies in transforming the problem from a single dimension to multiple dimensions, enabling a more comprehensive analytical perspective.
The experimental design covers diverse scenarios and baseline comparisons, with statistically significant results.
The modular design facilitates extension to related tasks and new datasets.
Open-sourcing the code and data is of significant value for community reproduction and follow-up research.
Compared to concurrent work, this paper demonstrates greater depth in problem formulation and comprehensiveness in experimental analysis.
The paper's logical flow—from problem definition to method design to experimental validation—forms a complete and coherent narrative.
The computational overhead is reasonable, rendering the method deployable in practical applications.
Future work may consider fusion with additional modalities (e.g., audio, 3D point clouds).
Validating scalability on larger data and models is an important subsequent direction.
Combining the proposed method with reinforcement learning for end-to-end optimization is worth exploring.
Cross-domain transfer is a direction worth investigating; the method's generalizability requires further validation.
For edge computing and mobile deployment scenarios, a further-lightweighted variant of the method warrants investigation.

Rating¶

Novelty: ⭐⭐⭐ Component-level innovations are incremental, though the scenario and dataset are novel.
Experimental Thoroughness: ⭐⭐⭐ Ablations are thorough, but scenarios are limited.
Writing Quality: ⭐⭐⭐⭐ Problem definition is clear and well-structured.
Value: ⭐⭐⭐⭐ Demonstrates direct application value for smart livestock farming.