Multivariate Gaussian Representation Learning for Medical Action Evaluation¶

Conference: AAAI 2026 arXiv: 2511.10060 Code: https://github.com/HaoxianLiu/GaussMedAct Area: Medical Imaging / Action Recognition Keywords: CPR Assessment, Gaussian Mixture Model, Skeleton-based Action Recognition, Spatiotemporal Representation, Medical Dataset

TL;DR¶

This paper proposes GaussMedAct, a framework that models joint motion trajectories as multivariate Gaussian mixture distributions combined with a Cartesian-vector dual-stream encoding scheme. It achieves 92.1% Top-1 accuracy on the newly constructed CPREval-6k dataset while requiring only 10% of the computational cost of ST-GCN.

Background & Motivation¶

Background: CPR quality directly affects survival rates in cardiac arrest. Manual assessment achieves only 74.8% accuracy, and existing vision-based systems struggle to capture centimeter-level positional deviations and millisecond-level frequency variations.

Limitations of Prior Work: - RGB-based methods (e.g., TimeSformer) impose high computational costs and lack anatomical modeling. - Skeleton-based methods (e.g., ST-GCN) discard motion semantics through rigid temporal pooling and are sensitive to noise. - No suitable CPR assessment dataset exists — existing datasets are small-scale with coarse-grained annotations.

Key Insight: Inspired by 3D Gaussian Splatting, which efficiently represents dense point clouds with a compact set of Gaussian primitives, this work treats joint motion trajectories as spatiotemporal point sets and employs Gaussian mixture models for compact, noise-robust representation.

Method¶

Overall Architecture¶

Input skeleton sequence → Dual-stream spatial encoding (Cartesian coordinates + bone vectors) → Independent multivariate Gaussian representation per stream → Feature fusion → Downstream tasks (classification / report generation)

Key Designs¶

Multivariate Gaussian Representation (MGR):
- Function: Models the spatiotemporal trajectory of each joint \(\mathcal{X}_i = \{(x, y, \alpha \cdot t)\}\) as a mixture of \(K\) Gaussian distributions, with parameters estimated via the EM algorithm.
- Mechanism: Each Gaussian component produces a 10-dimensional action token: \(\mathbf{f}_{i,k} = [\boldsymbol{\mu}; \mathbf{s}; \mathbf{q}] \in \mathbb{R}^{10}\), where \(\mu\) denotes the mean (average position), \(\mathbf{s}\) the scale (motion amplitude), and \(\mathbf{q}\) the quaternion rotation (motion direction).
- Design Motivation: The anisotropic covariance of Gaussian distributions naturally encodes motion direction and amplitude while providing robustness to pose estimation noise.
Hybrid Spatial Encoding (HSE):
- Function: The Cartesian coordinate stream captures global trajectory consistency, while the bone vector stream encodes motion transitions.
- Mechanism: The joint stream uses absolute coordinates \((x, y)\); the bone stream uses difference vectors between adjacent joints \((\Delta x, \Delta y)\). Both streams are processed independently through MGR and then fused.
- Design Motivation: Psychological studies show that sparse 2D light points suffice to convey action impressions; absolute position and relative kinematics are thus complementary.
CPREval-6k Dataset:
- 6,372 expert-annotated CPR videos with 22 clinical labels.
- Hierarchical annotation: one primary error label plus multiple secondary error labels per video.
- Association rule mining reveals error propagation chains (e.g., unstable hand position → positional drift, confidence 77.6%).

Loss & Training¶

Label smoothing loss for improved generalization.
Early stopping with a maximum of 300 epochs.

Key Experimental Results¶

Main Results (CPREval-6k)¶

Method	Modality	Top-1 Acc	Top-5 Acc	GFLOPs
TimeSformer	RGB	91.65%	99.07%	393.96
SlowFast	RGB	87.54%	98.23%	65.70
ST-GCN	Skeleton	86.22%	97.81%	43.76
CTR-GCN	Skeleton	89.38%	98.06%	6.73
MGR-only	Skeleton	89.54%	98.27%	~2
GaussMedAct	Skeleton	92.12%	99.13%	4.45

Cross-Dataset Evaluation (Coach Dataset, 14 Classes)¶

Method	Modality	Top-1 Acc
TSN-pretrained	RGB	90.67%
STGCN-best	Skeleton	92.46%
PoseC3D	Skeleton	92.08%
GaussMedAct	Skeleton	95.24%

Ablation Study¶

Configuration	Top-1 Acc	GFLOPs	Notes
MGR only	89.54%	~2	Gaussian representation alone surpasses all skeleton baselines
HSE only	87.21%	~3	Spatial encoding alone
Full (MGR + HSE)	92.12%	4.45	Synergistic gain of +2.58%

Key Findings¶

Skeleton-based methods require on average 6.13× fewer FLOPs than RGB methods; GaussMedAct surpasses all RGB methods with only 4.45 GFLOPs.
MGR alone is highly competitive (89.54%), validating the effectiveness of Gaussian distribution modeling for motion trajectories.
Cross-dataset generalization improves by +2.78% (95.24% vs. 92.46%), demonstrating representational robustness.
Deployment in real CPR training scenarios yielded a 32% improvement in trainee practical assessment scores.

Highlights & Insights¶

Cross-domain transfer from 3D Gaussian Splatting to action recognition: The conceptual transfer from representing dense point clouds with Gaussian primitives to representing motion trajectories is both elegant and effective.
Extremely compact 10-dimensional action token: mean (3) + scale (3) + quaternion (4) = 10 dimensions, compressing variable-length temporal sequences into a fixed-size representation with high computational efficiency.
Dataset contribution: The hierarchical error annotations and association rule analysis in CPREval-6k reveal CPR error propagation chains, offering independent research value.

Limitations & Future Work¶

The number of Gaussian components \(K\) is a hyperparameter; different actions may require different values of \(K\).
The EM algorithm is non-differentiable, precluding end-to-end training of MGR parameters.
Validation is limited to CPR, a highly repetitive motion; applicability to complex free-form actions remains unverified.
2D skeleton input discards depth information; 3D skeleton combined with MGR could yield further improvements.

vs. ST-GCN: ST-GCN employs graph convolution with temporal convolution at 43.76 GFLOPs; GaussMedAct achieves higher accuracy using only 4.45 GFLOPs via Gaussian mixture modeling and dual-stream encoding.
vs. TimeSformer: Although RGB Transformers capture rich features, they lack anatomical modeling; GaussMedAct's skeleton-based foundation makes it naturally suited for medical action evaluation.

Rating¶

Novelty: ⭐⭐⭐⭐ Gaussian representation for action recognition constitutes a novel cross-domain transfer.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers new dataset, cross-dataset evaluation, ablation study, and real-world deployment.
Writing Quality: ⭐⭐⭐⭐ Motivation is clear, though notation is dense.
Value: ⭐⭐⭐⭐ Both the dataset and method have practical value for medical action evaluation.