Multivariate Gaussian Representation Learning for Medical Action Evaluation¶
Conference: AAAI 2026 arXiv: 2511.10060 Code: https://github.com/HaoxianLiu/GaussMedAct Area: Medical Imaging / Action Recognition Keywords: CPR Assessment, Gaussian Mixture Model, Skeleton-based Action Recognition, Spatiotemporal Representation, Medical Dataset
TL;DR¶
This paper proposes GaussMedAct, a framework that models joint motion trajectories as multivariate Gaussian mixture distributions combined with a Cartesian-vector dual-stream encoding scheme. It achieves 92.1% Top-1 accuracy on the newly constructed CPREval-6k dataset while requiring only 10% of the computational cost of ST-GCN.
Background & Motivation¶
Background: CPR quality directly affects survival rates in cardiac arrest. Manual assessment achieves only 74.8% accuracy, and existing vision-based systems struggle to capture centimeter-level positional deviations and millisecond-level frequency variations.
Limitations of Prior Work: - RGB-based methods (e.g., TimeSformer) impose high computational costs and lack anatomical modeling. - Skeleton-based methods (e.g., ST-GCN) discard motion semantics through rigid temporal pooling and are sensitive to noise. - No suitable CPR assessment dataset exists — existing datasets are small-scale with coarse-grained annotations.
Key Insight: Inspired by 3D Gaussian Splatting, which efficiently represents dense point clouds with a compact set of Gaussian primitives, this work treats joint motion trajectories as spatiotemporal point sets and employs Gaussian mixture models for compact, noise-robust representation.
Method¶
Overall Architecture¶
Input skeleton sequence → Dual-stream spatial encoding (Cartesian coordinates + bone vectors) → Independent multivariate Gaussian representation per stream → Feature fusion → Downstream tasks (classification / report generation)
Key Designs¶
-
Multivariate Gaussian Representation (MGR):
- Function: Models the spatiotemporal trajectory of each joint \(\mathcal{X}_i = \{(x, y, \alpha \cdot t)\}\) as a mixture of \(K\) Gaussian distributions, with parameters estimated via the EM algorithm.
- Mechanism: Each Gaussian component produces a 10-dimensional action token: \(\mathbf{f}_{i,k} = [\boldsymbol{\mu}; \mathbf{s}; \mathbf{q}] \in \mathbb{R}^{10}\), where \(\mu\) denotes the mean (average position), \(\mathbf{s}\) the scale (motion amplitude), and \(\mathbf{q}\) the quaternion rotation (motion direction).
- Design Motivation: The anisotropic covariance of Gaussian distributions naturally encodes motion direction and amplitude while providing robustness to pose estimation noise.
-
Hybrid Spatial Encoding (HSE):
- Function: The Cartesian coordinate stream captures global trajectory consistency, while the bone vector stream encodes motion transitions.
- Mechanism: The joint stream uses absolute coordinates \((x, y)\); the bone stream uses difference vectors between adjacent joints \((\Delta x, \Delta y)\). Both streams are processed independently through MGR and then fused.
- Design Motivation: Psychological studies show that sparse 2D light points suffice to convey action impressions; absolute position and relative kinematics are thus complementary.
-
CPREval-6k Dataset:
- 6,372 expert-annotated CPR videos with 22 clinical labels.
- Hierarchical annotation: one primary error label plus multiple secondary error labels per video.
- Association rule mining reveals error propagation chains (e.g., unstable hand position → positional drift, confidence 77.6%).
Loss & Training¶
- Label smoothing loss for improved generalization.
- Early stopping with a maximum of 300 epochs.
Key Experimental Results¶
Main Results (CPREval-6k)¶
| Method | Modality | Top-1 Acc | Top-5 Acc | GFLOPs |
|---|---|---|---|---|
| TimeSformer | RGB | 91.65% | 99.07% | 393.96 |
| SlowFast | RGB | 87.54% | 98.23% | 65.70 |
| ST-GCN | Skeleton | 86.22% | 97.81% | 43.76 |
| CTR-GCN | Skeleton | 89.38% | 98.06% | 6.73 |
| MGR-only | Skeleton | 89.54% | 98.27% | ~2 |
| GaussMedAct | Skeleton | 92.12% | 99.13% | 4.45 |
Cross-Dataset Evaluation (Coach Dataset, 14 Classes)¶
| Method | Modality | Top-1 Acc |
|---|---|---|
| TSN-pretrained | RGB | 90.67% |
| STGCN-best | Skeleton | 92.46% |
| PoseC3D | Skeleton | 92.08% |
| GaussMedAct | Skeleton | 95.24% |
Ablation Study¶
| Configuration | Top-1 Acc | GFLOPs | Notes |
|---|---|---|---|
| MGR only | 89.54% | ~2 | Gaussian representation alone surpasses all skeleton baselines |
| HSE only | 87.21% | ~3 | Spatial encoding alone |
| Full (MGR + HSE) | 92.12% | 4.45 | Synergistic gain of +2.58% |
Key Findings¶
- Skeleton-based methods require on average 6.13× fewer FLOPs than RGB methods; GaussMedAct surpasses all RGB methods with only 4.45 GFLOPs.
- MGR alone is highly competitive (89.54%), validating the effectiveness of Gaussian distribution modeling for motion trajectories.
- Cross-dataset generalization improves by +2.78% (95.24% vs. 92.46%), demonstrating representational robustness.
- Deployment in real CPR training scenarios yielded a 32% improvement in trainee practical assessment scores.
Highlights & Insights¶
- Cross-domain transfer from 3D Gaussian Splatting to action recognition: The conceptual transfer from representing dense point clouds with Gaussian primitives to representing motion trajectories is both elegant and effective.
- Extremely compact 10-dimensional action token: mean (3) + scale (3) + quaternion (4) = 10 dimensions, compressing variable-length temporal sequences into a fixed-size representation with high computational efficiency.
- Dataset contribution: The hierarchical error annotations and association rule analysis in CPREval-6k reveal CPR error propagation chains, offering independent research value.
Limitations & Future Work¶
- The number of Gaussian components \(K\) is a hyperparameter; different actions may require different values of \(K\).
- The EM algorithm is non-differentiable, precluding end-to-end training of MGR parameters.
- Validation is limited to CPR, a highly repetitive motion; applicability to complex free-form actions remains unverified.
- 2D skeleton input discards depth information; 3D skeleton combined with MGR could yield further improvements.
Related Work & Insights¶
- vs. ST-GCN: ST-GCN employs graph convolution with temporal convolution at 43.76 GFLOPs; GaussMedAct achieves higher accuracy using only 4.45 GFLOPs via Gaussian mixture modeling and dual-stream encoding.
- vs. TimeSformer: Although RGB Transformers capture rich features, they lack anatomical modeling; GaussMedAct's skeleton-based foundation makes it naturally suited for medical action evaluation.
Rating¶
- Novelty: ⭐⭐⭐⭐ Gaussian representation for action recognition constitutes a novel cross-domain transfer.
- Experimental Thoroughness: ⭐⭐⭐⭐ Covers new dataset, cross-dataset evaluation, ablation study, and real-world deployment.
- Writing Quality: ⭐⭐⭐⭐ Motivation is clear, though notation is dense.
- Value: ⭐⭐⭐⭐ Both the dataset and method have practical value for medical action evaluation.