Skip to content

Multivariate Gaussian Representation Learning for Medical Action Evaluation

Conference: AAAI 2026 arXiv: 2511.10060 Code: https://github.com/HaoxianLiu/GaussMedAct Area: Medical Imaging / Action Recognition Keywords: CPR Assessment, Gaussian Mixture Model, Skeleton-based Action Recognition, Spatiotemporal Representation, Medical Dataset

TL;DR

This paper proposes GaussMedAct, a framework that models joint motion trajectories as multivariate Gaussian mixture distributions combined with a Cartesian-vector dual-stream encoding scheme. It achieves 92.1% Top-1 accuracy on the newly constructed CPREval-6k dataset while requiring only 10% of the computational cost of ST-GCN.

Background & Motivation

Background: CPR quality directly affects survival rates in cardiac arrest. Manual assessment achieves only 74.8% accuracy, and existing vision-based systems struggle to capture centimeter-level positional deviations and millisecond-level frequency variations.

Limitations of Prior Work: - RGB-based methods (e.g., TimeSformer) impose high computational costs and lack anatomical modeling. - Skeleton-based methods (e.g., ST-GCN) discard motion semantics through rigid temporal pooling and are sensitive to noise. - No suitable CPR assessment dataset exists — existing datasets are small-scale with coarse-grained annotations.

Key Insight: Inspired by 3D Gaussian Splatting, which efficiently represents dense point clouds with a compact set of Gaussian primitives, this work treats joint motion trajectories as spatiotemporal point sets and employs Gaussian mixture models for compact, noise-robust representation.

Method

Overall Architecture

Input skeleton sequence → Dual-stream spatial encoding (Cartesian coordinates + bone vectors) → Independent multivariate Gaussian representation per stream → Feature fusion → Downstream tasks (classification / report generation)

Key Designs

  1. Multivariate Gaussian Representation (MGR):

    • Function: Models the spatiotemporal trajectory of each joint \(\mathcal{X}_i = \{(x, y, \alpha \cdot t)\}\) as a mixture of \(K\) Gaussian distributions, with parameters estimated via the EM algorithm.
    • Mechanism: Each Gaussian component produces a 10-dimensional action token: \(\mathbf{f}_{i,k} = [\boldsymbol{\mu}; \mathbf{s}; \mathbf{q}] \in \mathbb{R}^{10}\), where \(\mu\) denotes the mean (average position), \(\mathbf{s}\) the scale (motion amplitude), and \(\mathbf{q}\) the quaternion rotation (motion direction).
    • Design Motivation: The anisotropic covariance of Gaussian distributions naturally encodes motion direction and amplitude while providing robustness to pose estimation noise.
  2. Hybrid Spatial Encoding (HSE):

    • Function: The Cartesian coordinate stream captures global trajectory consistency, while the bone vector stream encodes motion transitions.
    • Mechanism: The joint stream uses absolute coordinates \((x, y)\); the bone stream uses difference vectors between adjacent joints \((\Delta x, \Delta y)\). Both streams are processed independently through MGR and then fused.
    • Design Motivation: Psychological studies show that sparse 2D light points suffice to convey action impressions; absolute position and relative kinematics are thus complementary.
  3. CPREval-6k Dataset:

    • 6,372 expert-annotated CPR videos with 22 clinical labels.
    • Hierarchical annotation: one primary error label plus multiple secondary error labels per video.
    • Association rule mining reveals error propagation chains (e.g., unstable hand position → positional drift, confidence 77.6%).

Loss & Training

  • Label smoothing loss for improved generalization.
  • Early stopping with a maximum of 300 epochs.

Key Experimental Results

Main Results (CPREval-6k)

Method Modality Top-1 Acc Top-5 Acc GFLOPs
TimeSformer RGB 91.65% 99.07% 393.96
SlowFast RGB 87.54% 98.23% 65.70
ST-GCN Skeleton 86.22% 97.81% 43.76
CTR-GCN Skeleton 89.38% 98.06% 6.73
MGR-only Skeleton 89.54% 98.27% ~2
GaussMedAct Skeleton 92.12% 99.13% 4.45

Cross-Dataset Evaluation (Coach Dataset, 14 Classes)

Method Modality Top-1 Acc
TSN-pretrained RGB 90.67%
STGCN-best Skeleton 92.46%
PoseC3D Skeleton 92.08%
GaussMedAct Skeleton 95.24%

Ablation Study

Configuration Top-1 Acc GFLOPs Notes
MGR only 89.54% ~2 Gaussian representation alone surpasses all skeleton baselines
HSE only 87.21% ~3 Spatial encoding alone
Full (MGR + HSE) 92.12% 4.45 Synergistic gain of +2.58%

Key Findings

  • Skeleton-based methods require on average 6.13× fewer FLOPs than RGB methods; GaussMedAct surpasses all RGB methods with only 4.45 GFLOPs.
  • MGR alone is highly competitive (89.54%), validating the effectiveness of Gaussian distribution modeling for motion trajectories.
  • Cross-dataset generalization improves by +2.78% (95.24% vs. 92.46%), demonstrating representational robustness.
  • Deployment in real CPR training scenarios yielded a 32% improvement in trainee practical assessment scores.

Highlights & Insights

  • Cross-domain transfer from 3D Gaussian Splatting to action recognition: The conceptual transfer from representing dense point clouds with Gaussian primitives to representing motion trajectories is both elegant and effective.
  • Extremely compact 10-dimensional action token: mean (3) + scale (3) + quaternion (4) = 10 dimensions, compressing variable-length temporal sequences into a fixed-size representation with high computational efficiency.
  • Dataset contribution: The hierarchical error annotations and association rule analysis in CPREval-6k reveal CPR error propagation chains, offering independent research value.

Limitations & Future Work

  • The number of Gaussian components \(K\) is a hyperparameter; different actions may require different values of \(K\).
  • The EM algorithm is non-differentiable, precluding end-to-end training of MGR parameters.
  • Validation is limited to CPR, a highly repetitive motion; applicability to complex free-form actions remains unverified.
  • 2D skeleton input discards depth information; 3D skeleton combined with MGR could yield further improvements.
  • vs. ST-GCN: ST-GCN employs graph convolution with temporal convolution at 43.76 GFLOPs; GaussMedAct achieves higher accuracy using only 4.45 GFLOPs via Gaussian mixture modeling and dual-stream encoding.
  • vs. TimeSformer: Although RGB Transformers capture rich features, they lack anatomical modeling; GaussMedAct's skeleton-based foundation makes it naturally suited for medical action evaluation.

Rating

  • Novelty: ⭐⭐⭐⭐ Gaussian representation for action recognition constitutes a novel cross-domain transfer.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Covers new dataset, cross-dataset evaluation, ablation study, and real-world deployment.
  • Writing Quality: ⭐⭐⭐⭐ Motivation is clear, though notation is dense.
  • Value: ⭐⭐⭐⭐ Both the dataset and method have practical value for medical action evaluation.