MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition¶

Conference: NeurIPS 2025 arXiv: 2505.20744 Code: Available Area: Interpretability Keywords: Motion Primitives, Transformer, Wearable Sensors, Activity Recognition, Temporal Decomposition

TL;DR¶

MoPFormer is proposed to decompose wearable sensor signals into sequences of motion primitives and model their temporal dependencies via a Transformer, surpassing state-of-the-art methods on multiple HAR benchmarks while maintaining a lightweight architecture.

Background & Motivation¶

Background: Human activity recognition (HAR) using wearable sensors is widely applied in health monitoring and sports analysis; mainstream approaches employ CNNs or RNNs to process raw signals directly.

Limitations of Prior Work: (1) Raw signals contain substantial noise, and varying sampling rates hinder generalization; (2) CNNs struggle to capture long-range temporal dependencies; (3) standard Transformers adopt unreasonable tokenization strategies for HAR signals.

Key Challenge: The continuous nature of sensor signals conflicts with the discrete token format required by Transformers.

Key Insight: Human activities can be naturally decomposed into motion primitives (e.g., "raise arm," "take a step"), which serve as more semantically meaningful tokens for the Transformer.

Method¶

Overall Architecture¶

Input IMU signals → Motion primitive extraction (learned segmentation + encoding) → Primitive-sequence Transformer → Activity classification.

Key Designs¶

Motion Primitive Extraction
Function: Automatically decomposes continuous sensor signals into discrete primitive sequences.
Mechanism: A learned segmentation network identifies primitive boundaries in the signal; an encoder maps each segment to a fixed-length primitive embedding.
Design Motivation: Primitives constitute more natural and stable representation units that are invariant to sampling rate variations.
Primitive Transformer
Function: Models temporal dependencies among motion primitives.
Mechanism: Standard Transformer encoder with positional encoding, multi-head self-attention, and feed-forward layers.
Design Motivation: The primitive sequence is far shorter than the raw signal (10–20 primitives vs. hundreds of sample points), yielding high computational efficiency.
Lightweight Design
Function: Keeps the model compact for deployment on edge devices.
Mechanism: Small embedding dimensions (64–128), shallow Transformer (2–4 layers), and parameter sharing.
Design Motivation: Wearable devices have limited computational resources.

Loss & Training¶

Cross-entropy classification loss combined with an auxiliary primitive segmentation loss. The model is trained end-to-end.

Key Experimental Results¶

Main Results¶

Method	UCI-HAR Acc↑	PAMAP2 F1↑	Opportunity F1↑	Params↓
DeepConvLSTM	93.2%	89.5%	85.3%	2.1M
InceptionTime	94.1%	90.8%	86.7%	3.5M
HAR-Transformer	94.5%	91.2%	87.1%	4.2M
MoPFormer	95.8%	92.7%	89.3%	0.8M

Ablation Study¶

Configuration	UCI-HAR Acc	Note
Fixed-window tokenization	93.8%	Standard sliding-window segmentation
Learned segmentation, no Transformer	94.2%	Primitives + MLP
Learned segmentation + Transformer	95.8%	Full model

Key Findings¶

MoPFormer outperforms all baselines on every benchmark while achieving the smallest parameter count (0.8M vs. 4.2M).
Motion primitive decomposition contributes most significantly — even replacing the Transformer with an MLP yields better results than standard methods.
Strong cross-sampling-rate generalization — performance drops only 1.2% when transferring from 50 Hz to 25 Hz, compared to 5%+ for standard methods.

Highlights & Insights¶

Naturalness of Motion Primitives: Human motion is inherently composed of elemental primitives, making this representation more aligned with kinematic reality. The approach is transferable to robot action recognition, sign language recognition, and related domains.
Efficiency Advantage: Primitive sequences are substantially shorter than raw signals, drastically reducing the computational cost of the Transformer while simultaneously improving accuracy.
Robustness: Resilience to sampling rate variation is a critical practical advantage for real-world deployment.

Limitations & Future Work¶

The interpretability of learned primitive segmentations remains to be validated — it is unclear whether the discovered primitives correspond to genuine motion units.
Evaluation is limited to IMU data; extension to other sensor modalities (EMG, pressure) is necessary.
Primitive decomposition may be inaccurate for complex transitional activities (e.g., "sitting down immediately followed by standing up").

vs. HAR-Transformer: Applies fixed-window tokenization directly to raw signals, ignoring the natural structure of motion; MoPFormer's learned segmentation provides a more principled alternative.
vs. DeepConvLSTM: A CNN-RNN hybrid with a large parameter count and insufficient capacity to capture long-range dependencies.

Rating¶

Novelty: ⭐⭐⭐⭐ The use of motion primitives as tokens is intuitive and effective.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets with comprehensive ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear and well-structured presentation.
Value: ⭐⭐⭐⭐ High practical value for real-world wearable HAR applications.