Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction¶
Conference: AAAI 2026 arXiv: 2512.21707 Code: https://github.com/alanyz106/ST-MoE Area: Human Understanding Keywords: Multi-person motion prediction, Mixture of Experts, Mamba, spatiotemporal modeling, efficient inference
TL;DR¶
This paper proposes ST-MoE, the first framework to combine Mixture of Experts (MoE) with bidirectional spatiotemporal Mamba for multi-person motion prediction. Four heterogeneous spatiotemporal experts flexibly capture complex spatiotemporal dependencies, achieving state-of-the-art accuracy while reducing parameter count by 41.38% and accelerating training by 3.6×.
Background & Motivation¶
Multi-Person Motion Prediction (MPMP) aims to forecast future joint positions of multiple individuals from historical motion sequences, with important applications in human-robot interaction, autonomous driving, and surveillance systems.
Existing methods suffer from two core limitations:
Inflexible spatiotemporal representations: - MRT employs fixed-pattern spatiotemporal positional encodings, lacking flexibility - TBIFormer introduces trajectory-aware relative positional encodings to enhance spatial awareness, but body-part connectivity operations increase sequence length - IAFormer uses self-attention to explore spatiotemporal features in interaction information, yielding strong performance but poor efficiency
High computational cost: - Transformer-based methods incur large computational overhead due to the quadratic complexity of self-attention - Computational cost grows rapidly as the number of persons increases
Core motivation: Can a novel paradigm be designed that is both flexible and efficient, comprehensively capturing spatiotemporal dependencies in human motion?
The authors' insight is threefold: (1) the dynamic activation mechanism of MoE enables flexible sub-network selection; (2) Mamba's linear complexity can replace quadratic-complexity attention; (3) combining both simultaneously addresses flexibility and efficiency.
Method¶
Overall Architecture¶
Input motion sequences → DCT + Multi-Pose Encoder → gated router assigns features to 4 spatiotemporal experts → expert outputs are aggregated via weighted summation → Multi-Pose Decoder + iDCT → predicted future motion.
Key Designs¶
1. Problem Formulation and Input Processing¶
The historical motion sequence of the \(i\)-th person is defined as \(\textbf{P}_{1:t}^i \in \mathbb{R}^{D \times t}\), and the goal is to predict future motion \(\textbf{P}_{t+1:T}^i \in \mathbb{R}^{D \times (T-t)}\), where \(D = J \times 3\) denotes 3D coordinates of \(J\) joints.
Input padding: The last observed frame is replicated \(T-t\) times and concatenated to the observation sequence, forming \(\textbf{P}_{\text{input}}^i \in \mathbb{R}^{D \times T}\).
Multi-Pose Encoder: A 3-layer GCN encoder from IAFormer is adopted. A DCT transform is applied beforehand to improve representation compactness:
2. Mixture of Spatiotemporal Mamba Experts (MoSTME)¶
This is the core contribution. The encoded features \(\textbf{F}_{\text{input}} \in \mathbb{R}^{B \times D \times T}\) are fed simultaneously into the expert pool and the router.
Gated routing mechanism:
where \(g(\cdot)\) is an MLP-based gating function. TopK retains the original values of the top-\(k\) entries and sets the rest to \(-\infty\); after softmax, inactive experts receive near-zero weights, achieving sparse activation. Experiments confirm that activating all experts (\(k=4\)) yields the best performance.
Four heterogeneous experts — each composed of a different combination of bidirectional spatial Mamba and bidirectional temporal Mamba:
| Expert Type | Processing Order | Captured Pattern |
|---|---|---|
| Spatial-Temporal (ST) | Spatial → Temporal | Spatial-then-temporal dependencies |
| Temporal-Temporal (TT) | Temporal → Temporal | Strong temporal dependencies |
| Temporal-Spatial (TS) | Temporal → Spatial | Temporal-then-spatial dependencies |
| Spatial-Spatial (SS) | Spatial → Spatial | Strong spatial dependencies |
Taking the ST expert as an example:
Key design: All experts share the same set of bidirectional temporal Mamba and bidirectional spatial Mamba parameters; only the combination order differs, substantially reducing parameter count.
3. Bidirectional Spatiotemporal Mamba¶
The unidirectional modeling of vanilla Mamba limits global dependency capture. A bidirectional scanning mechanism is introduced:
Feature representations are then enhanced via LayerNorm + FFN + residual connection:
Spatial Mamba scans along the pose dimension \(D\); Temporal Mamba scans along the temporal dimension \(T\). Both follow the standard Selective SSM architecture with discretization and input-dependent \(A\), \(B\), \(C\) matrices.
Loss & Training¶
Spatial loss \(L_s\): constrains joint positions in both the observed and predicted intervals
Temporal consistency loss \(L_t\): mitigates temporal jitter in predicted motion
Total loss: \(L = \alpha L_s + \beta L_t\), with \(\alpha=1, \beta=1, \lambda=0.1\)
Training configuration: batch size 96, Adam optimizer, initial learning rate 0.01, exponential decay (\(0.1^{1/50}\)/epoch), single RTX 3090 GPU.
Key Experimental Results¶
Main Results¶
CMU-Mocap (UMPM) dataset — JPE (mm):
| Method | 0.2s | 0.6s | 1.0s | Avg |
|---|---|---|---|---|
| MRT | 36 | 115 | 193 | 114 |
| TBIFormer | 30 | 109 | 182 | 107 |
| JRFormer | 32 | 104 | 161 | 99 |
| IAFormer | 32 | 96 | 159 | 96 |
| ST-MoE (Ours) | 31 | 95 | 158 | 95 |
CHI3D dataset — JPE (mm):
| Method | 0.2s | 0.4s | 0.6s | 0.8s | 1.0s | Avg |
|---|---|---|---|---|---|---|
| TBIFormer | 45 | 95 | 145 | 192 | 233 | 142 |
| IAFormer | 39 | 83 | 129 | 176 | 218 | 129 |
| ST-MoE (Ours) | 44 | 79 | 123 | 161 | 200 | 121 |
Average JPE is reduced by 8 mm relative to IAFormer and by 21 mm relative to TBIFormer.
Efficiency comparison: 41.38% fewer parameters and 3.6× training speedup (vs. IAFormer).
Ablation Study¶
Effectiveness of heterogeneous experts (CMU-Mocap UMPM):
| Configuration | Avg JPE (↓) | Avg APE (↓) | Note |
|---|---|---|---|
| Baseline (Encoder/Decoder only) | 111.1 | 73.3 | No experts |
| +ST expert ×4 | 104.5 | 70.7 | Spatial-temporal only |
| +TT expert ×4 | 98.1 | 66.4 | Temporal-temporal only |
| +TS expert ×4 | 100.1 | 68.7 | Temporal-spatial only |
| +SS expert ×4 | 98.3 | 68.2 | Spatial-spatial only |
| +All (one of each) | 95.0 | 65.4 | Heterogeneous combination is optimal |
Effectiveness of bidirectional scanning:
| Scanning Strategy | Avg JPE | Avg APE |
|---|---|---|
| Forward only | 99.3 | 67.5 |
| Backward only | 98.9 | 67.0 |
| Bidirectional | 95.0 | 65.4 |
Bidirectional scanning reduces JPE by 4.3 mm and APE by 3.9 mm compared to unidirectional scanning.
Key Findings¶
- Activating all experts is optimal: Experiments show that activating all 4 experts yields the best performance, with JPE/APE consistently decreasing as more experts are activated.
- Single MoE layer is best: Stacking additional MoE layers leads to overfitting.
- Heterogeneous outperforms homogeneous: The combination of four distinct expert types significantly surpasses using four copies of any single expert type.
- t-SNE visualization confirms that the four experts learn distinct feature distributions, forming clearly separated clusters.
- Adaptive gating weight visualization: TT/ST experts tend to capture approximately static motion patterns, while SS/TS experts tend to capture spatially dynamic patterns.
Highlights & Insights¶
- Elegant combination of MoE and Mamba: MoE provides flexible expert selection, while Mamba's linear complexity replaces the quadratic complexity of attention — two orthogonal improvements that reinforce each other.
- The parameter-sharing design is ingenious — all four experts share the same Mamba parameters and differ only in combination order, achieving diverse functionality with minimal parameters.
- Qualitative analysis is convincing: t-SNE and gating weight visualizations intuitively demonstrate the mechanism by which different experts capture distinct motion patterns (static vs. dynamic, spatial vs. temporal).
- The framework is general and can be extended to other sequence prediction tasks requiring spatiotemporal modeling.
Limitations & Future Work¶
- Deterministic prediction only: The current method outputs a single deterministic trajectory; future work should extend to stochastic multi-person motion prediction.
- Scene limitations: Experiments are conducted primarily in laboratory settings with few persons (2–10); performance in dense crowd scenarios remains unvalidated.
- Single MoE layer limitation: The authors find that multiple MoE layers cause overfitting, suggesting a need for improved regularization strategies.
- Fixed number of four experts: The design space of expert types warrants further exploration (e.g., introducing cross-person interaction experts).
Related Work & Insights¶
- IAFormer is the primary baseline, using attention to learn spatiotemporal interaction information with strong performance but poor efficiency.
- Mamba proposes a selective scanning mechanism for long-range dependency modeling with linear inference complexity.
- MoE-Mamba alternately stacks MoE and Mamba layers; the proposed approach — embedding Mamba inside experts — is more lightweight.
- The idea of applying heterogeneous experts to motion prediction can inspire other spatiotemporal modeling tasks such as traffic flow prediction and action recognition.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The heterogeneous expert design combining MoE and Mamba is novel
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — 4 datasets, extensive ablations, and visualization analysis
- Writing Quality: ⭐⭐⭐⭐ — Clear structure and well-designed figures
- Value: ⭐⭐⭐⭐ — A benchmark work for the efficiency–accuracy trade-off