High-Resolution Spatiotemporal Modeling with Global-Local State Space Models for Video-Based Human Pose Estimation¶

Conference: ICCV 2025 arXiv: 2510.11017 Code: N/A Area: Human Understanding Keywords: Video Human Pose Estimation, Mamba, State Space Model, Spatiotemporal Modeling, High-Resolution

TL;DR¶

This paper proposes GLSMamba, the first pure-Mamba framework for video-based human pose estimation (VHPE). It models global dynamic context via a Global Spatiotemporal Mamba (GSM) module—featuring 6D selective space-time scanning and spatiotemporal-modulated scan merging—and captures local keypoint details via a Local Refinement Mamba (LRM) with windowed spatiotemporal scanning. The method achieves state-of-the-art performance on four benchmarks with linear computational complexity.

Background & Motivation¶

Video human pose estimation (VHPE) requires dense spatiotemporal analysis, with the key challenge being the simultaneous capture of: - Global dynamic context: Overall body motion patterns and trends - Local motion details: High-frequency variations at individual keypoints

Inherent limitations of existing approaches:

CNN-based methods (e.g., TDMI): Fixed receptive fields limit global reasoning, leading to large errors under occlusion and motion blur.

Transformer-based methods (e.g., DiffPose): Capable of capturing global dependencies but neglect local high-frequency details, and incur quadratic complexity on high-resolution sequences—directly operating at \(\frac{1}{4}\) resolution \(\times T\) frames (15,360 tokens) results in OOM.

Existing video Mamba methods (e.g., VideoMamba): Perform only per-frame bidirectional scanning by flattening spatial tokens, which increases the distance between temporally adjacent tokens and lacks dedicated designs for local details.

Core observation: There is a need for an architecture that (1) performs global modeling on high-resolution spatiotemporal sequences with linear complexity, and (2) simultaneously enhances local keypoint motion details.

Method¶

Overall Architecture¶

Input video sequence → Visual encoder (ViTPose, frozen) extracts high-resolution features → Global Spatiotemporal Mamba (GSM, 4 blocks) → Local Refinement Mamba (LRM, 2 blocks) → Detection head → Pose heatmaps.

Feature resolution is \(\frac{1}{4} \times H \times W \times T\). For a 5-frame sequence (\(\delta=2\), two preceding and two following frames plus the current frame), the total token count reaches 15,360.

Key Designs¶

Global Spatiotemporal Mamba (GSM):
- Sequential Channel Attention: Concatenates the feature sequence and applies GAP → MLPs → sigmoid to produce per-frame channel attention weights, adaptively activating important spatiotemporal information.
- 6D Selective Space-Time Scan (STS6D): Flattens the feature sequence into 1D along six spatiotemporal scanning paths and feeds each into an S6 block. Specifically, multi-frame features are stacked into a panoramic spatiotemporal representation; horizontal/vertical traversals yield \(\tilde{\mathbf{y}}_1, \tilde{\mathbf{y}}_4\) (unified scan, capturing high-level spatiotemporal representations); per-frame spatial traversals yield \(\tilde{\mathbf{y}}_2, \tilde{\mathbf{y}}_5\) (spatial scan, capturing complete body spatial context); pixel-wise temporal traversals yield \(\tilde{\mathbf{y}}_3, \tilde{\mathbf{y}}_6\) (temporal scan, capturing dense motion trends).
- Spatial- and Temporal-Modulated scan Merging (STMM): First merges bidirectional scan results by type (\(\tilde{\mathbf{y}}_u, \tilde{\mathbf{y}}_s, \tilde{\mathbf{y}}_t\)), then applies spatial and temporal modulation compensation via Deformable Convolution to adaptively aggregate scan knowledge from different semantic streams.

Design Motivation: Adapts 1D Mamba to video spatiotemporal modeling; the 6-direction scanning fully exploits information across all dimensions, and DCN-based adaptive fusion avoids the information loss caused by simple summation.

Local Refinement Mamba (LRM):
- Windowed Space-Time Scan (WSTS): Partitions the feature sequence into non-overlapping 3D temporal tube windows (e.g., \(8\times6\times T\)), performing forward and backward scanning within each window and feeding the results into S6 blocks.
- Enhances local details while maintaining a sequence-length-scale receptive field.
- Removes Sequential Channel Attention and replaces STS6D/STMM with WSTS.

Design Motivation: While GSM focuses on global understanding, it lacks the local high-frequency details of individual keypoints. LRM complements this with fine-grained motion information via dense scanning within local windows.

Dual-stream gating design: In each GSM block, the main stream passes through STS6D+STMM to produce global features \(\tilde{\mathcal{F}}\); a parallel stream applies depthwise convolution + LayerNorm + SiLU to produce gating attention \(\bar{\mathcal{A}}\). The two streams are multiplied element-wise before passing through an FFN.

Loss & Training¶

Standard heatmap estimation loss: \(\mathcal{L}_H = \|\hat{\mathbf{H}}^i_t - \mathbf{H}^i_t\|_2^2\)
Initialized with ViTPose pretrained weights (on COCO); backbone is frozen during inference.
AdamW optimizer; initial lr \(1\times10^{-4}\), decayed to \(1\times10^{-5}\) at epoch 6, and \(1\times10^{-6}\) at epoch 12.
Data augmentation: random rotation/scaling, cropping, and flipping.
Temporal span \(\delta=2\) (5 frames total); trained for 20 epochs on a single TITAN RTX GPU.

Key Experimental Results¶

Main Results (Tables)¶

PoseTrack2017 Validation Set (mAP):

Method	Backbone	Mean mAP
PoseWarper	HRNet-W48	81.2
DCPose	HRNet-W48	82.8
FAMI-Pose	HRNet-W48	84.8
TDMI	HRNet-W48	85.7
DiffPose	ViT-B	86.4
DSTA	ViT-H	85.6
GLSMamba-B	ViT-B	86.9
GLSMamba-H	ViT-H	88.0

PoseTrack2018 / PoseTrack21 / Sub-JHMDB:

Dataset	GLSMamba-B	GLSMamba-H	Prev. SOTA
PoseTrack2018	84.2	84.9	83.5 (TDMI/DSTA)
PoseTrack21	84.1	84.7	83.5 (TDMI/DSTA)
Sub-JHMDB	97.9	-	96.0 (FAMI-Pose)

Ablation Study (Tables)¶

Component Ablation (PoseTrack2017):

Setting	GSM	LRM	mAP
Backbone only	-	-	74.2
+ GSM	✓	-	86.0 (+11.8)
+ GSM + LRM (full)	✓	✓	86.9 (+0.9)

STS6D Scanning Direction Ablation:

Scanning Directions	#Params	GFLOPs	mAP
Unified scan	9.1M	137.4	85.8
+ Spatial scan	9.4M	138.1	86.5
+ Spatial + Temporal scan (full STS6D)	9.8M	138.9	86.9
Full STS6D w/o STMM	9.1M	137.4	86.2

Resolution Impact and Computational Efficiency:

Method	Resolution	#Tokens	#Params	GFLOPs	mAP
GLSMamba-B	\(\frac{1}{4}\times T\)	15,360	9.8M	138.9	86.9
GLSMamba-BLR	\(\frac{1}{16}\times T\)	960	9.8M	85.1	85.7
TransLR	\(\frac{1}{16}\times T\)	960	46.3M	125.7	84.2
TransNR	\(\frac{1}{8}\times T\)	3,840	47M	315.2	84.8
TransHR	\(\frac{1}{4}\times T\)	15,360	-	-	OOM

Key Findings¶

GSM contributes most: Introducing GSM alone raises mAP from 74.2 to 86.0 (+11.8), demonstrating the critical importance of global spatiotemporal modeling for VHPE.
6-direction scanning yields incremental gains: Unified → +Spatial → +Temporal scanning improves mAP from 85.8 → 86.5 → 86.9 with negligible additional computation.
STMM outperforms simple summation by 0.7 mAP: Adaptive fusion of semantically distinct scan results is important.
High resolution is significantly beneficial: \(\frac{1}{4}\) resolution outperforms \(\frac{1}{16}\) by 1.2 mAP, while Transformer-based architectures OOM at the same resolution.
Extremely parameter-efficient: Only 9.8M trainable parameters (86.2% fewer than methods that fine-tune the backbone), with 138.9 GFLOPs (66% of PoseWarper).

Highlights & Insights¶

First pure-Mamba VHPE framework: Demonstrates the substantial potential of SSMs for dense prediction tasks in computer vision.
Design philosophy of decoupled global-local modeling: GSM and LRM serve distinct roles, proving more effective than a unified architecture.
Linear complexity for high-resolution sequences: While Transformers OOM at 15,360 tokens, Mamba operates at only 138.9 GFLOPs.
Elegant multi-direction design of STS6D: The three scan types—unified, spatial, and temporal—each capture distinct semantics and are strongly complementary.
Extremely low training cost: Frozen backbone with only 9.8M trainable parameters, trainable on a single TITAN RTX GPU.

Limitations & Future Work¶

The backbone weights are fully frozen, which may limit adaptability in domain-specific settings.
The temporal span is fixed at \(\delta=2\) (5 frames); a longer temporal range may further improve performance.
On Sub-JHMDB, there remains a notable gap compared to the post-processing method DeciWatch (98.8 vs. 97.9), as post-processing methods operate in pose coordinate space and are qualitatively different.
The local window size (\(8\times6\times T\)) is fixed; adaptive window sizes may be more effective.
Other dense spatiotemporal tasks, such as 3D human pose estimation and video segmentation, remain unexplored.

SSM/Mamba family: The evolution from S4 to Mamba; this work extends 1D Mamba to high-resolution video spatiotemporal modeling.
Limitations of CNN vs. Transformer: CNNs have limited receptive fields; Transformers suffer from quadratic complexity; Mamba achieves the best trade-off at high resolution.
Comparison with VideoMamba: VideoMamba performs only simple per-frame flattening, whereas the proposed 6D scanning combined with windowed local scanning provides a more comprehensive treatment.
The proposed approach provides a reference for adapting Mamba to other spatiotemporal tasks such as tracking and action recognition.

Rating¶

Novelty: ⭐⭐⭐⭐ First pure-Mamba VHPE framework; STS6D multi-direction scanning and STMM fusion are novel designs.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive validation on four benchmarks with detailed ablations covering components, scanning directions, resolution, and computational efficiency.
Writing Quality: ⭐⭐⭐⭐ Clear structure, complete mathematical derivations, and rich visualizations (activation maps, comparisons).
Value: ⭐⭐⭐⭐ Opens a new direction for Mamba in dense spatiotemporal prediction tasks, with a prominent computational efficiency advantage.