ICCV 2025 Video Understanding masked skeleton modeling high-level semantic prediction target generation network self-supervised learning action recognition

Towards Efficient General Feature Prediction in Masked Skeleton Modeling¶

Conference: ICCV 2025
arXiv: 2509.03609
Code: None
Area: Skeleton Action Recognition / Self-Supervised Learning
Keywords: masked skeleton modeling, high-level semantic prediction, target generation network, self-supervised learning, action recognition

TL;DR¶

This paper proposes GFP (General Feature Prediction), a framework that elevates the reconstruction target in masked skeleton modeling from low-level joint coordinates to multi-scale high-level semantic feature prediction. Coupled with a lightweight Target Generation Network and an information maximization constraint, GFP achieves a 6.2× training speedup while attaining state-of-the-art performance.

Background & Motivation¶

Masked skeleton modeling follows the MAE paradigm, randomly masking joints and reconstructing missing coordinates.
Existing methods suffer from two critical issues:
Heavy decoder computation: A 90% masking ratio combined with a Transformer decoder produces extremely long decoding sequences (750 low-level targets), resulting in slow training.
Lack of semantic supervision: Low-level coordinate reconstruction lacks high-level spatiotemporal semantic guidance, creating a semantic gap with downstream tasks.
Although S-JEPA uses model-generated features as targets, its patch-level prediction requires a large EMA encoder (28.32G FLOPs) and converges very slowly (requiring 1200 epochs).
Core Idea: Replace the large number of low-level targets (750) with fewer high-level semantic targets (251), simultaneously improving feature quality and training efficiency.

Method¶

Overall Architecture¶

GFP establishes a bidirectional learning paradigm between an encoder–decoder architecture and a Target Generation Network (TGN). The encoder processes visible joint features; the decoder progressively predicts multi-scale high-level features (from short-term motion patterns to global action semantics); and the TGN provides online supervision through consistency learning. Variance–covariance regularization prevents representation collapse.

Key Designs¶

High-Level Feature Prediction: Instead of conventional per-patch coordinate reconstruction, a hierarchical prediction objective is designed:
- The learning target is elevated from $\mathcal{L}_p = \frac{1}{N}\|f(E_N) - X_e\|^2_F$ (low-level) to $\mathcal{L}_p = \frac{1}{M}\|f(E_N) - g(X)\|^2_F$ (high-level).
- Decoder inputs are progressively downsampled along the temporal axis via temporal average pooling (kernel sizes $t_1, t_2, \ldots$), building a pyramid structure through cascaded Transformer decoders.
- Four hierarchical feature levels: $t_1=5$ (5-frame local motion), $t_2=10$ (10-frame mid-term dynamics), $t_3=30$ (30-frame long-range cycles), and global semantics.
- The total number of prediction targets is reduced from 750 to 251, and decoder FLOPs drop from 17.70G to 1.57G.
Target Generation Network (TGN): A lightweight multi-MLP structure that provides online supervision at each hierarchical level.
- Each level employs an independent MLP: a 3-layer MLP (512 hidden units) for local semantic extraction and a 3-layer MLP (2048 hidden units) for global semantic extraction.
- TGN takes motion features (inter-frame differences) as input, $X_e = X_e[1:] - X_e[:-1]$, avoiding the bias introduced by sharing the same input as the encoder.
- Computational cost is only 0.64G FLOPs, compared to 28.32G for S-JEPA's EMA encoder.
- Bidirectional learning: the decoder predicts TGN-generated targets, while TGN simultaneously adapts to the decoder's feature representations.
Information Maximization Constraint: Prevents the bidirectional learning from collapsing to a trivial solution.
- Variance regularization: Ensures that the batch variance of each feature dimension exceeds a threshold $\gamma=1$: $$\mathcal{L}_{var} = \frac{1}{C_t}\sum_{i=1}^{C_t}\max(0, \gamma - \sqrt{\text{Var}(Z_{t_g}[:,i])})$$
- Covariance regularization: Drives the covariance between feature dimensions toward zero, eliminating redundancy: $$\mathcal{L}_{cov} = \frac{1}{C_t}\sum_{i \neq j}[\text{Cov}(Z_{t_g})]^2_{i,j}$$
- Inspired by VICReg's information-maximizing representation learning.

Loss & Training¶

Total loss: $\mathcal{L}_{total} = \lambda \mathcal{L}_{pred} + \mathcal{L}_{reg}$
Regularization loss: $\mathcal{L}_{reg} = \sum_{j \in \mathcal{J}}(\alpha \mathcal{L}_{cov}(Z_{t_j}) + \beta \mathcal{L}_{var}(Z_{t_j}))$
Hyperparameters: $\lambda=5,\ \alpha=5,\ \beta=1$
Pre-training for 400 epochs; AdamW ($\beta_1=0.9,\ \beta_2=0.95$, weight decay 0.05)
Learning rate: cosine annealing after 20 warmup epochs ($1\text{e-}3 \to 5\text{e-}4$)
Encoder: 8 layers, 8-head attention with 256-dim embeddings and 1024-dim FFN
Motion-aware masking strategy with segment length $l=4$

Key Experimental Results¶

Main Results (Tables)¶

NTU-60 Skeleton Action Recognition (comparison with MAE-based methods, single RTX 4090):

Method	Target Type	Enc FLOPs	Dec FLOPs	TGN FLOPs	Training Time	Speedup	x-sub	x-view
SkeletonMAE	Joint	1.97G	17.70G	-	20h27m	1×	74.8	77.7
MAMP	Motion	1.97G	17.70G	-	20h27m	1×	84.9	89.1
S-JEPA	Patch-level	1.97G	17.70G	28.32G	90h57m	0.2×	85.3	89.8
GFP	Hierarchical	1.97G	1.57G	0.64G	3h14m	6.2×	85.9	92.0

NTU-120 + PKU-MMD II:

Method	NTU-120 x-sub	NTU-120 x-setup	PKU-II x-sub
MAMP	78.6	79.1	53.8
S-JEPA	79.6	79.9	53.5
GFP	79.1	80.3	56.2

Ablation Study (Tables)¶

TGN Input Ablation (NTU-60):

TGN Input	x-sub	x-view
Joint coordinates	85.0	90.9
Masked joints	84.2	90.3
Motion features (frame differences)	85.9	92.0

Semi-Supervised Learning (NTU-60, 1%/10% labels):

Method	x-sub 1%	x-sub 10%	x-view 1%	x-view 10%
HaLP	46.6	72.6	48.7	77.1
USDRL	57.3	80.2	60.7	84.0
MAMP	66.0	88.0	68.7	91.5
S-JEPA	67.5	88.4	69.1	91.4
GFP	71.8	88.7	72.9	92.1

Key Findings¶

Significant training efficiency gains: GFP completes pre-training in only 3h14m, 6.2× faster than SkeletonMAE and 28× faster than S-JEPA.
High-level semantic targets outperform low-level reconstruction targets: removing TGN and directly predicting low-level targets (e.g., flattened vectors) leads to a notable performance drop.
Complementarity of hierarchical features: Progressively removing global or local objectives both lead to performance degradation (Figure 3).
Motion features (frame differences) are optimal as TGN input; masked inputs are detrimental.
Action retrieval improvements are particularly pronounced: GFP outperforms MAMP by 17.1% on x-view (87.1 vs. 70.0), validating the value of high-level semantic targets for global representation learning.
TGN architecture is robust: variants with 2/3/4 MLP layers show negligible performance differences.

Highlights & Insights¶

Precise problem diagnosis: the paper identifies computational redundancy and semantic insufficiency of low-level reconstruction as the current bottleneck, and addresses both issues directly and effectively.
The hierarchical target design (5-frame → 10-frame → 30-frame → global) elegantly balances local motion details and global action understanding.
Incorporating VICReg's information maximization constraint cleanly resolves the collapse problem in joint training.
The lightweight TGN (0.64G FLOPs, 3-layer MLP) is far more efficient than S-JEPA's EMA encoder (28.32G FLOPs).

Limitations & Future Work¶

The hierarchical granularity (5/10/30 frames) is manually specified; adaptive determination strategies remain unexplored.
Validation is limited to skeleton data and has not been extended to masked modeling on RGB video.
The computational overhead of covariance regularization under large batch sizes warrants attention.
Comparisons with methods that jointly combine contrastive learning and masked modeling are absent.

This work demonstrates the principle that "target quality matters more than quantity" in masked modeling.
The paradigm shift from low-level reconstruction to high-level semantic prediction is generalizable to masked modeling in other modalities.
The lightweight online target generation scheme of TGN offers a viable alternative to large EMA-updated encoders.

Rating¶

Novelty: ⭐⭐⭐⭐ The idea of replacing low-level reconstruction with high-level semantic prediction is clear and effective, though the core components (pyramid decoding, VICReg) are adapted from prior work.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three datasets, multiple tasks (recognition, retrieval, semi-supervised learning), and comprehensive ablations (target type, TGN input, architecture, projector depth).
Writing Quality: ⭐⭐⭐⭐ Problem motivation is well articulated; efficiency comparisons are intuitive (Table 1 includes both FLOPs and training time).
Value: ⭐⭐⭐⭐⭐ The combination of 6.2× speedup and state-of-the-art performance offers substantial practical value and establishes a new paradigm for skeleton self-supervised learning.