STaRFormer: Semi-Supervised Task-Informed Representation Learning via Dynamic Attention-Based Regional Masking¶

Conference: NeurIPS 2025
arXiv: 2504.10097
Code: https://star-former.github.io
Area: Self-Supervised Learning
Keywords: Time-series classification, contrastive learning, dynamic masking, non-stationarity, irregular sampling

TL;DR¶

STaRFormer is proposed, which employs Dynamic Attention-based Regional Masking (DAReM) to identify task-critical regions and apply masking perturbations, coupled with intra-batch and intra-class semi-supervised contrastive learning to embed task information into latent representations. The method comprehensively outperforms state-of-the-art baselines across 56 datasets spanning non-stationary, irregularly sampled, classification, anomaly detection, and regression settings.

Background & Motivation¶

Background: Time-series modeling methods typically assume complete, stationary, and uniformly sampled data. Self-supervised contrastive learning approaches (e.g., TS2Vec, TimesURL) are decoupled from downstream tasks.

Limitations of Prior Work: Real-world sensor data frequently exhibit non-stationarity and irregular sampling (e.g., UWB ranging: 79% non-stationary); pretrained contrastive learning methods are insufficiently coupled with downstream tasks.

Key Challenge: Contrastive learning requires effective augmentation strategies, yet conventional random augmentations disregard task relevance. Masking task-critical regions is necessary to compel the model to learn robust representations.

Goal: Design a framework that couples representation learning with downstream tasks while handling non-stationarity and irregular sampling.

Key Insight: Dynamic attention-based masking identifies task-critical regions → masking → reconstruction → intra-batch/intra-class contrastive learning.

Core Idea: DAReM identifies task-critical regions → masking perturbs statistical properties → semi-supervised contrastive learning couples downstream tasks = task-aware robust time-series representations.

Method¶

Overall Architecture¶

A Siamese architecture is adopted: the left branch processes the original sequence for downstream tasks, while the right branch processes the masked sequence for reconstruction. Both branches share parameters and generate unmasked/masked latent representations, which are aligned via a semi-supervised contrastive loss.

Key Designs¶

DAReM:
- Function: Dynamically identifies and masks task-critical regions.
- Mechanism: Collects attention weights → attention rollout → computes global importance scores → applies regional masking to high-importance regions.
- Design Motivation: Masking critical regions prevents the model from relying on any single feature, thereby improving robustness.
Semi-Supervised Contrastive Learning:
- Function: Integrates self-supervised (intra-batch) and supervised (intra-class) contrastive objectives.
- Mechanism: Intra-batch — masked and unmasked representations of the same sequence form positive pairs; intra-class — sequences of the same class also form positive pairs.
- Design Motivation: The semi-supervised formulation strikes a balance between purely self-supervised and purely supervised contrastive learning.

Loss & Training¶

\(\mathfrak{L} = \mathfrak{L}_{Task} + \lambda_{CL}[\lambda_{fuse}\mathfrak{L}_{bw} + (1-\lambda_{fuse})\mathfrak{L}_{cw}]\)

Key Experimental Results¶

Main Results (56 Datasets)¶

Data Type	Task	STaRFormer	Best Baseline	Gain
Non-stationary + spatiotemporal (DKT)	Classification	0.852	0.849 (Transformer)	+0.3%
Non-stationary + spatiotemporal (GeoLife)	Classification	0.932	0.913 (ST-GRU)	+2.1%
Irregular sampling (P19/P12/PAM)	Classification	AUROC/Acc improved	Multiple baselines	Significant
30 UEA datasets	Classification	Best on 14/30	TARNet et al.	Overall best

Ablation Study¶

Configuration	Effect	Note
w/o DAReM	Degraded	Masking task-critical regions is more effective
w/o intra-class contrastive	Degraded	Supervised signal is important for representation quality
w/o intra-batch contrastive	Degraded	Self-supervised objective provides additional regularization

Key Findings¶

DAReM is particularly effective on non-stationary data.
Semi-supervised > purely self-supervised > purely supervised.
Generality demonstrated across 56 datasets.

Highlights & Insights¶

Masking task-critical regions is substantially more informative than masking random regions.
Task-coupled contrastive learning more directly benefits downstream tasks.

Limitations & Future Work¶

DAReM hyperparameters require tuning.
Validation is conducted primarily on Transformer backbones.

vs. TARNet: TARNet performs point masking with reconstruction. STaRFormer generalizes to regional masking with semi-supervised contrastive learning.
vs. TS2Vec/TimesURL: These methods adopt decoupled self-supervised learning. STaRFormer couples representation learning with downstream tasks.

Rating¶

Novelty: ⭐⭐⭐⭐ The combination of DAReM and semi-supervised contrastive learning is novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 56 datasets × 3 task types.
Writing Quality: ⭐⭐⭐⭐ The methodology is clearly articulated.
Value: ⭐⭐⭐⭐ Provides important reference for time-series representation learning.

Supplementary Technical Details¶

DAReM involves three hyperparameters: \(\varphi\) (maximum masking ratio), \(\zeta\) (attention score threshold), and \(\gamma\) (regional boundary).
The contrastive learning temperature \(\tau\) is set to 0.1 across all experiments.
Validation on the BMW industrial dataset (Digital Key Trajectories) demonstrates practical deployment value.
For regression tasks, k-means clustering is used to generate pseudo-labels, where \(k\) is a hyperparameter requiring optimization.
For anomaly detection tasks, element-wise contrastive learning is employed, incorporating both intra-class and inter-class positive pair combinations.
Experiments in large-scale federated learning settings have been conducted in a parallel work [forstenhausler_leveraging_2025].
Achieving the best performance on 14 out of 30 datasets in the UEA multivariate benchmark demonstrates the generality of the framework.
The method is particularly effective on irregularly sampled medical datasets (P19/P12/PAM).
Comparisons with ViTST indicate that directly modeling time-series is more effective than converting them to images for ViT-based processing.
Multiple state-of-the-art F1 scores are achieved on 5 anomaly detection datasets.
Regression tasks are validated in industrial equipment predictive maintenance scenarios.
Parameter sharing in the Siamese network incurs only approximately 30% additional computational overhead.