TSPulse: Tiny Pre-Trained Models with Disentangled Representations for Rapid Time Series¶
Conference: ICLR 2026 arXiv: 2505.13033 Code: https://huggingface.co/ibm-granite/granite-timeseries-tspulse-r1 Area: Time Series Keywords: Time Series Pre-trained Model, Disentangled Representations, Dual-Space Reconstruction, Anomaly Detection, Tiny Model
TL;DR¶
This paper proposes TSPulse, an ultra-lightweight time series pre-trained model with only 1M parameters, which surpasses models 10–100× larger on four tasks — classification (+5–16%), anomaly detection (+20%), imputation (+50%), and similarity retrieval (+25%) — through dual-space masked reconstruction and dual-embedding disentanglement.
Background & Motivation¶
Time series analysis encompasses diverse downstream tasks including forecasting, anomaly detection, imputation, classification, and retrieval. Inspired by successes in NLP and CV, the time series community has increasingly explored large-scale pre-trained models:
Task-specific models: TimesFM, Chronos, and Moirai focus on forecasting tasks.
General-purpose models: Moment and UniTS extend to classification, anomaly detection, and imputation.
Cross-domain models: Time-LLM and GPT4TS attempt to adapt LLMs to time series.
Core problem: Existing pre-trained models have massive parameter counts (hundreds of millions to billions), leading to prohibitively high deployment and fine-tuning costs. TTM demonstrates that compact models with 1–5M parameters can achieve competitive performance on forecasting, but this advantage is limited to that task.
Research gap: Can a ~1M-parameter pre-trained model be constructed that simultaneously achieves state-of-the-art performance across multiple non-forecasting diagnostic tasks?
Method¶
Overall Architecture¶
TSPulse is built on the lightweight TSMixer architecture. The core pipeline is: - Input \(\mathbf{X} \in \mathbb{R}^{S \times C}\) → masking → dual-space encoding (time domain + frequency domain) → TSMixer backbone → mini decoder → multi-objective output heads
Key Designs¶
-
Dual-Space Masked Reconstruction
- Reconstruction of masked inputs is performed simultaneously in both the time domain and frequency domain.
- Core intuition: certain patterns are more readily detected in the time domain (e.g., spikes), while others are more prominent in the frequency domain (e.g., periodicity).
- The time-domain masked input \(\mathbf{X}_m\) is transformed via FFT to obtain the frequency-domain representation \(\mathbf{X}^f_m\).
- Key design: the frequency domain is not explicitly masked; instead, the time-domain masked signal is directly fed into FFT, naturally propagating masking into the frequency domain.
- After encoding, representations are concatenated: \(\mathbf{Input}_E = [\mathbf{Time}_E; \mathbf{FFT}_E; \mathbf{Reg}_E] \in \mathbb{R}^{C \times K \times D}\)
-
Dual-Embedding Disentanglement
- Detailed embeddings (first \(2N\) patch embeddings): used for full-signal reconstruction, capturing fine-grained time-domain and frequency-domain patterns.
- Semantic embeddings (last \(R\) register embeddings): used for high-level semantic reconstruction, encoding global features.
- Semantic embeddings are supervised through two tasks:
- Frequency signature prediction: \(\mathcal{L}_{prob} = \text{CE}(\mathbf{X}^f_{prob}, \mathbf{Y}^f_{prob})\) (softmax distribution over log-amplitude spectra)
- Short-term forecasting: \(\mathcal{L}_{future} = \text{MSE}(\mathbf{X}_{future}, \mathbf{Y}_{future})\)
- Design motivation: different downstream tasks require different levels of information — classification benefits from semantic embeddings, while imputation relies on detailed embeddings.
-
TSLens (Classification Fine-Tuning Component)
- A learned mechanism replacing standard pooling that adaptively extracts relevant features from dual embeddings.
- Pipeline: mini decoder (initialized from pre-trained weights + channel mixing) → dimensionality-reduction projection → flatten → linear classification head.
- Dynamically focuses on the most informative features in both local and global representations.
-
Multi-Head Triangulation for Anomaly Detection
- Three prediction heads detect anomalies from complementary perspectives:
- \(\text{Head}_{time}\): time-domain reconstruction deviation → detects spike anomalies
- \(\text{Head}_{fft}\): frequency-domain reconstruction deviation → detects periodic anomalies
- \(\text{Head}_{future}\): short-term prediction deviation → detects trend anomalies
- Fusion strategies: \(\text{Head}_{ensemble}\) (max-value fusion) or \(\text{Head}_{triang.}\) (selecting the best head based on a small validation set).
- This represents the first pre-trained model to unify multi-space outputs for triangulation within a single lightweight framework.
- Three prediction heads detect anomalies from complementary perspectives:
-
Hybrid Masking Strategy
- Conventional block masking is ill-suited for real-world imputation, where missing values are irregular.
- Hybrid strategy: simultaneously masks complete patches and partial point-level positions.
- Key design: the mask token \(\mathbf{M} \in \mathbb{R}^{1 \times pl}\) is defined at the original patch level (not in embedding space), enabling flexible partial masking.
- Ablations show that removing hybrid pre-training causes a 79% performance drop under hybrid-masking evaluation.
-
Identity Initialization for Channel Mixing
- Pre-training employs a channel-independent mode.
- Channel mixing is enabled during fine-tuning, but newly added mixing layers are initialized with identity weights.
- This prevents random initialization from causing activation discontinuities and gradient instability between pre-trained layers.
Loss & Training¶
Joint minimization of a multi-objective weighted loss: - \(\mathcal{L}_{time1} = \text{MSE}(\mathbf{X}, \mathbf{Y})\): time-domain reconstruction (masked positions only) - \(\mathcal{L}_{time2} = \text{MSE}(\mathbf{X}, \mathbf{Y}')\): time-domain reconstruction via inverse FFT from frequency space - \(\mathcal{L}_{fft} = \text{MSE}(\mathbf{X}^f, \mathbf{Y}^f)\): frequency-domain reconstruction - \(\mathcal{L}_{prob} = \text{CE}(\mathbf{X}^f_{prob}, \mathbf{Y}^f_{prob})\): frequency signature - \(\mathcal{L}_{future} = \text{MSE}(\mathbf{X}_{future}, \mathbf{Y}_{future})\): short-term forecasting
Task-specialized pre-training is achieved by re-weighting loss head priorities (e.g., AD retains all heads; classification emphasizes the time-domain and probability heads).
Pre-training covers ~1B time series samples and requires only one day on 8×A100 GPUs.
Key Experimental Results¶
Anomaly Detection (TSB-AD Leaderboard, Figure 4)¶
| Method | Univariate VUS-PR | Multivariate VUS-PR |
|---|---|---|
| Sub-PCA (Prev. SOTA) | 0.42 | - |
| CNN (Prev. SOTA) | - | 0.31* |
| MOMENT (ZS) | 0.38 | - |
| TSPulse (ZS) | 0.48 (+14%) | 0.36 (+16%) |
| TSPulse (FT) | 0.52 (+24%) | 0.36 (+26%*) |
*TSPulse simultaneously ranks first on both univariate and multivariate TSB-AD leaderboards.
Classification (UEA 29 Datasets, Figure 5)¶
| Method | Parameters | Mean Accuracy |
|---|---|---|
| VQShape | ~37M | 0.701 |
| MOMENT | ~110M | 0.675 |
| UniTS | ~10M | 0.634 |
| TSPulse | ~1M | 0.733 (+5–16%) |
Imputation (6 LTSF Benchmarks, Figure 6 — Hybrid Masking)¶
| Method | Setting | Mean MSE↓ |
|---|---|---|
| MOMENT | ZS | 0.276 |
| UniTS (PMT) | PMT | 0.170 |
| TSPulse | ZS | 0.074 (+56–73%) |
| TimesNet | FT | 0.080 |
| TSPulse | FT | 0.039 (+49–51%) |
Ablation Study (Table 1)¶
Classification Ablation:
| Variant | Accuracy | Drop |
|---|---|---|
| TSPulse (Full) | 0.747 | - |
| w/o Short Embedding | 0.689 | -8% |
| w/o Long Embedding | 0.681 | -10% |
| w/o Masking | 0.691 | -8% |
| w/o CM Identity Init | 0.685 | -9% |
| w/o TSLens (Avg-Pool) | 0.675 | -11% |
| w/o TSLens (Max-Pool) | 0.645 | -16% |
| w/o Dual-space | 0.696 | -7% |
Efficiency Comparison (Table 23)¶
| Model | Params (M) | GPU Inference (ms) | CPU Inference (s) | Memory (GB) |
|---|---|---|---|---|
| TSPulse | 1.06 | 7.16 | 0.06 | 0.39 |
| MOMENT (small) | 35.34 (33×) | 32.57 (5×) | 2.74 (46×) | 0.56 |
| MOMENT (large) | 341.24 (322×) | 405.42 (57×) | 21.98 (366×) | 2.30 |
| Chronos (tiny) | 8.39 (8×) | 39.81 (6×) | 66.15 (1103×) | 2.91 |
Key Findings¶
- 1M parameters outperforms models 10–100× larger: model size is not the sole determinant of performance; architectural design is equally critical.
- Dual-space learning is essential: removing the frequency-domain branch causes a 7% drop in classification and an 8% drop in imputation.
- Hybrid masking pre-training is the key to imputation performance: pure block masking leads to a 79% collapse under hybrid-masking evaluation.
- TSLens significantly outperforms standard pooling: drops of -11% (avg-pool) and -16% (max-pool) validate the value of learned attention.
- Semantic embeddings from register tokens are robust to distortion: insensitive to noise, amplitude variation, and temporal shift, while remaining sensitive to frequency and shape.
Highlights & Insights¶
- "Small but mighty" philosophy: 1M parameters suffices — the key lies in elegant architectural design (dual-space, dual-embedding, multi-head triangulation).
- Value of disentangled representations: the separation of fine-grained embeddings from semantic embeddings allows different tasks to select the most suitable representations.
- Elegance of multi-head triangulation: different reconstruction heads are naturally suited to detecting different anomaly types; their fusion outperforms any single perspective.
- Zero-shot surpasses trained models: TSPulse's zero-shot anomaly detection outperforms all models trained on the target data.
- CPU-friendly: a 0.06-second CPU inference time enables GPU-free deployment.
- IBM Granite series: open-sourced on HuggingFace, offering strong practical utility.
Limitations & Future Work¶
- Forecasting tasks are currently not addressed, though the viability of compact models for forecasting has been established by TTM.
- Pre-training data primarily covers specific domains (energy, transportation, etc.); transferability to other domains remains to be validated.
- The two-stage design of univariate pre-training followed by multivariate fine-tuning may not be optimal.
- Incremental learning capability is absent: the model cannot continuously update without forgetting prior knowledge.
- Few-shot classification capability warrants further exploration.
- Cross-modal fusion (e.g., time series + text) is a promising direction for future work.
Related Work & Insights¶
- TTM (Tiny Time Mixers): a pioneer in compact time series pre-trained models, but limited to forecasting tasks.
- MOMENT: a general-purpose time series foundation model based on a T5-encoder architecture, with 35–341M parameters.
- Chronos: a T5-style encoder-decoder focused on forecasting, ranging from 0.06M to 709M parameters.
- UniTS: a prompt-tuned multi-task model.
- TSMixer: the backbone network of TSPulse, adopting the MLP-Mixer paradigm as an alternative to Transformers.
- Insight: compact models + task-specialized pre-training + carefully designed post-processing components constitute an efficient and powerful foundation model design paradigm.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (dual-space dual-embedding disentanglement + multi-head triangulation + hybrid masking — a combination of multiple innovations)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (75+ datasets, 4 major tasks, comprehensive ablations, efficiency analysis, embedding sensitivity analysis)
- Writing Quality: ⭐⭐⭐⭐ (thorough content, clear logic, exceptionally rich appendix)
- Value: ⭐⭐⭐⭐⭐ (1M-parameter model surpasses models 100× larger; open-sourced and deployment-friendly)