Skip to content

Multi-Scale Finetuning for Encoder-based Time Series Foundation Models

Conference: NeurIPS 2025 arXiv: 2506.14087 Code: https://github.com/zqiao11/MSFT Area: Time Series / Foundation Model Fine-tuning Keywords: Time series foundation models, multi-scale modeling, fine-tuning, causal inference, parameter-efficient fine-tuning

TL;DR

This paper proposes MSFT (Multi-Scale FineTuning), which leverages causal analysis to reveal that naive fine-tuning suffers from scale confounding, and designs a multi-scale modeling framework for efficient fine-tuning of encoder-based time series foundation models, significantly outperforming both naive fine-tuning and from-scratch SOTA methods.

Background & Motivation

Time series foundation models (TSFMs) demonstrate strong zero-shot forecasting performance, yet how to efficiently fine-tune them for downstream tasks remains an underexplored problem. Existing fine-tuning strategies (full fine-tuning, linear probing) suffer from two core issues:

The multi-scale nature of time series is neglected: Time series data exhibits different temporal patterns at different sampling scales. For example, hourly energy consumption reflects local usage patterns, while daily aggregation captures macro-level trends. Naive fine-tuning learns only at the original scale and is prone to overfitting to single-scale patterns.

The multi-scale capability of TSFMs is underutilized: TSFMs are pre-trained on multi-scale datasets and inherently possess multi-scale forecasting ability, yet naive fine-tuning restricts the model to learning at a single scale, wasting the multi-scale knowledge acquired during pre-training.

The authors analyze this from a causal inference perspective: scale \(S\) acts as a confounder that simultaneously affects the input \(X\) (temporal patterns at different scales) and the knowledge activated in the model \(M\) (scale-specific pre-trained knowledge), introducing a backdoor path \(X \leftarrow S \rightarrow M \rightarrow Y\) that induces spurious correlations. It is therefore necessary to eliminate the confounding effect via do-calculus intervention \(P(Y|do(X))\).

Method

Overall Architecture

MSFT operates under a causal intervention framework, realizing backdoor adjustment as \(P(Y|do(X)) = \sum_s P(Y|X,S=s,M=g(X,s))P(s)\). The pipeline proceeds as follows: 1. Downsample the original time series to generate multi-scale sequences (by factors of \(2^k\)) 2. Independently tokenize and encode each scale 3. Concatenate multi-scale token sequences and capture both within-scale and cross-scale dependencies via decoupled dependency modeling 4. Weighted aggregation of multi-scale predictions

Key Designs

  1. Scale-specific knowledge activation: Pre-trained parameters are frozen; independent linear adapters (input projection layers) and independent LoRA modules (attention layers) are introduced for each scale. This avoids interference caused by differing token resolutions across scales and realizes \(M=g(X,s)\) in the formulation—activating scale-specific TSFM knowledge.

  2. Decoupled token dependency modeling: Composed of two components:

  3. In-scale attention: A mask \(\mathbf{M}_{in}\) ensures that tokens attend only to tokens within the same scale, preventing spurious attention caused by temporal index misalignment across scales.

  4. Cross-scale aggregator: Bidirectional (coarse-to-fine C2F and fine-to-coarse F2C) layer-wise fusion of adjacent-scale information. Tokens are mapped to a shared space via linear projection \(\phi_{i,j}^l\) and then fused according to temporal alignment: C2F uses Repeat upsampling, and F2C uses AvgPool downsampling.

  5. Multi-scale mixing output: Each scale produces an independent prediction \(\hat{Y}_i\); the training objective is a weighted loss \(\mathcal{L}_{pred} = \sum_i w_i \mathcal{L}_{pred,i}\), where weights \(w_i\) are learned via softmax. At inference, predictions from each scale are upsampled to the original resolution and then combined by weighted summation, producing an ensemble effect that mitigates overfitting.

Loss & Training

The original prediction loss of each TSFM (MSE or NLL) is used, weighted and summed according to the learned scale weights. Pre-trained parameters are frozen; only the adapters, LoRA modules, cross-scale aggregator, and scale weights are trained.

Key Experimental Results

Main Results (Long-term Forecasting, MSE, lower is better)

Dataset Metric MSFT (Moirai-Base) Full FT (Moirai-Base) TimeMixer SimpleTM Gain
ETTm1 MSE 0.332 0.368 0.381 0.381 −9.8% vs FT
ETTm2 MSE 0.247 0.258 0.275 0.275 −4.3% vs FT
Weather MSE 0.213 0.232 0.240 0.243 −8.2% vs FT
Electricity MSE 0.169 0.173 0.182 0.166 −2.3% vs FT

MSFT consistently outperforms naive fine-tuning, LoRA, AdaLoRA, and other parameter-efficient fine-tuning methods across three TSFM backbones (Moirai, Moment, UniTS).

Ablation Study

Configuration ETTm1 MSE Weather MSE Note
MSFT (full) 0.332 0.213 All components
w/o cross-scale aggregator 0.340 0.220 Cross-scale fusion is important
w/o scale-specific LoRA 0.345 0.222 Scale-specific knowledge activation is important
w/o multi-scale mixing 0.338 0.218 Weighted fusion is beneficial
Single scale (K=0) 0.361 0.230 Multi-scale modeling is the core

Key Findings

  • Multi-scale modeling is the primary source of improvement; each component (scale-specific adapters, decoupled dependency modeling, weighted mixing) contributes independently.
  • MSFT surpasses not only various fine-tuning baselines but also from-scratch SOTA deep learning methods (TimeMixer, SimpleTM, etc.).
  • The framework is equally applicable to probabilistic forecasting tasks (leveraging Moirai's probabilistic forecasting capability).

Highlights & Insights

  • The causal analysis perspective provides a theoretically novel and compelling diagnosis of naive fine-tuning—scale acts as a confounder that introduces spurious correlations.
  • The framework design is clean and general, compatible with different encoder-based TSFM architectures.
  • Freezing pre-trained parameters and employing lightweight adapters yields high parameter efficiency while avoiding catastrophic forgetting.

Limitations & Future Work

  • The framework targets only encoder-based TSFMs; decoder-only (e.g., TimesFM) and encoder-decoder (e.g., Chronos) architectures are not explored.
  • The number of scales \(K\) requires manual selection; no adaptive mechanism is provided.
  • The downsampling strategy is fixed as average pooling; alternative approaches (e.g., wavelet transforms) merit exploration.
  • vs TimeMixer: TimeMixer trains a multi-scale model from scratch, whereas MSFT leverages the multi-scale capability of pre-trained TSFMs for fine-tuning—different approaches sharing the same objective.
  • vs LoRA: Standard LoRA does not differentiate between scales; MSFT's scale-specific LoRA more effectively activates the corresponding pre-trained knowledge.
  • vs Scaleformer: Scaleformer refines predictions iteratively from coarse to fine, while MSFT employs bidirectional fusion for greater flexibility.

Rating

  • Novelty: ⭐⭐⭐⭐ The causal perspective on fine-tuning is novel, though multi-scale modeling itself is not entirely new.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three backbones, multiple datasets, comparisons with diverse fine-tuning methods, and comprehensive ablations.
  • Writing Quality: ⭐⭐⭐⭐⭐ Logically clear, with causal analysis and method design organically integrated.
  • Value: ⭐⭐⭐⭐ Provides a practical and theoretically grounded general framework for TSFM fine-tuning.