Multi-Scale Finetuning for Encoder-based Time Series Foundation Models¶

Conference: NeurIPS 2025 arXiv: 2506.14087 Code: https://github.com/zqiao11/MSFT Area: Time Series / Foundation Model Fine-tuning Keywords: Time series foundation models, multi-scale modeling, fine-tuning, causal inference, parameter-efficient fine-tuning

TL;DR¶

This paper proposes MSFT (Multi-Scale FineTuning), which leverages causal analysis to reveal that naive fine-tuning suffers from scale confounding, and designs a multi-scale modeling framework for efficient fine-tuning of encoder-based time series foundation models, significantly outperforming both naive fine-tuning and from-scratch SOTA methods.

Background & Motivation¶

Time series foundation models (TSFMs) demonstrate strong zero-shot forecasting performance, yet how to efficiently fine-tune them for downstream tasks remains an underexplored problem. Existing fine-tuning strategies (full fine-tuning, linear probing) suffer from two core issues:

The multi-scale nature of time series is neglected: Time series data exhibits different temporal patterns at different sampling scales. For example, hourly energy consumption reflects local usage patterns, while daily aggregation captures macro-level trends. Naive fine-tuning learns only at the original scale and is prone to overfitting to single-scale patterns.

The multi-scale capability of TSFMs is underutilized: TSFMs are pre-trained on multi-scale datasets and inherently possess multi-scale forecasting ability, yet naive fine-tuning restricts the model to learning at a single scale, wasting the multi-scale knowledge acquired during pre-training.

The authors analyze this from a causal inference perspective: scale \(S\) acts as a confounder that simultaneously affects the input \(X\) (temporal patterns at different scales) and the knowledge activated in the model \(M\) (scale-specific pre-trained knowledge), introducing a backdoor path \(X \leftarrow S \rightarrow M \rightarrow Y\) that induces spurious correlations. It is therefore necessary to eliminate the confounding effect via do-calculus intervention \(P(Y|do(X))\).

Method¶

Overall Architecture¶

MSFT operates under a causal intervention framework, realizing backdoor adjustment as \(P(Y|do(X)) = \sum_s P(Y|X,S=s,M=g(X,s))P(s)\). The pipeline proceeds as follows: 1. Downsample the original time series to generate multi-scale sequences (by factors of \(2^k\)) 2. Independently tokenize and encode each scale 3. Concatenate multi-scale token sequences and capture both within-scale and cross-scale dependencies via decoupled dependency modeling 4. Weighted aggregation of multi-scale predictions

Key Designs¶

Scale-specific knowledge activation: Pre-trained parameters are frozen; independent linear adapters (input projection layers) and independent LoRA modules (attention layers) are introduced for each scale. This avoids interference caused by differing token resolutions across scales and realizes \(M=g(X,s)\) in the formulation—activating scale-specific TSFM knowledge.
Decoupled token dependency modeling: Composed of two components:
In-scale attention: A mask \(\mathbf{M}_{in}\) ensures that tokens attend only to tokens within the same scale, preventing spurious attention caused by temporal index misalignment across scales.
Cross-scale aggregator: Bidirectional (coarse-to-fine C2F and fine-to-coarse F2C) layer-wise fusion of adjacent-scale information. Tokens are mapped to a shared space via linear projection \(\phi_{i,j}^l\) and then fused according to temporal alignment: C2F uses Repeat upsampling, and F2C uses AvgPool downsampling.
Multi-scale mixing output: Each scale produces an independent prediction \(\hat{Y}_i\); the training objective is a weighted loss \(\mathcal{L}_{pred} = \sum_i w_i \mathcal{L}_{pred,i}\), where weights \(w_i\) are learned via softmax. At inference, predictions from each scale are upsampled to the original resolution and then combined by weighted summation, producing an ensemble effect that mitigates overfitting.

Loss & Training¶

The original prediction loss of each TSFM (MSE or NLL) is used, weighted and summed according to the learned scale weights. Pre-trained parameters are frozen; only the adapters, LoRA modules, cross-scale aggregator, and scale weights are trained.

Key Experimental Results¶

Main Results (Long-term Forecasting, MSE, lower is better)¶

Dataset	Metric	MSFT (Moirai-Base)	Full FT (Moirai-Base)	TimeMixer	SimpleTM	Gain
ETTm1	MSE	0.332	0.368	0.381	0.381	−9.8% vs FT
ETTm2	MSE	0.247	0.258	0.275	0.275	−4.3% vs FT
Weather	MSE	0.213	0.232	0.240	0.243	−8.2% vs FT
Electricity	MSE	0.169	0.173	0.182	0.166	−2.3% vs FT

MSFT consistently outperforms naive fine-tuning, LoRA, AdaLoRA, and other parameter-efficient fine-tuning methods across three TSFM backbones (Moirai, Moment, UniTS).

Ablation Study¶

Configuration	ETTm1 MSE	Weather MSE	Note
MSFT (full)	0.332	0.213	All components
w/o cross-scale aggregator	0.340	0.220	Cross-scale fusion is important
w/o scale-specific LoRA	0.345	0.222	Scale-specific knowledge activation is important
w/o multi-scale mixing	0.338	0.218	Weighted fusion is beneficial
Single scale (K=0)	0.361	0.230	Multi-scale modeling is the core

Key Findings¶

Multi-scale modeling is the primary source of improvement; each component (scale-specific adapters, decoupled dependency modeling, weighted mixing) contributes independently.
MSFT surpasses not only various fine-tuning baselines but also from-scratch SOTA deep learning methods (TimeMixer, SimpleTM, etc.).
The framework is equally applicable to probabilistic forecasting tasks (leveraging Moirai's probabilistic forecasting capability).

Highlights & Insights¶

The causal analysis perspective provides a theoretically novel and compelling diagnosis of naive fine-tuning—scale acts as a confounder that introduces spurious correlations.
The framework design is clean and general, compatible with different encoder-based TSFM architectures.
Freezing pre-trained parameters and employing lightweight adapters yields high parameter efficiency while avoiding catastrophic forgetting.

Limitations & Future Work¶

The framework targets only encoder-based TSFMs; decoder-only (e.g., TimesFM) and encoder-decoder (e.g., Chronos) architectures are not explored.
The number of scales \(K\) requires manual selection; no adaptive mechanism is provided.
The downsampling strategy is fixed as average pooling; alternative approaches (e.g., wavelet transforms) merit exploration.

vs TimeMixer: TimeMixer trains a multi-scale model from scratch, whereas MSFT leverages the multi-scale capability of pre-trained TSFMs for fine-tuning—different approaches sharing the same objective.
vs LoRA: Standard LoRA does not differentiate between scales; MSFT's scale-specific LoRA more effectively activates the corresponding pre-trained knowledge.
vs Scaleformer: Scaleformer refines predictions iteratively from coarse to fine, while MSFT employs bidirectional fusion for greater flexibility.

Rating¶

Novelty: ⭐⭐⭐⭐ The causal perspective on fine-tuning is novel, though multi-scale modeling itself is not entirely new.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Three backbones, multiple datasets, comparisons with diverse fine-tuning methods, and comprehensive ablations.
Writing Quality: ⭐⭐⭐⭐⭐ Logically clear, with causal analysis and method design organically integrated.
Value: ⭐⭐⭐⭐ Provides a practical and theoretically grounded general framework for TSFM fine-tuning.