ProbFM: Probabilistic Time Series Foundation Model with Uncertainty Decomposition¶
Conference: AAAI2026
arXiv: 2601.10591
Code: To be confirmed
Area: Time Series
Keywords: time series, foundation model, uncertainty quantification, deep evidential regression, financial forecasting
TL;DR¶
This work is the first to introduce Deep Evidential Regression (DER) with a Normal-Inverse-Gamma prior into a time series foundation model architecture, enabling epistemic-aleatoric uncertainty decomposition in a single forward pass. The practical value of uncertainty-aware trading strategies is validated on cryptocurrency forecasting.
Background & Motivation¶
- Time Series Foundation Models (TSFMs) demonstrate strong zero-shot forecasting performance, yet lack principled uncertainty quantification in high-stakes scenarios such as finance.
- Limitations of prior work:
- Mixture models (MOIRAI): rely on predefined distributional components and cannot distinguish epistemic from aleatoric uncertainty.
- Student-t distribution (Lag-Llama): impose strong distributional assumptions that may not generalize across diverse time series characteristics.
- Conformal Prediction (TimeGPT): post-hoc calibration that is not integrated into the learning process.
- Architectural heterogeneity makes it difficult to attribute performance gains to uncertainty quantification methods versus architectural advantages.
Core Problem¶
- How to achieve principled epistemic-aleatoric uncertainty decomposition within TSFMs?
- How to provide complete uncertainty quantification without sacrificing point forecast accuracy?
- How to fairly evaluate the contribution of uncertainty quantification strategies in isolation from architectural differences?
Method¶
Overall Architecture¶
ProbFM = Adaptive Patching + Transformer Backbone + DER Head, comprising six components: input processing, Transformer representation learning, DER uncertainty estimation, composite loss, single-stage training with evidence annealing, and single-pass inference.
Key Designs¶
1. Normal-Inverse-Gamma (NIG) Prior
- Models the parameters of the predictive distribution rather than directly parameterizing the distribution: \(p(\mu, \sigma^2) = \text{NIG}(\mu, \lambda, \alpha, \beta)\)
- Explicit uncertainty decomposition:
- Aleatoric: \(\mathbb{U}_{\text{aleatoric}} = \frac{\beta}{\alpha - 1}\) (intrinsic data noise)
- Epistemic: \(\mathbb{U}_{\text{epistemic}} = \frac{\beta}{(\alpha-1)\lambda}\) (model uncertainty, reducible with more data)
2. DER Head Parameter Projection
- The Transformer output \(h\) is projected to four NIG parameters: \(\mu\) is unconstrained; \(\lambda, \beta\) are kept positive via Softplus + \(\epsilon\); \(\alpha\) is constrained to \(> 1\) via Softplus + 1 + \(\epsilon\).
3. Augmented Loss Function
- Evidential loss: \(\mathcal{L}_{\text{EDL}} = \mathcal{L}_{\text{NLL}} + \lambda_{\text{evd}} \mathcal{L}_{\text{reg}}\)
- Coverage loss: \(\mathcal{L}_{\text{coverage}} = |\text{PICP}_{\text{target}} - \text{PICP}_{\text{actual}}|\), directly optimizing prediction interval coverage rate.
- Full objective: \(\mathcal{L}_{\text{ProbFM}} = \mathcal{L}_{\text{EDL}} + \lambda_{\text{coverage}} \cdot \mathcal{L}_{\text{coverage}} + \lambda_{\text{wd}} \|\theta\|_2^2\)
4. Evidence Annealing
- \(\text{evidence\_scale}(t) = \min(1.0, t / T_{\text{anneal}})\), preventing overconfidence in early training.
- Unlike the KL regularization annealing of Sensoy et al., this approach directly controls the evidence accumulation process.
5. Controlled Experimental Design
- All methods uniformly employ a 1-layer LSTM (32 hidden dims) as the backbone.
- Only the loss function and output head are varied, isolating the contribution of uncertainty quantification strategies.
Key Experimental Results¶
| Method | RMSE | MAE | Characteristics |
|---|---|---|---|
| MSE Baseline | 0.044 | 0.030 | No probabilistic output |
| Gaussian NLL | 0.044 | 0.029 | Total variance only |
| Student-t NLL | 0.045 | 0.030 | Heavy-tail modeling |
| Quantile Loss | 0.044 | 0.029 | Quantile-based intervals |
| Evidential (ProbFM) | 0.045 | 0.030 | Epistemic + aleatoric decomposition |
- Forecast accuracy: DER achieves parity with other methods (RMSE 0.045 vs. baseline 0.044), demonstrating that uncertainty quantification does not sacrifice accuracy.
- Uncertainty-aware trading: filtering high-uncertainty predictions via epistemic/aleatoric thresholds improves risk-adjusted returns.
- Portfolio optimization: uncertainty-based position sizing outperforms the equal-weight baseline.
Highlights & Insights¶
- First application of DER + NIG prior to the TSFM architecture, filling the gap in uncertainty decomposition for time series foundation models.
- The controlled experimental design (fixed LSTM backbone) rigorously isolates the contribution of uncertainty quantification methods.
- Coverage loss directly optimizes prediction interval coverage without requiring post-hoc calibration.
- Complete uncertainty quantification is obtained in a single forward pass, offering high computational efficiency.
- Financial application validation (trading filtering + portfolio optimization) demonstrates practical decision-making value.
Limitations & Future Work¶
- Validation is limited to cryptocurrency daily return data; multi-domain (energy, transportation, weather) and multi-frequency experiments are absent.
- The controlled experiment uses a 1-layer LSTM (32 dims) with minimal model capacity; validation at true foundation model scale has not been conducted.
- Only univariate single-step forecasting is supported; multi-step and multivariate extensions (NIW prior) are mentioned only as future work.
- Although evidence annealing mitigates the evidence collapse problem in DER, theoretical guarantees remain insufficient.
- End-to-end comparisons against real TSFMs such as MOIRAI and Lag-Llama are not performed.
Related Work & Insights¶
- vs MOIRAI: MOIRAI employs a 4-component mixture distribution requiring multi-component sampling; ProbFM uses a single forward pass but lacks MOIRAI's multivariate multi-step capability.
- vs Lag-Llama: Lag-Llama assumes a Student-t distribution as a single distributional family; ProbFM learns a distribution over distribution parameters via NIG.
- vs TimeGPT: TimeGPT's conformal prediction performs post-hoc calibration; ProbFM integrates uncertainty into the training process.
- vs Standard Bayesian / MC Dropout: ProbFM uses a single forward pass, avoiding the overhead of multiple sampling runs.
Further Implications¶
- The epistemic-aleatoric decomposition of DER has a natural application in active learning: samples with high epistemic uncertainty should be prioritized for annotation.
- The coverage loss approach is generalizable to calibration in any probabilistic forecasting model.
- The evidence annealing strategy provides a useful reference for other evidential learning tasks (classification, object detection).
- The direction of TSFM + uncertainty decomposition warrants further exploration at larger scales.
Rating¶
- Novelty: ⭐⭐⭐⭐ (First application of DER to TSFMs, though DER itself is not a novel method)
- Experimental Thoroughness: ⭐⭐⭐ (Controlled experimental design is sound, but data and model scale are insufficient)
- Writing Quality: ⭐⭐⭐⭐ (Methodological exposition is clear with a solid theoretical foundation)
- Value: ⭐⭐⭐ (The direction is meaningful, but the experimental scale limits persuasiveness)