ICML2026 Time Series AI paper notes paper summaries Time-Series Forecasting Adversarial Robustness Reasoning Anomaly Detection Self-Supervised Learning Alignment/RLHF

📈 Time Series¶

🧪 ICML2026 · 45 paper notes

📌 Same area in other venues: 📷 CVPR2026 (7) · 🔬 ICLR2026 (121) · 💬 ACL2026 (8) · 🤖 AAAI2026 (31) · 🧠 NeurIPS2025 (54) · 📹 ICCV2025 (4)

🔥 Top topics: Time-Series Forecasting ×33 · Adversarial Robustness ×3 · Reasoning ×2 · Anomaly Detection ×2 · Self-Supervised Learning ×2

Adaptive Time Series Reasoning via Segment Selection: This paper proposes ARTIST, which frames time series question answering (TSQA) as a sequential decision-making problem of "reasoning while selecting segments." Through a controller-reasoner architecture and hierarchical self-play RL, the model selectively reads task-relevant temporal segments, thereby improving reasoning accuracy.
AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection: AnomSeer formalizes statistical evidence from classical time-series anomaly detection into expert reasoning trajectories and reinforces Multimodal LLMs (MLLMs) via TimerPO. This enables the model to simultaneously perform anomaly type classification, interval localization, and fine-grained explanation based on line chart inputs.
Beyond Extrapolation: Knowledge Utilization Paradigm with Bidirectional Inspiration for Time Series Forecasting: The KUP-BI framework is proposed, which constructs a "post-target continuation" knowledge base from the training set. It retrieves continuation patterns of similar historical trajectories through ratio-based transformations to generate a continuation-style auxiliary stream. This stream is fused with backbone network features via a gating mechanism, consistently improving long-term forecasting performance across 6 datasets and 4 backbone architectures.
Building Social World Models with Large Language Models: This paper proposes the "Social World Model" (SWM), which treats collective beliefs as states and social events as exogenous actions. It utilizes an LLM as a transition engine to learn an event-conditioned state transition distribution \(P_\theta(\mathbf s_{t+1}\mid\mathbf s_t,e_t)\). By utilizing a frozen "hindsight posterior attributor" to provide pseudo-labels, it bypasses the challenge of missing "event \(\rightarrow\) belief change" annotations. SWM significantly outperforms time-series foundation models and strong baselines like GPT-5.5 on SWM-Bench, a benchmark constructed from real prediction markets (Kalshi/Polymarket).
CombinationTS: A Modular Framework for Understanding Time-Series Forecasting Models: CombinationTS decouples time-series forecasting models into five orthogonal modules: Input Transformation, Embedding, Encoder, Decoder, and Output Transformation. By performing paired Monte Carlo sampling on a shared "Evaluation Condition Space," it replaces fragile single-point MSE with marginal performance \(\mu\) and stability \(\sigma\). The primary conclusion is that with a well-designed data view (Embedding), a parameter-free Identity Encoder can match or even outperform complex Transformers, suggesting that "SOTA gains" in time-series forecasting largely stem from data representation rather than modeling capacity.
DAG: A Dual Correlation Network for Time Series Forecasting with Exogenous Variables: For Time Series Forecasting with known future covariates (TSF-X), DAG designs a dual-pathway network: one pathway captures "historical exogenous → future exogenous" attention patterns along the temporal dimension and injects them into "historical endogenous → future endogenous" predictions, while the other captures "historical exogenous → historical endogenous" patterns along the channel dimension and injects them into "future exogenous → future endogenous" predictions. DAG achieves the best MSE on 10/12 public/newly released TSF-X datasets, significantly outperforming TimeXer, TFT, TiDE, CrossLinear, and PatchTST.
DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal: DistMatch proposes a recursive binning method based on KS statistics—by grouping residuals into approximately exchangeable leaf nodes, it discards weight reassignment, providing effective conformal prediction intervals under distribution shift. It achieves the smallest interval widths across five datasets while maintaining valid coverage.
Divide and Contrast: Learning Robust Temporal Features Without Augmentation: Di-COT efficiently learns robust time series representations without data augmentation by randomly partitioning sequences into overlapping sub-blocks for contrastive learning. Compared to existing methods, it is 2.5 times faster with higher accuracy, validated comprehensively across 6 large-scale datasets + 124 UCR + 28 UEA.
Do Time Series Foundation Model Benchmarks Hide Regime-Dependent Failures? Evidence from Traffic Speed Forecasting: This paper argues that Time Series Foundation Models (TSFMs) exhibit a phenomenon of "good average metrics but failure at critical moments" in traffic speed forecasting. By employing regime-stratified evaluation based on traffic states, the authors expose catastrophic failures masked by aggregate metrics and propose BMA (Bimodal Mixture Augmentation), a post-processing method that requires no retraining, to bring prediction interval coverage in "transition regimes" back to levels near historical baselines.
Doubly Outlier-Robust Online Infinite Hidden Markov Model: This paper proposes BR-iHMM, which combines "robust observation updates (WoLF)" with "batched state inference (degenerate sticky HDP prior)." It provides bounded Posterior Influence Functions (PIFs) in both the observation and state spaces for online infinite Hidden Markov Models. On streaming data containing outliers—including financial order books, electricity loads, and synthetic regressions—it reduces one-step-ahead prediction RMSE by up to 67%.
Dynamic-TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series: By utilizing MMD to detect distribution drifts and dynamically expanding a heterogeneous expert pool combined with a Temporal Memory Router to ensure selection consistency, Dynamic-TMoE achieves new SOTA results across nine time-series benchmarks—reducing MSE by 10.4% and MAE by 7.8% on average compared to all baselines.
Ellipsoidal Time Series Forecasting: Fern reformulates long-term time series forecasting (LTSF) as "optimal transport from a fixed Gaussian source to a data-dependent ellipsoid." By leveraging Brenier’s Theorem, it restricts the search space to the Symmetric Positive Definite (SPD) Jacobian class. Utilizing low-rank spectral decomposition via Householder reflections, it reduces computational complexity from \(O(n^3)\) to \(O(Rn)\) and achieves up to a 790× stability improvement over baselines like DLinear and Koopa in non-stationary shock scenarios.
Embedding Hybrid Systems into Continuous Latent Vector Fields: This paper first proves an existence theorem—stating that as long as the latent space dimension \(m>2n\), an essentially discontinuous \(n\)-dimensional hybrid system can be embedded into \(m\)-dimensional Euclidean space with a continuous vector field on its image. Based on this, it designs the latent Neural ODE framework CHyLL++, which recovers hybrid system flows across various geometries and topologies with high precision from time series data alone.
Exposure Bias as Epistemic Underidentification in Recursive Forecasting: This paper provides a theoretical reinterpretation of "exposure bias" in recursive multi-step forecasting: it is not merely a distribution shift between training (teacher forcing) and deployment (self-feeding rollouts). Under partial observability or state truncation, it becomes a problem of epistemic underidentification. One-step supervision identifies model behavior only on the observed context, leaving it undetermined what the rollout should output on self-generated "induced states." The authors formalize this using "induced state \(Z\) + provenance variable \(P\)," providing an error decomposition and experimental validation.
FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models: FactoryNet is the first large-scale industrial time-series dataset with a unified control-loop structure—51 million data points / 23k end-to-end task executions (13.3k real + 9,800 simulated) across 6 machine entities, aligning all signals according to the Setpoint-Effort-Feedback-Context (S-E-F-C) cybernetic classification; 27 types of labeled anomalies + health baselines + counterfactual pairs enable zero-shot cross-entity transfer and parameter-efficient anomaly detection.
FRACTAL: State Space Model with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences: This paper generalizes the probability measures behind the HiPPO framework to fractional power-law measures with an adjustable singular index \(\alpha\), achieving "full history retention + recency sensitivity + scale invariance" for the first time. This theory is implemented as an LTI diagonalized SSM—FRACTAL—which ties S5 with an 87.11% average score on Long Range Arena and achieves 61.85% on ListOps.
From Observations to States: Latent Time Series Forecasting: The authors discover that existing TSF models, despite high prediction accuracy, often exhibit "Latent Chaos" in their latent spaces. They propose LatentTSF—which first compresses observations into a high-dimensional latent state space using an AutoEncoder, then allows any mainstream backbone to perform future prediction within this space (using a dual Pred + Align loss), and finally decodes back to the observation space. This approach consistently reduces MSE/MAE across six standard benchmarks and restores the temporal locality and spectral structure of latent representations.
Generalizing Multi-scale Time-Series Modeling with a Single Operator: The Sigma framework unifies existing discrete multi-scale operators by learning Learnable Discrete Gaussian (LDG) kernels with continuous, distance-aware scale parameters. It achieves SOTA performance on both long-term and short-term forecasting tasks while significantly reducing computational costs (5.3× faster training, 3.8× less VRAM).
HELIX: Hybrid Encoding with Learnable Identity and Cross-dimensional Synthesis for Time Series Imputation: Ours learns a "feature identity embedding" for each feature as a persistent semantic anchor. Combined with time-feature double helix attention, it achieved first place across all 21 missing scenarios in 5 public multivariate time series datasets, with an MAE reduction of over 25% compared to the runner-up ImputeFormer on datasets such as ETT-h1.
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series: HEPA learns predictable dynamics in time series through horizon-conditioned JEPA self-supervised pre-training. By freezing the encoder and fine-tuning only the predictor, it outperforms multiple SOTA methods across 14 benchmarks in 11 domains using a single architecture and fixed hyperparameters, achieving 92% performance with only 2% labeled data.
HiPPO Zoo: Explicit Memory Mechanisms for Interpretable State Space Models: This work explicates the implicit memory mechanisms in modern SSMs (such as Mamba) by extending the HiPPO framework into the "HiPPO Zoo" (consisting of 5 variants). Each variant implements specific modern SSM capabilities—non-linearity, adaptive memory, associative memory, multiscale representation, and predictive target constraints—using interpretable polynomial representations, achieving 100% accuracy on selective copying and associative recall tasks.
IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection: IMPACT utilizes "influence functions" simultaneously as a searchlight and a scalpel—first training an initial model with a multi-channel deviation loss to calculate the influence score of each training sample on validation risk. Under theoretical guarantees of risk reduction, it flips high-influence contaminated unlabeled samples into labeled anomalies and perturbs "boundary normal samples" (those with minimal risk contribution) along the gradient direction to generate "unseen pseudo-anomalies." Finally, a dual-head network learns both seen and unseen anomaly categories, consistently surpassing over ten unsupervised and open-set baselines across 8 real-world time-series benchmarks.
Incremental Transformer Neural Processes: By incorporating causal masking and KV caching—mechanisms common in Large Language Models—into Transformer Neural Processes (TNP), the update cost for each new observation in streaming scenarios is reduced from \(\mathcal{O}(N^2)\) to \(\mathcal{O}(N)\). Combined with a "dense autoregressive training" strategy that covers all context lengths in a single forward pass, incTNP maintains or exceeds the performance of standard TNP, while its "implicit Bayesianness" (prediction consistency) remains comparable to permutation-invariant TNPs.
Interpretability in Deep Time Series Models Demands Semantic Alignment: This is a position paper—proposing that deep time series models should enforce semantic alignment: making a model's internal variables and mechanisms correspond to a domain expert's reasoning rather than just explaining internal computations. The core innovation defines persistence constraints for semantic alignment regarding temporal evolution (a challenge unique to time series).
It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks: TIME is a next-generation benchmark for Time Series Foundation Models (TSFMs). It overcomes four major pain points—data reuse, quality issues, improper task configurations, and low evaluation granularity—through human annotation + LLM-driven data cleaning, context-aligned task design, and a pattern-level evaluation perspective. It includes 50 entirely new datasets, 98 tasks, and evaluations of 12 TSFMs.
Latent Laplace Diffusion for Irregular Multivariate Time Series: LLapDiff is a generative framework that performs diffusion in latent space. By parameterizing stable modal evolution with learnable complex-conjugate poles in the Laplace domain, it achieves long-term forecasting and missing value imputation for irregular time series without step-by-step physical time integration, achieving an average rank of 2.1±1.7 across 7 datasets.
Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models: CTDG-SSM introduces a Topology-aware HiPPO projection and State Space Models to simultaneously capture multi-hop Long-Range Spatial (LRS) and Long-Range Temporal (LRT) dependencies in dynamic graphs. It outperforms Prev. SOTA in link prediction and node classification while using only 1/10 of the parameters of competing methods.
Learning Manifold and Itô Dynamics with Branched Neural Rough Differential Equations: Neural Rough Differential Equations (NRDE) can only handle Stratonovich dynamics due to their reliance on shuffle algebra. This paper replaces the log-ODE step of NRDE with geometric numerical integration on Hopf algebras: using Grossman–Larson rooted tree algebra for Euclidean Itô, Munthe–Kaas–Wright planar rooted tree algebra for ordered covariant derivatives on manifolds, and reserving shuffle algebra for classical Stratonovich. This generalizes signature methods to Itô and manifold-valued dynamics for the first time, complemented by a branched signature kernel objective that makes quadratic variation terms visible during training.
Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining: This paper performs a systematic comparative study using 11 synthetic time series generators and 2 time series foundation models trained from scratch. It finds that generator rankings are unstable across different architectures, and the forecasting error gap between the best and worst generators can be as large as 2. Rather than solving the difficult selection problem, simply mixing all generators with equal weights (Mixed11) can match or exceed the best single generator. Combining this with real data yields the strongest corpus. The study concludes that synthetic pretraining is a "corpus composition" problem rather than a "generator selection" problem, and composition strategies must be validated for each specific model architecture.
Nested Spatio-Temporal Time Series Forecasting: NeST treats "future macro-region trends" as top-down guidance. Combined with semantic regions constructed via spectral clustering and bidirectional cross-scale cross-attention, it achieves comprehensive improvements in accuracy, long-range stability, and near-linear complexity for node-level spatio-temporal forecasting on large-scale traffic networks.
OLIVIA: Harmonizing Time Series Foundation Models with Power Spectral Density: OLIVIA significantly improves the pre-training of time series foundation models on heterogeneous data by introducing a Power Spectral Density (PSD)-driven coordination mechanism—comprising the Harmonizer (orthogonal second-order coordination based on Householder reflections) and HarmonicAttention (low-dimensional interaction via resonators)—achieving SOTA performance across TSLib Zero-shot, GIFT-Eval, and GluonTS benchmarks.
Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation: Aiming at scenarios where "multiple interacting systems must be predicted simultaneously" (e.g., exchange rates of 16 countries, new COVID-19 cases in hundreds of counties), this paper proposes Equilibrium State Estimation (ESE). It first estimates the "equilibrium state proportions" of all systems in one go, then performs single-pass forecasting based on the direction of current state deviation from equilibrium. This replaces the \(O(n)\) training of repeated individual system predictions with linear-time single inference, achieving parity with SOTA accuracy while being 10–70× faster and providing a plug-and-play wrapper for any existing predictor.
Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting: PPM utilizes a lightweight encoder to infer context-aware Gaussian priors from historical sequences, then "pushes forward" this prior into a comprehensive predictive distribution using a two-layer MLP. Trained jointly with KDE-NLL and mean MSE, PPM outperforms diffusion models like DeepAR and NsDiff across seven time-series benchmarks while achieving \(2 \times\) to \(100 \times\) faster inference.
PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering: For Time Series Question Answering (TSQA), PATRA explicitly decomposes sequences into full / trend / season patterns at the representation level and performs deep cross-modal alignment via three sets of learnable alignment tokens. At the training stage, it utilizes a two-phase RL approach (SFT + GRPO), mapping rewards from discriminative and generative tasks into a unified \([0,2]\) range to resolve difficulty imbalance, outperforming text-only LLMs and multimodal TS-LLMs like ChatTS across four categories of TSQA tasks.
Position: Current Benchmarking Hinders Real Progress in Deep Learning for Time Series: This position paper systematically reveals the core issue of current time series forecasting benchmarks—discrepancies in design choices (global/local parameters, preprocessing, exogenous variables, temporal and spatial processing) are often overlooked as "implementation details," leading to unfair comparisons between papers. Through controlled experiments across 44 datasets, 7 SOTAs, and multiple reference architectures, it demonstrates that the impact of these differences (5-15%) often exceeds the contribution of specific sequence modeling layers (1-3%).
QuITE: Query-based Irregular Time Series Embedding: QuITE is a plug-and-play embedding module that aggregates irregular observations into fixed-dimensional representations using learnable query tokens via self-attention. It adapts arbitrary multivariate time series (MTS) models to irregular MTS (IMTS) without architectural modifications or artificial value generation, achieving an average relative improvement of 54.7% on iTransformer + QuITE.
Self-Supervised Dynamical System Representations for Physiological Time-Series: PULSE treats physiological time-series as being generated by "transferable system parameters + non-transferable sample-specific noise." It proposes a cross-reconstruction objective—where a system representation inferred from one window is used to reconstruct another independent sample from the same system—forcing the encoder to retain only shared dynamics while discarding initial conditions and noise, thereby learning more transferable representations for clinical semantics.
Semantics-Enhanced Retrieval-Augmented Time Series Forecasting: SERAF adds a "semantic retrieval" path to retrieval-augmented time series forecasting: it automatically translates each historical time series segment into a structured text description (season/trend/volatility). By retrieving two sets of "similar past + corresponding future" based on both numerical and text semantic similarity and adaptively fusing them, the model can identify historical patterns that are "numerically dissimilar but inherently isomorphic" in non-stationary series. It outperforms pure numerical retrieval SOTAs across seven real-world datasets.
Simulation-Augmented Multi-Step Split Conformal Prediction for Aggregated Forecasts: Addressing aggregated forecast targets such as "annual totals" and "Year-over-Year (Y-o-Y) growth rates," this paper proposes SA-MSCP. The method collects residuals via expanding window cross-validation and simulates numerous future paths using block bootstrap. Prediction intervals are then constructed from the empirical quantiles of the aggregated trajectories. On M4 and a private dataset, this approach significantly improves empirical coverage for aggregated targets, albeit with a noticeable increase in interval width.
Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases: Addressing the new problem of "querying morphological intent using natural language on massive Time Series Databases (TSDB)," this paper proposes the Sonar-TS neuro-symbolic framework. Much like active sonar, it first "pings" to coarsely filter candidate windows using SQL on multi-scale feature indices, then "locks on" to raw signals for precise verification using LLM-generated Python programs (Search-Then-Verify). Accompanied by NLQTSBench, the first benchmark for library-level long histories, Sonar-TS significantly outperforms traditional Text-to-SQL and Time-series Foundation Models on complex queries (average 0.61 vs. 0.16 for the strongest baseline).
Spatiotemporal Imputation with Graph-Informed Flow Matching: To address the issues of "error accumulation in iterative RNN/GNN propagation" and "problem-agnostic Gaussian priors and slow sampling in diffusion models" for spatiotemporal imputation, this paper proposes GiFlow. By constructing a "Graph Prior" through spatiotemporal filtering of observed signals to replace the Gaussian prior, the starting point of Flow Matching is moved closer to the target distribution with a shorter transport path. Combined with a hybrid vector field integrating spatial/temporal attention and spatiotemporal propagation, GiFlow consistently outperforms SOTA on synthetic and real-world datasets (air quality, traffic).
The Cost of Learning Under Multiple Change Points: This paper proposes the Anytime Tracking CUSUM (ATC) algorithm, which utilizes a time-varying adaptive threshold and the "selective detection" principle to achieve near minimax-optimal dynamic regret \(O(\sigma^2 (S+1) \log T)\) without any detectability assumptions (such as minimum spacing or minimum jump size). It also provides the first formal quantification of the logarithmic degradation bound due to "endogenous confounding from missed detections" in multi-change point scenarios.
Time-series Forecasting Through the Lens of Dynamics: The authors utilize Allen's Interval Algebra to propose the PRO-DYN nomenclature, decomposing any time-series forecasting (TSF) model into three stages: "Pre-processing (PRO) → Dynamics (DYN) → Post-processing (PRO)." Two empirical laws are identified: (i) the DYN component must be learnable and complete to outperform LTSF-Linear, and (ii) the DYN component must be positioned at the end of the pipeline (PRE-DYN configuration) to benefit from long lookback windows. These laws are validated by enhancing Informer/FEDformer/MICN/FiLM with a linear DYN layer to consistently improve performance and by shifting the DYN component to the front-end for iTransformer/PatchTST/Crossformer, which leads to performance degradation.
TimeOmni-VL: Unified Models for Time Series Understanding and Generation: TimeOmni-VL achieves the industry's best performance in forecasting and imputation by converting time series into high-fidelity images (Bi-TSI) and introducing an understanding-guided generation mechanism (CoT as diffusion conditioning). This marks the first successful unified multimodal framework that simultaneously masters time series understanding and generation tasks.
U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecasting: U-Cast uses a simple U-Net backbone + a two-stage training curriculum (MAE pre-training → CRPS fine-tuning) + MC-Dropout to achieve probabilistic weather forecasting capabilities comparable to complex professional models (GenCast), while reducing training computation and inference latency by 10×—disrupting the industry stereotype that "frontier performance must be complex."