Routing Channel-Patch Dependencies in Time Series Forecasting with Graph Spectral Decomposition¶

Conference: ICLR 2026 arXiv: 2603.13702 Code: GitHub Area: Time Series Forecasting Keywords: Channel dependency, graph spectral decomposition, frequency-aware, MoE routing, plug-and-play

TL;DR¶

This paper proposes xCPD, a plug-and-play plugin that refines the modeling unit of multivariate time series from "channels" to "channel-patches." It constructs spectral embeddings via a shared graph Fourier basis, groups nodes into low/mid/high frequency bands based on spectral energy responses, and applies dynamic MoE routing to adaptively select frequency-specific filter experts. xCPD can be seamlessly integrated into any existing CI/CD model to consistently improve both long- and short-term forecasting performance, and supports zero-shot transfer.

Background & Motivation¶

Background: Multivariate time series forecasting (MTSF) is a core AI task with broad applications in traffic, finance, energy, and meteorology. Recent advances have proceeded along two main axes: model architecture (Linear/CNN/Transformer/MLP/GNN) and channel strategy (CI/CD/CP), with the latter emerging as a key performance bottleneck.

Limitations of the Three Channel Paradigms: (1) CI (Channel-Independent), e.g., DLinear/PatchTST, models each channel independently—robust but ignores inter-channel relations; (2) CD (Channel-Dependent), e.g., TSMixer/TimesNet, aggregates all channels and may introduce irrelevant information leading to over-smoothing; (3) CP (Channel-Partiality), e.g., DUET/CCM/TimeFilter, attempts to balance the two but has fundamental shortcomings.

Coarse-Granularity Bottleneck in CP Methods: Existing CP methods operate at the channel level, treating entire channels as relational units and failing to model local patch-level interactions. For instance, channel A may exhibit a smooth seasonal trend in segment \(T_1\) and sharp anomalies in segment \(T_2\), yet channel-level models produce only a single averaged weight and cannot differentiate the distinct interaction patterns across segments.

Frequency Coupling Problem: CD/CD models compute attention weights in the time domain, where low-frequency trends, mid-frequency fluctuations, and high-frequency noise are entangled in the same embedding. A high attention score between two channels may simultaneously reflect meaningful low-frequency seasonal dependencies and irrelevant high-frequency noise correlations—the model cannot disentangle them, leading to spurious correlations.

Key Insight: The modeling unit is refined from "channels" to "channel-patches" (with patches as graph nodes). Dependency modeling is conducted in the graph spectral domain (rather than the time domain), followed by frequency-energy-based grouping and MoE routing of frequency-specific filter experts, achieving frequency-decoupled fine-grained channel-patch dependency modeling.

Practical Value: xCPD is designed as a post-processing plugin that requires no retraining of the backbone model. With linear computational complexity, it can be directly embedded into existing forecasting pipelines, making it suitable for large-scale real-time scenarios.

Method¶

Overall Architecture¶

xCPD consists of three core modules: (A) Spectral Channel-Patch Embedding → (B) Channel-Patch Grouping → (C) Channel-Patch Routing with MoE. The input is the backbone model's prediction output \(\hat{X}^{\text{model}} \in \mathbb{R}^{C \times T'}\); after processing through xCPD, the refined prediction \(\hat{X}^{\text{predict}}\) is produced. The overall pipeline is: patching → linear embedding → channel-patch graph construction → shared graph Fourier transform → spectral energy-based grouping → MoE routing for frequency expert selection → adaptive graph learning → gated residual correction for output.

Key Design 1: Spectral Channel-Patch Embedding¶

Function: Divides the backbone output into patches, applies linear embedding, constructs a channel-patch graph, and projects node embeddings into the spectral domain via a shared graph Fourier basis.
Mechanism: The prediction output \(\hat{X}^{\text{model}}\) is divided into \(N = \lceil T'/P \rceil\) non-overlapping patches and linearly mapped to \(d\)-dimensional embeddings \(X^{\text{emb}} \in \mathbb{R}^{n \times d}\) (where \(n = C \times N\)). An adjacency matrix is constructed using cosine similarity: \(A_{ij}^t = \cos(X_i^{\text{emb},t}, X_j^{\text{emb},t})\). The normalized graph Laplacian \(L = I - D^{-1/2}AD^{-1/2}\) is computed, and eigendecomposition yields the shared Fourier basis \(U\). The spectral embedding is \(X^{\text{spc}} = U^\top X^{\text{emb}}\).
Design Motivation: (1) Cosine similarity-based graph construction is scale-invariant with respect to variable magnitudes, suitable for multivariate settings. (2) Per-batch eigendecomposition produces inconsistent Fourier bases across batches, making cross-batch comparison infeasible. A shared graph Fourier basis (Theorem 4.1) is therefore learned from the averaged Laplacian, ensuring all time steps are mapped to a consistent spectral domain. The theoretical guarantee \(\|U^t - UR^t\|_F \leq C\|L^t - L_{\text{avg}}\|_F\) establishes that the shared basis provides a linear approximation to each per-batch basis.

Key Design 2: Spectral Channel-Patch Grouping¶

Function: Assigns channel-patch nodes to three frequency groups—low, mid, and high—based on each node's spectral energy response, and constructs ego-graph subgraphs for frequency-aware message passing.
Mechanism: Learnable boundaries \(\tau_1, \tau_2\) define three frequency bands; sigmoid-based soft partitioning computes per-frequency membership weights \(\alpha_j^{\text{low/mid/high}}\). The spectral energy response is defined as \(S_{i,j} = \|U_{i,j} \cdot X_{j,:}^{\text{spc}}\|_2^2\) (Theorem 4.2 guarantees energy conservation: \(\sum_j S_{i,j} = \|X_{i,:}^{\text{emb}}\|_2^2\)). Nodes are then assigned to the frequency band with the highest energy via softmax. For each node, an ego-graph is constructed using \(k\)-NN neighbor selection, and intra-group subgraphs are built based on frequency labels.
Design Motivation: (1) Learnable frequency boundaries adapt to the frequency structure of different datasets. (2) Spectral energy responses directly quantify each node's sensitivity to different frequency components, enabling precise grouping. (3) Ego-graphs reduce noise by retaining only dependencies relevant to the center node. (4) Frequency-based subgraphs ensure message passing occurs between nodes in the same frequency band, preventing mixing of trend nodes and noise nodes.

Key Design 3: Dynamic MoE Routing (Spectral Channel-Patch Routing with DyMoE)¶

Function: Dynamically selects a variable number of frequency-specific filter experts (low/mid/high-frequency filters) for each ego-graph, constructs a sparse adjacency matrix, and performs graph learning.
Mechanism: Three frequency filters construct adjacency matrices from low/mid/high spectral components, respectively. The routing network computes \(\psi(x_i) = \text{Linear}_c(x_i) + \epsilon \cdot \text{Softplus}(\text{Linear}_n(x_i))\) (comprising deterministic and stochastic noise components). Experts are selected by choosing the minimum number required such that their cumulative probability \(\geq \tau\) (Eq. 7), differing from fixed Top-K selection. The selected experts' edge sets are merged per Eq. (8) to construct a sparse adjacency matrix, followed by \(L\)-layer graph learning (Eq. 9–10) for neighborhood aggregation. The final output is produced via gated dual-path residual correction: \(\hat{X}^{\text{predict}} = \hat{X}^{\text{model}} + \sigma(g_{\text{GNN}}) \odot \delta_{\text{GNN}} + \sigma(g_{\text{Lin}}) \odot \delta_{\text{Lin}}\).
Design Motivation: (1) Three frequency-specific experts separately capture smooth trends (low), local fluctuations (mid), and abrupt changes/anomalies (high), enabling frequency-decoupled modeling. (2) DyMoE dynamically allocates experts—different inputs receive different combinations—offering greater flexibility than fixed Top-K. (3) The gated residual design degrades gracefully to the original backbone prediction when gate values approach zero, ensuring safe integration. (4) Entropy loss \(\mathcal{L}_{\text{Entropy}}\) and balance loss \(\mathcal{L}_{\text{Balance}}\) are added to the training objective to prevent expert collapse.

Key Design 4: Gated Dual-Path Residual Correction and Optimization¶

Function: Combines a GNN path (cross-channel spectral dependencies) and a Linear path (CI-style refinement), with learnable gates determining each path's contribution.
Mechanism: \(\delta_{\text{GNN}} = W_{\text{proj}} H^{(L)}\) captures cross-variable spectral dependencies; \(\delta_{\text{Lin}} = f_{\text{lin}}(\hat{X}^{\text{model}})\) preserves channel-independent refinement. Gate parameters \(g_{\text{GNN}}, g_{\text{Lin}} \in \mathbb{R}^C\) operate per channel. Total loss: \(\mathcal{L} = \mathcal{L}_{\text{MSE}} + \mu\mathcal{L}_{\text{Entropy}} + \beta\mathcal{L}_{\text{Balance}}\).
Design Motivation: The dual-path design simultaneously leverages the strengths of CD (GNN path) and CI (Linear path), enabling adaptive balancing. Per-channel gating allows different variables to select different degrees of dependency modeling.

Key Experimental Results¶

Table 1: Long-Term Forecasting Main Results (9 datasets, 4 backbones, MSE↓)¶

Setting	TSMixer → +xCPD	DLinear → +xCPD	PatchTST → +xCPD	TimesNet → +xCPD
ETTh1 avg	0.412 → 0.401	0.456 → 0.445	0.469 → 0.455	0.458 → 0.447
Weather avg	0.234 → 0.221	0.265 → 0.253	0.259 → 0.248	0.259 → 0.249
Electricity avg	0.167 → 0.158	0.212 → 0.197	0.205 → 0.194	0.192 → 0.175
Traffic avg	0.408 → 0.394	0.625 → 0.606	0.482 → 0.467	0.620 → 0.558

xCPD achieves consistent improvements across nearly all 144 experimental configurations, with the most significant gains on high-dimensional datasets (Electricity: 321 variables; Traffic: 862 variables).

Table 2: Comparison with LIFT and CCM Baselines (TSMixer/DLinear backbone)¶

Dataset	TSMixer+LIFT	TSMixer+CCM	TSMixer+xCPD	DLinear+LIFT	DLinear+CCM	DLinear+xCPD
ETTh2	0.351	0.351	0.345	0.553	0.552	0.507
Weather	0.231	0.225	0.221	0.262	0.262	0.253
Traffic	0.405	0.396	0.394	0.620	0.614	0.606

xCPD outperforms both LIFT and CCM on all 9 datasets.

Table 3: Comparison with 5 CP Baselines under General Settings¶

Dataset + Architecture	+PRReg	+LIFT	+PCD	+CCM	+xCPD
ETTm1 Transformer	0.349	0.356	0.404	0.300	0.289
Exchange Linear	0.048	0.050	—	0.045	0.042
Weather Transformer	0.180	0.178	0.198	0.164	0.161

xCPD achieves the best performance across all 10 configurations.

Key Findings¶

High-dimensional datasets benefit most: The performance gain from xCPD increases with the number of channels (Electricity: 321; Traffic: 862), as spectral frequency decoupling more effectively suppresses irrelevant channel noise in high-dimensional settings.
CI models benefit more from xCPD: In zero-shot experiments, CI models (DLinear: 12.0%, PatchTST: 15.2%) exhibit significantly larger improvements than CD models (TSMixer: 6.7%, TimesNet: 11.1%), demonstrating that xCPD injects the cross-channel interaction capability that CI models lack.
Larger gains at longer forecast horizons: In zero-shot settings, performance improvements increase with forecast horizon length, suggesting that frequency knowledge transfer is more effective for long-range dependencies.
Linear computational complexity: Time complexity \(\mathcal{O}(nkd + Lnkd)\) and space complexity \(\mathcal{O}(nd + nk)\) introduce only 9%–11% training time overhead, far below CCM's quadratic complexity.
Ablation study: Removing any individual component—shared Fourier basis, frequency partitioning, node grouping, or filters—degrades performance. Replacing DyMoE with Top-K, Random-K, RegionTop-K, or TimeFilter also underperforms the full xCPD, validating the necessity of each component.

Highlights & Insights¶

Dependency modeling in the graph spectral domain: xCPD is the first method to model channel interactions entirely in the graph spectral domain (rather than the time domain). In the spectral domain, low/mid/high-frequency components are naturally decoupled, eliminating the spurious correlations caused by frequency coupling in time-domain attention—this is the core innovation distinguishing xCPD from all prior CP methods including LIFT, CCM, PCD, and TimeFilter.
Granularity improvement from channels to channel-patches: Different temporal segments of the same channel may interact differently with other channels; patch-level modeling is the first to capture this segment-level heterogeneity.
Dynamic expert allocation via DyMoE: Unlike fixed Top-K, DyMoE adaptively selects 1–3 experts based on a cumulative probability threshold, routing smooth segments to low-frequency experts and abrupt segments to high-frequency experts—enabling input-aware fine-grained modeling.
Visualization validates theory: Figure 3 demonstrates the correspondence between spectral energy and time-domain patterns—nodes with high low-frequency energy indeed correspond to smooth trends, while nodes with high high-frequency energy correspond to rapid fluctuations—confirming the energy conservation guarantee of Theorem 4.2.
Plug-and-play and zero-shot transfer: As a post-processing plugin, xCPD requires no backbone retraining, and the learned frequency filtering knowledge transfers across datasets (zero-shot improvements across 48 configurations).

Limitations & Future Work¶

Zero-shot transfer evaluated only on ETT series: Cross-domain transfer (e.g., Weather → Traffic) is not validated; effectiveness between domains with substantially different frequency structures remains uncertain.
Quadratic cost in graph construction: Although overall complexity is linear, computing the cosine similarity adjacency matrix remains \(O(n^2d)\), which may become a bottleneck when both the number of channels and the number of patches are large.
Prior assumption of three frequency groups: The fixed partition into low/mid/high frequency bands may be suboptimal for certain datasets that require finer or coarser divisions; a mechanism for adaptively determining the number of groups is absent.
Validation limited to forecasting tasks: Generalizability to other time series tasks (classification, anomaly detection, imputation) is not assessed.
Approximation error of the shared graph Fourier basis: The approximation bound in Theorem 4.1 depends on the spectral gap of \(L_{\text{avg}}\); when data distribution shifts dramatically, the gap may be small, degrading approximation quality.

Dimension	xCPD (Ours)	CCM (Chen et al., NeurIPS 2024)	TimeFilter (Hu et al., 2025)
Modeling granularity	Channel-patch level	Channel level (channel clustering)	Patch level but in time domain
Modeling domain	Graph spectral domain	Time domain	Time domain
Adaptivity	Frequency-specific MoE routing	Similarity-based clustering	Spatiotemporal attention filtering
Frequency decoupling	✓ Spectral energy grouping	✗ Frequency coupling	✗ Frequency coupling
Plugin compatibility	✓ Post-processing, no retraining	✓ But quadratic complexity	Requires integration into specific architecture
Zero-shot transfer	✓ Frequency knowledge transferable	Not validated	Not validated

Rating¶

Novelty: ⭐⭐⭐⭐ Spectral-domain channel-patch dependency modeling combined with DyMoE represents a new perspective, with simultaneous innovation along three axes (granularity / domain / adaptivity)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 9 datasets × 4 backbones × 144 configurations, plus short-term / zero-shot / efficiency / ablation / visualization analyses—extremely comprehensive
Writing Quality: ⭐⭐⭐⭐ Clear method description, rigorous theoretical derivations (two theorems), rich figures and tables
Value: ⭐⭐⭐⭐ As a general plug-and-play plugin, xCPD offers direct practical value to the time series forecasting community; linear complexity makes it deployment-friendly