STAMP: Spatial-Temporal Adapter with Multi-Head Pooling¶

Conference: NeurIPS 2025 arXiv: 2511.10848 Code: https://github.com/autonlab/STAMP Area: EEG Signals / Foundation Model Adaptation Keywords: EEG Classification, TSFM Adapter, Spatial-Temporal Encoding, Multi-Head Pooling, Parameter Efficiency

TL;DR¶

STAMP introduces a lightweight spatial-temporal adapter with only 750K parameters for Time Series Foundation Models (TSFMs). Through three sets of positional encodings (token/spatial/temporal), cross-gated MLP mixing, and multi-head attention pooling, it enables a frozen TSFM (e.g., MOMENT 385M) to compete with or surpass EEG-specific models with 29M parameters (CBraMod) across 8 EEG datasets, achieving 193% higher Kappa than CBraMod on BCIC-IV-2a.

Background & Motivation¶

Background: TSFMs (e.g., MOMENT, Chronos), pretrained across multiple domains, have demonstrated strong general-purpose representation capabilities. EEG-specific foundation models (CBraMod, LaBraM) perform well on EEG classification but require large parameter counts (29M/5.8M) and EEG-specific pretraining.

Limitations of Prior Work: TSFMs process univariate time series — EEG is spatiotemporal data with 64 channels × 1000+ time steps, making direct application infeasible for the spatial dimension. Feeding each channel independently into a TSFM discards inter-channel spatial relationships.

Key Challenge: TSFMs possess strong temporal representations but lack spatial understanding; EEG models understand spatial structure but require large-scale EEG pretraining. The core challenge is enabling TSFMs to understand EEG's spatial dimension at minimal cost.

Goal: Design a lightweight adapter that enables general-purpose TSFMs to efficiently process spatiotemporal EEG data.

Key Insight: Freeze the TSFM and train only a 750K-parameter adapter — three sets of positional encodings inject spatial information, cross-gated MLP mixes spatial and temporal features, and multi-head attention pooling aggregates representations.

Core Idea: Triple positional encoding (token + spatial + temporal) + cross-gated MLP spatiotemporal mixing + multi-head pooling = 750K parameters enabling a frozen TSFM to process EEG spatiotemporal data.

Method¶

Overall Architecture¶

EEG data (\(S\) channels × \(T\) time steps) → frozen TSFM (e.g., MOMENT-L 385M) encodes to \(S \times T' \times D\) (downsampled 1024→128) → Positional Encoding (\(\tilde{e}_{ij} = e'_{ij} + p_{ij} + s_i + t_j\)) → CC-GMLP (decoupled spatial and temporal gated mixing) → MHAP (multi-head attention pooling to fixed-length vector) → Classification Head

Key Designs¶

Three-Set Positional Encoding (PE-NST):
- Function: Injects spatial-temporal positional information into TSFM output tokens.
- Mechanism: Token-wise PE \(p_{ij} \in \mathbb{R}^D\) provides an independent embedding for each (channel, time) position; Spatial PE \(s_i\) encodes channel identity (e.g., C3/C4/Oz); Temporal PE \(t_j\) encodes temporal position. All three are summed.
- Design Motivation: Ablation studies demonstrate that all three PE components are necessary — token PE alone is insufficient (lacking general spatial/temporal structure), and spatial + temporal PE alone is insufficient (lacking position-specific embeddings).
Cross-Channel GMLP (CC-GMLP):
- Function: Performs feature mixing along spatial and temporal dimensions respectively.
- Mechanism: Spatial gating \(g_S(Z) = Z_1 \odot (W \cdot Z_2)\) (mixing along the spatial dimension); temporal gating operates analogously along the temporal dimension. Both operate independently to preserve spatiotemporal disentanglement.
- Design Motivation: Transformers are parameter-heavy for spatiotemporal sequences; GMLP is more efficient, and the CC (cross-channel) variant further reduces parameters (0.74M vs. GMLP 0.79M) while achieving better performance.
Multi-Head Attention Pooling (MHAP):
- Function: Aggregates variable-length spatiotemporal tokens into a fixed-length classification vector.
- Mechanism: Multiple learnable query vectors aggregate token information via attention weights. Final classification: \(\hat{y} = \text{softmax}(W(\lambda z + (1-\lambda)\hat{e}))\)
- Design Motivation: More flexible than mean pooling — capable of learning to attend to different temporal segments and spatial regions.

Loss & Training¶

Standard cross-entropy classification loss
MOMENT-Large (385M) is frozen; adapter contains 750K trainable parameters
Compatible with multiple TSFMs (MOMENT S/B/L, Chronos, TSPulse)

Key Experimental Results¶

Main Results (8 EEG Datasets)¶

Dataset	STAMP (750K)	CBraMod (29M)	LaBraM (5.8M)	Result
SHU-MI	0.660 AUC	0.657	0.660	Tie
MentalArith	0.811	0.749	0.772	STAMP wins
BCIC-IV-2a	0.409 Kappa	0.212	0.316	+193%
TUEV	0.662	0.618	0.664	Tie
SEED-V	0.208	0.259	0.239	CBraMod wins
FACED	0.278	0.508	0.470	CBraMod wins

STAMP is competitive or superior on 6/8 datasets, with relative weakness on emotion recognition tasks.

Ablation Study¶

Variant	Description
PE-NST (all three sets)	Optimal
PE-ST (no token PE)	Performance degrades
CC-GMLP vs. Transformer	CC-GMLP outperforms on all 4 datasets with fewer parameters
MHAP vs. Mean Pooling	MHAP substantially better on BCIC-IV-2a; comparable elsewhere
Different TSFM backbones	MOMENT L > B > S; Chronos slightly better on emotion tasks; TSPulse stronger on event tasks

Key Findings¶

A 750K-parameter adapter enables a general-purpose TSFM to match a 29M-parameter EEG-specific model on most tasks — 39× improvement in parameter efficiency.
Emotion recognition (SEED-V, FACED) is a weakness for TSFMs, likely due to the absence of emotion-relevant features in TSFM pretraining.
CC-GMLP is more efficient than Transformer for spatiotemporal mixing — suggesting EEG spatiotemporal relationships are relatively simple and do not require full attention.
The choice of TSFM backbone has limited impact — architectural design matters more than pretraining data.

Highlights & Insights¶

Exceptional Parameter Efficiency: The combination of a 750K adapter and a frozen 385M TSFM outperforms a 29M EEG-specific model, validating that general temporal representations + lightweight spatiotemporal adaptation constitute a more efficient paradigm.
Elegant CC-GMLP Design: Decoupled spatial and temporal gated mixing avoids the curse of dimensionality while preserving spatiotemporal interaction.
Discovery of TSFM Capability Boundaries: Failure on emotion recognition reveals that TSFM pretraining signals lack affective semantics — EEG-specific pretraining may be necessary for such tasks.

Limitations & Future Work¶

Poor performance on emotion recognition tasks — TSFM pretraining data does not contain emotion-related neural signal patterns.
Evaluation is limited to classification; temporal forecasting and generation tasks — the primary use case of TSFMs — are not assessed, as EEG prediction tasks remain scarce.
The complete design with five positional encoding components introduces additional hyperparameters requiring tuning for different EEG devices.
Despite being lightweight, the 750K adapter still requires separate training per dataset — zero-shot cross-dataset generalization is not validated.
CC-GMLP assumes separability of spatial and temporal dimensions, which may be insufficient for tasks requiring joint spatiotemporal modeling (e.g., rapid-response brain-computer interfaces).
Dependence on patch-level feature extraction from the TSFM may result in information loss for fine-grained events in raw EEG waveforms.

vs. CBraMod: A 29M-parameter EEG-specific model; STAMP achieves comparable performance with a 750K adapter and a general-purpose TSFM.
vs. LaBraM: A 5.8M-parameter model; STAMP matches or exceeds it on most tasks.
Insight: The paradigm of freezing large pretrained models and training lightweight adapters, already successful in NLP and CV, is validated here for the EEG domain.

Rating¶

Novelty: ⭐⭐⭐⭐ First work to adapt TSFMs for EEG spatiotemporal data
Experimental Thoroughness: ⭐⭐⭐⭐⭐ 8 datasets + multiple TSFMs + comprehensive ablations
Writing Quality: ⭐⭐⭐⭐ Method description is clear and well-organized
Value: ⭐⭐⭐⭐ Validates the feasibility of general-purpose TSFM + lightweight adaptation for EEG