MEGState: Phoneme Decoding from Magnetoencephalography Signals¶

Conference: NeurIPS 2025 arXiv: 2512.17978 Code: None Area: Brain-Computer Interface / Speech Decoding Keywords: MEG, phoneme decoding, state space model, multi-resolution convolution, brain-computer interface

TL;DR¶

This paper proposes MEGState, an architecture combining multi-resolution convolution and sensor-wise state space models (SSMs) for decoding phonemes from magnetoencephalography (MEG) signals, achieving substantial improvements over baseline methods on the LibriBrain dataset.

Background & Motivation¶

Limitations of Prior Work¶

Background: Decoding speech representations from brain activity is of significant importance for restoring communicative ability in individuals with paralysis or severe speech disorders. Invasive brain-computer interfaces (e.g., ECoG) have demonstrated continuous speech reconstruction with word error rates below 5%; however, their reliance on neurosurgical implantation limits scalability and clinical viability.

Magnetoencephalography (MEG), as a non-invasive alternative, offers a safe and repeatable means of detecting speech-related neural activity, yet faces three major challenges:

Low signal-to-noise ratio: MEG signals are extremely weak.

High temporal resolution: This gives rise to high-dimensional time series.

Sparse neural representations: Speech information is sparsely encoded in MEG signals.

Goal: The paper is motivated by the need to design an architecture capable of simultaneously capturing fine-grained local temporal dynamics and long-range temporal dependencies to overcome the above challenges.

Method¶

Overall Architecture¶

MEGState takes MEG signals \(\mathbf{X} \in \mathbb{R}^{M \times T}\) (\(M\) sensors, \(T\) time steps) as input, sequentially passes them through a multi-resolution convolution module to extract local temporal features, then through a sensor-wise SSM to model global temporal dependencies, and finally applies average pooling followed by a fully connected layer to output phoneme classification probabilities.

Key Designs¶

Multi-Resolution Convolution Module: Four parallel one-dimensional convolutional layers with kernel sizes of \(f_{sample}/2\), \(f_{sample}/4\), \(f_{sample}/8\), and \(f_{sample}/16\) (where \(f_{sample}\) denotes the sampling rate). Different kernel sizes capture cortical response features at different temporal scales. The four outputs are concatenated to yield \(\mathbf{H} \in \mathbb{R}^{F \times M \times T}\). Design Motivation: Cortical responses to different phonemes exhibit distinct characteristics at different temporal scales; the multi-resolution design enables comprehensive capture of these differences.
Sensor-wise SSM: A state space model based on the S5 variant, modeling global temporal dependencies on a per-sensor basis. The module comprises \(L\) hierarchically stacked blocks, each consisting of an SSM layer + LayerNorm + residual connection + FFN. S5 employs the HiPPO-N matrix as the state matrix \(\mathbf{A}\), achieving efficient recurrence via diagonalization and zero-order hold discretization. Design Motivation: SSMs are well-suited for modeling long-range dependencies in continuous signals; sensor-wise processing preserves spatial information.
Training Data Sampling Strategy: A novel combination of smoothing and data augmentation is introduced to improve SNR and alleviate phoneme label imbalance. At each step, two phoneme labels \(y_1, y_2\) are uniformly sampled; for each label, \(N'\) samples are drawn and averaged to construct class-conditional prototypes (for denoising); the two prototypes and their labels are then mixed using a mixup coefficient \(\alpha\). The prototype size is \(N'=100\) and \(\alpha=0.5\).

Loss & Training¶

Cross-entropy loss is used for training.
AdamW optimizer with \(\beta_1=0.9\), \(\beta_2=0.999\), learning rate \(10^{-4}\).
Batch size 32, trained for 50 epochs.
Number of SSM blocks \(L=2\).
Class-conditional prototype averaging combined with mixup augmentation simultaneously addresses low SNR and class imbalance.

Key Experimental Results¶

Main Results¶

Dataset: LibriBrain, comprising MEG recordings from a single subject listening to audiobooks, with 306 sensors at 1 kHz sampling rate, downsampled to 250 Hz after preprocessing, totaling 52.32 hours of data across 39 ARPAbet phoneme labels.

Model	Accuracy↑	Cohen's Kappa↑	Macro-F1↑	Leaderboard F1↑
Baseline	38.80±2.40	45.71±0.74	34.82±1.92	—
w/o Multi-Resol Conv	40.25±3.15	37.90±2.83	34.77±1.83	—
w/o Sensor-wise SSM	40.37±3.20	49.60±6.25	37.18±4.44	—
MEGState (Full)	45.53±1.88	54.19±2.42	41.11±2.20	55.74 (68.41)

Ablation Study¶

Configuration	Accuracy↑	Kappa↑	Macro-F1↑	Notes
Full MEGState	45.53	54.19	41.11	Best with both modules
w/o Multi-Resol Conv	40.25 (-5.28)	37.90 (-16.29)	34.77 (-6.34)	Largest Kappa drop
w/o Sensor-wise SSM	40.37 (-5.16)	49.60 (-4.59)	37.18 (-3.93)	More moderate impact

Key Findings¶

MEGState significantly outperforms the baseline on all metrics (p<0.05): Accuracy +6.73, Kappa +8.48, Macro-F1 +6.29.
Both modules are indispensable: removing the multi-resolution convolution has a larger impact (Kappa drops sharply from 54.19 to 37.90).
Phoneme-level analysis shows MEGState outperforms the baseline on 19 of 39 phonemes, with 10 reaching statistical significance.
A 5-model ensemble on the leaderboard further improves Macro-F1 to 68.41%.
Phonemes with more training samples generally yield better decoding performance, though some low-frequency phonemes are also decoded effectively.

Highlights & Insights¶

Effective training data sampling strategy: Class-conditional prototype averaging for denoising combined with mixup for class imbalance addresses two problems simultaneously, serving as a practical technique for low-SNR neural signal decoding.
The design rationale for multi-resolution convolution is well-motivated: different phonemes exhibit distinct cortical response patterns at different temporal scales, consistent with neuroscientific knowledge of speech processing.
SSMs are more suitable than Transformers for high-sampling-rate continuous signals (e.g., 250 Hz MEG) due to their superior scalability with respect to sequence length.
The substantial leaderboard gain from 5-model ensembling (55.74 → 68.41) indicates considerable complementarity among individual models.

Limitations & Future Work¶

Validation is limited to a single subject; cross-subject generalization remains unexplored.
The paper is relatively brief (workshop paper format), with concise method descriptions and limited in-depth analysis.
Evaluation is restricted to phoneme classification; end-to-end speech reconstruction and word-level decoding are not attempted.
The multi-resolution convolution kernel sizes are hard-coupled to the sampling rate, requiring adjustment for different sampling rates.
No direct comparison is made with other MEG decoding methods (e.g., MAD-MEG, NeuSpeech).
SSM variants such as Mamba may yield further improvements.
While the 38–46% accuracy substantially exceeds the baseline, it remains far from the level required for practical brain-computer interfaces.

Invasive BCI methods (ECoG-DCNet, Cortical-SSM) establish upper bounds for speech decoding performance.
SSM variants including S4, S5, and Mamba provide a theoretical foundation for modeling continuous neural signals.
Multi-resolution processing is a common strategy in both speech recognition (e.g., the wav2vec family) and EEG/MEG analysis.
The training strategy of prototype averaging combined with mixup is generalizable to other low-SNR biosignal decoding tasks.

Rating¶

Novelty: ⭐⭐⭐
Experimental Thoroughness: ⭐⭐⭐
Writing Quality: ⭐⭐⭐
Value: ⭐⭐⭐⭐