MEG-XL: Data-Efficient Brain-to-Text via Long-Context Pre-Training¶
Conference: ICML 2026
arXiv: 2602.02494
Code: Open-sourced (paper states release code + weights)
Area: Brain-Computer Interface / Neural Decoding / Foundation Models
Keywords: brain-to-text, long-context pre-training, MEG, criss-cross attention, masked token prediction
TL;DR¶
MEG-XL performs masked token pre-training on 2.5 minutes (191k tokens) of MEG context (5–300× longer than prior work), then fine-tunes on a 50-word brain-to-text task. With only 1 hour of data, it matches the decoding accuracy of SOTA supervised methods trained on 50 hours, and significantly outperforms all brain foundation models.
Background & Motivation¶
Background: Brain-to-text (B2T) decoding is a core direction in brain-computer interfaces (BCI), divided into invasive (cortical electrodes, e.g., Moses 2021, Willett 2023, Card 2024, which have reached usable accuracy) and non-invasive (MEG/EEG, lower threshold but weaker signals). Non-invasive representatives include Défossez et al. (2022) decoding speech from 1s MEG, and d'Ascoli et al. (2025) extending context to sentence-level (150s) for word decoding. Brain foundation models (LaBraM, BIOT, EEGPT, BrainOmni, CBraMod) perform masked pre-training on short windows of 5–30 seconds.
Limitations of Prior Work: (1) Supervised methods require ~50 hours of training data per subject, which is impractical for paralyzed patients who cannot provide long training recordings. (2) Existing brain foundation models are almost all pre-trained on ≤10s short windows, which is severely mismatched with the long-range neural linguistic structures (phrases, sentences, discourse) needed downstream; recent analysis (Yang 2026) finds these FMs underperform supervised methods in low-data regimes. (3) Extending context is bottlenecked by computation: standard transformer attention is \(\mathcal{O}((CT')^2)\), and multi-channel + long sequence quickly exhausts GPU memory.
Key Challenge: Neural activity contains language-related structures spanning tens of seconds to minutes (phrase aggregation, syntax, discourse coherence), but short-window pre-trained models cannot see or learn to exploit such long-range dependencies. Meanwhile, the clinical deployment scenario most in need of "fast adaptation with little data" is precisely the blind spot of short-context FMs.
Goal: (i) Build a framework for masked pre-training on minute-scale MEG context without exhausting GPU memory; (ii) Verify whether long-context pre-training can truly outperform SOTA supervised methods and all existing FMs in low-data downstream scenarios (especially contextual word decoding); (iii) Explain why long context is useful—does it really learn selective and hierarchical attention?
Key Insight: Inspired by Transformer-XL, the authors view "neural data = long documents" and argue that only pre-training on long context, as in language modeling, can capture long-range statistical priors. The computational bottleneck is addressed by criss-cross factorized attention (Wang 2025), which decouples attention along time and channel dimensions for parallelization.
Core Idea: Each channel is independently tokenized using BioCodec (rank-12 temporal compression), fed into an 8-layer criss-cross transformer. Within a 2.5-minute MEG window, 40% of 3-second blocks are masked for prediction, forcing the model to learn minute-scale neural dependencies. Fine-tuning on word decoding yields a data-efficient B2T model.
Method¶
Overall Architecture¶
Pre-training: Raw MEG \(\mathbf{X}\in\mathbb{R}^{C\times T}\) (multi-channel, 50Hz downsampled) is passed through a frozen BioCodec (RVQ 6 layers, vocab 256, 12× temporal downsampling) per channel to obtain \(\mathbf{Z}\in\{0,...,255\}^{C\times T'\times 6}\). Token embeddings concatenate 6 codebook vectors, projected to \(d_{model}\), with added sensor position (Fourier features), orientation, and type embeddings, then processed by an 8-layer criss-cross transformer. 3-second blocks are uniformly masked until 40% of tokens are covered, predicting 6 RVQ code levels per position. Fine-tuning: Adopts d'Ascoli's task—50 words × 3s MEG windows concatenated into 150s input; the model predicts the T5 word embedding for each word's time segment. An MLP head is trained with SigLIP contrastive loss; inference uses nearest neighbor retrieval.
Key Designs¶
-
2.5-Minute Ultra-Long Context Pre-Training + Criss-Cross Attention:
- Function: Enables masked prediction on 191k-token sequences without exhausting GPU memory.
- Mechanism: Standard attention's \(\mathcal{O}((CT')^2)\) complexity is infeasible for multi-channel, long sequences. Criss-cross splits features in half: one half uses SpatialAttn (per time step, cross-channel attention, \(\mathcal{O}(T'\cdot C^2)\)), the other uses TemporalAttn (per channel, cross-time attention, \(\mathcal{O}(C\cdot T'^2)\)). Temporal attention includes RoPE time encoding. The two halves are concatenated along the channel dimension, with residual connections, RMSNorm, and SELU FFN. Total complexity drops from \(\mathcal{O}((CT')^2)\) to \(\mathcal{O}(C\cdot T'^2+T'\cdot C^2)\), allowing a 2.5-minute × hundreds of channels × 50Hz × 12× compressed 191k-token sequence to fit on a single GPU.
- Design Motivation: Neural linguistic structures span seconds to minutes; short-window models are structurally incapable of capturing them—models need a "long enough field of view." Native attention is a computational bottleneck; criss-cross leverages the physical intuition that "temporal and spatial correlations are approximately separable," which is a good approximation for brain signals with "temporal correlation per sensor + spatial correlation per time."
-
Multi-Channel Independent RVQ Tokenization (BioCodec) + Residual Codebook Input Embedding:
- Function: Compresses continuous MEG signals into discrete token sequences, reducing sequence length and providing masked prediction targets.
- Mechanism: BioCodec (a neural audio codec-style tokenizer trained on EEG) independently applies RVQ to each channel: \(Q=6\) residual quantization levels, each with vocab \(V=256\). 12× temporal downsampling compresses 50Hz × 150s × hundreds of channels into 191k tokens. Input embeddings look up each codebook \(\mathbf{e}^{(q)}_{z_{c,q,t}}\), concatenate, and project \(\mathbf{h}^{(0)}_{c,t}=\mathbf{W}_{proj}[\mathbf{e}^{(1)};...;\mathbf{e}^{(Q)}]\); sensor position Fourier features \(\gamma(\mathbf{v})=[\cos(2\pi\mathbf{Bv}),\sin(2\pi\mathbf{Bv})]\), orientation, and type embeddings are added.
- Design Motivation: Unlike BrainTokenizer, which compresses both time and channel, this approach compresses only time—yielding better reconstruction quality and avoiding loss of task-relevant information during tokenization. RVQ outperforms single VQ for high-frequency temporal data, with multi-level residuals capturing both slow dynamics and high-frequency details.
-
Large Block Masking (3s) + Synchronized Masking Across All Channels:
- Function: Forces the model to learn temporal dependencies across seconds, not just simple cross-channel interpolation.
- Mechanism: Randomly selects 3-second time blocks until 40% of tokens are masked; all channels are synchronously masked at selected time steps, replaced with mask embeddings. The model predicts RVQ codes for each masked position: \(p(z_{c,q,t}\mid\mathbf{X}_{\backslash\mathcal{M}})=\text{softmax}(\mathbf{W}_q\mathbf{h}^{(L)}_{c,t})\), with loss \(\mathcal{L}=-\frac{1}{|\mathcal{M}|CQ}\sum_t\sum_c\sum_q\log p(z_{c,q,t}\mid\mathbf{X}_{\backslash\mathcal{M}})\). The 3s block size is intentionally long, covering the typical neural response duration for words (Kutas & Federmeier 2011).
- Design Motivation: MEG is highly temporally autocorrelated; short masks allow the model to "cheat" via neighbor interpolation, failing to learn long-range dependencies. 3s blocks + synchronized masking across channels eliminate these shortcuts, forcing genuine long-term modeling. The 40% mask rate is much higher than BERT's 15%, between MAE's 75% and wav2vec 2.0's 49%—empirically tuned.
Loss & Training¶
Pre-training: ~300 hours of MEG (CamCAN + MOUS + SMN4Lang), covering rest, movement, speech, etc., across hundreds of subjects; masked token prediction with cross-entropy loss; different MEG systems have varying channel counts, handled via channel masking for padding. Fine-tuning: SigLIP contrastive loss + word embedding regression head; end-to-end fine-tuning of transformer + MLP head; inference uses cosine similarity for nearest neighbor word retrieval.
Key Experimental Results¶
Main Results¶
| Model | Params | MEG-MASC (13%) | Armeni (13%) | LibriBrain (13%) | MEG-MASC (100%) | Armeni (100%) | LibriBrain (100%) |
|---|---|---|---|---|---|---|---|
| BioCodec baseline | 1.0M | 19.8 | 20.0 | 19.9 | 31.2 | 37.1 | 41.9 |
| EEGPT | 4.7M | 19.6 | 20.3 | 20.3 | 26.3 | 20.8 | 22.9 |
| BIOT | 3.2M | 20.0 | 20.2 | 20.6 | 31.3 | 35.7 | 45.6 |
| BBL | 15M | 21.5 | 22.3 | 32.1 | 35.9 | 39.1 | 49.9 |
| BrainOmni | 8.4M | 18.7 | 21.0 | 29.7 | 19.1 | 62.3 | 63.0 |
| LaBraM | 5.8M | 33.2 | 26.3 | 40.3 | 31.1 | 42.0 | 47.7 |
| MEG-XL (Ours) | 20M | 47.0 | 54.9 | 57.3 | 46.4 | 61.2 | 63.0 |
In low-data (13%) settings, MEG-XL outperforms the next-best LaBraM by 13–28 points; with full data, it matches or exceeds BrainOmni. BrainOmni collapses to 19.1% on MEG-MASC (shallow multi-subject), indicating that existing FMs fail in the clinically critical "shallow data, many subjects" scenario.
Ablation Study / Long-Context Effects¶
| Configuration | Result |
|---|---|
| Random init MEG-XL (no pre-training) | Performance similar to supervised baseline, showing gains come from pre-training, not architecture |
| Pre-training context 5s → 30s → 100s → 150s | Word decoding linear probe improves monotonically, saturates after ~100s |
| Full-context vs Matched-context inference | Nearly identical → longer context at inference is useless unless pre-trained for it |
| Masked prediction (zero-shot) | Performance improves monotonically from 5s to 150s context, not saturated—longer may help further |
| Attention analysis | Long-context models show local attention in early layers + global integration in deeper layers + lower attention entropy |
Key Findings¶
- Dramatic data efficiency: MEG-XL achieves the accuracy of supervised SOTA with 1 hour of data, which otherwise requires 50 hours (~50× data efficiency).
- Long-context models learn "when to look far / when to look near" via selective, hierarchical attention—short-context models attend uniformly from the first layer and never learn this stratification.
- On "deep single-subject" data like LibriBrain, d'Ascoli's supervised method overtakes MEG-XL when data is abundant (after 2.5 hours), indicating the boundary where pre-training can substitute for subject-specific data.
- In low-data settings, supervised methods are decisively outperformed by MEG-XL (over +25 points on MEG-MASC), which is the most important scenario for BCI clinical deployment.
Highlights & Insights¶
- "Long context is a learned ability, not a given one" is the most important insight—providing long context only at inference is useless; it must be fed during pre-training. This echoes the length generalization literature in LMs, extending this principle to neural decoding.
- The success of criss-cross attention on brain signals suggests that highly structured spatiotemporal data can generally use "spatiotemporal factorized attention" to avoid quadratic complexity; this idea applies to fMRI, ECoG, sensor networks, and any \(C\times T\) high-dimensional signals.
- In clinical deployment, "cross-subject pre-training replacing deep within-subject training" is a true paradigm shift—reducing BCI training from "50 hours per new user" to "1–2 hours," especially important for paralyzed patients.
- The 3-second large block + synchronized masking across all channels is a clever design—it simultaneously blocks "temporal neighbor interpolation" and "channel neighbor interpolation" shortcuts, forcing genuine semantic-level modeling.
Limitations & Future Work¶
- Only tested on perceived speech (subjects listening to audiobooks), not the more challenging imagined speech; the latter is the actual use case for paralyzed BCI users.
- The retrieval vocabulary is only 50 words (top-250 shows similar trends), still several orders of magnitude short of the open vocabulary (thousands of words) needed clinically.
- The interpretability gains from long context (hierarchical attention) remain at the statistical description level; no clear evidence yet mapping to specific linguistic structures (syllable/word/phrase).
- GPU memory is still the limit—150s is the VRAM ceiling; unable to verify whether even longer context continues to help.
- Pre-training data comes from healthy research subjects; there may be domain shift with real MEG signals from paralyzed patients.
Related Work & Insights¶
- vs d'Ascoli et al. (2025) (Supervised SOTA): They first extended MEG input to sentence-level 150s, but via supervised training with high data demand; MEG-XL inherits the same context length but uses self-supervised pre-training to reduce data needs by one or two orders of magnitude.
- vs LaBraM / EEGPT / BIOT / BrainOmni: These brain FMs use ≤30s short-window pre-training and collapse in low-data settings; MEG-XL's minute-scale context brings qualitative change.
- vs CBraMod / BrainOmni (criss-cross origin): BrainOmni also uses criss-cross attention but with a 30s window; this work shows that only at 150s does factorized attention realize its full value.
- vs Transformer-XL (homage in naming): Directly transfers the LM long-context paradigm to neural data and demonstrates similar principles apply.
Rating¶
- Novelty: ⭐⭐⭐⭐ First to combine minute-scale context + RVQ tokenization + criss-cross attention for B2T, with clear rationale and significant effect.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ 3 MEG datasets + 6 FM baselines + supervised SOTA + linear probe + zero-shot prediction + attention analysis, with a very complete evidence chain.
- Writing Quality: ⭐⭐⭐⭐⭐ Exceptionally well-structured—drawing analogy from "LM long-context success" to neural data, integrating theoretical framework, empirical results, and mechanism analysis; formulas and figures are precise.
- Value: ⭐⭐⭐⭐⭐ Substantially advances the clinical feasibility of non-invasive BCI; the "long context is a learned ability" principle is methodologically significant for the entire neural foundation model field.