Language Reconstruction with Brain Predictive Coding from fMRI Data¶

Conference: ACL 2026 arXiv: 2405.11597 Code: None Area: Brain-Computer Interface / Language Decoding Keywords: fMRI language reconstruction, predictive coding, brain signal decoding, neurolinguistics, side network

TL;DR¶

This paper proposes PredFT, an end-to-end fMRI-to-Text decoding model comprising a main network (language decoding) and a side network (brain predictive coding representations). By extracting prospective semantic representations from prediction-related brain regions (PTO areas) and integrating them into the decoding process, PredFT achieves a BLEU-1 of 34.95% on the LeBel dataset (Sub-1), outperforming the strongest baseline MapGuide by 7.84 percentage points.

Background & Motivation¶

Background: Reconstructing natural language from fMRI signals provides an important window into understanding the mechanisms of language formation in the human brain. Recent studies have leveraged pretrained language models to achieve open-vocabulary fMRI-to-Text decoding: Tang et al. employ GPT to generate semantic candidates and select matching content via brain signals, while Xi et al. cast the problem as sequence-to-sequence translation.

Limitations of Prior Work: Existing research focuses on model architecture design and language model utilization, while neglecting a critical neuroscientific foundation — how natural language is encoded in the human brain. Specifically, the brain naturally performs multi-timescale predictions of future content while perceiving current speech stimuli (predictive coding theory), yet this information has never been exploited to guide language reconstruction.

Key Challenge: Brain signals contain rich prospective predictive information, but existing decoding models utilize only the current-moment brain activity representations, failing to leverage the predictive signals naturally generated by the brain.

Goal: (1) Validate the feasibility of predictive coding theory in fMRI-to-Text decoding; (2) design a decoding model that effectively utilizes brain predictive representations; (3) analyze the effects of different brain regions, prediction distances, and prediction lengths on decoding performance.

Key Insight: Predictive coding theory posits that the brain naturally predicts upcoming words upon hearing speech. Caucheteux et al. have demonstrated that constructing language model representations from predicted content strengthens the linear mapping between language model activations and brain responses. This motivates the question: can predictive representations be extracted from brain signals to assist language reconstruction?

Core Idea: A dual-network architecture is designed — the main network performs standard fMRI-to-Text decoding, while the side network extracts prospective representations from prediction-related brain regions (PTO areas). A Predictive Coding Attention mechanism integrates the predictive information into the decoding process.

Method¶

Overall Architecture¶

PredFT is an end-to-end model consisting of a main network \(\mathcal{M}_\theta\) (encoder-decoder) and a side network \(\mathcal{M}_\phi\) (encoder-decoder). The main network encodes fMRI sequences into spatiotemporal features and generates text via a Transformer decoder; the side network extracts representations from prediction-related ROIs and fuses them through self-attention, with its encoder output \(H_{\phi_\text{Enc}}^M\) injected into the main network as predictive representations. The side network decoder is discarded at inference time.

Key Designs¶

Main Network Encoder (fMRI Feature Extraction + Temporal Modeling):
- Function: Extract spatial-temporal features from raw fMRI signals.
- Mechanism: For 4D volumetric fMRI images \(F_{i,j} \in \mathbb{R}^{w \times h \times d \times (k+1)}\), \(L\) layers of 3D-CNN (with group normalization, ReLU, and residual connections) progressively reduce the input to a one-dimensional vector \(x_{i,j}^t \in \mathbb{R}^{d_m}\); for 2D surface fMRI, a linear layer performs dimensionality reduction directly. An FIR model \(g_t\) is then applied to compensate for BOLD signal delays, followed by concatenation of \(k-k^*\) future frames and linear fusion. Temporal positional encodings are added before feeding into a Transformer encoder to capture temporal dependencies.
- Design Motivation: The BOLD signal of fMRI has an approximately 4–6 second delay; temporal compensation via the FIR model is critical for correctly aligning brain activity with speech.
Side Network (Brain Predictive Representation Extraction):
- Function: Extract prospective semantic representations from prediction-related brain regions.
- Mechanism: The side network encoder \(\mathcal{M}_{\phi_\text{Enc}}\) receives sequences \(R_{i,j}\) from prediction-related ROIs (concatenated from regions including STS, IFG, SMG, and Angular Gyrus), applies a fully connected layer for dimensionality reduction, FIR compensation, and positional encoding, and feeds the result into a Transformer encoder to output predictive representations \(H_{\phi_\text{Enc}}^M\). The side network decoder takes future words \(V_j\) (prediction targets at distance \(d\) and length \(l\)) as input and trains the encoder via cross-entropy loss to learn predictive representations. The decoder is discarded at inference.
- Design Motivation: Predictive coding validation experiments demonstrate that prediction scores in PTO regions are significantly higher than those over the whole brain or random ROIs, confirming that selecting the correct brain regions is essential for extracting effective predictive signals.
Predictive Coding Attention (PC-Attention):
- Function: Integrate predictive representations from the side network into the main network decoder.
- Mechanism: A PC-Attention module is added to each layer of the main network Transformer decoder, using the decoder hidden states \(H_{\theta_\text{Dec}}^l\) as queries and the side network encoder output \(H_{\phi_\text{Enc}}^M\) as keys and values. The attention mask is designed such that each token in text segment \(u_j^t\) is allowed to attend to all predictive representations from time steps after \(t\), while representations from earlier steps are masked — since predictive information should originate from brain activity following the current moment.
- Design Motivation: The masking design enforces causality — decoding of the current word leverages only future predictive information, consistent with the forward-looking nature of predictive coding.

Loss & Training¶

The model is trained end-to-end with a joint loss \(\mathcal{L} = \mathcal{L}_\text{Main} + \lambda \mathcal{L}_\text{Side}\). Both networks share an embedding layer (updated only by gradients from the main network) and are each trained with a left-to-right autoregressive cross-entropy loss. At inference, the side network decoder is discarded; only its encoder is retained to provide predictive representations.

Key Experimental Results¶

Main Results¶

LeBel Dataset Within-Subject Decoding (10 frames = 20 seconds)

Model	BLEU-1	BLEU-4	ROUGE1-F	BERTScore
Tang's	22.25	0.00	19.44	80.84
BrainLLM	24.18	1.11	21.16	83.26
MapGuide	27.11	1.54	24.83	82.66
PredFT w/o SideNet	27.91	1.29	26.82	81.35
PredFT	34.95	1.78	32.03	82.92

Narratives Dataset Cross-Subject Decoding

Length	Model	BLEU-1	ROUGE1-F	BERTScore
10 frames	UniCoRN	20.64	19.23	75.35
10 frames	PredFT	24.73	19.53	78.52
40 frames	UniCoRN	21.76	25.30	74.40
40 frames	PredFT	27.80	25.96	78.63

Ablation Study¶

Effect of ROI Selection on Decoding Performance (LeBel Dataset)

ROI Type	Description	Relative Performance
BPC (prediction-related regions)	STS, IFG, SMG, Angular Gyrus	Best
Whole (whole brain)	Entire cortex	Second
Random	Randomly selected regions	Worst

Key Findings¶

The side network contributes substantially: PredFT improves BLEU-1 on Sub-1 from 27.91 to 34.95 (+7.04) compared to the w/o SideNet variant, demonstrating the practical utility of brain predictive information for decoding.
ROI selection is critical: BPC regions (PTO) consistently outperform whole-brain and random ROIs, validating the region-specificity of predictive coding.
An optimal range exists for prediction length and distance: excessively short (\(l=1,2\)) or long (\(l=11,12\)) prediction lengths are suboptimal; moderate lengths (\(l=6,7,8\)) combined with appropriate distances (\(d=3\)–\(5\)) yield the best results.
Within-subject decoding substantially outperforms cross-subject decoding; long-text generation (BLEU-3/4) remains challenging for all models.

Highlights & Insights¶

Directly translating the neuroscientific theory of predictive coding into model design represents an elegant interdisciplinary innovation — the side network's "auxiliary training, inference-time discard" strategy is reminiscent of knowledge distillation.
The causal masking in PC-Attention is concise yet powerful — restricting each current token to attend only to future predictive representations perfectly mirrors the forward-looking nature of predictive coding.
The predictive coding validation experiments carry independent scientific value, systematically characterizing the interactions among brain regions, prediction distances, and prediction lengths.

Limitations & Future Work¶

Validation is limited to fMRI data; applicability to other brain signal modalities (MEG, EEG) remains unexplored.
Unexpected content may interfere with the brain's predictive function, potentially degrading decoding performance.
Substantial room for improvement remains in precise long-text generation across all models (BLEU-4 is generally below 2%).
The approach could be extended to brain signal decoding under visual stimulation.

vs Tang's: Tang et al. employ GPT beam search to generate candidates followed by selection, whereas PredFT performs end-to-end decoding and additionally exploits brain predictive information.
vs UniCoRN: UniCoRN adopts a three-stage training framework based on BART; PredFT introduces a predictive coding prior via the side network and improves BLEU-1 by more than 6 percentage points in the cross-subject setting.
vs BrainLLM: BrainLLM concatenates fMRI embeddings with word embeddings for Llama2 fine-tuning; PredFT provides more targeted auxiliary signals through a dedicated predictive network.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic application of predictive coding theory to fMRI-to-Text decoding, with prominent interdisciplinary innovation.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across two datasets, multiple subjects, ROI analysis, and prediction parameter analysis; human evaluation is absent.
Writing Quality: ⭐⭐⭐⭐ The logical progression from predictive coding validation to model design is clear and well-structured.
Value: ⭐⭐⭐⭐ Provides a theoretically grounded novel method for the brain-computer interface field and validates the practical utility of brain predictive information.