Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamics¶

Conference: CVPR 2026
arXiv: 2603.24176
Code: None
Area: 3D Vision
Keywords: EEG-to-fMRI, Diffusion Models, Spatiotemporal Modeling, Intermediate Frame Reconstruction, Visual Decoding

TL;DR¶

A diffusion Transformer framework conditioned on EEG for fMRI reconstruction is proposed, modeling brain activity as a spatiotemporal sequence of neural frames rather than independent snapshots. The method achieves spatiotemporally consistent fMRI reconstruction at cortical vertex-level resolution, supports intermediate frame interpolation via null-space sampling, and validates the preservation of functional information on downstream visual decoding tasks.

Background & Motivation¶

Background: fMRI offers high spatial resolution cortical representations but is costly to acquire; EEG provides millisecond-level temporal resolution but with low spatial precision. EEG-to-fMRI translation aims to leverage the complementarity of both modalities by inferring fMRI-level spatial patterns from EEG.
Limitations of Prior Work: (1) ROI-level methods (e.g., NeuroBOLT) can model temporal continuity but suffer from low spatial resolution; (2) voxel/cortical-level methods (CNN-TC, CATD, etc.) achieve high spatial fidelity but reconstruct frames independently, lacking temporal consistency; (3) evaluation relies solely on low-level metrics such as MSE/SSIM, which cannot assess whether reconstructed fMRI preserves functionally meaningful neural information.
Key Challenge: High spatial resolution and temporal continuity are difficult to achieve simultaneously — independent reconstruction ensures spatial accuracy but introduces inter-frame artifacts, while sequence modeling ensures temporal continuity but is constrained by spatial granularity.
Goal: To reconstruct temporally continuous and consistent fMRI frame sequences at the high spatial resolution of 91,282 cortical vertices.
Key Insight: Brain activity is modeled as evolving spatiotemporal neural frames (rather than independent snapshots), using a diffusion Transformer to jointly capture vertex-level spatial detail and inter-frame temporal dependencies.
Core Idea: An EEG-guided diffusion Transformer generates spatiotemporally consistent fMRI sequences, with null-space constrained sampling enabling intermediate frame reconstruction.

Method¶

Overall Architecture¶

Input: A temporally aligned EEG window sequence \(\mathbf{S}\) (64 channels, 1000 Hz, with a 4s hemodynamic delay relative to fMRI). Output: An fMRI sequence of \(K_w\) frames \(\mathbf{X} \in \mathbb{R}^{K_w \times N_v}\) (\(N_v = 91{,}282\) cortical vertices). Core pipeline: EEG features are extracted via a temporal encoder → a linear fMRI autoencoder compresses the spatial dimension → a diffusion Transformer performs EEG-conditioned denoising in the low-dimensional space → decoding recovers vertex-level fMRI. At inference, two modes are supported: direct reconstruction and null-space constrained intermediate frame reconstruction (InterRecon).

Key Designs¶

Spatiotemporal Tokenization and EEG Condition Injection:
- Function: Joint modeling of spatial and temporal dimensions.
- Mechanism: The \(K_w\)-frame fMRI sequence is tokenized into \((K_w \times N_v)\) vertex-level tokens, each appended with a temporal positional encoding to distinguish different frames. EEG features extracted by a temporal convolutional encoder are injected into vertex tokens via cross-attention at each Transformer layer. This enables the model to simultaneously account for spatial structure and EEG-guided temporal patterns during denoising.
- Design Motivation: Independent per-frame modeling leads to inter-frame inconsistency. Treating multiple frames as a unified sequence allows self-attention to naturally capture inter-frame temporal dependencies.
Null-Space Sampling for InterRecon:
- Function: Reconstructing arbitrary intermediate frames from sparse anchor frames without retraining.
- Mechanism: Sparse observations are modeled as linear measurements \(\mathbf{y} = \mathbf{A}\mathbf{X}\), where \(\mathbf{A} = \text{diag}(m_1,...,m_{K_w})\) indicates known frames. At each reverse diffusion step, the denoised estimate is decomposed into a range-space component (enforcing consistency with anchor frames) and a null-space component (preserving generative freedom): \(\hat{\mathbf{x}}_{0|n} = \mathbf{A}^\dagger \mathbf{y} + (\mathbf{I} - \mathbf{A}^\dagger \mathbf{A})\mathbf{x}_{0|n}\). This guarantees exact anchor frame matching while allowing intermediate frames to be freely generated.
- Design Motivation: Missing or corrupted frames are common in real fMRI acquisition. The null-space approach decouples observation constraints from generative freedom, enabling adaptation to diverse interpolation scenarios without retraining. It also serves as an intrinsic evaluation tool for temporal consistency.
Linear fMRI Autoencoder:
- Function: Dimensionality reduction for efficient diffusion modeling while preserving the null-space decomposition property.
- Mechanism: A linear MLP maps each \(N_v = 91{,}282\)-dimensional fMRI frame to a 1024-dimensional latent representation. Both encoder and decoder are linear transformations, trained end-to-end with the diffusion model.
- Design Motivation: Linearity ensures that the null-space projection \((\mathbf{I} - \mathbf{A}^\dagger \mathbf{A})\) remains exact in the compressed space. A nonlinear autoencoder would break the range–null-space decomposition.

Loss & Training¶

Denoising score matching loss: \(\mathcal{L}_{\text{diff}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(\mathbf{x}^{(n)}, n, \mathbf{h}_\text{EEG})\|^2]\)
Diffusion parameters: 1000 timesteps, linear noise schedule; DDIM with 50 steps at inference.
Architecture: 6-layer Transformer, 8-head attention, hidden dimension 1024.
Training: AdamW, lr \(= 1\times10^{-4}\), batch size 32, 200 epochs, single A100 GPU.
Within-subject training with 80/20 split; test set contains unseen video clips.

Key Experimental Results¶

Main Results¶

Dynamic fMRI frame reconstruction (averaged over 6 subjects, sequence length 10, whole-brain):

Method	MSE ↓	r ↑	Cos ↑
CNN-TC	0.315	0.804	0.824
CNN-TAG	0.309	0.810	0.829
E2FNet	0.297	0.819	0.836
E2FGAN	0.290	0.822	0.839
Ours	0.277	0.824	0.849

Visual cortex (V1) sub-region, 10 frames: MSE 0.193, r 0.834, Cos 0.887.

Ablation Study¶

Intermediate frame reconstruction (InterRecon) comparison:

Method	MSE ↓	r ↑	Cos ↑
Linear interpolation	0.280	0.830	0.851
Ours w/o null space	0.272	0.839	0.852
Ours w/ null space	0.250	0.852	0.865

Null-space constraints yield consistent improvements: MSE reduced by 8.1%, r improved by 1.5%, Cos improved by 1.5%.

Key Findings¶

Strong temporal robustness: As sequence length increases from 3 to 30 frames, the proposed method's whole-brain MSE changes only marginally from 0.282 to 0.281, whereas CNN-TC degrades from 0.302 to 0.322. This demonstrates that joint spatiotemporal modeling effectively captures long-range temporal dependencies.
Superior performance in functional regions: Reconstruction metrics in the visual and auditory cortices substantially exceed the whole-brain average, consistent with neuroscientific expectations that these regions are strongly driven during a movie-watching task.
Downstream visual decoding validation: Applying the CineSync-f decoder to reconstructed fMRI recovers coarse semantic structures of scenes (characters, poses, scene layout), confirming that the reconstructions preserve functional neural representations.
Null-space sampling requires no retraining — it uses the identical model checkpoint as direct reconstruction, modifying only the sampling strategy.

Highlights & Insights¶

Paradigm shift to spatiotemporal frames: Reframing fMRI reconstruction from "independent per-frame prediction" to "sequence modeling" represents a conceptually significant advance. Unified self-attention across temporal and spatial dimensions captures more complete neural dynamics than prior purely spatial approaches.
Dual value of null-space sampling: It serves both as a practical tool for missing frame imputation (a common need in real fMRI preprocessing) and as an intrinsic evaluation of temporal consistency — verifying whether the model has learned genuine temporal dependencies without requiring additional metrics.
Elegant constraint of the linear autoencoder: Trading expressive power for preserved mathematical properties is an elegant engineering decision, ensuring that the null-space decomposition holds exactly in the latent space.

Limitations & Future Work¶

Within-subject training: Models are currently trained per subject and cannot generalize across subjects. Cross-subject modeling requires stronger anatomical or functional alignment methods.
Fixed EEG-fMRI delay: A fixed 4s delay is assumed, whereas real hemodynamic delays vary across brain regions and time. A learnable alignment module could be explored.
The expressiveness of the linear autoencoder is limited; mildly nonlinear designs that preserve the null-space decomposition property could be investigated.
Quantitative evaluation of downstream visual decoding is limited, with only qualitative visualizations provided.

vs. NeuroBOLT: ROI-level modeling naturally achieves temporal consistency but at low spatial resolution (hundreds of regions vs. 91,282 vertices). This work fills the gap of "high spatial resolution + temporal consistency."
vs. CATD: A cortical-level fMRI translation model that reconstructs frames independently. This work demonstrates that sequence modeling outperforms per-frame methods on both whole-brain and functional region metrics.
vs. image diffusion models: Drawing on ideas from DiT and null-space diffusion sampling, this work applies them to high-dimensional spatiotemporal data in neuroscience, showcasing the potential of diffusion models for scientific data modeling.

Rating¶

Novelty: ⭐⭐⭐⭐ Reframes fMRI reconstruction as spatiotemporal sequence generation; null-space sampling enables intermediate frame reconstruction without retraining.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation across multiple sequence lengths, brain regions, baselines, InterRecon, and downstream decoding; limited to 6 subjects on a single dataset.
Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous and the method is clearly described, though the presentation may skew toward machine learning readers over neuroscience audiences.
Value: ⭐⭐⭐⭐ Introduces a new paradigm for joint multimodal neuroimaging modeling, though the application domain is relatively niche.