Skip to content

Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamics

Conference: CVPR 2026
arXiv: 2603.24176
Code: None
Area: 3D Vision
Keywords: EEG-to-fMRI, Diffusion Models, Spatiotemporal Modeling, Intermediate Frame Reconstruction, Visual Decoding

TL;DR

A diffusion Transformer framework conditioned on EEG for fMRI reconstruction is proposed, modeling brain activity as a spatiotemporal sequence of neural frames rather than independent snapshots. The method achieves spatiotemporally consistent fMRI reconstruction at cortical vertex-level resolution, supports intermediate frame interpolation via null-space sampling, and validates the preservation of functional information on downstream visual decoding tasks.

Background & Motivation

  1. Background: fMRI offers high spatial resolution cortical representations but is costly to acquire; EEG provides millisecond-level temporal resolution but with low spatial precision. EEG-to-fMRI translation aims to leverage the complementarity of both modalities by inferring fMRI-level spatial patterns from EEG.
  2. Limitations of Prior Work: (1) ROI-level methods (e.g., NeuroBOLT) can model temporal continuity but suffer from low spatial resolution; (2) voxel/cortical-level methods (CNN-TC, CATD, etc.) achieve high spatial fidelity but reconstruct frames independently, lacking temporal consistency; (3) evaluation relies solely on low-level metrics such as MSE/SSIM, which cannot assess whether reconstructed fMRI preserves functionally meaningful neural information.
  3. Key Challenge: High spatial resolution and temporal continuity are difficult to achieve simultaneously — independent reconstruction ensures spatial accuracy but introduces inter-frame artifacts, while sequence modeling ensures temporal continuity but is constrained by spatial granularity.
  4. Goal: To reconstruct temporally continuous and consistent fMRI frame sequences at the high spatial resolution of 91,282 cortical vertices.
  5. Key Insight: Brain activity is modeled as evolving spatiotemporal neural frames (rather than independent snapshots), using a diffusion Transformer to jointly capture vertex-level spatial detail and inter-frame temporal dependencies.
  6. Core Idea: An EEG-guided diffusion Transformer generates spatiotemporally consistent fMRI sequences, with null-space constrained sampling enabling intermediate frame reconstruction.

Method

Overall Architecture

Input: A temporally aligned EEG window sequence \(\mathbf{S}\) (64 channels, 1000 Hz, with a 4s hemodynamic delay relative to fMRI). Output: An fMRI sequence of \(K_w\) frames \(\mathbf{X} \in \mathbb{R}^{K_w \times N_v}\) (\(N_v = 91{,}282\) cortical vertices). Core pipeline: EEG features are extracted via a temporal encoder → a linear fMRI autoencoder compresses the spatial dimension → a diffusion Transformer performs EEG-conditioned denoising in the low-dimensional space → decoding recovers vertex-level fMRI. At inference, two modes are supported: direct reconstruction and null-space constrained intermediate frame reconstruction (InterRecon).

Key Designs

  1. Spatiotemporal Tokenization and EEG Condition Injection:

    • Function: Joint modeling of spatial and temporal dimensions.
    • Mechanism: The \(K_w\)-frame fMRI sequence is tokenized into \((K_w \times N_v)\) vertex-level tokens, each appended with a temporal positional encoding to distinguish different frames. EEG features extracted by a temporal convolutional encoder are injected into vertex tokens via cross-attention at each Transformer layer. This enables the model to simultaneously account for spatial structure and EEG-guided temporal patterns during denoising.
    • Design Motivation: Independent per-frame modeling leads to inter-frame inconsistency. Treating multiple frames as a unified sequence allows self-attention to naturally capture inter-frame temporal dependencies.
  2. Null-Space Sampling for InterRecon:

    • Function: Reconstructing arbitrary intermediate frames from sparse anchor frames without retraining.
    • Mechanism: Sparse observations are modeled as linear measurements \(\mathbf{y} = \mathbf{A}\mathbf{X}\), where \(\mathbf{A} = \text{diag}(m_1,...,m_{K_w})\) indicates known frames. At each reverse diffusion step, the denoised estimate is decomposed into a range-space component (enforcing consistency with anchor frames) and a null-space component (preserving generative freedom): \(\hat{\mathbf{x}}_{0|n} = \mathbf{A}^\dagger \mathbf{y} + (\mathbf{I} - \mathbf{A}^\dagger \mathbf{A})\mathbf{x}_{0|n}\). This guarantees exact anchor frame matching while allowing intermediate frames to be freely generated.
    • Design Motivation: Missing or corrupted frames are common in real fMRI acquisition. The null-space approach decouples observation constraints from generative freedom, enabling adaptation to diverse interpolation scenarios without retraining. It also serves as an intrinsic evaluation tool for temporal consistency.
  3. Linear fMRI Autoencoder:

    • Function: Dimensionality reduction for efficient diffusion modeling while preserving the null-space decomposition property.
    • Mechanism: A linear MLP maps each \(N_v = 91{,}282\)-dimensional fMRI frame to a 1024-dimensional latent representation. Both encoder and decoder are linear transformations, trained end-to-end with the diffusion model.
    • Design Motivation: Linearity ensures that the null-space projection \((\mathbf{I} - \mathbf{A}^\dagger \mathbf{A})\) remains exact in the compressed space. A nonlinear autoencoder would break the range–null-space decomposition.

Loss & Training

  • Denoising score matching loss: \(\mathcal{L}_{\text{diff}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(\mathbf{x}^{(n)}, n, \mathbf{h}_\text{EEG})\|^2]\)
  • Diffusion parameters: 1000 timesteps, linear noise schedule; DDIM with 50 steps at inference.
  • Architecture: 6-layer Transformer, 8-head attention, hidden dimension 1024.
  • Training: AdamW, lr \(= 1\times10^{-4}\), batch size 32, 200 epochs, single A100 GPU.
  • Within-subject training with 80/20 split; test set contains unseen video clips.

Key Experimental Results

Main Results

Dynamic fMRI frame reconstruction (averaged over 6 subjects, sequence length 10, whole-brain):

Method MSE ↓ r ↑ Cos ↑
CNN-TC 0.315 0.804 0.824
CNN-TAG 0.309 0.810 0.829
E2FNet 0.297 0.819 0.836
E2FGAN 0.290 0.822 0.839
Ours 0.277 0.824 0.849

Visual cortex (V1) sub-region, 10 frames: MSE 0.193, r 0.834, Cos 0.887.

Ablation Study

Intermediate frame reconstruction (InterRecon) comparison:

Method MSE ↓ r ↑ Cos ↑
Linear interpolation 0.280 0.830 0.851
Ours w/o null space 0.272 0.839 0.852
Ours w/ null space 0.250 0.852 0.865

Null-space constraints yield consistent improvements: MSE reduced by 8.1%, r improved by 1.5%, Cos improved by 1.5%.

Key Findings

  • Strong temporal robustness: As sequence length increases from 3 to 30 frames, the proposed method's whole-brain MSE changes only marginally from 0.282 to 0.281, whereas CNN-TC degrades from 0.302 to 0.322. This demonstrates that joint spatiotemporal modeling effectively captures long-range temporal dependencies.
  • Superior performance in functional regions: Reconstruction metrics in the visual and auditory cortices substantially exceed the whole-brain average, consistent with neuroscientific expectations that these regions are strongly driven during a movie-watching task.
  • Downstream visual decoding validation: Applying the CineSync-f decoder to reconstructed fMRI recovers coarse semantic structures of scenes (characters, poses, scene layout), confirming that the reconstructions preserve functional neural representations.
  • Null-space sampling requires no retraining — it uses the identical model checkpoint as direct reconstruction, modifying only the sampling strategy.

Highlights & Insights

  • Paradigm shift to spatiotemporal frames: Reframing fMRI reconstruction from "independent per-frame prediction" to "sequence modeling" represents a conceptually significant advance. Unified self-attention across temporal and spatial dimensions captures more complete neural dynamics than prior purely spatial approaches.
  • Dual value of null-space sampling: It serves both as a practical tool for missing frame imputation (a common need in real fMRI preprocessing) and as an intrinsic evaluation of temporal consistency — verifying whether the model has learned genuine temporal dependencies without requiring additional metrics.
  • Elegant constraint of the linear autoencoder: Trading expressive power for preserved mathematical properties is an elegant engineering decision, ensuring that the null-space decomposition holds exactly in the latent space.

Limitations & Future Work

  • Within-subject training: Models are currently trained per subject and cannot generalize across subjects. Cross-subject modeling requires stronger anatomical or functional alignment methods.
  • Fixed EEG-fMRI delay: A fixed 4s delay is assumed, whereas real hemodynamic delays vary across brain regions and time. A learnable alignment module could be explored.
  • The expressiveness of the linear autoencoder is limited; mildly nonlinear designs that preserve the null-space decomposition property could be investigated.
  • Quantitative evaluation of downstream visual decoding is limited, with only qualitative visualizations provided.
  • vs. NeuroBOLT: ROI-level modeling naturally achieves temporal consistency but at low spatial resolution (hundreds of regions vs. 91,282 vertices). This work fills the gap of "high spatial resolution + temporal consistency."
  • vs. CATD: A cortical-level fMRI translation model that reconstructs frames independently. This work demonstrates that sequence modeling outperforms per-frame methods on both whole-brain and functional region metrics.
  • vs. image diffusion models: Drawing on ideas from DiT and null-space diffusion sampling, this work applies them to high-dimensional spatiotemporal data in neuroscience, showcasing the potential of diffusion models for scientific data modeling.

Rating

  • Novelty: ⭐⭐⭐⭐ Reframes fMRI reconstruction as spatiotemporal sequence generation; null-space sampling enables intermediate frame reconstruction without retraining.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation across multiple sequence lengths, brain regions, baselines, InterRecon, and downstream decoding; limited to 6 subjects on a single dataset.
  • Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous and the method is clearly described, though the presentation may skew toward machine learning readers over neuroscience audiences.
  • Value: ⭐⭐⭐⭐ Introduces a new paradigm for joint multimodal neuroimaging modeling, though the application domain is relatively niche.