Modeling Spatiotemporal Neural Frames for High Resolution Brain Dynamics¶
Conference: CVPR 2026
arXiv: 2603.24176
Code: None
Area: 3D Vision
Keywords: EEG-to-fMRI, Diffusion Models, Spatiotemporal Modeling, Intermediate Frame Reconstruction, Visual Decoding
TL;DR¶
A diffusion Transformer framework conditioned on EEG for fMRI reconstruction is proposed, modeling brain activity as a spatiotemporal sequence of neural frames rather than independent snapshots. The method achieves spatiotemporally consistent fMRI reconstruction at cortical vertex-level resolution, supports intermediate frame interpolation via null-space sampling, and validates the preservation of functional information on downstream visual decoding tasks.
Background & Motivation¶
- Background: fMRI offers high spatial resolution cortical representations but is costly to acquire; EEG provides millisecond-level temporal resolution but with low spatial precision. EEG-to-fMRI translation aims to leverage the complementarity of both modalities by inferring fMRI-level spatial patterns from EEG.
- Limitations of Prior Work: (1) ROI-level methods (e.g., NeuroBOLT) can model temporal continuity but suffer from low spatial resolution; (2) voxel/cortical-level methods (CNN-TC, CATD, etc.) achieve high spatial fidelity but reconstruct frames independently, lacking temporal consistency; (3) evaluation relies solely on low-level metrics such as MSE/SSIM, which cannot assess whether reconstructed fMRI preserves functionally meaningful neural information.
- Key Challenge: High spatial resolution and temporal continuity are difficult to achieve simultaneously — independent reconstruction ensures spatial accuracy but introduces inter-frame artifacts, while sequence modeling ensures temporal continuity but is constrained by spatial granularity.
- Goal: To reconstruct temporally continuous and consistent fMRI frame sequences at the high spatial resolution of 91,282 cortical vertices.
- Key Insight: Brain activity is modeled as evolving spatiotemporal neural frames (rather than independent snapshots), using a diffusion Transformer to jointly capture vertex-level spatial detail and inter-frame temporal dependencies.
- Core Idea: An EEG-guided diffusion Transformer generates spatiotemporally consistent fMRI sequences, with null-space constrained sampling enabling intermediate frame reconstruction.
Method¶
Overall Architecture¶
Input: A temporally aligned EEG window sequence \(\mathbf{S}\) (64 channels, 1000 Hz, with a 4s hemodynamic delay relative to fMRI). Output: An fMRI sequence of \(K_w\) frames \(\mathbf{X} \in \mathbb{R}^{K_w \times N_v}\) (\(N_v = 91{,}282\) cortical vertices). Core pipeline: EEG features are extracted via a temporal encoder → a linear fMRI autoencoder compresses the spatial dimension → a diffusion Transformer performs EEG-conditioned denoising in the low-dimensional space → decoding recovers vertex-level fMRI. At inference, two modes are supported: direct reconstruction and null-space constrained intermediate frame reconstruction (InterRecon).
Key Designs¶
-
Spatiotemporal Tokenization and EEG Condition Injection:
- Function: Joint modeling of spatial and temporal dimensions.
- Mechanism: The \(K_w\)-frame fMRI sequence is tokenized into \((K_w \times N_v)\) vertex-level tokens, each appended with a temporal positional encoding to distinguish different frames. EEG features extracted by a temporal convolutional encoder are injected into vertex tokens via cross-attention at each Transformer layer. This enables the model to simultaneously account for spatial structure and EEG-guided temporal patterns during denoising.
- Design Motivation: Independent per-frame modeling leads to inter-frame inconsistency. Treating multiple frames as a unified sequence allows self-attention to naturally capture inter-frame temporal dependencies.
-
Null-Space Sampling for InterRecon:
- Function: Reconstructing arbitrary intermediate frames from sparse anchor frames without retraining.
- Mechanism: Sparse observations are modeled as linear measurements \(\mathbf{y} = \mathbf{A}\mathbf{X}\), where \(\mathbf{A} = \text{diag}(m_1,...,m_{K_w})\) indicates known frames. At each reverse diffusion step, the denoised estimate is decomposed into a range-space component (enforcing consistency with anchor frames) and a null-space component (preserving generative freedom): \(\hat{\mathbf{x}}_{0|n} = \mathbf{A}^\dagger \mathbf{y} + (\mathbf{I} - \mathbf{A}^\dagger \mathbf{A})\mathbf{x}_{0|n}\). This guarantees exact anchor frame matching while allowing intermediate frames to be freely generated.
- Design Motivation: Missing or corrupted frames are common in real fMRI acquisition. The null-space approach decouples observation constraints from generative freedom, enabling adaptation to diverse interpolation scenarios without retraining. It also serves as an intrinsic evaluation tool for temporal consistency.
-
Linear fMRI Autoencoder:
- Function: Dimensionality reduction for efficient diffusion modeling while preserving the null-space decomposition property.
- Mechanism: A linear MLP maps each \(N_v = 91{,}282\)-dimensional fMRI frame to a 1024-dimensional latent representation. Both encoder and decoder are linear transformations, trained end-to-end with the diffusion model.
- Design Motivation: Linearity ensures that the null-space projection \((\mathbf{I} - \mathbf{A}^\dagger \mathbf{A})\) remains exact in the compressed space. A nonlinear autoencoder would break the range–null-space decomposition.
Loss & Training¶
- Denoising score matching loss: \(\mathcal{L}_{\text{diff}} = \mathbb{E}[\|\epsilon - \epsilon_\theta(\mathbf{x}^{(n)}, n, \mathbf{h}_\text{EEG})\|^2]\)
- Diffusion parameters: 1000 timesteps, linear noise schedule; DDIM with 50 steps at inference.
- Architecture: 6-layer Transformer, 8-head attention, hidden dimension 1024.
- Training: AdamW, lr \(= 1\times10^{-4}\), batch size 32, 200 epochs, single A100 GPU.
- Within-subject training with 80/20 split; test set contains unseen video clips.
Key Experimental Results¶
Main Results¶
Dynamic fMRI frame reconstruction (averaged over 6 subjects, sequence length 10, whole-brain):
| Method | MSE ↓ | r ↑ | Cos ↑ |
|---|---|---|---|
| CNN-TC | 0.315 | 0.804 | 0.824 |
| CNN-TAG | 0.309 | 0.810 | 0.829 |
| E2FNet | 0.297 | 0.819 | 0.836 |
| E2FGAN | 0.290 | 0.822 | 0.839 |
| Ours | 0.277 | 0.824 | 0.849 |
Visual cortex (V1) sub-region, 10 frames: MSE 0.193, r 0.834, Cos 0.887.
Ablation Study¶
Intermediate frame reconstruction (InterRecon) comparison:
| Method | MSE ↓ | r ↑ | Cos ↑ |
|---|---|---|---|
| Linear interpolation | 0.280 | 0.830 | 0.851 |
| Ours w/o null space | 0.272 | 0.839 | 0.852 |
| Ours w/ null space | 0.250 | 0.852 | 0.865 |
Null-space constraints yield consistent improvements: MSE reduced by 8.1%, r improved by 1.5%, Cos improved by 1.5%.
Key Findings¶
- Strong temporal robustness: As sequence length increases from 3 to 30 frames, the proposed method's whole-brain MSE changes only marginally from 0.282 to 0.281, whereas CNN-TC degrades from 0.302 to 0.322. This demonstrates that joint spatiotemporal modeling effectively captures long-range temporal dependencies.
- Superior performance in functional regions: Reconstruction metrics in the visual and auditory cortices substantially exceed the whole-brain average, consistent with neuroscientific expectations that these regions are strongly driven during a movie-watching task.
- Downstream visual decoding validation: Applying the CineSync-f decoder to reconstructed fMRI recovers coarse semantic structures of scenes (characters, poses, scene layout), confirming that the reconstructions preserve functional neural representations.
- Null-space sampling requires no retraining — it uses the identical model checkpoint as direct reconstruction, modifying only the sampling strategy.
Highlights & Insights¶
- Paradigm shift to spatiotemporal frames: Reframing fMRI reconstruction from "independent per-frame prediction" to "sequence modeling" represents a conceptually significant advance. Unified self-attention across temporal and spatial dimensions captures more complete neural dynamics than prior purely spatial approaches.
- Dual value of null-space sampling: It serves both as a practical tool for missing frame imputation (a common need in real fMRI preprocessing) and as an intrinsic evaluation of temporal consistency — verifying whether the model has learned genuine temporal dependencies without requiring additional metrics.
- Elegant constraint of the linear autoencoder: Trading expressive power for preserved mathematical properties is an elegant engineering decision, ensuring that the null-space decomposition holds exactly in the latent space.
Limitations & Future Work¶
- Within-subject training: Models are currently trained per subject and cannot generalize across subjects. Cross-subject modeling requires stronger anatomical or functional alignment methods.
- Fixed EEG-fMRI delay: A fixed 4s delay is assumed, whereas real hemodynamic delays vary across brain regions and time. A learnable alignment module could be explored.
- The expressiveness of the linear autoencoder is limited; mildly nonlinear designs that preserve the null-space decomposition property could be investigated.
- Quantitative evaluation of downstream visual decoding is limited, with only qualitative visualizations provided.
Related Work & Insights¶
- vs. NeuroBOLT: ROI-level modeling naturally achieves temporal consistency but at low spatial resolution (hundreds of regions vs. 91,282 vertices). This work fills the gap of "high spatial resolution + temporal consistency."
- vs. CATD: A cortical-level fMRI translation model that reconstructs frames independently. This work demonstrates that sequence modeling outperforms per-frame methods on both whole-brain and functional region metrics.
- vs. image diffusion models: Drawing on ideas from DiT and null-space diffusion sampling, this work applies them to high-dimensional spatiotemporal data in neuroscience, showcasing the potential of diffusion models for scientific data modeling.
Rating¶
- Novelty: ⭐⭐⭐⭐ Reframes fMRI reconstruction as spatiotemporal sequence generation; null-space sampling enables intermediate frame reconstruction without retraining.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive validation across multiple sequence lengths, brain regions, baselines, InterRecon, and downstream decoding; limited to 6 subjects on a single dataset.
- Writing Quality: ⭐⭐⭐⭐ Mathematical derivations are rigorous and the method is clearly described, though the presentation may skew toward machine learning readers over neuroscience audiences.
- Value: ⭐⭐⭐⭐ Introduces a new paradigm for joint multimodal neuroimaging modeling, though the application domain is relatively niche.