NeuroFlow: Toward Unified Visual Encoding and Decoding from Neural Activity¶
Conference: CVPR 2026
arXiv: 2604.09817
Code: https://michaelmaiii.github.io/NeuroFlow-S (Project Page)
Area: Medical Imaging / Brain-Computer Interface / fMRI Visual Decoding
Keywords: Visual Encoding & Decoding, fMRI, Flow Matching, Cross-modal Alignment, Variational Autoencoder
TL;DR¶
NeuroFlow unifies "image-to-brain" (encoding) and "brain-to-image" (decoding) into a single flow model. It utilizes a variational backbone, NeuroVAE, to compress fMRI into a semantically structured latent space, followed by Cross-modal Flow Matching (XFM) to learn a reversible continuous flow between visual and neural latent distributions. By integrating forward (encoding) and backward (decoding) paths, it achieves SOTA or comparable performance on both tasks with only approximately 25% of the parameters of MindEye2.
Background & Motivation¶
Background: Understanding neural encoding (external stimulus \(\to\) neural activity) and decoding (neural activity \(\to\) stimulus) is central to neuroscience and Brain-Computer Interfaces (BCI). fMRI is a mainstream measurement due to its high spatial resolution. Currently, encoding and decoding are treated as two independent tasks: encoding models (e.g., SynBrain, MindSimulator) use pretrained visual features and voxel regression to predict brain responses, while decoding models (e.g., MindEye, MindEye2, BrainDiffuser) map brain signals to visual-language embedding spaces like CLIP to reconstruct images via generative models.
Limitations of Prior Work: Even in efforts to bridge both directions, two independent networks are typically used. These either perform fine-grained mapping in pixel-voxel space (resulting in blurry reconstructions lacking semantics) or employ two independent linear regressions to handle encoding and decoding separately despite bridging neural/visual latent spaces. This lack of shared representation prevents the two complementary processes from being trained jointly and fails to model their consistency.
Key Challenge: Encoding and decoding are essentially the forward and inverse directions of the same mapping (in a Bayesian context, the neural distribution is the likelihood of the visual representation, and the visual posterior is derived from the neural distribution). Current paradigms decouple them. A deeper issue lies in the prevalent "conditional noise-to-data diffusion" strategy, which only establishes a stochastic mapping between an empirical distribution and Gaussian noise, suffering from a training-inference distribution gap.
Goal: To unify encoding and decoding within a single model satisfying two essential properties: (i) Shared Latent Space, where both processes are jointly optimized; and (ii) Encoding-Decoding Consistency, ensuring synthesized neural signals can be reversibly reconstructed into coherent images.
Core Idea: Rewrite encoding and decoding as a time-dependent, reversible flow within a shared latent space. Forward integration (\(z_v \to z_n\)) performs encoding, and backward integration (\(z_n \to z_v\)) performs decoding. The tasks are distinguished solely by the direction of time, abandoning the noise-started, conditional guidance paradigm of diffusion.
Method¶
Overall Architecture¶
NeuroFlow aims to use a single model to generate brain signals from images and reconstruct images from brain signals. It consists of three components: a frozen visual backbone (CLIP encoding + UnCLIP decoding) for transitions between images and semantic latents; a trainable neural backbone (NeuroVAE) that projects fMRI into a semantically aligned probabilistic latent space; and the Cross-modal Flow Matching (XFM), which learns a reversible flow between the visual latent distribution \(z_v\) and the neural latent distribution \(z_n\).
Training occurs in two stages, followed by an inference stage: Stage-1 trains NeuroVAE to establish a structured neural latent space; Stage-2 freezes both backbones to train the XFM flow; Stage-3 uses the same XFM flow for inference—forward in time for encoding and backward for decoding.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
IMG["Visual Stimulus Image"] --> VB["Frozen Visual Backbone<br/>CLIP Encoder → Visual Latent z_v"]
FMRI["fMRI Voxel Signal"] --> NV["NeuroVAE<br/>Variational Backbone<br/>fMRI → Neural Latent z_n / z_c"]
VB --> XFM["Cross-modal Flow Matching XFM<br/>Reversible Continuous Flow z_v ↔ z_n"]
NV --> XFM
XFM -->|"Forward Integration Δt>0<br/>z_v→z_n (Encoding)"| ENC["Synthetic fMRI Activity"]
XFM -->|"Backward Integration Δt<0<br/>z_n→z_v (Decoding)"| DEC["UnCLIP Reconstructed Image"]
Key Designs¶
1. NeuroVAE: Compressing fMRI into a Semantically Structured Neural Latent Space To share a latent space, fMRI must be represented in a "clean, semantically aligned, and sampleable" format rather than via voxel-level noise. NeuroVAE is a Variational Autoencoder where the neural encoder \(E_n\) estimates a posterior \(q(z_n \in \mathbb{R}^{m\times d} \mid x_{\text{fMRI}})\) from fMRI \(x_{\text{fMRI}}\). A linear projection aggregates the channel dimension into a compact latent vector \(z_c \in \mathbb{R}^{1\times d}\), while the decoder \(D_n\) reconstructs \(\hat{x}_{\text{fMRI}}\). This dual latent vector division decouples the encoding path from the decoding process. Probabilistic modeling captures the "one-to-many" relationship of brain responses to the same stimulus, while the Gaussian prior ensures the latent space is smooth for continuous flow matching.
2. Cross-modal Flow Matching (XFM): Encoding and Decoding as a Reversible Flow This is the core innovation addressing the unidirectional nature and training-inference gap of diffusion. XFM learns a time-dependent vector field \(v_\theta(z,t)\) directly between the visual latent distribution \(z_v\) and the neural latent distribution \(z_n\), satisfying the ODE:
The vector field is parameterized by a Scalable Interpolant Transformer (SiT). Intermediate states are defined by cosine interpolation \(z_t = \alpha_t z_0 + \sigma_t z_1\), with \(\alpha_t = \cos^2(\frac{\pi}{2}t)\) and \(\sigma_t = \sin^2(\frac{\pi}{2}t)\). Reversibility is guaranteed by the uniqueness of the ODE solution: forward integration (\(\Delta t > 0\)) implements \(z_v \to z_n\) (encoding), and backward integration (\(\Delta t < 0\)) implements \(z_n \to z_v\) (decoding).
3. Contrastive + Cycle-consistency Alignment: Coarse Distribution Alignment XFM requires initial coarse alignment between distributions to learn a vector field efficiently. NeuroVAE employs two contrastive objectives: \(\mathcal{L}_{\text{clip}} = \text{SoftCLIP}(z_n, z_v)\) to align neural latents with visual semantics, and \(\mathcal{L}_{\text{cyc}} = \text{SoftCLIP}(\hat{z}_n, z_v)\) (where \(\hat{z}_n = E_n(\hat{x}_{\text{fMRI}})\)) to ensure reconstructed fMRI signals maintain semantic consistency. Ablations show that without these, retrieval accuracy drops from 86.4% to 0.3%, indicating that this coarse alignment is a prerequisite for XFM.
Loss & Training¶
Stage-1 (NeuroVAE) uses a composite objective:
\(\mathcal{L}_{\text{mse}}\) maintains voxel fidelity, \(\mathcal{L}_{\text{kl}}\) regularizes the posterior toward \(\mathcal{N}(0,I)\), while \(\mathcal{L}_{\text{clip}}\)/\(\mathcal{L}_{\text{cyc}}\) handle semantic alignment. Weights are set to \(\alpha=0.001\), \(\beta=\lambda=1000\).
Stage-2 (XFM) minimizes the flow matching error under uniform time sampling:
Inference uses Euler integration: \(z_{t+\Delta t} = z_t + \Delta t\, v_\theta(z_t,t)\). The model was trained handled on a single A100-40G within 5 hours.
Key Experimental Results¶
The Natural Scenes Dataset (NSD) was used with 4 subjects. Evaluation metrics include Inception Score (Incep), CLIP similarity, and retrieval metrics (Raw fMRI vs. Synthetic/Syn fMRI).
Main Results¶
| Method | Type | Decoding Incep↑ | Decoding CLIP↑ | Encoding Incep↑ | Encoding CLIP↑ | Retrieval Raw↑ | Retrieval Syn↑ |
|---|---|---|---|---|---|---|---|
| MindSimulator | E | - | - | 93.1% | 91.2% | - | - |
| SynBrain | E | - | - | 95.7% | 94.3% | 84.8% | 92.5% |
| BrainDiffuser | D | 91.3% | 90.9% | - | - | 18.8% | - |
| MindEye | D | 94.6% | 93.3% | - | - | 90.0% | - |
| MindEye2 | D | 95.4% | 93.0% | - | - | 98.8% | - |
| NeuroFlow | E&D | 95.6% | 94.2% | 98.6% | 98.7% | 80.6% | 97.0% |
NeuroFlow is the only unified E&D model. It achieves best-in-class performance in decoding (Incep/CLIP) and significantly outperforms specialized encoding models. Notably, decoding from synthetic fMRI outperforms raw fMRI, suggesting the model distills task-relevant semantics.
Efficiency Comparison¶
| Method | Type | Pretrained | Architecture | Params |
|---|---|---|---|---|
| SynBrain | E | No | VAE+Transformer | 690M |
| MindEye2 | D | Yes | Linear+MLP+DP | 2.60B |
| NeuroFlow | E&D | No | VAE+XFM | 660M |
Ablation Study (Subject 1)¶
| Configuration | Decoding CLIP↑ | Encoding CLIP↑ | Retrieval Raw↑ | Retrieval Syn↑ |
|---|---|---|---|---|
| Full NeuroFlow | 95.0% | 98.7% | 86.4% | 96.4% |
| w/o \(\mathcal{L}_{\text{XFM}}\) | 83.7% | 58.1% | 86.4% | 14.1% |
| w/o CLIP/Cyc | 58.8% | 51.3% | 0.3% | 0.5% |
Key Findings¶
- Alignment is foundational: Without contrastive alignment, retrieval collapses, meaning XFM cannot learn a flow across distant distributions.
- XFM is the unification engine: Removing it causes a sharp drop in encoding CLIP (98.7% \(\to\) 58.1%), proving it is the key module bridging the two directions.
- Synthetic signals outperform raw signals: NeuroVAE acts as a de-noiser and semantic distiller, producing neural signals that yield better decoding results than the original noisy fMRI.
- Brain-consistent modeling: Synthesized fMRI preserves cortical functional selectivity (e.g., activating FFA for faces), matching known cortical organization.
Highlights & Insights¶
- Reversible ODE Flow: Unifying E&D via "time direction = task" is a mathematically elegant paradigm where consistency is intrinsically guaranteed by the uniqueness of ODE solutions.
- Bypassing Noise-to-Data: Creating a flow directly between two empirical distributions avoids the training-inference gap and allows decoding to start from a "semantic sketch" rather than pure noise.
- Semantic Distillation: The observation that synthetic signals are superior for decoding provides a useful insight for handling low-SNR biological data.
Limitations & Future Work¶
- Raw Retrieval Cost: Unified modeling sacrifices some raw fMRI retrieval accuracy compared to MindEye2.
- Semantic Focus: Metrics are primarily semantic; explicit capture and evaluation of low-level visual structures remain limited.
- Scaling: Validation is currently restricted to the NSD dataset and fMRI modality.
Related Work & Insights¶
- vs. MindEye2: NeuroFlow performs E&D with 25% of the parameters and no pretraining requirement, though it trails in raw retrieval.
- vs. SynBrain: NeuroFlow replaces conditional noise-to-data diffusion with XFM, improving encoding CLIP from 94.3% to 98.7%.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐
- Experimental Thoroughness: ⭐⭐⭐⭐
- Writing Quality: ⭐⭐⭐⭐⭐
- Value: ⭐⭐⭐⭐⭐