Zebra: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding¶
Conference: NeurIPS 2025 arXiv: 2510.27128 Code: GitHub Area: Other Keywords: fMRI decoding, zero-shot generalization, adversarial training, representation disentanglement, brain visual decoding
TL;DR¶
This paper proposes Zebra, the first zero-shot brain visual decoding framework, which disentangles fMRI representations into subject-invariant and semantic-specific components via adversarial training and residual decomposition, enabling cross-subject visual reconstruction generalization without fine-tuning on new subjects.
Background & Motivation¶
fMRI-to-image reconstruction is a frontier direction in computational neuroscience and computer vision, aiming to reverse-engineer BOLD signals from the visual cortex into images. However, existing methods face a critical challenge: inability to generalize across individuals.
Current approaches (MindEye2, MindTuner, etc.) typically adopt a two-stage paradigm—pretraining a unified model on multi-subject data followed by subject-specific fine-tuning. This paradigm suffers from severe limitations: (1) every new patient requires AI expert intervention for fine-tuning; (2) the fine-tuning process is time-consuming (~one day), hindering real-time brain-computer interface applications; (3) no universal feature space capable of learning neural representations across human subjects exists.
Core argument: Despite inter-individual variability in brain activity, the human cortex encodes semantic information in a topographically organized, cross-subject consistent manner (supported by neuroscientific evidence). Therefore, zero-shot generalization can be achieved by explicitly separating subject-invariant components from semantic-specific ones.
Existing methods all fail under zero-shot settings: MindTuner's subject-specific design breaks down directly on unseen subjects; NeuroPictor, while transforming fMRI from different subjects into a unified shape, remains sensitive to subject noise and fails to learn invariant representations.
Method¶
Overall Architecture¶
Zebra builds upon a baseline framework (fMRI-PTE encoder + unCLIP diffusion prior + SDXL decoder), augmented with two core modules: Subject-Invariant Feature Extraction (SIFE) and Semantic-Specific Feature Extraction (SSFE). Training is performed once on training-set subjects, and inference on unseen subjects is performed directly without fine-tuning.
Key Designs¶
- Subject-Invariant Feature Extraction (SIFE): Subject-invariant features are separated via residual decomposition and adversarial training. A self-attention module \(\mathcal{F}_i\) extracts invariant features \(\bm{E}_i = \mathcal{F}_i(\bm{E})\), with the residual yielding subject-specific features \(\bm{E}_s = \bm{E} - \bm{E}_i\).
Adversarial training ensures \(\bm{E}_i\) contains no subject identity information—a subject discriminator \(\mathcal{D}_{dis}\) attempts to identify the subject from \(\bm{E}_i\), while the invariant extractor \(\mathcal{F}_i\) is trained to prevent such identification:
A classifier \(\mathcal{D}_{cls}\) is simultaneously trained to retain subject identity in \(\bm{E}_s\) (via \(\mathcal{L}_{cls}^{\bm{E}}\)), forming a complementary constraint.
- Representation Preservation Anchor: Adversarial training may distort the original feature space. An auxiliary fMRI reconstruction task is introduced to preserve the informational integrity of the feature space:
A two-layer deconvolution network with a linear prediction head reconstructs the input signal, ensuring that \(\bm{E}\) retains biological fidelity and semantic coherence under adversarial training.
- Semantic-Specific Feature Extraction (SSFE): Semantic information is further injected into \(\bm{E}_i\). A vision projector maps brain features into the CLIP visual space, yielding semantic-specific features \(\bm{F}_s = \mathcal{P}_s(\bm{E}_i)\) and semantic-invariant features \(\bm{F}_i = \mathcal{P}_i(\bm{E}_s)\). A BiMixCo loss aligns \(\bm{F}_s\) with OpenCLIP embeddings (\(\mathcal{L}_{spe}^{\bm{F}}\)), while a gradient reversal layer (GRL) prevents \(\bm{F}_i\) from aligning with CLIP features (\(\mathcal{L}_{inv}^{\bm{F}}\)), forcing more semantic information to flow into \(\bm{F}_s\).
Loss & Training¶
The total loss integrates seven components:
where \(\mathcal{L}_{sem} = \mathcal{L}_{cls} + \mathcal{L}_{\text{CLIP}_v} + \mathcal{L}_{\text{CLIP}_t}\) and \(\lambda=30\). Training runs for 60 epochs on 8 H800 GPUs with batch size 128, AdamW optimizer, and learning rate 1e-4. Inference uses a two-stage SDXL unCLIP decoding pipeline.
Key Experimental Results¶
Main Results (NSD dataset, average over subjects 1/2/5/7)¶
| Method | Training | PixCorr↑ | SSIM↑ | Alex(2)↑ | Alex(5)↑ | Incep↑ | CLIP↑ |
|---|---|---|---|---|---|---|---|
| NeuroPictor⋆ | Zero-shot | 0.057 | 0.297 | 71.4% | 74.7% | 62.5% | 66.0% |
| Our baseline | Zero-shot | 0.074 | 0.316 | 70.8% | 74.0% | 63.5% | 62.5% |
| Zebra | Zero-shot | 0.131 | 0.375 | 74.6% | 81.2% | 72.2% | 71.5% |
| MindEye2 | Few-shot (1h) | 0.195 | 0.419 | 84.2% | 90.6% | 81.2% | 79.2% |
| MindTuner | Full fine-tune | 0.322 | 0.421 | 95.8% | 98.8% | 95.6% | 93.8% |
Zebra substantially outperforms other zero-shot methods (PixCorr +0.074, Incep +9.7%), with some metrics approaching fully fine-tuned models.
Ablation Study¶
| Baseline | SIFE Adv. | SIFE Anchor | SSFE Adv. | SSFE Anchor | PixCorr | Alex(5) | CLIP |
|---|---|---|---|---|---|---|---|
| ✓ | 0.089 | 74.7% | 63.2% | ||||
| ✓ | ✓ | 0.129 | 77.4% | 66.8% | |||
| ✓ | ✓ | ✓ | 0.134 | 78.3% | 69.3% | ||
| ✓ | ✓ | ✓ | ✓ | 0.142 | 79.6% | 70.8% | |
| ✓ | ✓ | ✓ | ✓ | ✓ | 0.153 | 81.8% | 72.3% |
Key Findings¶
- All metrics improve steadily as the number of training subjects increases from 4 to 7 (CLIP: 63.7% → 72.3%), indicating that more subject data benefits generalization.
- UMAP/t-SNE visualizations confirm that \(\bm{E}_i\) is highly mixed across subjects (no subject-specific clustering), while \(\bm{E}_s\) clearly clusters by subject identity.
- Zero-shot inference takes approximately 1 second per image, compared to over 12 hours required by conventional fine-tuning pipelines.
- Zebra shows larger advantages on low-level perceptual metrics; semantic accuracy remains weaker than few-shot methods.
Highlights & Insights¶
- Pioneering problem formulation: This work is the first to define zero-shot brain visual decoding, advancing fMRI decoding from subject-specific fine-tuning towards a plug-and-play paradigm.
- Neuroscience-driven design: The architecture is grounded in neuroscientific evidence that the cortex encodes semantics consistently across individuals, achieving representation disentanglement via adversarial training and residual decomposition.
- Elegant representation preservation anchor: This design addresses the classical problem of adversarial training corrupting feature spaces, using fMRI reconstruction as an anchor to maintain informational integrity.
- From a practical standpoint, zero-shot decoding holds significant value for clinical applications such as brain-computer interfaces and neural rehabilitation.
Limitations & Future Work¶
- Semantic fidelity remains inferior to few-shot methods, with degraded performance on rare object categories.
- Validation is limited to the NSD dataset with only 8 subjects, constraining the scale of evaluation.
- The work focuses solely on image reconstruction, leaving more complex modalities such as text and video unexplored.
- Additional subjects and fMRI recordings are needed to comprehensively capture real-world visual experiences.
Related Work & Insights¶
Compared to methods requiring fine-tuning such as MindEye2 and MindTuner, Zebra requires no test-subject data whatsoever. Compared to NeuroPictor's unified brain encoding, Zebra effectively removes subject noise through explicit disentanglement. Key insight: in biomedical scenarios with large inter-individual variability, adversarial disentanglement may serve as a general strategy for achieving zero-shot generalization.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First zero-shot brain visual decoding work; pioneering in both problem formulation and methodology.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive quantitative, qualitative, ablation, and visualization analyses, though dataset and subject scale are limited.
- Writing Quality: ⭐⭐⭐⭐⭐ Motivation is clearly articulated, methodology is presented intuitively, and experiments are well organized.
- Value: ⭐⭐⭐⭐⭐ Significant practical implications for brain-computer interfaces and clinical neuroscience.