UMBRAE: Unified Multimodal Brain Decoding¶
Conference: ECCV 2024
arXiv: 2404.07202
Code: https://weihaox.github.io/UMBRAE
Area: Multimodal VLM
Keywords: Brain signal decoding, fMRI, Cross-subject training, Multimodal LLM, Brain-visual alignment
TL;DR¶
This paper proposes UMBRAE, which aligns fMRI signals with image features using a universal brain encoder and feeds them into a frozen MLLM to achieve multimodal brain decoding (description, grounding, retrieval, visual reconstruction). It innovatively introduces a cross-subject training strategy, enabling a single model to serve multiple subjects and outperform single-subject models.
Background & Motivation¶
Background: Brain signal decoding research has made progress in recent years, decoding fMRI signals into images (such as MindEye), videos, or text, but remains limited to single-modality outputs.
Limitations of Prior Work: (a) Single-modality decoding yields lossy representations—text lacks spatial location information, while image reconstruction is an underdetermined problem and lacks explicit representation of scene structures; (b) each subject requires a separately trained model due to structural and functional differences in activation patterns across different brains.
Key Challenge: Brain signals contain rich multimodal information (semantic concepts + spatial locations + object relations), but existing methods can only decode into a single modality. Subject-specific training cannot leverage the synergy of multi-subject data.
Goal: (a) Achieve unified decoding from brain signals to multimodal representations; (b) train a cross-subject universal model.
Key Insight: Align brain signals with intermediate features of a pre-trained image encoder, and then leverage the multitasking capability of MLLMs to achieve decoding at different levels of granularity.
Core Idea: Once brain signals are aligned to the image feature space, the multimodal understanding capabilities of the MLLM can be directly reused.
Method¶
Overall Architecture¶
UMBRAE consists of three components: (1) A brain encoder (subject-specific tokenizer + universal perceive encoder) that maps fMRI signals to fixed-length brain tokens; (2) a multimodal alignment module that aligns brain tokens with CLIP image features; (3) a frozen MLLM (e.g., Shikra/LLaVA) that executes different tasks through a prompt interface.
Key Designs¶
1. Brain Encoder Architecture - Function: Encodes variable-length fMRI signals from different subjects into a unified, fixed-length token sequence. - Mechanism: A lightweight tokenizer per subject + a shared perceive encoder (Transformer cross-attention); each subject has learnable subject tokens (5x1024). - Design Motivation: Structural brain differences among subjects are processed by dedicated tokenizers, while universal cognitive patterns are captured by the shared encoder.
2. Cross-Subject Training Strategy - Function: Jointly trains data from multiple subjects within a single model. - Mechanism: Within each batch, 50% of the data comes from the same subject, while the remaining is uniformly sampled from other subjects. - Design Motivation: Key discovery—the cross-subject model actually outperforms single-subject models, indicating the presence of transferable brain activity patterns.
3. Alignment with Intermediate Image Features - Function: Aligns brain features with the second-to-last layer features of CLIP ViT-L/14 (16x16x1024). - Mechanism: Simple MSE reconstruction loss \(\mathcal{L}_{\text{rec}} = \mathbb{E}[\|V(v) - B(b)\|^2]\). - Design Motivation: Intermediate features preserve both semantic and spatial information, which can be directly fed into the MLLM adapter.
4. Brain Prompting Interface
- Function: Uses different prompt templates for different tasks—"Describe this image" for captioning, "Locate
5. Weakly Supervised Adaptation to New Subjects - Function: Rapidly adapts to new subject data with a small amount of training samples. - Mechanism: Freezes the perceive encoder and only trains the new subject's tokenizer. - Design Motivation: Cross-subject training has already learned universal patterns; new subjects only need to learn to "translate" their format.
Loss & Training¶
- Training Loss: MSE reconstruction loss aligning brain features with image features.
- Optimizer: AdamW, \(\beta_1=0.9\), \(\beta_2=0.95\), weight decay = 0.01.
- Learning Rate: One-cycle scheduler, initial learning rate of 3e-4.
- Training Scale: Single A100 GPU, 240 epochs, batch size 256, approximately 12 hours.
- Data: NSD dataset, 24,980 training samples and 982 testing samples for each of the 4 subjects.
Key Experimental Results¶
Main Results¶
| Method | BLEU1 | METEOR | CIDEr | SPICE | CLIP-S |
|---|---|---|---|---|---|
| SDRecon | 36.21 | 10.03 | 13.83 | 5.02 | 61.07 |
| OneLLM | 47.04 | 13.55 | 22.99 | 6.26 | 54.80 |
| BrainCap | 55.96 | 16.68 | 41.30 | 9.06 | 64.31 |
| UMBRAE-S1 | 57.63 | Best | Best | Best | 65.00+ |
| UMBRAE | Best | Best | Best | Best | Best |
Ablation Study¶
| Setting | Performance |
|---|---|
| Single-subject training (UMBRAE-S1) | Baseline |
| Cross-subject training (UMBRAE) | Outperforms single-subject, without increasing training time |
| Weakly supervised adaptation (10% data) | Still achieves reasonable performance |
| 7B vs 13B LLM | 13B yields further improvements |
Key Findings¶
- Cross-subject outperforms single-subject: Joint training leverages shared neural patterns across subjects, enhancing generalization.
- UMBRAE is the first to achieve direct brain-signal grounding (brain grounding), approaching the baseline of utilizing ground-truth images while being over 10 times faster.
- Simple MSE alignment to intermediate image features successfully recovers semantic and spatial information without requiring contrastive learning or a diffusion prior.
- Weakly supervised adaptation can scale to new subjects with only a minimal amount of data.
Highlights & Insights¶
- Counter-intuitive conclusion on cross-subject training: While neuroscience posits significant individual brain differences, experiments prove that shared training performs better, hinting at a deeper universal pattern in speech and human cognition.
- Extreme simplicity of the alignment target: Aligning brain features with intermediate image features purely via MSE is sufficient to unlock the full potential of the MLLM.
- BrainHub benchmark: The first comprehensive brain understanding evaluation benchmark, extending NSD to support captioning, grounding, and retrieval multi-tasks.
- Model-agnostic design: The proposed method can be combined with any image encoder, LLM, and MLLM.
Limitations & Future Work¶
- fMRI equipment is expensive and non-portable, limiting practical applications.
- The NSD dataset scale is limited (~25K per subject); whether larger-scale data can yield further improvements remains to be explored.
- Current grounding precision is limited by the spatial resolution of fMRI.
- The study only utilizes 4 subjects, and scalability to larger cohorts remains to be validated.
- The quality of visual reconstruction depends on the downstream generative model.
Related Work & Insights¶
- MindEye/BrainDiffuser: Single-modality visual reconstruction; UMBRAE extends this to multimodal decoding.
- OneLLM: A unified multimodal encoder that includes brain signals, but requires massive data and computational resources.
- Shikra/LLaVA: Serve as the MLLM base; UMBRAE demonstrates that brain signals can "disguise" themselves as image features to be understood by the MLLM.
- Insight: Brain signals are essentially an "encoding" of natural images; a good decoder only needs to learn the mapping to an existing representation space.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (First unified multimodal brain decoding + cross-subject training)
- Technical Depth: ⭐⭐⭐⭐ (Reasonable architectural design with theoretically motivated cross-subject sampling strategies)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Comprehensive evaluation across multiple tasks + BrainHub benchmark + thorough ablation studies)
- Writing Quality: ⭐⭐⭐⭐ (Clear motivation and intuitive methodology representation)
- Value: ⭐⭐⭐⭐ (Significant value at the intersection of brain-computer interfaces and multimodal learning)