UMBRAE: Unified Multimodal Brain Decoding¶

Conference: ECCV 2024
arXiv: 2404.07202
Code: https://weihaox.github.io/UMBRAE
Area: Multimodal VLM
Keywords: Brain signal decoding, fMRI, Cross-subject training, Multimodal LLM, Brain-visual alignment

TL;DR¶

This paper proposes UMBRAE, which aligns fMRI signals with image features using a universal brain encoder and feeds them into a frozen MLLM to achieve multimodal brain decoding (description, grounding, retrieval, visual reconstruction). It innovatively introduces a cross-subject training strategy, enabling a single model to serve multiple subjects and outperform single-subject models.

Background & Motivation¶

Background: Brain signal decoding research has made progress in recent years, decoding fMRI signals into images (such as MindEye), videos, or text, but remains limited to single-modality outputs.

Limitations of Prior Work: (a) Single-modality decoding yields lossy representations—text lacks spatial location information, while image reconstruction is an underdetermined problem and lacks explicit representation of scene structures; (b) each subject requires a separately trained model due to structural and functional differences in activation patterns across different brains.

Key Challenge: Brain signals contain rich multimodal information (semantic concepts + spatial locations + object relations), but existing methods can only decode into a single modality. Subject-specific training cannot leverage the synergy of multi-subject data.

Goal: (a) Achieve unified decoding from brain signals to multimodal representations; (b) train a cross-subject universal model.

Key Insight: Align brain signals with intermediate features of a pre-trained image encoder, and then leverage the multitasking capability of MLLMs to achieve decoding at different levels of granularity.

Core Idea: Once brain signals are aligned to the image feature space, the multimodal understanding capabilities of the MLLM can be directly reused.

Method¶

Overall Architecture¶

UMBRAE consists of three components: (1) A brain encoder (subject-specific tokenizer + universal perceive encoder) that maps fMRI signals to fixed-length brain tokens; (2) a multimodal alignment module that aligns brain tokens with CLIP image features; (3) a frozen MLLM (e.g., Shikra/LLaVA) that executes different tasks through a prompt interface.

Key Designs¶

1. Brain Encoder Architecture - Function: Encodes variable-length fMRI signals from different subjects into a unified, fixed-length token sequence. - Mechanism: A lightweight tokenizer per subject + a shared perceive encoder (Transformer cross-attention); each subject has learnable subject tokens (5x1024). - Design Motivation: Structural brain differences among subjects are processed by dedicated tokenizers, while universal cognitive patterns are captured by the shared encoder.

2. Cross-Subject Training Strategy - Function: Jointly trains data from multiple subjects within a single model. - Mechanism: Within each batch, 50% of the data comes from the same subject, while the remaining is uniformly sampled from other subjects. - Design Motivation: Key discovery—the cross-subject model actually outperforms single-subject models, indicating the presence of transferable brain activity patterns.

3. Alignment with Intermediate Image Features - Function: Aligns brain features with the second-to-last layer features of CLIP ViT-L/14 (16x16x1024). - Mechanism: Simple MSE reconstruction loss \(\mathcal{L}_{\text{rec}} = \mathbb{E}[\|V(v) - B(b)\|^2]\). - Design Motivation: Intermediate features preserve both semantic and spatial information, which can be directly fed into the MLLM adapter.

4. Brain Prompting Interface - Function: Uses different prompt templates for different tasks—"Describe this image" for captioning, "Locate " for grounding. - Mechanism: Brain features replace image features in the prompt embeddings. - Design Motivation: The instruction-following capability of MLLMs naturally supports multi-task switching.

5. Weakly Supervised Adaptation to New Subjects - Function: Rapidly adapts to new subject data with a small amount of training samples. - Mechanism: Freezes the perceive encoder and only trains the new subject's tokenizer. - Design Motivation: Cross-subject training has already learned universal patterns; new subjects only need to learn to "translate" their format.

Loss & Training¶

Training Loss: MSE reconstruction loss aligning brain features with image features.
Optimizer: AdamW, \(\beta_1=0.9\), \(\beta_2=0.95\), weight decay = 0.01.
Learning Rate: One-cycle scheduler, initial learning rate of 3e-4.
Training Scale: Single A100 GPU, 240 epochs, batch size 256, approximately 12 hours.
Data: NSD dataset, 24,980 training samples and 982 testing samples for each of the 4 subjects.

Key Experimental Results¶

Main Results¶

Method	BLEU1	METEOR	CIDEr	SPICE	CLIP-S
SDRecon	36.21	10.03	13.83	5.02	61.07
OneLLM	47.04	13.55	22.99	6.26	54.80
BrainCap	55.96	16.68	41.30	9.06	64.31
UMBRAE-S1	57.63	Best	Best	Best	65.00+
UMBRAE	Best	Best	Best	Best	Best

Ablation Study¶

Setting	Performance
Single-subject training (UMBRAE-S1)	Baseline
Cross-subject training (UMBRAE)	Outperforms single-subject, without increasing training time
Weakly supervised adaptation (10% data)	Still achieves reasonable performance
7B vs 13B LLM	13B yields further improvements

Key Findings¶

Cross-subject outperforms single-subject: Joint training leverages shared neural patterns across subjects, enhancing generalization.
UMBRAE is the first to achieve direct brain-signal grounding (brain grounding), approaching the baseline of utilizing ground-truth images while being over 10 times faster.
Simple MSE alignment to intermediate image features successfully recovers semantic and spatial information without requiring contrastive learning or a diffusion prior.
Weakly supervised adaptation can scale to new subjects with only a minimal amount of data.

Highlights & Insights¶

Counter-intuitive conclusion on cross-subject training: While neuroscience posits significant individual brain differences, experiments prove that shared training performs better, hinting at a deeper universal pattern in speech and human cognition.
Extreme simplicity of the alignment target: Aligning brain features with intermediate image features purely via MSE is sufficient to unlock the full potential of the MLLM.
BrainHub benchmark: The first comprehensive brain understanding evaluation benchmark, extending NSD to support captioning, grounding, and retrieval multi-tasks.
Model-agnostic design: The proposed method can be combined with any image encoder, LLM, and MLLM.

Limitations & Future Work¶

fMRI equipment is expensive and non-portable, limiting practical applications.
The NSD dataset scale is limited (~25K per subject); whether larger-scale data can yield further improvements remains to be explored.
Current grounding precision is limited by the spatial resolution of fMRI.
The study only utilizes 4 subjects, and scalability to larger cohorts remains to be validated.
The quality of visual reconstruction depends on the downstream generative model.

MindEye/BrainDiffuser: Single-modality visual reconstruction; UMBRAE extends this to multimodal decoding.
OneLLM: A unified multimodal encoder that includes brain signals, but requires massive data and computational resources.
Shikra/LLaVA: Serve as the MLLM base; UMBRAE demonstrates that brain signals can "disguise" themselves as image features to be understood by the MLLM.
Insight: Brain signals are essentially an "encoding" of natural images; a good decoder only needs to learn the mapping to an existing representation space.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ (First unified multimodal brain decoding + cross-subject training)
Technical Depth: ⭐⭐⭐⭐ (Reasonable architectural design with theoretically motivated cross-subject sampling strategies)
Experimental Thoroughness: ⭐⭐⭐⭐⭐ (Comprehensive evaluation across multiple tasks + BrainHub benchmark + thorough ablation studies)
Writing Quality: ⭐⭐⭐⭐ (Clear motivation and intuitive methodology representation)
Value: ⭐⭐⭐⭐ (Significant value at the intersection of brain-computer interfaces and multimodal learning)