Neuro-3D: Towards 3D Visual Decoding from EEG Signals¶
Conference: CVPR 2025
arXiv: 2411.12248
Code: https://github.com/gzq17/neuro-3D
Area: 3D Vision
Keywords: EEG decoding, Brain-Computer Interface (BCI), 3D point cloud reconstruction, Dynamic-static fusion, CLIP alignment
TL;DR¶
Neuro-3D is the first work to reconstruct colored 3D point clouds from electroencephalography (EEG) signals. It introduces the EEG-3D dataset (12 subjects, 72 Objaverse object categories, dynamic video + static image stimuli) and achieves cross-modal 3D visual decoding through a dynamic-static EEG fusion encoder, CLIP-aligned contrastive learning, and diffusion-based point cloud generation with color prediction.
Background & Motivation¶
- Background: Brain signal visual decoding originated from fMRI, and 2D image reconstruction has already been achieved (e.g., MindEye, Brain-Diffuser). EEG has attracted attention due to its portability and high temporal resolution, but existing EEG decoding is limited to 2D images or category classification.
- Limitations of Prior Work: (1) There is no prior work on decoding from EEG to 3D—3D reconstruction requires understanding the shape and appearance of objects, while EEG signals are highly noisy; (2) There is a lack of datasets containing both EEG recordings and 3D ground truths; (3) Existing EEG datasets (such as Things-EEG, GOD) lack 3D annotations and dynamic video stimuli.
- Key Challenge: The signal-to-noise ratio of EEG is extremely low (due to non-invasive acquisition), whereas 3D reconstruction requires detailed shape and color information—creating a significant gap between signal quality and target complexity.
- Goal: Establish the EEG-3D dataset and design a complete decoding pipeline from EEG to 3D point clouds.
- Key Insight: Dynamic video stimuli (object rotation) provide 3D viewpoint variation information, while static image stimuli provide stable appearance information—the fusion of both allows EEG signals to capture a more complete 3D perception.
- Core Idea: Dynamic-static EEG fusion \(\rightarrow\) CLIP alignment (contrastive learning) \(\rightarrow\) shape generation (diffusion point cloud) + color prediction (single-step coloring).
Method¶
Overall Architecture¶
Dynamic EEG \(e_d\) (viewing rotating videos) + static EEG \(e_s\) (viewing images) \(\rightarrow\) dynamic-static fusion encoder (adaptive aggregation via cross-attention) \(\rightarrow\) decoupling into geometric features \(f_g\) and appearance features \(f_a\) \(\rightarrow\) CLIP-aligned contrastive learning \(\rightarrow\) \(f_g\)-conditioned diffusion generation of an 8192-point 3D point cloud \(\rightarrow\) \(f_a\)-conditioned single-step color prediction \(\rightarrow\) colored 3D point cloud.
Key Designs¶
-
Dynamic-Static EEG Fusion Encoder
- Function: Adaptively fuse dynamic (temporally rich) and static (high signal-to-noise ratio) EEG signals.
- Mechanism: Static EEG is encoded as \(z_s = E_s(e_s)\), and dynamic EEG is encoded as \(z_d = E_d(e_d)\) (incorporating temporal self-attention). Adaptive neural aggregator: \(z_{sd} = \text{Softmax}(QK^T/\sqrt{d})V\), where \(Q\) is derived from the static representation, and \(K/V\) are derived from the dynamic representation.
- Design Motivation: Dynamic videos provide multi-angle information but elicit complex EEG responses; static images have a higher signal-to-noise ratio but lack 3D perspectives. Cross-attention allows the static representation to guide (\(Q\)) while the dynamic representation complements (\(K/V\)).
-
Geometry-Appearance Decoupled Learning
- Function: Decompose the EEG representation into separate shape and color branches.
- Mechanism: The fused features are mapped via two MLPs into \(f_g\) (geometry) and \(f_a\) (appearance), which are individually aligned with CLIP visual features: \(\mathcal{L}_{align} = \alpha \cdot \text{CLIP}(f, f_v) + (1-\alpha) \cdot \text{MSE}(f, f_v)\), coupled with a category classification loss \(\mathcal{L}_c\).
- Design Motivation: 3D shape and color are independent attributes—different colors can exist for the same shape; decoupling them enables more efficient learning for each branch.
-
Diffusion Point Cloud Generation + Majority Voting Coloring
- Function: Generate 3D shapes from EEG features and apply color.
- Mechanism: A Point-Voxel Network (PVN) is used as a denoiser, generating 8192 3D points via a Markov diffusion process conditioned on \(f_g\). Coloring is simplified using majority voting—predicting the dominant color of the object rather than point-wise colors to reduce prediction complexity.
- Design Motivation: EEG signals are too noisy to support accurate point-wise color prediction; majority voting provides a reasonable holistic color.
Loss & Training¶
\(\mathcal{L} = \mathcal{L}_{align}(f_g, f_v) + \mathcal{L}_{align}(f_a, f_v) + \gamma \mathcal{L}_c\). The feature dimension is 1024, and videos are downsampled to \(n=4\) frames for CLIP alignment.
Key Experimental Results¶
Main Results¶
| Task | Metric | Neuro-3D | Baseline |
|---|---|---|---|
| Object Category Classification (72 classes) | top-1 | Significantly outperforms | DeepNet 3.70%, Random 1.39% |
| Color Category Classification (6 classes) | top-1 | Significantly outperforms | DeepNet 20.95%, Random 16.67% |
| 3D Reconstruction | 2-way top-1 | Effectively discriminates | - |
| 3D Reconstruction | Chamfer Distance | Reasonably generates | - |
Ablation Study¶
| Configuration | Effect | Description |
|---|---|---|
| Static EEG only | Classification drops | Lacks 3D perspective information |
| Dynamic EEG only | Classification drops | Insufficient signal-to-noise ratio |
| Dynamic + static fusion | Optimal | Complementary information |
| w/o CLIP alignment | Reconstruction degrades | Semantic alignment serves as a bridge |
| w/o Decoupling | Shape/color confusion | Decoupling assists individual learning |
Key Findings¶
- Dynamic-static fusion performs better than using either modality alone—confirming that the two stimuli provide complementary information.
- Although 3D point cloud reconstruction from EEG is coarse, it is recognizable at the category level—marking the first step in this direction.
- The EEG-3D dataset is the first benchmark that simultaneously contains EEG recordings, 3D ground truths, and color information.
Highlights & Insights¶
- Pioneering Problem Definition: Proposing the visual decoding task from EEG to 3D for the first time.
- Long-term Value of the EEG-3D Dataset: 12 subjects \(\times\) 72 classes \(\times\) multi-modal annotations, which can support various future works.
- Innovation of Dynamic Video Stimuli: Previous EEG datasets only featured static images—rotating videos provide crucial cues for 3D perception.
Limitations & Future Work¶
- The low signal-to-noise ratio of EEG results in coarse reconstruction quality—future work could consider higher-quality signals such as fNIRS or Electrocorticography (ECoG).
- Color prediction is simplified to majority voting—point-wise color prediction requires more effective signal decoding.
- Only 12 subjects are included; generalization to a larger population requires validation.
- The 3D reconstruction is mainly recognizable at the category level, with limited ability for fine-grained discrimination within the same category.
Related Work & Insights¶
- vs MindEye/Brain-Diffuser: Reconstruction of 2D images from fMRI. Neuro-3D extends this to 3D and utilizes more portable EEG.
- vs Mind-3D: Also features 3D annotations but lacks color information. EEG-3D includes color annotations for the first time.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First 3D point cloud reconstruction from EEG.
- Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across datasets, methods, classification, and reconstruction.
- Writing Quality: ⭐⭐⭐⭐ Clear.
- Value: ⭐⭐⭐⭐ Pioneering work and dataset contribution.