Neuro-3D: Towards 3D Visual Decoding from EEG Signals¶

Conference: CVPR 2025
arXiv: 2411.12248
Code: https://github.com/gzq17/neuro-3D
Area: 3D Vision
Keywords: EEG decoding, Brain-Computer Interface (BCI), 3D point cloud reconstruction, Dynamic-static fusion, CLIP alignment

TL;DR¶

Neuro-3D is the first work to reconstruct colored 3D point clouds from electroencephalography (EEG) signals. It introduces the EEG-3D dataset (12 subjects, 72 Objaverse object categories, dynamic video + static image stimuli) and achieves cross-modal 3D visual decoding through a dynamic-static EEG fusion encoder, CLIP-aligned contrastive learning, and diffusion-based point cloud generation with color prediction.

Background & Motivation¶

Background: Brain signal visual decoding originated from fMRI, and 2D image reconstruction has already been achieved (e.g., MindEye, Brain-Diffuser). EEG has attracted attention due to its portability and high temporal resolution, but existing EEG decoding is limited to 2D images or category classification.
Limitations of Prior Work: (1) There is no prior work on decoding from EEG to 3D—3D reconstruction requires understanding the shape and appearance of objects, while EEG signals are highly noisy; (2) There is a lack of datasets containing both EEG recordings and 3D ground truths; (3) Existing EEG datasets (such as Things-EEG, GOD) lack 3D annotations and dynamic video stimuli.
Key Challenge: The signal-to-noise ratio of EEG is extremely low (due to non-invasive acquisition), whereas 3D reconstruction requires detailed shape and color information—creating a significant gap between signal quality and target complexity.
Goal: Establish the EEG-3D dataset and design a complete decoding pipeline from EEG to 3D point clouds.
Key Insight: Dynamic video stimuli (object rotation) provide 3D viewpoint variation information, while static image stimuli provide stable appearance information—the fusion of both allows EEG signals to capture a more complete 3D perception.
Core Idea: Dynamic-static EEG fusion \(\rightarrow\) CLIP alignment (contrastive learning) \(\rightarrow\) shape generation (diffusion point cloud) + color prediction (single-step coloring).

Method¶

Overall Architecture¶

Dynamic EEG \(e_d\) (viewing rotating videos) + static EEG \(e_s\) (viewing images) \(\rightarrow\) dynamic-static fusion encoder (adaptive aggregation via cross-attention) \(\rightarrow\) decoupling into geometric features \(f_g\) and appearance features \(f_a\) \(\rightarrow\) CLIP-aligned contrastive learning \(\rightarrow\) \(f_g\)-conditioned diffusion generation of an 8192-point 3D point cloud \(\rightarrow\) \(f_a\)-conditioned single-step color prediction \(\rightarrow\) colored 3D point cloud.

Key Designs¶

Dynamic-Static EEG Fusion Encoder
- Function: Adaptively fuse dynamic (temporally rich) and static (high signal-to-noise ratio) EEG signals.
- Mechanism: Static EEG is encoded as \(z_s = E_s(e_s)\), and dynamic EEG is encoded as \(z_d = E_d(e_d)\) (incorporating temporal self-attention). Adaptive neural aggregator: \(z_{sd} = \text{Softmax}(QK^T/\sqrt{d})V\), where \(Q\) is derived from the static representation, and \(K/V\) are derived from the dynamic representation.
- Design Motivation: Dynamic videos provide multi-angle information but elicit complex EEG responses; static images have a higher signal-to-noise ratio but lack 3D perspectives. Cross-attention allows the static representation to guide (\(Q\)) while the dynamic representation complements (\(K/V\)).
Geometry-Appearance Decoupled Learning
- Function: Decompose the EEG representation into separate shape and color branches.
- Mechanism: The fused features are mapped via two MLPs into \(f_g\) (geometry) and \(f_a\) (appearance), which are individually aligned with CLIP visual features: \(\mathcal{L}_{align} = \alpha \cdot \text{CLIP}(f, f_v) + (1-\alpha) \cdot \text{MSE}(f, f_v)\), coupled with a category classification loss \(\mathcal{L}_c\).
- Design Motivation: 3D shape and color are independent attributes—different colors can exist for the same shape; decoupling them enables more efficient learning for each branch.
Diffusion Point Cloud Generation + Majority Voting Coloring
- Function: Generate 3D shapes from EEG features and apply color.
- Mechanism: A Point-Voxel Network (PVN) is used as a denoiser, generating 8192 3D points via a Markov diffusion process conditioned on \(f_g\). Coloring is simplified using majority voting—predicting the dominant color of the object rather than point-wise colors to reduce prediction complexity.
- Design Motivation: EEG signals are too noisy to support accurate point-wise color prediction; majority voting provides a reasonable holistic color.

Loss & Training¶

\(\mathcal{L} = \mathcal{L}_{align}(f_g, f_v) + \mathcal{L}_{align}(f_a, f_v) + \gamma \mathcal{L}_c\). The feature dimension is 1024, and videos are downsampled to \(n=4\) frames for CLIP alignment.

Key Experimental Results¶

Main Results¶

Task	Metric	Neuro-3D	Baseline
Object Category Classification (72 classes)	top-1	Significantly outperforms	DeepNet 3.70%, Random 1.39%
Color Category Classification (6 classes)	top-1	Significantly outperforms	DeepNet 20.95%, Random 16.67%
3D Reconstruction	2-way top-1	Effectively discriminates	-
3D Reconstruction	Chamfer Distance	Reasonably generates	-

Ablation Study¶

Configuration	Effect	Description
Static EEG only	Classification drops	Lacks 3D perspective information
Dynamic EEG only	Classification drops	Insufficient signal-to-noise ratio
Dynamic + static fusion	Optimal	Complementary information
w/o CLIP alignment	Reconstruction degrades	Semantic alignment serves as a bridge
w/o Decoupling	Shape/color confusion	Decoupling assists individual learning

Key Findings¶

Dynamic-static fusion performs better than using either modality alone—confirming that the two stimuli provide complementary information.
Although 3D point cloud reconstruction from EEG is coarse, it is recognizable at the category level—marking the first step in this direction.
The EEG-3D dataset is the first benchmark that simultaneously contains EEG recordings, 3D ground truths, and color information.

Highlights & Insights¶

Pioneering Problem Definition: Proposing the visual decoding task from EEG to 3D for the first time.
Long-term Value of the EEG-3D Dataset: 12 subjects \(\times\) 72 classes \(\times\) multi-modal annotations, which can support various future works.
Innovation of Dynamic Video Stimuli: Previous EEG datasets only featured static images—rotating videos provide crucial cues for 3D perception.

Limitations & Future Work¶

The low signal-to-noise ratio of EEG results in coarse reconstruction quality—future work could consider higher-quality signals such as fNIRS or Electrocorticography (ECoG).
Color prediction is simplified to majority voting—point-wise color prediction requires more effective signal decoding.
Only 12 subjects are included; generalization to a larger population requires validation.
The 3D reconstruction is mainly recognizable at the category level, with limited ability for fine-grained discrimination within the same category.

vs MindEye/Brain-Diffuser: Reconstruction of 2D images from fMRI. Neuro-3D extends this to 3D and utilizes more portable EEG.
vs Mind-3D: Also features 3D annotations but lacks color information. EEG-3D includes color annotations for the first time.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First 3D point cloud reconstruction from EEG.
Experimental Thoroughness: ⭐⭐⭐⭐ Comprehensive evaluation across datasets, methods, classification, and reconstruction.
Writing Quality: ⭐⭐⭐⭐ Clear.
Value: ⭐⭐⭐⭐ Pioneering work and dataset contribution.