Skip to content

BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation

Conference: ACL 2025
arXiv: 2410.14971
Code: None
Area: Model Compression
Keywords: Brain signal decoding, EEG/MEG-to-Text, Vector Quantization, Mel-spectrogram reconstruction, Whisper

TL;DR

This paper proposes BrainECHO, a three-stage framework (Autoencoding-Alignment-Finetuning) that maps brain signals to the Mel-spectrogram space via vector-quantized discrete representations, enabling high-quality non-invasive brain-to-text decoding with Whisper.

Background & Motivation

1. Background

Decoding text from electroencephalogram (EEG) and magnetoencephalography (MEG) signals is a frontier topic in brain-computer interfaces (BCIs). Recently, pre-trained language models (such as BART and Whisper) have enabled open-vocabulary brain-to-text decoding.

2. Limitations of Prior Work

  • Teacher-forcing reliance: BART-based methods (e.g., EEG-to-Text, DeWave) rely on ground-truth target text prefixes during inference; performance drops drastically once teacher-forcing is removed.
  • Sensitivity to session noise: EEG/MEG signals are susceptible to muscle movements, ocular artifacts, and electrode impedance changes, making generalization across subjects/sessions difficult.
  • Modality alignment imbalance: Pre-trained language models excessively dominate the decoding process, leading to insufficient alignment between brain signals and linguistic representations.

3. Key Challenge

Directly mapping continuous brain signals to discrete text tokens faces "distribution shift"—the continuous-to-discrete end-to-end mapping is prone to spurious correlations, which is further exacerbated by noise in the brain signals.

4. Goal

How to achieve robust, high-quality EEG/MEG-to-text decoding without relying on teacher-forcing?

5. Key Insight

Introduce discrete representation learning: use Vector Quantization (VQ) to compress brain signals into a discrete codebook space shared with Mel-spectrograms. Utilizing the quantization process to inherently filter noise, and then leveraging Whisper's powerful speech recognition capability to complete the text decoding.

6. Core Idea

Using the discrete codebook of Mel-spectrograms as a bridge, continuous representations of brain signals are compressed into discrete tokens. High-quality brain-to-spectrogram-to-text decoding is achieved through a three-stage decoupled training paradigm.

Method

Overall Architecture

BrainECHO adopts a three-stage training paradigm:

  1. Stage 1: Mel-Spectrogram Autoencoding (Autoencoding)
  2. Stage 2: Brain-to-Audio Latent Alignment (Alignment)
  3. Stage 3: Whisper Finetuning (Finetuning)

Key Designs

Stage 1: Discrete Autoencoding

A Mel-spectrogram \(m \in \mathbb{R}^{T_m \times F_m}\) is encoded into a feature map \(z_m\) via an audio encoder. Then, a vector quantizer \(Q\) replaces each latent variable with the nearest vector from a codebook \(\mathbb{C} \in \mathbb{R}^{N \times D}\):

\[Q(z_m^{ij}) = z_q^{ij} = c_k, \quad k = \arg\min_{k \in \{1,...,N\}} \|z_m^{ij} - c_k\|_2\]

The training objective is:

\[L_1 = \|m - \hat{m}\|_2^2 + \alpha\|sg(z_m) - z_q\|_2^2 + \beta_1\|z_m - sg(z_q)\|_2^2\]

where \(sg(\cdot)\) denotes the stop-gradient operator. The encoder and decoder adopt a ResUNet structure, with a codebook size of \(N=2048\) and dimension \(D=8\).

Stage 2: Frozen Alignment

Freeze the quantizer and decoder trained in Stage 1. Train a Conformer-based brain signal encoder to transform the raw EEG/MEG signal \(\varepsilon\) into a latent representation \(z_\varepsilon\), then reuse the frozen quantizer and decoder to reconstruct the Mel-spectrogram:

\[L_2 = \|m - Dec(Q(z_\varepsilon))\|_2^2 + \gamma\|z_m - z_\varepsilon\|_2^2 + \beta_2\|z_\varepsilon - sg(Q(z_\varepsilon))\|_2^2\]

Key design point: Use a unified codebook—the same discrete space represents both audio and brain signals. The quantization process acts as a "sparsity-inducing filter" that naturally filters out task-irrelevant noise.

Stage 3: Whisper Finetuning

Input the reconstructed Mel-spectrogram into the Whisper-base model to decode text. Fine-tune the encoder using AdaLoRA to minimize the cross-entropy loss. This stage bridges the gap between the brain-reconstructed spectrograms and the Whisper pre-training distribution.

Brain Signal Encoder

A Spatio-Temporal convolutional network is used to process raw signals \(\rightarrow\) Conformer (4 layers of Transformer + 8-head attention) \(\rightarrow\) linear layers and 2D convolutions to map to the same shape as \(z_m\).

Loss & Training

  • Three-stage decoupled training to reduce resource consumption at each step.
  • L2 loss (not CLIP loss) ensures high-fidelity spectrogram reconstruction.
  • Beam search (beam size = 5) + repetition penalty (penalty = 5.0, no-repeat 2-gram).

Key Experimental Results

Main Results (Brennan EEG Dataset)

Method Input BLEU-1 BLEU-4 ROUGE-1 F WER↓
EEG-to-Text EEG Features 8.82 1.44 13.12 233.99
NeuSpeech EEG 85.31 83.75 82.64 16.97
MAD EEG 80.34 78.15 83.79 42.14
BrainECHO EEG 89.78 88.55 87.13 11.72
BrainECHO (Noise) Noise 4.75 0 8.52 105.27

GWilliams MEG Dataset

Method Split BLEU-4 WER↓
NeuSpeech Random 47.78 56.63
MAD Random 0 105.33
BrainECHO Random 72.42 31.44
BrainECHO Session 74.27 29.59
BrainECHO Subject 74.14 29.80

Ablation Study (Three-Stage Training)

Autoencoding Alignment Finetuning BLEU-4
88.55
85.74 (-3.17%)
86.38
28.32

Key Findings

  1. BLEU-4 reaches 88.55 (Brennan) and 72.42 (GWilliams), significantly outperforming the previous SOTA NeuSpeech (+5.73% / +51.57%).
  2. Noise Test: When Gaussian noise is inputted, BLEU-4 drops to 0, proving that the model indeed captures the intrinsic link between brain signals and text rather than simply memorizing text templates.
  3. Robustness Across Splits: The performance difference among Subject, Session, and Sentence splits is minimal, removing the need for external subject identifiers.
  4. The discrete representation space provided by the autoencoding stage yields a 3.17% BLEU-4 improvement.
  5. The finetuning stage is crucial—without finetuning Whisper, BLEU-4 drops dramatically from 88.55 to 28.32.

Highlights & Insights

  • Elegant Three-Stage Decoupled Design: The discrete codebook serves as both a cross-modal bridge and a noise filter, killing two birds with one stone.
  • Breaking the Teacher-Forcing Bottleneck: Previous BART-based methods barely worked without teacher-forcing; BrainECHO achieves true autoregressive decoding.
  • Spectrogram Duration Extension: Expanded from 3 seconds to 10+ seconds, supporting sentence-level rather than segment-level decoding while preserving complete semantics.
  • Shared Representation of the Unified Codebook: Brain and audio signals share the same discrete space, elegantly solving the modality alignment problem.

Limitations & Future Work

  1. Evaluated on only 2 relatively small-scale datasets (140/661 sentences); generalizability to larger datasets needs further exploration.
  2. Relies on the auditory paradigm—participants must listen to speech; visual reading or inner speech scenarios are not yet validated.
  3. Spectrogram reconstruction quality heavily influences final text decoding, but the relationship between reconstruction loss and decoding quality is not deeply analyzed.
  4. Whisper-base is relatively small; using larger Whisper variants could potentially boost performance.
  5. Training involves multiple stages; the end-to-end efficiency in practical deployment needs optimizing.
  • NeuSpeech / MAD: Pioneers of Whisper-based MEG-to-Text; BrainECHO introduces discrete representations on top of this.
  • VQ-VAE: Vector Quantization technology is widely used in speech and image generation; this paper cleverly applies it to brain signals.
  • DeWave: A BART-based method using discrete EEG encoding but still relying on teacher-forcing.
  • Insight: Using discrete representations as a cross-modal bridge can be extended to other sensory signals (e.g., fNIRS, electromyography) decoding tasks.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — The design of using a three-stage decoupled approach + VQ codebook as a cross-modal bridge is highly innovative.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Conducted on two datasets with multiple splitting strategies and detailed ablation studies, though the data scale is relatively small.
  • Writing Quality: ⭐⭐⭐⭐ — Clear framework diagram and detailed method descriptions.
  • Value: ⭐⭐⭐⭐⭐ — Breakthrough progress in the field of brain signal decoding, providing a new paradigm for BCI-based text input.