Skip to content

Autoregressive Visual Decoding from EEG Signals

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=TKjfzuVLX4
Code: https://github.com/ddicee/avde
Area: Brain-Computer Interface / EEG Visual Decoding / Autoregressive Image Generation
Keywords: EEG-to-Image, Visual Decoding, LaBraM, Next-Scale Prediction (VAR), Contrastive Learning, BCI

TL;DR

AVDE reformulates "decoding EEG signals into images" into a two-stage, autoregressive lightweight pipeline: first, it aligns EEG to the CLIP image space using the pre-trained EEG foundation model LaBraM combined with contrastive learning; then, it uses "Next-Scale Prediction" from the VAR framework to generate images progressively from EEG embeddings. With only 10% of the parameters, it outperforms previous SOTA models that rely on large diffusion models in both retrieval and reconstruction tasks.

Background & Motivation

  • Background: Decoding visual content from non-invasive brain signals is at the intersection of cognitive science and generative AI. While fMRI offers high precision, it is slow, expensive, and hardware-constrained. EEG, with its millisecond temporal resolution, portability, and low cost, has become a more deployable alternative. Recent works (ATM, NICE, etc.) have demonstrated potential in image retrieval and reconstruction.
  • Limitations of Prior Work: Mainstream EEG visual decoding follows the unCLIP paradigm—a multi-stage pipeline (Figure 1 shows five stages) consisting of an EEG encoder → diffusion prior → Stable Diffusion. This has three major flaws: ① Serial multi-stage processes accumulate errors at each step, damaging reconstruction fidelity; ② EEG encoders are often trained from scratch, making it difficult to extract good features from noisy signals with limited image-EEG pairs; ③ Diffusion models often exceed 3 billion parameters, making the computational/memory overhead impractical for real-time BCI.
  • Key Challenge: EEG is a noisy, low-information-density 1D time-series signal, while images are structured high-dimensional visual content. The distribution gap is massive. Bridging it usually requires complex multi-stage pipelines, but increased complexity leads to lower controllability and deployability—a trade-off between fidelity vs. simplicity/deployability.
  • Goal: Replace the multi-stage diffusion pipeline with a direct, coherent, and lightweight framework that maintains a direct mapping between EEG and images while reducing parameters and inference costs to deployable levels.
  • Core Idea: [Transfer Pre-trained EEG Representations] Replace encoders trained from scratch with LaBraM, pre-trained on 2000+ hours of EEG data. [Autoregressive Next-Scale Prediction instead of Diffusion] Use the VAR framework to treat EEG embeddings as the "coarsest scale" of an image, using a transformer to generate images from coarse to fine, naturally aligning the generation process with the hierarchical visual perception of the human brain.

Method

Overall Architecture

AVDE is a clear two-stage process. Stage one is "Alignment": encode EEG using pre-trained LaBraM, freeze CLIP to encode images, and pull EEG into the image representation space using a joint contrastive + regression objective to obtain information-rich EEG embeddings. Stage two is "Generation": treat the EEG embedding as the starting token [s] for the coarsest scale, feed it into a decoder-only transformer to autoregressively predict VQ-VAE multi-scale residual maps via "next-scale prediction," accumulate them from coarse to fine into a complete feature map, and finally reconstruct the image via the VQ-VAE decoder.

flowchart LR
    EEG[EEG Signal<br/>C×T] --> LaBraM[LaBraM Encoder<br/>Pre-trained + Contrastive Fine-tuning]
    IMG[Image] --> CLIP[Frozen CLIP] 
    LaBraM -. Stage 1 CLIP+MSE Alignment .- CLIP
    LaBraM --> EMB[EEG Embedding e]
    EMB --> PROJ["Project to Starting Token [s]"]
    PROJ --> VAR[VAR Transformer<br/>Next-Scale Prediction]
    VAR --> R["Hierarchical Residuals R1→R2→…→RK"]
    R --> VQ[VQ-VAE Decoder]
    VQ --> OUT[Reconstructed Image]

Key Designs

1. Replacing random-initialized encoders with pre-trained LaBraM: Standing on the shoulders of giants. EEG visual decoding has long been hindered by "limited paired data and noisy signals." Encoders trained from scratch struggle to converge. AVDE utilizes LaBraM, a model pre-trained on 2000+ hours of multi-subject, multi-condition EEG data. The encoding process involves: partitioning \(X \in \mathbb{R}^{C\times T}\) into patches along the time dimension using non-overlapping windows \(w\); extracting local temporal features \(e_{c_j,k}\in\mathbb{R}^d\) via 1D convolution, group normalization, and GELU; injecting spatio-temporal context with trainable temporal embeddings \(te_k\) and spatial embeddings \(se_j\); and integrating dependencies using a Transformer encoder. This transfer learning allows the encoder to generalize across subjects and extract semantically meaningful features.

2. Combined Contrastive + Regression Alignment: Structural alignment plus point-to-point precision. Since LaBraM was pre-trained on clinical EEG rather than visual stimuli, it must be fine-tuned. Given paired EEG-images, LaBraM encodes EEG into \(e\) and frozen CLIP encodes images into \(z\). A bi-directional contrastive loss pulls matching pairs together: \(L_{CLIP} = -\frac{1}{B}\sum_i \big(\log\frac{\exp(s(e_i,z_i)/\tau)}{\sum_j \exp(s(e_i,z_j)/\tau)} + \log\frac{\exp(s(e_i,z_i)/\tau)}{\sum_k \exp(s(e_k,z_i)/\tau)}\big)\), where \(s\) is cosine similarity and \(\tau\) is a learnable temperature. To ensure absolute positioning in the embedding space beyond just relative ranking, a regression term is added: \(L_{Combined} = \lambda L_{CLIP} + (1-\lambda) L_{MSE}\) (\(\lambda=0.8\)). Contrastive loss places brain signals structurally within the image manifold, while MSE pins them near the corresponding image embeddings.

3. Autoregressive Generation with Next-Scale Prediction: VAR as a natural metaphor for brain vision. Moving away from diffusion, AVDE leverages the VAR approach: a pre-trained VQ-VAE quantizes images into \(K\) multi-scale residual maps \((R_1,\dots,R_K)\) with increasing resolution. Feature maps are reconstructed via cumulative upsampling: \(F_k = \sum_{i=1}^k \mathrm{up}(R_i,(h,w))\). A decoder-only transformer predicts these residuals conditioned on the EEG embedding \(e\): \(p(R_1,\dots,R_K)=\prod_{k=1}^K p(R_k\mid R_1,\dots,R_{k-1},e)\). Specifically, \(e\) is projected to a starting token [s] to initiate generation. For each scale \(k\), a downsampled version of the previous cumulative map \(\tilde F_{k-1}=\mathrm{down}(F_{k-1},(h_k,w_k))\) acts as input. Training uses a block-wise causal attention mask and cross-entropy supervision. The EEG embedding acts as the "coarsest layer," ensuring the generation chain is directly linked to the input, mirroring the human brain's hierarchical perception from V1 (edges/colors) → V2/V4 (contours/structure) → IT (global objects).

Key Experimental Results

Datasets: THINGS-EEG (10 subjects, 1654 concepts training / 200 concepts test, RSVP paradigm, 63 channels) as primary, EEG-ImageNet as supplementary. Tasks: 200-way zero-shot retrieval + image reconstruction.

Main Results (200-way Zero-shot Retrieval, Average Top-1/Top-5)

Method Within-Subject Top-1 Within-Subject Top-5 Cross-Subject Top-1 Cross-Subject Top-5
EEGNetV4 0.186 0.441 0.089 0.224
NICE 0.242 0.512 0.113 0.273
ATM (Li et al. 2024) 0.269 0.548 0.115 0.280
AVDE (Ours) 0.300 0.582 0.143 0.329

Reconstruction Quality (Subject-08):

Method PixCorr↑ SSIM↑ AlexNet(5)↑ Inception↑ CLIP↑ SwAV↓
Li et al. 2024 0.160 0.345 0.866 0.734 0.786 0.582
CognitionCapturer 0.175 0.366 0.610 0.721 0.744 0.577
AVDE 0.188 0.396 0.889 0.765 0.795 0.557

Efficiency Comparison (Single A100, batch=1):

Method Params(M) FLOPs(G) Inference(ms) Memory(MB)
Li et al. 2024 3818.1 8738.6 310.4 4826.7
AVDE 425.3 1350.5 91.2 1809.6

Ablation Study (Average Reconstruction Across Subjects)

Configuration PixCorr↑ SSIM↑ CLIP↑ SwAV↓
LaBraM+VAR (Full) 0.147 0.366 0.747 0.586
ATM+VAR (Encoder swap) 0.141 0.351 0.731 0.601
EEGNet+VAR (Encoder swap) 0.132 0.323 0.712 0.627
LaBraM+Li et al. (unCLIP swap) 0.138 0.346 0.726 0.606
LaBraM+LDM-4 (Diffusion swap) 0.139 0.343 0.731 0.609
LaBraM+DiT-XL/2 (Diffusion swap) 0.143 0.354 0.735 0.594

Key Findings

  • 90% parameter reduction, 3.4x faster inference, and 2.7x memory savings while exceeding SOTA in retrieval and reconstruction, proving the value of lightweight autoregressive routes for BCI.
  • Ablations show that both the encoder and generation framework are critical: swapping LaBraM or reverting to unCLIP/Diffusion causes performance drops, indicating that the "pre-trained encoder + autoregressive generation" strategy is synergistic.
  • Interpretability: Visualization of intermediate reconstructions across 10 scales shows edges/colors in early stages (V1), structural contours in middle stages (V2/V4), and semantically complete objects in late stages (IT). Region-scale correlation analysis further supports this correspondence with human hierarchical visual perception.

Highlights & Insights

  • The shift from "multi-stage diffusion" to "single-chain autoregression" is the core insight: treating EEG embeddings as the coarsest scale keeps brain signals directly linked to the generation process, eliminating cross-stage error accumulation.
  • The alignment between "Next-Scale Prediction" and neuroscience is intrinsic: the coarse-to-fine generation naturally maps to visual pathways (V1→V2/V4→IT), providing both efficiency and neural interpretability.
  • Leveraging foundation models (LaBraM + VAR with pre-trained initializations) is key to handling small-data, high-noise scenarios, offering a replicable solution for EEG data scarcity.

Limitations & Future Work

  • Evaluation primarily focuses on Subject-08 (following convention); absolute metrics for cross-subject generalization remain a challenge.
  • Dependence on VQ-VAE discrete tokens and external components like CLIP/CFG means generation diversity and detail limits are constrained by these modules; parameters like top-k=900 and CFG=4.0 require tuning.
  • The focus is on object concept images; applicability to complex scenes, dynamic visuals, or real-time online BCI decoding still needs verification.
  • EEG Visual Decoding (unCLIP family): ATM, NICE, CognitionCapturer, and GeoCap typically use EEG encoder → diffusion prior → Stable Diffusion. AVDE targets the computational and multi-stage overhead of these methods.
  • Pre-trained EEG Models: LaBraM provides universal EEG representations, serving as the foundation for feature extraction, echoing the trend of using large-scale pre-training in vision and language.
  • Visual Autoregression (VAR): AVDE extends the VAR paradigm from pure image generation to conditional brain signal decoding, providing an interesting expansion of VAR's applications.
  • Insight: When a task is dominated by multi-stage pipelines with heavy error accumulation, returning to an "end-to-end chain + strong pre-training" often yields simultaneous improvements in performance and efficiency.

Rating

  • Novelty: ⭐⭐⭐⭐ First use of VAR "Next-Scale Prediction" for EEG visual decoding, treating EEG as the coarsest scale; elegant paradigm shift aligned with brain science.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets, dual tasks (retrieval/reconstruction), dual ablations (encoder/framework), and interpretability analysis are comprehensive; slight deduction for focus on a single best subject for reconstruction.
  • Writing Quality: ⭐⭐⭐⭐ Clear logic from motivation to experiment; formulas and diagrams are well-placed.
  • Value: ⭐⭐⭐⭐ Significant reduction in parameter/inference costs with superior results; highly relevant for deployable BCI and cognitive science tools.