Autoregressive Visual Decoding from EEG Signals¶

Conference: ICLR 2026
OpenReview: https://openreview.net/forum?id=TKjfzuVLX4
Code: https://github.com/ddicee/avde
Area: Brain-Computer Interface / EEG Visual Decoding / Autoregressive Image Generation
Keywords: EEG-to-Image, Visual Decoding, LaBraM, Next-Scale Prediction (VAR), Contrastive Learning, BCI

TL;DR¶

AVDE reformulates "decoding EEG signals into images" into a two-stage, autoregressive lightweight pipeline: first, it aligns EEG to the CLIP image space using the pre-trained EEG foundation model LaBraM combined with contrastive learning; then, it uses "Next-Scale Prediction" from the VAR framework to generate images progressively from EEG embeddings. With only 10% of the parameters, it outperforms previous SOTA models that rely on large diffusion models in both retrieval and reconstruction tasks.

Background & Motivation¶

Background: Decoding visual content from non-invasive brain signals is at the intersection of cognitive science and generative AI. While fMRI offers high precision, it is slow, expensive, and hardware-constrained. EEG, with its millisecond temporal resolution, portability, and low cost, has become a more deployable alternative. Recent works (ATM, NICE, etc.) have demonstrated potential in image retrieval and reconstruction.
Limitations of Prior Work: Mainstream EEG visual decoding follows the unCLIP paradigm—a multi-stage pipeline (Figure 1 shows five stages) consisting of an EEG encoder → diffusion prior → Stable Diffusion. This has three major flaws: ① Serial multi-stage processes accumulate errors at each step, damaging reconstruction fidelity; ② EEG encoders are often trained from scratch, making it difficult to extract good features from noisy signals with limited image-EEG pairs; ③ Diffusion models often exceed 3 billion parameters, making the computational/memory overhead impractical for real-time BCI.
Key Challenge: EEG is a noisy, low-information-density 1D time-series signal, while images are structured high-dimensional visual content. The distribution gap is massive. Bridging it usually requires complex multi-stage pipelines, but increased complexity leads to lower controllability and deployability—a trade-off between fidelity vs. simplicity/deployability.
Goal: Replace the multi-stage diffusion pipeline with a direct, coherent, and lightweight framework that maintains a direct mapping between EEG and images while reducing parameters and inference costs to deployable levels.
Core Idea: [Transfer Pre-trained EEG Representations] Replace encoders trained from scratch with LaBraM, pre-trained on 2000+ hours of EEG data. [Autoregressive Next-Scale Prediction instead of Diffusion] Use the VAR framework to treat EEG embeddings as the "coarsest scale" of an image, using a transformer to generate images from coarse to fine, naturally aligning the generation process with the hierarchical visual perception of the human brain.

Method¶

Overall Architecture¶

AVDE is a clear two-stage process. Stage one is "Alignment": encode EEG using pre-trained LaBraM, freeze CLIP to encode images, and pull EEG into the image representation space using a joint contrastive + regression objective to obtain information-rich EEG embeddings. Stage two is "Generation": treat the EEG embedding as the starting token [s] for the coarsest scale, feed it into a decoder-only transformer to autoregressively predict VQ-VAE multi-scale residual maps via "next-scale prediction," accumulate them from coarse to fine into a complete feature map, and finally reconstruct the image via the VQ-VAE decoder.

flowchart LR
    EEG[EEG Signal<br/>C×T] --> LaBraM[LaBraM Encoder<br/>Pre-trained + Contrastive Fine-tuning]
    IMG[Image] --> CLIP[Frozen CLIP] 
    LaBraM -. Stage 1 CLIP+MSE Alignment .- CLIP
    LaBraM --> EMB[EEG Embedding e]
    EMB --> PROJ["Project to Starting Token [s]"]
    PROJ --> VAR[VAR Transformer<br/>Next-Scale Prediction]
    VAR --> R["Hierarchical Residuals R1→R2→…→RK"]
    R --> VQ[VQ-VAE Decoder]
    VQ --> OUT[Reconstructed Image]

Key Designs¶

1. Replacing random-initialized encoders with pre-trained LaBraM: Standing on the shoulders of giants. EEG visual decoding has long been hindered by "limited paired data and noisy signals." Encoders trained from scratch struggle to converge. AVDE utilizes LaBraM, a model pre-trained on 2000+ hours of multi-subject, multi-condition EEG data. The encoding process involves: partitioning \(X \in \mathbb{R}^{C\times T}\) into patches along the time dimension using non-overlapping windows \(w\); extracting local temporal features \(e_{c_j,k}\in\mathbb{R}^d\) via 1D convolution, group normalization, and GELU; injecting spatio-temporal context with trainable temporal embeddings \(te_k\) and spatial embeddings \(se_j\); and integrating dependencies using a Transformer encoder. This transfer learning allows the encoder to generalize across subjects and extract semantically meaningful features.

2. Combined Contrastive + Regression Alignment: Structural alignment plus point-to-point precision. Since LaBraM was pre-trained on clinical EEG rather than visual stimuli, it must be fine-tuned. Given paired EEG-images, LaBraM encodes EEG into \(e\) and frozen CLIP encodes images into \(z\). A bi-directional contrastive loss pulls matching pairs together: \(L_{CLIP} = -\frac{1}{B}\sum_i \big(\log\frac{\exp(s(e_i,z_i)/\tau)}{\sum_j \exp(s(e_i,z_j)/\tau)} + \log\frac{\exp(s(e_i,z_i)/\tau)}{\sum_k \exp(s(e_k,z_i)/\tau)}\big)\), where \(s\) is cosine similarity and \(\tau\) is a learnable temperature. To ensure absolute positioning in the embedding space beyond just relative ranking, a regression term is added: \(L_{Combined} = \lambda L_{CLIP} + (1-\lambda) L_{MSE}\) (\(\lambda=0.8\)). Contrastive loss places brain signals structurally within the image manifold, while MSE pins them near the corresponding image embeddings.

3. Autoregressive Generation with Next-Scale Prediction: VAR as a natural metaphor for brain vision. Moving away from diffusion, AVDE leverages the VAR approach: a pre-trained VQ-VAE quantizes images into \(K\) multi-scale residual maps \((R_1,\dots,R_K)\) with increasing resolution. Feature maps are reconstructed via cumulative upsampling: \(F_k = \sum_{i=1}^k \mathrm{up}(R_i,(h,w))\). A decoder-only transformer predicts these residuals conditioned on the EEG embedding \(e\): \(p(R_1,\dots,R_K)=\prod_{k=1}^K p(R_k\mid R_1,\dots,R_{k-1},e)\). Specifically, \(e\) is projected to a starting token [s] to initiate generation. For each scale \(k\), a downsampled version of the previous cumulative map \(\tilde F_{k-1}=\mathrm{down}(F_{k-1},(h_k,w_k))\) acts as input. Training uses a block-wise causal attention mask and cross-entropy supervision. The EEG embedding acts as the "coarsest layer," ensuring the generation chain is directly linked to the input, mirroring the human brain's hierarchical perception from V1 (edges/colors) → V2/V4 (contours/structure) → IT (global objects).

Key Experimental Results¶

Datasets: THINGS-EEG (10 subjects, 1654 concepts training / 200 concepts test, RSVP paradigm, 63 channels) as primary, EEG-ImageNet as supplementary. Tasks: 200-way zero-shot retrieval + image reconstruction.

Main Results (200-way Zero-shot Retrieval, Average Top-1/Top-5)¶

Method	Within-Subject Top-1	Within-Subject Top-5	Cross-Subject Top-1	Cross-Subject Top-5
EEGNetV4	0.186	0.441	0.089	0.224
NICE	0.242	0.512	0.113	0.273
ATM (Li et al. 2024)	0.269	0.548	0.115	0.280
AVDE (Ours)	0.300	0.582	0.143	0.329

Reconstruction Quality (Subject-08):

Method	PixCorr↑	SSIM↑	AlexNet(5)↑	Inception↑	CLIP↑	SwAV↓
Li et al. 2024	0.160	0.345	0.866	0.734	0.786	0.582
CognitionCapturer	0.175	0.366	0.610	0.721	0.744	0.577
AVDE	0.188	0.396	0.889	0.765	0.795	0.557

Efficiency Comparison (Single A100, batch=1):

Method	Params(M)	FLOPs(G)	Inference(ms)	Memory(MB)
Li et al. 2024	3818.1	8738.6	310.4	4826.7
AVDE	425.3	1350.5	91.2	1809.6

Ablation Study (Average Reconstruction Across Subjects)¶

Configuration	PixCorr↑	SSIM↑	CLIP↑	SwAV↓
LaBraM+VAR (Full)	0.147	0.366	0.747	0.586
ATM+VAR (Encoder swap)	0.141	0.351	0.731	0.601
EEGNet+VAR (Encoder swap)	0.132	0.323	0.712	0.627
LaBraM+Li et al. (unCLIP swap)	0.138	0.346	0.726	0.606
LaBraM+LDM-4 (Diffusion swap)	0.139	0.343	0.731	0.609
LaBraM+DiT-XL/2 (Diffusion swap)	0.143	0.354	0.735	0.594

Key Findings¶

90% parameter reduction, 3.4x faster inference, and 2.7x memory savings while exceeding SOTA in retrieval and reconstruction, proving the value of lightweight autoregressive routes for BCI.
Ablations show that both the encoder and generation framework are critical: swapping LaBraM or reverting to unCLIP/Diffusion causes performance drops, indicating that the "pre-trained encoder + autoregressive generation" strategy is synergistic.
Interpretability: Visualization of intermediate reconstructions across 10 scales shows edges/colors in early stages (V1), structural contours in middle stages (V2/V4), and semantically complete objects in late stages (IT). Region-scale correlation analysis further supports this correspondence with human hierarchical visual perception.

Highlights & Insights¶

The shift from "multi-stage diffusion" to "single-chain autoregression" is the core insight: treating EEG embeddings as the coarsest scale keeps brain signals directly linked to the generation process, eliminating cross-stage error accumulation.
The alignment between "Next-Scale Prediction" and neuroscience is intrinsic: the coarse-to-fine generation naturally maps to visual pathways (V1→V2/V4→IT), providing both efficiency and neural interpretability.
Leveraging foundation models (LaBraM + VAR with pre-trained initializations) is key to handling small-data, high-noise scenarios, offering a replicable solution for EEG data scarcity.

Limitations & Future Work¶

Evaluation primarily focuses on Subject-08 (following convention); absolute metrics for cross-subject generalization remain a challenge.
Dependence on VQ-VAE discrete tokens and external components like CLIP/CFG means generation diversity and detail limits are constrained by these modules; parameters like top-k=900 and CFG=4.0 require tuning.
The focus is on object concept images; applicability to complex scenes, dynamic visuals, or real-time online BCI decoding still needs verification.

EEG Visual Decoding (unCLIP family): ATM, NICE, CognitionCapturer, and GeoCap typically use EEG encoder → diffusion prior → Stable Diffusion. AVDE targets the computational and multi-stage overhead of these methods.
Pre-trained EEG Models: LaBraM provides universal EEG representations, serving as the foundation for feature extraction, echoing the trend of using large-scale pre-training in vision and language.
Visual Autoregression (VAR): AVDE extends the VAR paradigm from pure image generation to conditional brain signal decoding, providing an interesting expansion of VAR's applications.
Insight: When a task is dominated by multi-stage pipelines with heavy error accumulation, returning to an "end-to-end chain + strong pre-training" often yields simultaneous improvements in performance and efficiency.

Rating¶

Novelty: ⭐⭐⭐⭐ First use of VAR "Next-Scale Prediction" for EEG visual decoding, treating EEG as the coarsest scale; elegant paradigm shift aligned with brain science.
Experimental Thoroughness: ⭐⭐⭐⭐ Multiple datasets, dual tasks (retrieval/reconstruction), dual ablations (encoder/framework), and interpretability analysis are comprehensive; slight deduction for focus on a single best subject for reconstruction.
Writing Quality: ⭐⭐⭐⭐ Clear logic from motivation to experiment; formulas and diagrams are well-placed.
Value: ⭐⭐⭐⭐ Significant reduction in parameter/inference costs with superior results; highly relevant for deployable BCI and cognitive science tools.