LaVCa: LLM-assisted Visual Cortex Captioning¶
Conference: ICLR 2026 arXiv: 2502.13606 Code: https://github.com/suyamat/LaVCa Area: 3D Vision / Neuroscience Keywords: visual cortex, voxel selectivity, LLM, fMRI encoding model, brain activity prediction
TL;DR¶
This paper proposes LaVCa, a method that leverages LLMs to generate natural language captions for individual voxels in the human visual cortex. Through a four-step pipeline—encoding model construction, optimal image selection, MLLM-based captioning, and LLM-driven keyword extraction with sentence composition—LaVCa reveals voxel-level visual selectivity more accurately and with greater semantic diversity than the prior method BrainSCUBA.
Background & Motivation¶
Background: fMRI encoding models are the standard tool for studying visual representations in the brain. Early approaches relied on hand-crafted features or one-hot semantic labels, offering interpretability at the cost of granularity; modern methods leverage DNN features (e.g., CLIP) to substantially improve prediction accuracy, but DNNs are black boxes that offer little insight into why individual voxels are activated.
Limitations of Prior Work: Existing data-driven captioning methods such as BrainSCUBA directly apply image captioning models (ClipCap) to generate voxel captions, resulting in limited vocabulary and semantic diversity. SASC concatenates short n-gram phrases but suffers from insufficient expressiveness. Both approaches lack the semantic richness needed to precisely characterize voxel selectivity.
Key Challenge: The fundamental tension lies in maintaining interpretability (concise captions) without losing the rich information present in the optimal image set.
Goal: To generate precise, concise, and semantically rich natural language descriptions for each voxel—descriptions that can accurately predict brain activity while revealing both inter-voxel and intra-voxel diversity.
Key Insight: The pipeline is decoupled into four interpretable steps, separating image selection from captioning and exploiting the open-vocabulary capacity of LLMs for keyword extraction and sentence composition.
Core Idea: An LLM first extracts common keywords from the optimal image set of a voxel and then composes them into a caption, achieving high accuracy and high semantic diversity in voxel-level visual cortex description.
Method¶
Overall Architecture¶
LaVCa four-step pipeline: - Input: fMRI brain activity data recorded while subjects viewed images from the NSD dataset - Step 1: Construct an encoding model for each voxel using CLIP-Vision embeddings and ridge regression: \(\mathbf{y}_i = \mathbf{W}\mathbf{x}_i + \bm{\varepsilon}_i\) - Step 2: Compute predicted responses of the encoding model over 1.7 million external OpenImages, selecting the top-N optimal images - Step 3: Generate a description for each optimal image using an MLLM (MiniCPM-V) - Step 4: Extract keywords from the descriptions using an LLM (GPT-4o) → filter via CLIP-Text cosine similarity → compose the final caption using MeaCap Sentence Composer - Output: One natural language caption per voxel
Key Designs¶
-
Encoding Model Construction (Step 1):
- Function: Build a linear predictive model mapping images to brain activity for each voxel
- Mechanism: Extract L2-normalized CLIP-Vision projection embeddings and fit encoding weights \(\mathbf{W} \in \mathbb{R}^{v \times d}\) via ridge regression
- Design Motivation: Linear models are interpretable; CLIP features are aligned in the visual-language space, facilitating subsequent text-based evaluation
-
Optimal Image Set Retrieval (Step 2):
- Function: Identify the set of images that most strongly activates a given voxel
- Mechanism: Compute the inner product between encoding weights and CLIP embeddings of 1.7 million external images; select top-N
- Design Motivation: Using external large-scale data (outside the training set) mitigates overfitting; \(N\) is tunable; OpenImages-v6 provides broad coverage
-
LLM Keyword Extraction and Sentence Composition (Step 4):
- Function: Distill common keywords from captions of multiple optimal images and synthesize a voxel caption
- Mechanism: GPT-4o in-context learning extracts keywords → CLIP-Text cosine similarity between each keyword and the encoding weight is computed and filtered via a softmax threshold → MeaCap Sentence Composer combines keywords into a sentence, substituting encoding weights for the original image features
- Design Motivation: More concise and interpretable than direct caption concatenation; covers a broader vocabulary than the end-to-end BrainSCUBA approach; keyword filtering ensures relevance
Evaluation Protocol¶
- Sentence-level prediction: Voxel activity is predicted as the cosine similarity between the caption (encoded via Sentence-BERT) and captions of NSD images; accuracy is measured by Spearman correlation
- Image-level prediction: Images are generated from captions using FLUX.1-schnell → CLIP-Vision embeddings are compared against NSD images, eliminating the confounding effect of language
Key Experimental Results¶
Main Results¶
Sentence-level brain activity prediction accuracy (top-5000 voxels, mean ± std across 4 subjects):
| Method | #Keywords | Sentence Composer | subj01 | subj02 | subj05 | subj07 |
|---|---|---|---|---|---|---|
| Shuffled | - | - | 0.007±0.199 | 0.058±0.223 | 0.068±0.243 | 0.009±0.175 |
| BrainSCUBA | - | - | 0.207±0.062 | 0.251±0.071 | 0.264±0.084 | 0.182±0.065 |
| LaVCa | 1 | ✗ | 0.205±0.068 | 0.250±0.075 | 0.272±0.086 | 0.186±0.072 |
| LaVCa | 5 | ✓ | 0.246±0.066 | 0.287±0.075 | 0.306±0.084 | 0.218±0.073 |
Image-level prediction accuracy similarly shows that LaVCa (5 keywords + SC) outperforms BrainSCUBA across all subjects (e.g., subj01: 0.213 vs. 0.188).
Ablation Study¶
| Configuration | Vocabulary Size (inter-voxel) | Semantic Variance | PCA 90% Dimensions |
|---|---|---|---|
| BrainSCUBA | 3,193 | 0.0588 | 127 |
| Top-1 MLLM caption | 13,959 | 0.0638 | 210 |
| LaVCa | 16,922 | 0.0642 | 219 |
Intra-ROI shuffle test (validating inter-voxel diversity):
| ROI | Original | Shuffled | Ratio |
|---|---|---|---|
| OFA (face region) | 0.095 | 0.028 | 3.3× |
| PPA (scene region) | 0.213 | 0.151 | 1.4× |
| EBA (body region) | 0.157 | 0.018 | 8.7× |
Key Findings¶
- The combination of 5 keywords and Sentence Composer significantly outperforms BrainSCUBA and single-keyword variants across all subjects
- LaVCa's vocabulary is 5.3× larger than BrainSCUBA's (16,922 vs. 3,193), with higher semantic diversity
- Even within ROIs traditionally considered selective for a single category (e.g., faces in OFA, scenes in PPA), LaVCa reveals rich multi-concept encoding—individual voxels can simultaneously encode multiple distinct concepts
- Cross-subject analysis confirms that intra-ROI diversity is reproducible
Highlights & Insights¶
- Elegant decoupled design: Decomposing voxel captioning into four steps—image selection, captioning, keyword extraction, and sentence composition—makes each component independently replaceable (any VLM/LLM), yielding greater flexibility and interpretability than end-to-end approaches. This modular paradigm is transferable to other tasks requiring interpretable feature descriptions from data
- Encoding weights as image feature surrogates: Using encoding weights in place of image features to guide MeaCap Sentence Composer cleverly connects neuroscience signals directly into an NLP pipeline
- Revealing diversity within canonical ROIs: Prior work assumed OFA encodes only "faces," yet LaVCa finds voxels encoding "tongue," "smile," "animals," and other concepts—a conceptual advance enabled by methodological improvement
Limitations & Future Work¶
- Dependence on CLIP feature space: Both the encoding model and keyword filtering rely on CLIP; concepts poorly represented in CLIP (e.g., subtle textures, abstract notions) may limit LaVCa's expressiveness
- Linear encoding assumption: Ridge regression assumes a linear relationship between voxel responses and CLIP features, which may be an oversimplification for higher-level visual areas
- LLM hallucination risk: GPT-4o may introduce hallucinated keywords during extraction; CLIP-based filtering mitigates but does not eliminate this risk
- Restricted to visual cortex: The method has not yet been extended to other brain regions (e.g., auditory or language areas); similar approaches could be explored for studying language encoding
Related Work & Insights¶
- vs. BrainSCUBA: BrainSCUBA applies ClipCap end-to-end to generate voxel captions, with vocabulary constrained by ClipCap's training data. LaVCa's decoupled design exploits the open vocabulary of LLMs, yielding a 5× vocabulary increase and superior accuracy
- vs. SASC: SASC concatenates short n-gram phrases, retaining minimal information. LaVCa preserves richer semantic content through multi-image keyword extraction
- vs. brain decoding work: Brain decoding asks "what did the subject see?", while encoding models ask "what does this voxel represent?"—these are complementary research directions
Rating¶
- Novelty: ⭐⭐⭐⭐ The method is an ingenious combination of existing components, but the decoupled design and integration of LLMs into neuroscience represent a genuinely new direction
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Sentence-level and image-level evaluation, diversity analysis, intra-ROI shuffle tests, and cross-subject validation are all conducted—highly comprehensive
- Writing Quality: ⭐⭐⭐⭐⭐ Logically clear with well-designed figures and precisely described methodology
- Value: ⭐⭐⭐⭐ Offers an important contribution to understanding visual cortex representations, though the scope is relatively narrow (neuroscience-focused)