SATTC: Structure-Aware Label-Free Test-Time Calibration for Cross-Subject EEG-to-Image Retrieval¶
Conference: CVPR 2026
arXiv: 2603.20738
Code: https://github.com/QunjieHuang/SATTC-CVPR2026
Area: Time Series
Keywords: EEG decoding, Cross-subject retrieval, Label-free calibration, hubness mitigation, Similarity matrix
TL;DR¶
SATTC is proposed as a label-free test-time calibration head. By employing a Product-of-Experts (PoE) fusion of a geometric expert (subject-adaptive whitening + adaptive CSLS) and a structural expert (mutual nearest neighbors + bidirectional top-k ranking + category popularity), it operates directly on the similarity matrix of frozen EEG and image encoders. This approach significantly enhances Top-1 accuracy and alleviates the hubness effect in cross-subject EEG-to-image retrieval.
Background & Motivation¶
- Background: EEG-to-image retrieval maps EEG signals into a shared embedding space, utilizing nearest neighbor search to find corresponding images. Recent studies (e.g., ATM) have developed robust EEG encoders through contrastive learning, achieving competitive zero-shot retrieval performance on the THINGS-EEG benchmark.
- Limitations of Prior Work: Current pipelines suffer from three test-time constraints: (1) Lack of structure-aware label-free calibration, where inference is simplified to basic nearest neighbor search; (2) Absence of subject-adaptive, density-aware hubness mitigation, as globally fixed CSLS neighborhood sizes cannot accommodate local density variations across different queries and categories; (3) Underutilization of structural cues, such as mutual nearest neighbors and bidirectional ranking, to diagnose and rectify the quality of small-k shortlists.
- Key Challenge: During cross-subject deployment, the EEG feature distributions (mean, variance, and covariance structures) of different subjects exhibit significant statistical shifts (subject shift). Coupled with the hubness effect in high-dimensional embedding spaces—where a few "popular" images dominate the top-k lists of most queries—this results in highly unreliable small-k shortlists, which is a critical failure in practical neural decoding applications.
- Goal: Under the strict constraints of frozen encoders and no target domain labels, the objective is to calibrate retrieval rankings solely by manipulating the EEG-image similarity matrix.
- Key Insight: Cross-subject retrieval is redefined as a "similarity matrix calibration" problem. Instead of modifying encoder weights, the similarity structure itself is adjusted. This is approached from two complementary perspectives: the geometric perspective (density-aware local scaling) and the structural perspective (consistency patterns within ranking relationships).
- Core Idea: A geometric expert is used to mitigate hubness caused by non-uniform density, while a structural expert identifies high-confidence matches and penalizes popular hub categories. The two are fused via product fusion to generate calibrated retrieval scores.
Method¶
Overall Architecture¶
This paper addresses the "unreliable test-time ranking" problem in cross-subject EEG-to-image retrieval. When a new subject is introduced, it is impractical to re-label data or fine-tune networks. SATTC redefines the task as calibrating a similarity matrix. Given a frozen EEG encoder \(f_{\text{eeg}}\) and an image encoder \(f_{\text{img}}\), a similarity matrix \(S_{\text{new}}\) of size \(|Q| \times |C|\) is generated (rows represent queries, columns represent candidate image categories). SATTC acts as an operator \(F: S_{\text{new}} \mapsto S_{\text{final}}\) on this matrix without modifying any network weights.
The matrix undergoes three stages: First, Subject-Adaptive Whitening (SAW) aligns the EEG features of different subjects into a shared statistical coordinate system. Next, a "geometric expert" performs adaptive CSLS scaling based on local density, while a "structural expert" extracts reliable matches and identifies spurious hubs from ranking relationships. Finally, the two experts are fused in the logit space as \(S_{\text{final}} = \alpha S_{\text{geom}} + \beta S_{\text{struct}}\) to output the calibrated retrieval scores. This process is label-free, training-free, and agnostic to the underlying encoders.
%%{init: {'flowchart': {'rankSpacing': 24, 'nodeSpacing': 28, 'padding': 6, 'wrappingWidth': 400}}}%%
flowchart TD
A["Frozen EEG / Image Encoders<br/>Generate Similarity Matrix S_new"] --> B["Subject-Adaptive Whitening (SAW)<br/>Estimate μ_s, Σ_s per subject → Whiten to shared sphere"]
B --> C["Geometric Expert: Adaptive CSLS<br/>Select k based on row/column local density to suppress hubness"]
B --> D["Structural Expert<br/>MNN anchors + Bidirectional top-L + Hub penalty"]
C --> E["Product-of-Experts Fusion<br/>S_final = α·S_geom + β·S_struct"]
D --> E
E --> F["Calibrated Retrieval Scores<br/>Nearest Neighbor Retrieval"]
Key Designs¶
1. Subject-Adaptive Whitening (SAW): Aligning different subjects to the same sphere without labels
The primary obstacle in cross-subject retrieval is subject shift, where the feature distributions (mean, variance, covariance) of different individuals vary significantly. SAW estimates the mean \(\mu_s\) and covariance \(\Sigma_s\) for each subject \(s\) independently and constructs a regularized whitening transform \(W_s = (\Sigma_s + \lambda I)^{-1/2}\). Subject-specific EEG embeddings are centered, whitened by \(W_s\), and L2-normalized. An optional global whitening is applied to the image side. Post-whitening, features for each subject approximate zero mean, unit covariance, and unit norm, effectively mapping them to a shared sphere. This geometrically eliminates distribution shifts without requiring target domain labels. SAW provides the largest single performance gain, increasing Top-5 accuracy from 30.5% to 36.4%.
2. Adaptive CSLS Geometric Expert: Tailoring neighborhoods to local density to suppress hubness
High-dimensional embedding spaces suffer from the hubness effect, where a few "hub" images dominate the top-k results of many queries. While classic CSLS uses a fixed global neighborhood \(k\) for density penalty, EEG embeddings are highly non-uniform. Fixed \(k\) over-penalizes sparse regions and under-penalizes dense hubs. The adaptive version maps row density \(\rho_{\text{row}}(q)\) to a query-specific \(k_{\text{row}}(q) \in [k_{\min}, k_{\max}]\) and column density \(\rho_{\text{col}}(c)\) to \(k_{\text{col}}(c)\). The CSLS score remains:
However, the neighborhood mean terms \(r_q\) and \(r_c\) are calculated using these adaptive neighborhood sizes. This eliminates the need to tune a global \(k\) for the entire dataset.
3. Structural Expert: Reinforcing reliable matches and suppressing false hubs from ranking consistency
While the geometric expert focuses on density, the structural expert analyzes ranking relationships. It computes row/column rankings from \(S_{\text{new}}\) to identify three types of signals: (1) Anchors, defined as Mutual Nearest Neighbors (MNN@1) where \(r_{\text{row}}(q,c)=r_{\text{col}}(c,q)=1\), receive a positive bias \(+\lambda_{\text{anchor}}\); (2) Bidirectional top-L pairs serve as relaxed consistency matches; (3) Hub candidates, which have low row rank but high column rank (appearing frequently in top-k lists), receive a negative penalty \(-\lambda_{\text{pen}} h(c)\) based on a normalized hubness score \(h(c)\). Mutual nearest neighbors are highly reliable in cross-domain retrieval as both samples identify each other as the best match. Conversely, categories appearing excessively in various top-k lists are likely hubs and are actively suppressed. The structural matrix is computed once and remains fixed to prevent feedback loops where hubs might reinforce themselves.
Loss & Training¶
SATTC does not involve training; all operations are performed at test time. The underlying EEG encoder is trained using the AdamW optimizer with a batch size of 1024, a learning rate of \(5 \times 10^{-4}\), and a temperature \(\tau=1.0\). Fusion is controlled by a scalar \(\beta\) (default 1.9), with \(\alpha\) fixed at 1.
Key Experimental Results¶
Main Results¶
200-way cross-subject retrieval on the THINGS-EEG dataset (LOSO protocol, averaged over all folds and 3 seeds):
| Method | Top-5 (%)↑ | Top-1 (%)↑ |
|---|---|---|
| ATM (Original) | 20.0 | 5.5 |
| Standardized Baseline (cosine+L2+CW) | 30.5 | 9.2 |
| + SAW | 36.4 | 13.7 |
| + SAW + CW | 36.8 | 13.5 |
| + SAW + CW + CSLS (fixed k=12) | 38.1 | 14.1 |
| + SAW + CW + Ada-CSLS | 38.8 | 13.9 |
| SATTC (Full) | 38.4 | 14.8 |
Plug-and-play generalization across encoders (SATTC as a general calibration layer):
| Encoder | Top-5 Baseline → +SATTC | Top-1 Baseline → +SATTC |
|---|---|---|
| ATM | 30.5 → 38.4 (+7.9) | 9.2 → 14.8 (+5.6) |
| EEGNetV4 | 20.5 → 34.8 (+14.3) | 5.4 → 10.8 (+5.4) |
| EEGConformer | 11.6 → 23.2 (+11.6) | 2.5 → 6.9 (+4.4) |
| ShallowFBCSPNet | 14.6 → 30.8 (+16.2) | 3.5 → 11.1 (+7.6) |
Ablation Study¶
| Configuration | Top-5 (%) | Top-1 (%) | Description |
|---|---|---|---|
| Standardized Baseline | 30.5 | 9.2 | cosine+L2+CW |
| + SAW | 36.4 | 13.7 | Largest single gain (+6.2/+4.5) |
| + SAW + CW | 36.8 | 13.5 | Limited extra gain from CW |
| + Ada-CSLS | 38.8 | 13.9 | Geometric calibration |
| + Structural PoE (SATTC) | 38.4 | 14.8 | Significant Top-1 improvement |
Key Findings¶
- SAW is the primary performance contributor, with a 6.2 percentage point absolute increase in Top-5, indicating that cross-subject statistical shift is the main barrier.
- The structural expert mainly improves Top-1 (13.9 → 14.8) without degrading Top-5, demonstrating precision in identifying correct matches.
- Adaptive CSLS is comparable to fixed CSLS in accuracy but produces a more uniform hubness distribution (flatter category popularity curves).
- SATTC is effective across four architectural styles (CSP, CNN, Transformer), confirming encoder-agnosticism.
- \(\beta\) is stable over a wide range; the default 1.9 is within 0.1% of the optimal setting.
Highlights & Insights¶
- Refinement of Problem: Restructuring cross-subject retrieval as a "test-time similarity matrix calibration" problem, rather than a training problem, decouples the method from the encoder. It can be applied to any existing encoder to yield immediate improvements.
- Complementary Expert Design: The geometric expert addresses hubness via density, while the structural expert addresses it via ranking consistency. Their fusion in logit space is simple yet effective.
- Rigorous Experimental Design: The use of nested LOSO prevents data leakage. The hyperparameter selection strategy (using specific subject tiers) avoids overfitting and demonstrates true generalization across encoders.
Limitations & Future Work¶
- Validation is currently limited to the THINGS-EEG dataset; generalization to other EEG-image datasets remains to be confirmed.
- The structural expert relies on manually designed heuristics (ranking, MNN, popularity); learnable enhancements could be explored.
- The current implementation requires pre-computing the full similarity matrix, hindering online streaming inference (though SAW and CSLS components could be adapted).
- Lack of integration with training-time domain adaptation (e.g., adversarial training), which might provide complementary gains.
- Absolute Top-1 accuracy (14.8%) remains low, highlighting the inherent difficulty of EEG-to-image retrieval.
Related Work & Insights¶
- vs ATM: ATM uses non-standardized dot-product similarity. Simply switching to cosine, L2, and whitening improves Top-5 from 20% to 30.5%, showing that inference pipeline standardization is often overlooked.
- vs Standard CSLS (Lample et al., 2018): Used for cross-lingual word embedding alignment with fixed neighborhoods; SATTC’s adaptive version removes the need to tune \(k\).
- vs Training-time DA (e.g., MS-MDA): While DA aligns distributions during training, SATTC calibrates at test time. The two are complementary.
Rating¶
- Novelty: ⭐⭐⭐⭐ Combines retrieval calibration and cross-subject EEG perspectives, though individual components (whitening, CSLS, MNN) are established.
- Experimental Thoroughness: ⭐⭐⭐⭐ Extensive multi-encoder validation and ablation, though tested on only one dataset.
- Writing Quality: ⭐⭐⭐⭐ Clear derivations and logical contrastive experiments demonstrate the contribution of each component.
- Value: ⭐⭐⭐⭐ An encoder-agnostic plug-and-play calibration layer is highly practical, despite the niche application area.