NeuroGaze-Distill: Brain-informed Distillation and Depression-Inspired Geometric Priors for Robust Facial Emotion Recognition¶

Conference: ICLR 2026 arXiv: 2509.11916 Code: GitHub (minimal reproduction repository) Area: Human Understanding Keywords: facial emotion recognition, knowledge distillation, EEG prototypes, depression-inspired prior, cross-dataset robustness

TL;DR¶

This paper proposes NeuroGaze-Distill, a cross-modal distillation framework that extracts static Valence-Arousal prototypes from an EEG-trained teacher model and injects them into a purely visual student model via Proto-KD and depression-inspired geometric priors (D-Geo), improving cross-dataset robustness for facial expression recognition without requiring paired EEG-face data.

Background & Motivation¶

Appearance-based facial expression recognition (FER) models perform well within-domain but generalize poorly across datasets — differences in demographics, capture conditions, and annotation conventions induce severe distribution shift.
Facial appearance is an indirect and biased proxy for emotion, whereas physiological signals (e.g., EEG) encode emotional dynamics that are decoupled from appearance.
Collecting large-scale paired EEG-face data is impractical and incompatible with deploying purely visual systems.
Core idea: learn static neuro-informed prototypes in the continuous Valence-Arousal (V/A) space and distill them into a pure image-based student model.
Inspired by affective neuroscience: depression-related research has observed anhedonia — attenuated emotional responses in high-valence regions.
This observation is encoded as a lightweight geometric regularization term (D-Geo) that softly shapes the geometry of the embedding space.

Method¶

Overall Architecture¶

Four-stage pipeline: 1. Train a teacher network on EEG data (DREAMER + MAHNOB-HCI) to regress V/A values. 2. Aggregate teacher validation-set embeddings into a 5×5 V/A prototype grid (25 static prototypes), then freeze and reuse. 3. Train a ResNet-18/50 student model on FERPlus with a joint objective: CE + KD + Proto-KD + D-Geo. 4. At inference, only the visual model is required — no EEG or other non-visual signals are needed.

Key Designs¶

Design 1: Static Neuro-informed Prototypes - Function: Construct 25 V/A prototypes from the EEG teacher's validation-set embeddings. - Mechanism: Discretize the V/A space into a 5×5 grid (bin centers from −0.8 to 0.8) and average L2-normalized penultimate-layer teacher features within each bin. Empty bins are filled with the mean of the nearest non-empty bin. - Design Motivation: The 5×5 grid balances coverage and statistical stability — denser grids (e.g., 7×7) result in sparse bins and collapse. Prototypes are constructed once, frozen, and reused, requiring no paired EEG-face data.

Design 2: Prototype Knowledge Distillation (Proto-KD, Cosine) - Function: Align student features with the static prototypes. - Mechanism: Compute cosine similarities $s_k = \cos(f(x), p_k)$ between the student feature $f(x)$ and all prototypes $p_k$, construct a soft distribution $q^{stu} = \text{softmax}(s/\tau)$ ($\tau=0.90$), and minimize $D_{KL}(q^{pro} \| q^{stu})$. - Design Motivation: Enables the visual student to implicitly inherit the affective spatial structure captured by the EEG teacher without requiring paired training data.

Design 3: Depression-Inspired Geometric Prior (D-Geo) - Function: Regularize the geometry of the embedding space. - Mechanism: (1) Apply an intra-class variance upper-bound cap to high-valence categories (happiness, surprise); (2) globally encourage inter-class margins to maintain separability. A cosine ramp delays activation (epochs 20→60), and the regularization weight is kept small. - Design Motivation: Motivated by anhedonia findings in depression research — more compact representations of high-valence regions may improve robustness. t-SNE visualizations confirm that D-Geo yields more compact high-valence clusters while preserving separation of low-valence classes.

Loss & Training¶

Overall loss: $$\mathcal{L} = \underbrace{\mathcal{L}_{CE}}_{\text{label smoothing 0.055 + class weights}} + \lambda_{kd} \underbrace{\mathcal{L}_{KD}}_{\text{MSE/KL, T=5.0}} + \lambda_{proto} \underbrace{D_{KL}(q^{pro} \| q^{stu})}_{\text{Proto-KD, } \tau=0.90} + \lambda_{geo} \underbrace{\mathcal{L}_{D\text{-}Geo}}_{\text{delayed activation}}$$

Training details: AdamW optimizer, cosine learning rate decay, base LR $2 \times 10^{-4}$, weight decay 0.05, batch size 128, mixed precision (AMP), gradient clipping 1.0. Student EMA is disabled (Mean-Teacher-style EMA degrades performance on this task).

Key Experimental Results¶

Main Results¶

FERPlus validation set ablation (8-way):

Variant	Acc (%)	Macro-F1 (%)	bACC (%)
B0: CE only	78.22	51.29	49.77
B1: +KD	82.31	63.56	59.28
B2: +KD+Proto	81.48	64.21	60.30
B3: Full (+D-Geo)	83.06	64.74	59.90
Full (T=1)	83.66	65.39	61.32

Cross-dataset evaluation (A3_full, present-only):

Dataset	Acc (%)	Macro-F1 (%)	bACC (%)
FERPlus (valid)	83.06	64.74	59.90
AffectNet-mini	76.30	75.60	75.77
CK+	64.93	49.33	52.46

Ablation Study¶

Component contribution analysis:

Component	Key Effect
KD (B0→B1)	Large Macro-F1 gain (+12.27); accelerates early convergence
Proto-KD (B1→B2)	Improves class-balanced bACC (+1.02); stabilizes prototype alignment
D-Geo (B2→B3)	Maintains best Macro-F1; shapes more compact high-valence clusters

Hyperparameter sensitivity:

Parameter	Optimal Value	Notes
$\lambda_{proto}$	0.12	Stable in range 0.10–0.15
D-Geo activation	epoch 20→60	Starting from epoch 0 harms early separability
V/A grid size	5×5	7×7 leads to sparse bin collapse

Key Findings¶

KD is the single largest contributing factor (Macro-F1 +12.27); Proto-KD and D-Geo provide complementary late-stage improvements.
Proto-KD reduces anger/sadness confusion (on AffectNet-mini); D-Geo further improves high-valence cluster purity.
t-SNE visualizations show progressive improvement from B0 to B3: increasing inter-class margins and more compact high-valence clusters.
The present-only Macro-F1 on AffectNet-mini (75.60%) substantially outperforms the 8-way result on CK+ (39.23%), highlighting label-set mismatch as a critical issue in cross-dataset evaluation.
Student EMA (Mean-Teacher style) degrades performance on this task and is disabled.
All artifacts (prototype bank, checkpoints, metrics JSON) are verified with SHA-256 hashes, ensuring strong reproducibility.

Highlights & Insights¶

Novel cross-modal distillation pathway: EEG → V/A prototypes → visual student, elegantly circumventing the need for paired data.
D-Geo, the depression-inspired geometric prior, offers a highly original regularization strategy — encoding neuroscientific insights as embedding-space constraints.
The 5×5 prototype design is simple and effective: constructed once, frozen, and reused with no additional inference overhead.
The present-only evaluation protocol fairly addresses label-set mismatch and is worth broader adoption in the FER community.
The responsible research statement is thorough: it explicitly clarifies that D-Geo is a non-diagnostic, non-clinical geometric bias, and provides a reuse checklist.

Limitations & Future Work¶

The FERPlus validation accuracy (83.06%) is not state-of-the-art in FER, serving primarily as a proof of concept for the framework's generalizability.
The gains from D-Geo are modest (Macro-F1 from 64.21 to 64.74), potentially constrained by the small regularization weight.
The teacher network relies on limited EEG data (DREAMER + MAHNOB-HCI); prototype quality is bounded by the scale and diversity of the EEG corpus.
8-way performance on CK+ remains low (Acc 55.86%), indicating that extreme domain-shift scenarios remain challenging.
Only ResNet-18/50 backbones are evaluated; stronger visual architectures (e.g., ViT, ConvNeXt) are not explored.
The "Gaze" component of "NeuroGaze" is disabled in the final experiments, creating a slight disconnect between the name and the actual method.

Knowledge Distillation (Hinton et al., 2015): foundational logit-based soft-target distillation method.
Prototype Learning (Snell et al., 2017; Li et al., 2020): prototypes represent class/region structure; Proto-KD extends this paradigm to the V/A space.
EEG-based Emotion Recognition: most prior work is EEG-only or requires paired multimodal training; this paper is the first to realize unidirectional distillation (EEG → V/A → vision).
Inspiration: the cross-modal prototype distillation paradigm may generalize to other "expensive signal → cheap inference" scenarios (e.g., fMRI → behavioral prediction).

Rating¶

Novelty: ⭐⭐⭐⭐ The EEG → V/A prototype distillation concept is novel, and D-Geo offers a distinctive, cross-disciplinary perspective.
Experimental Thoroughness: ⭐⭐⭐ Ablations are sufficient, but comprehensive comparison with FER state-of-the-art methods is lacking, and data scale is limited.
Writing Quality: ⭐⭐⭐⭐ Algorithmic pseudocode is clear; the responsible research statement is thorough and professional, though some descriptions are redundant.
Value: ⭐⭐⭐ The framework concept is thought-provoking, but practical FER performance gains are limited; the effectiveness of D-Geo requires further validation.

Parameter	Optimal Value	Notes
\(\lambda_{proto}\)	0.12	Stable in range 0.10–0.15
D-Geo activation	epoch 20→60	Starting from epoch 0 harms early separability
V/A grid size	5×5	7×7 leads to sparse bin collapse