egoEMOTION: Egocentric Vision and Physiological Signals for Emotion and Personality Recognition in Real-World Tasks¶
Conference: NeurIPS 2025 arXiv: 2510.22129 Code: Available (open-source dataset + baseline implementation) Area: Affective Computing / Egocentric Vision / Multimodal Dataset Keywords: egocentric vision, emotion recognition, personality, physiological signals, Project Aria
TL;DR¶
This paper introduces egoEMOTION — the first dataset combining egocentric vision (Meta Project Aria glasses) with physiological signals for emotion and personality recognition. It encompasses 43 participants, 50+ hours of recordings, and 16 tasks, and demonstrates that egocentric vision signals (particularly eye-tracking features) outperform conventional physiological signals for emotion prediction in real-world scenarios.
Background & Motivation¶
Background: Egocentric vision has established large-scale benchmarks (Ego4D, EPIC-KITCHENS), while emotion recognition has relied on physiological signals collected in laboratory settings (DEAP, AMIGOS, etc.).
Limitations of Prior Work: (a) Egocentric vision benchmarks ignore participants' emotional states, assuming affective neutrality; (b) existing emotion datasets are confined to laboratory settings with low ecological validity; (c) the only publicly available mobile eye-tracking emotion dataset, eSEE-d, covers only 4 emotion categories, provides no physiological signals, and requires a fixed head position.
Key Challenge: Emotion and personality are intrinsic drivers of behavior, yet egocentric vision systems cannot model these internal states.
Goal: To construct a high-ecological-validity multimodal affective dataset and demonstrate that egocentric vision signals alone are sufficient for emotion prediction.
Core Idea: Egocentric glasses signals — especially eye-tracking — are more effective than conventional physiological signals for emotion prediction in real-world settings.
Method¶
Overall Architecture¶
43 participants wore Project Aria glasses and physiological sensors while completing 16 tasks (9 emotion-eliciting videos + 7 naturalistic activities). Three benchmarks are defined: (1) binary classification of continuous emotion dimensions V/A/D, (2) 9-class discrete emotion recognition, and (3) binary classification of Big Five personality traits.
Key Designs¶
-
16-Task Protocol:
- Session A: 9 video clips (~48s each) corresponding to 8 emotions from Mikels' Wheel plus a neutral condition.
- Session B: Flappy Bird (frustration), tasting unpleasant gummies (disgust), Jenga (socially tense), drawing with music (relaxation), writing a sad letter (sadness), Slenderman horror game (fear), telling jokes (amusement).
-
Multi-Level Annotation Scheme:
- Emoti-SAM 7-point scale for collecting V/A/D ratings.
- Weighted Mikels' Wheel: 100% distributed across 9 emotion categories (in 10% increments) to capture the relative intensity of mixed emotions.
- BFI-2 personality questionnaire.
-
Sensor Configuration:
- Aria glasses: eye-tracking video (640×480@90fps), POV camera (1408×1408@10fps), dual IMU (1000Hz + 800Hz), nose-pad PPG (128Hz).
- External sensors: ECG (1024Hz), EDA (256Hz), RSP (400Hz).
-
612-Dimensional Feature Extraction:
- ECG/PPG: 77 features; EDA: 31 features; RSP: 14 features.
- Egocentric-derived features: pupil size, pixel intensity, Fisherface, gaze direction, blink detection, LBP-TOP micro-expressions.
- 15 statistical descriptors per signal.
Baseline Methods¶
- Continuous emotion: SVM-RBF + LOSO cross-validation.
- Discrete emotion: Random Forest + SelectKBest (top-10) + LOSO.
- Personality: Random Forest + SelectKBest + LOSO.
- Deep learning: CNN and WER (Transformer), 5-fold cross-validation.
Key Experimental Results¶
Main Results — Modality Comparison (F1 Score)¶
| Benchmark | Wearable (ECG/EDA/RSP) | Egocentric Glasses | Full Fusion | Chance Baseline |
|---|---|---|---|---|
| Continuous Emotion (V/A/D avg.) | 0.70 | 0.74 | 0.75 | 0.59 |
| Discrete Emotion (9-class avg.) | 0.24 | 0.46 | 0.46 | 0.11 |
| Personality (Big Five avg.) | 0.50 | 0.57 | 0.59 | 0.53 |
Classical Methods vs. Deep Learning¶
| Benchmark | Classical (All) | CNN (All) | Transformer WER (All) |
|---|---|---|---|
| Continuous Emotion | 0.75 | 0.68 | 0.60 |
| Discrete Emotion | 0.46 | 0.22 | 0.21 |
| Personality | 0.59 | — | 0.47 |
Key Findings¶
- Egocentric glasses signals consistently outperform conventional physiological sensors: the advantage is most pronounced for discrete emotion recognition (0.46 vs. 0.24).
- Gaze features are most informative for continuous emotion; pixel intensity is most effective for discrete emotion; IMU features generalize best across tasks.
- Classical ML substantially outperforms deep learning — deep models severely overfit given the small dataset (43 participants × 16 tasks).
- Personality prediction is the most difficult benchmark, approaching the random baseline.
Highlights & Insights¶
- Weighted Mikels' Wheel Annotation: enables quantification of the relative intensity of mixed emotions, providing richer labels than simple multi-class selection. This scheme is transferable to large-scale video affective annotation pipelines.
- Eye-tracking video > conventional physiological signals: future affective recognition systems may not require contact sensors such as ECG/EDA — an eye-tracking-equipped glasses device may suffice.
- Fisherface features: applying PCA+LDA to eye-tracking video frames as a low-cost visual descriptor yields competitive performance.
Limitations & Future Work¶
- The sample of 43 participants (predominantly university students) is relatively small.
- Deep learning performance is substantially inferior to classical methods; few-shot or meta-learning strategies for small-sample settings warrant exploration.
- Annotations are collected only after each task concludes rather than as continuous time-series labels.
- Task design minimizes body movement, limiting generalization to unrestricted daily-life scenarios.
- The representational capacity of large visual foundation models (e.g., VideoMAE, InternVideo) remains unexplored.
- The experimenter's presence (seated behind a curtain) may have influenced participants' natural behavior.
- All participants are healthy individuals; generalizability to clinical populations (e.g., anxiety, depression) is unknown.
Related Work & Insights¶
- vs. DEAP/AMIGOS/ASCERTAIN: These canonical affective datasets are collected in laboratory settings using stationary EEG/ECG equipment, resulting in low ecological validity. egoEMOTION is the first to use a mobile egocentric device in semi-naturalistic settings, more closely approximating real-world use.
- vs. Ego4D/EPIC-KITCHENS: Large-scale egocentric vision datasets that lack affective annotations. egoEMOTION addresses this gap but is substantially smaller in scale.
- vs. eSEE-d: The only publicly available mobile eye-tracking emotion dataset, yet limited to 4 emotion categories, requiring a chin rest and providing no physiological signals. egoEMOTION significantly extends coverage along every dimension.
- vs. K-EmoCon/EmoPairCompete: Emotions are elicited naturalistically in social contexts, but egocentric vision and eye-tracking data are absent. egoEMOTION offers more comprehensive sensor coverage.
- The annotation methodology of egoEMOTION could be applied to extend large-scale datasets such as Ego4D with affective labels.
- The effectiveness of eye-tracking features suggests that built-in eye trackers in mixed reality (MR) devices may already be sufficient to support real-time affective inference.
Rating¶
- Novelty: ⭐⭐⭐⭐ — First multimodal dataset combining egocentric vision and affective signals, filling an important gap.
- Experimental Thoroughness: ⭐⭐⭐⭐ — 612-dimensional features, cross-modal comparisons, classical vs. deep learning analysis; limited by small participant count.
- Writing Quality: ⭐⭐⭐⭐⭐ — Exceptionally clear for a dataset paper, with detailed descriptions of experimental design.
- Value: ⭐⭐⭐⭐ — The open-source dataset and baselines will advance future research, though the scale constrains direct application.