Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy¶

Conference: CVPR 2026 arXiv: 2603.02123 Code: https://github.com/waHAHJIAHAO/Nano-EmoX Area: Multimodal VLM Keywords: Affective Computing, Multimodal Language Model, Cognitive Hierarchy, Emotion Recognition, Empathetic Interaction

TL;DR¶

Nano-EmoX proposes a cognition-inspired three-level emotional task hierarchy (Perception → Understanding → Interaction) and is the first multimodal language model to unify six core affective tasks within a compact 2.2B parameter framework, employing a P2E progressive training paradigm that cultivates capabilities from basic perception to high-level empathy.

Background & Motivation¶

Background: The development of affective multimodal language models (MLMs) is constrained by the gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization.
Limitations of Prior Work: (i) Existing models are predominantly single-level specialists — dedicated to either perception (emotion recognition), understanding (cause reasoning), or interaction (empathetic response), lacking unification; (ii) large model scales (7–9B) render practical deployment difficult.
Key Challenge: Emotional intelligence constitutes a continuum from perception to empathy, yet existing methods decompose it into isolated tasks, precluding cross-level knowledge transfer.
Goal: Design a compact unified model (<3B parameters) capable of handling six core affective tasks across three cognitive levels: perception, understanding, and interaction.
Key Insight: Inspired by perception-action models, affective tasks are organized by cognitive depth and trained progressively from low to high levels.
Core Idea: Full-modality encoders (enhanced facial encoder + fusion encoder) combined with a P2E progressive training framework (Perception → Fusion → Multi-task Instruction Tuning).

Method¶

Overall Architecture¶

Four modality branches (visual, speech, facial, fusion) + heterogeneous adapters + a lightweight language model (Qwen2.5 2.2B). Six tasks: Multimodal Sentiment Analysis (MSA), Multimodal Emotion Recognition (MER), Open-Vocabulary MER (OV-MER), Emotion Cause Reasoning (ERI), Multimodal Intent Recognition (MIR), and Empathetic Response Generation (ERG).

Key Designs¶

Enhanced Facial Encoder:
Function: Extracts fine-grained, identity-agnostic facial affective representations.
Mechanism: A FaceXFormer encoder extracts multi-scale facial features \(E_f\) from video frames. A Temporal Modeling module reconstructs inter-frame temporal relationships via cross-attention \(E_f^c = \text{CrossAttention}(Q, E_f^K, E_f^V)\), where \(Q\) denotes learnable temporal query tokens. A two-layer fully connected network then projects the output to the language model dimension.
Design Motivation: Facial expressions are critical visual cues for affective perception, yet general-purpose visual encoders (e.g., SigLIP) lack sufficient granularity. A dedicated facial encoder with temporal modeling captures the dynamic evolution of expressions.
Cross-Modal Hierarchical Expert Fusion Encoder:
Function: Adaptively fuses complementary affective information from visual and speech modalities.
Mechanism: Three fusion experts (with independent weights) each perform cross-attention fusion over features extracted from different layers of the visual and speech encoders (speech layers 16/18/22 + visual layers 12/16/22), producing \(E_{mf}^i\). A gating network dynamically weights each expert's contribution \(G_1, G_2, G_3\), yielding the final fusion embedding \(E_{mf} = G_1 \odot E_{mf}^1 + G_2 \odot E_{mf}^2 + G_3 \odot E_{mf}^3\).
Design Motivation: Tasks at different cognitive levels require feature fusion at different representational depths (e.g., low-level features suit prosodic perception, high-level features suit semantic reasoning). Hierarchical experts with dynamic gating enable task-adaptive fusion.
P2E Progressive Training Framework:
Function: Cultivates the model's affective intelligence incrementally according to cognitive depth.
Mechanism: A three-phase curriculum — (1) Phase 1: Fundamental modality alignment, training only the modality-specific adapters (visual + facial on FERV39K/CAER; speech on CREMA-D/M3ED); (2) Phase 2: Cross-modal fusion pre-training, activating and training the fusion encoder on MIntRec/MIntRec2.0; (3) Phase 3: Multi-task instruction fine-tuning, activating LoRA to fine-tune the LM and jointly training all six tasks with a carefully designed data mixture ratio (MER:OV-MER:MIR:ERI:ERG = 18:28:5:31:18).
Design Motivation: Training proceeds from shallow to deep, following the principles of cognitive development — first establishing perceptual foundations, then cultivating cross-modal fusion, and finally developing higher-order reasoning and empathy.

Loss & Training¶

A unified maximum likelihood estimation objective: \(\theta^{MLE} = \arg\max_\theta \sum \log P(Y|T;\theta)\). Different modules are progressively unfrozen across the three training phases.

Key Experimental Results¶

Main Results¶

Task	Nano-EmoX (2.2B)	AffectGPT (8.3B)	EmoLLMs (7B)	Notes
MSA	Competitive	SOTA	—	Implicitly learned
MER	SOTA / Competitive	Runner-up	Runner-up	Core perception task
OV-MER	SOTA	Runner-up	N/A	Open-vocabulary
ERI	SOTA / Competitive	Runner-up	N/A	Cause reasoning
MIR	SOTA	N/A	N/A	Intent recognition
ERG	SOTA / Competitive	N/A	N/A	Empathetic response

Ablation Study¶

Configuration	Key Metric	Notes
Full Nano-EmoX	Best	Complete framework
w/o Facial Encoder	Degraded	Facial cues are critical for emotion perception
w/o Fusion Encoder	Degraded	Cross-modal fusion is effective
w/o P2E (direct multi-task)	Significantly degraded	Progressive training is essential

Key Findings¶

2.2B parameters suffice to match or surpass 7–9B models across six tasks, demonstrating the effectiveness of the architecture and training strategy.
P2E progressive training yields substantial gains over direct multi-task training, validating the value of cognition-level curriculum design.
The facial encoder contributes more to emotion perception than augmenting general-purpose visual encoders.

Highlights & Insights¶

The three-level cognitive hierarchy framework serves not only as a task organization principle but also as a guiding design for the training strategy.
First unification of six affective tasks under <3B parameters, achieving an outstanding balance between efficiency and capability.
The Phase 2 design positioning intent recognition as a bridge between perception and reasoning is theoretically grounded — intent inference requires cross-modal integration.

Limitations & Future Work¶

Small-scale models may still underperform larger models on complex reasoning tasks.
Training data primarily covers English and Chinese; multilingual generalization remains unverified.
MSA is not explicitly trained but implicitly acquired from related tasks, which may be suboptimal.

vs. AffectGPT: Supports four tasks with 8.3B parameters; Nano-EmoX supports six tasks with 2.2B parameters at comparable or superior performance.
vs. EmoLLMs: Operates only on text-level affective tasks; Nano-EmoX extends to complete multimodal affective intelligence.

Rating¶

Novelty: ⭐⭐⭐⭐ The cognitive hierarchy framework and P2E training strategy are genuinely novel.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across six tasks with thorough ablation studies.
Writing Quality: ⭐⭐⭐⭐ Clear framework presentation with a solid cognitive-theoretic foundation.
Value: ⭐⭐⭐⭐⭐ A compact and efficient unified affective AI system with high practical deployment value.