Skip to content

Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

Conference: CVPR 2026 arXiv: 2603.02123 Code: https://github.com/waHAHJIAHAO/Nano-EmoX Area: Multimodal VLM Keywords: Affective Computing, Multimodal Language Model, Cognitive Hierarchy, Emotion Recognition, Empathetic Interaction

TL;DR

Nano-EmoX proposes a cognition-inspired three-level emotional task hierarchy (Perception → Understanding → Interaction) and is the first multimodal language model to unify six core affective tasks within a compact 2.2B parameter framework, employing a P2E progressive training paradigm that cultivates capabilities from basic perception to high-level empathy.

Background & Motivation

  1. Background: The development of affective multimodal language models (MLMs) is constrained by the gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization.
  2. Limitations of Prior Work: (i) Existing models are predominantly single-level specialists — dedicated to either perception (emotion recognition), understanding (cause reasoning), or interaction (empathetic response), lacking unification; (ii) large model scales (7–9B) render practical deployment difficult.
  3. Key Challenge: Emotional intelligence constitutes a continuum from perception to empathy, yet existing methods decompose it into isolated tasks, precluding cross-level knowledge transfer.
  4. Goal: Design a compact unified model (<3B parameters) capable of handling six core affective tasks across three cognitive levels: perception, understanding, and interaction.
  5. Key Insight: Inspired by perception-action models, affective tasks are organized by cognitive depth and trained progressively from low to high levels.
  6. Core Idea: Full-modality encoders (enhanced facial encoder + fusion encoder) combined with a P2E progressive training framework (Perception → Fusion → Multi-task Instruction Tuning).

Method

Overall Architecture

Four modality branches (visual, speech, facial, fusion) + heterogeneous adapters + a lightweight language model (Qwen2.5 2.2B). Six tasks: Multimodal Sentiment Analysis (MSA), Multimodal Emotion Recognition (MER), Open-Vocabulary MER (OV-MER), Emotion Cause Reasoning (ERI), Multimodal Intent Recognition (MIR), and Empathetic Response Generation (ERG).

Key Designs

  1. Enhanced Facial Encoder:
  2. Function: Extracts fine-grained, identity-agnostic facial affective representations.
  3. Mechanism: A FaceXFormer encoder extracts multi-scale facial features \(E_f\) from video frames. A Temporal Modeling module reconstructs inter-frame temporal relationships via cross-attention \(E_f^c = \text{CrossAttention}(Q, E_f^K, E_f^V)\), where \(Q\) denotes learnable temporal query tokens. A two-layer fully connected network then projects the output to the language model dimension.
  4. Design Motivation: Facial expressions are critical visual cues for affective perception, yet general-purpose visual encoders (e.g., SigLIP) lack sufficient granularity. A dedicated facial encoder with temporal modeling captures the dynamic evolution of expressions.

  5. Cross-Modal Hierarchical Expert Fusion Encoder:

  6. Function: Adaptively fuses complementary affective information from visual and speech modalities.
  7. Mechanism: Three fusion experts (with independent weights) each perform cross-attention fusion over features extracted from different layers of the visual and speech encoders (speech layers 16/18/22 + visual layers 12/16/22), producing \(E_{mf}^i\). A gating network dynamically weights each expert's contribution \(G_1, G_2, G_3\), yielding the final fusion embedding \(E_{mf} = G_1 \odot E_{mf}^1 + G_2 \odot E_{mf}^2 + G_3 \odot E_{mf}^3\).
  8. Design Motivation: Tasks at different cognitive levels require feature fusion at different representational depths (e.g., low-level features suit prosodic perception, high-level features suit semantic reasoning). Hierarchical experts with dynamic gating enable task-adaptive fusion.

  9. P2E Progressive Training Framework:

  10. Function: Cultivates the model's affective intelligence incrementally according to cognitive depth.
  11. Mechanism: A three-phase curriculum — (1) Phase 1: Fundamental modality alignment, training only the modality-specific adapters (visual + facial on FERV39K/CAER; speech on CREMA-D/M3ED); (2) Phase 2: Cross-modal fusion pre-training, activating and training the fusion encoder on MIntRec/MIntRec2.0; (3) Phase 3: Multi-task instruction fine-tuning, activating LoRA to fine-tune the LM and jointly training all six tasks with a carefully designed data mixture ratio (MER:OV-MER:MIR:ERI:ERG = 18:28:5:31:18).
  12. Design Motivation: Training proceeds from shallow to deep, following the principles of cognitive development — first establishing perceptual foundations, then cultivating cross-modal fusion, and finally developing higher-order reasoning and empathy.

Loss & Training

A unified maximum likelihood estimation objective: \(\theta^{MLE} = \arg\max_\theta \sum \log P(Y|T;\theta)\). Different modules are progressively unfrozen across the three training phases.

Key Experimental Results

Main Results

Task Nano-EmoX (2.2B) AffectGPT (8.3B) EmoLLMs (7B) Notes
MSA Competitive SOTA Implicitly learned
MER SOTA / Competitive Runner-up Runner-up Core perception task
OV-MER SOTA Runner-up N/A Open-vocabulary
ERI SOTA / Competitive Runner-up N/A Cause reasoning
MIR SOTA N/A N/A Intent recognition
ERG SOTA / Competitive N/A N/A Empathetic response

Ablation Study

Configuration Key Metric Notes
Full Nano-EmoX Best Complete framework
w/o Facial Encoder Degraded Facial cues are critical for emotion perception
w/o Fusion Encoder Degraded Cross-modal fusion is effective
w/o P2E (direct multi-task) Significantly degraded Progressive training is essential

Key Findings

  • 2.2B parameters suffice to match or surpass 7–9B models across six tasks, demonstrating the effectiveness of the architecture and training strategy.
  • P2E progressive training yields substantial gains over direct multi-task training, validating the value of cognition-level curriculum design.
  • The facial encoder contributes more to emotion perception than augmenting general-purpose visual encoders.

Highlights & Insights

  • The three-level cognitive hierarchy framework serves not only as a task organization principle but also as a guiding design for the training strategy.
  • First unification of six affective tasks under <3B parameters, achieving an outstanding balance between efficiency and capability.
  • The Phase 2 design positioning intent recognition as a bridge between perception and reasoning is theoretically grounded — intent inference requires cross-modal integration.

Limitations & Future Work

  • Small-scale models may still underperform larger models on complex reasoning tasks.
  • Training data primarily covers English and Chinese; multilingual generalization remains unverified.
  • MSA is not explicitly trained but implicitly acquired from related tasks, which may be suboptimal.
  • vs. AffectGPT: Supports four tasks with 8.3B parameters; Nano-EmoX supports six tasks with 2.2B parameters at comparable or superior performance.
  • vs. EmoLLMs: Operates only on text-level affective tasks; Nano-EmoX extends to complete multimodal affective intelligence.

Rating

  • Novelty: ⭐⭐⭐⭐ The cognitive hierarchy framework and P2E training strategy are genuinely novel.
  • Experimental Thoroughness: ⭐⭐⭐⭐⭐ Comprehensive evaluation across six tasks with thorough ablation studies.
  • Writing Quality: ⭐⭐⭐⭐ Clear framework presentation with a solid cognitive-theoretic foundation.
  • Value: ⭐⭐⭐⭐⭐ A compact and efficient unified affective AI system with high practical deployment value.