Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning¶
Paper Information¶
- Conference: ICLR 2026
- arXiv: 2602.00971
- Code: https://HitEmotion.github.io/
- Area: Multimodal Affective Computing / Theory of Mind / Reinforcement Learning / Large Models
- Keywords: Theory of Mind, Emotion Reasoning, MLLM, Hierarchical Benchmark, GRPO, Chain-of-Thought Optimization
TL;DR¶
This paper constructs HitEmotion, a hierarchical multimodal emotion understanding benchmark grounded in Theory of Mind (ToM), and proposes the TMPO framework, which leverages intermediate mental states as process-level supervision to enhance the emotion reasoning capabilities of MLLMs.
Background & Motivation¶
Core Problem¶
Despite strong performance across various tasks, multimodal large language models (MLLMs) exhibit notable deficiencies in deep emotion understanding, stemming from three core issues:
Lack of a unified cognitive framework: Existing benchmarks provide only coarse-grained scores and cannot localize breakpoints in model reasoning ability.
Unfaithful reasoning chains: CoT reasoning appears coherent but is essentially template matching, lacking genuine tracking of mental states.
Emotion hallucination: Models produce distorted emotion attributions when faced with cross-modal conflicting cues.
Limitations of Prior Work¶
- EQ-Bench and EmoBench cover only the text modality.
- EmoBench-M and EmotionHallucer are multimodal but feature scattered task designs not organized by cognitive depth.
- No existing benchmark simultaneously provides reasoning chain and rationale evaluation.
Method¶
HitEmotion Benchmark: Three-Level Cognitive Hierarchy¶
Level 1 — Emotion Perception and Recognition (EPR): 10 tasks - Maps multimodal signals to predefined emotion categories. - Includes tasks such as facial expression recognition and multimodal sentiment recognition.
Level 2 — Emotion Understanding and Analysis (EUA): 8 tasks - Requires context-aware and relational reasoning. - Includes tasks such as humor understanding, sarcasm detection, and multi-party dialogue emotion analysis.
Level 3 — Emotion Cognition and Reasoning (ECR): 6 tasks - Demands causal reasoning and second-order mental inference. - Includes tasks such as emotion elicitation reasoning, emotion explanation, and irony comprehension.
In total: 24 tasks, 20,114 instances, covering both video and image modalities.
TMPO Training Framework¶
Stage 1: ToM-Aligned Supervised Fine-Tuning (SFT)¶
A structured reasoning template is employed, wrapping intermediate reasoning steps within <think></think> tags and the final output within <answer></answer> tags:
Gold-standard reasoning chains are constructed via a four-step pipeline: LLM generation → filtering → augmentation → correction.
Stage 2: GRPO-Based ToM Preference Optimization¶
For each input, \(N\) candidate outputs are sampled and evaluated through a multi-dimensional reward function:
Four reward components: - Structure Reward: Correct ordering of reasoning steps. - Content Reward: Correctness of the final answer. - Process Reward: Use of domain-specific language. - Consistency Reward: Penalty for logical and factual inconsistencies.
GRPO optimization objective: $\(\max_{\pi_\theta} \mathbb{E}_{y_i \sim \pi_{\text{old}}} \left[ \frac{\pi_\theta(y_i)}{\pi_{\text{old}}(y_i)} A_i \right] - \beta D_{KL}(\pi_\theta \| \pi_{\text{ref}})\)$
ToM-Style Prompting Mechanism¶
Prompt designs are tailored to three levels of cognitive complexity: - Level 1: First-order mental state attribution — integrating observable signals to infer emotions. - Level 2: Relational and contextual mental modeling — associating emotions with specific entities or communicative intentions. - Level 3: Causal attribution and second-order reasoning — explaining the origins of emotions and their social interpretations.
Experiments¶
Baseline Model Evaluation (EPR Level 1)¶
| Model | FESD | ISA | MESA | MER | MSA | OSA | SIA |
|---|---|---|---|---|---|---|---|
| VideoLLaMA3-7B | 61.78 | 46.85 | 21.60 | 52.18 | 64.62 | 67.89 | 35.20 |
| LLaVA-One-Vision-7B | 63.44 | 49.19 | 17.05 | 39.50 | 65.40 | 63.00 | 27.00 |
Key Findings¶
- State-of-the-art models exhibit inconsistent performance on higher-level cognitive tasks: Even the strongest closed-source models show significant deficiencies at the ECR level.
- ToM reasoning chains, used alone as a prompting strategy, substantially improve closed-source model performance: This validates the effectiveness of ToM as a reasoning scaffold.
- TMPO optimization yields consistent gains: It outperforms baselines across all evaluation tasks, generating reasoning chains that are markedly superior in faithfulness and logical coherence.
- From "general emergence" to "domain acquisition": TMPO transforms reasoning ability from a general property into a cognitively specialized skill.
Highlights & Insights¶
- The first evaluation framework to unify psychological theory with MLLM reasoning processes and rationale generation.
- Elegant ToM prompting mechanism design: Three levels of cognitive hierarchy correspond to three reasoning templates of increasing depth.
- Innovative combination of GRPO and process-level rewards: Intermediate mental states serve simultaneously as supervision signals and reward sources.
- Scale: A comprehensive benchmark spanning 24 datasets and 20K+ instances.
Limitations & Future Work¶
- Gold-standard reasoning chains rely on LLM generation, potentially introducing inherent LLM biases.
- The benchmark is built upon reconstructed existing datasets, whose original annotation quality varies.
- GRPO training incurs considerable computational cost.
- Evaluation is primarily conducted in single-turn QA settings; emotion reasoning in multi-turn interactions remains underexplored.
Related Work & Insights¶
- Multimodal Affective Computing: Fusion strategies such as SALV and PAD have evolved from early/late fusion to intermediate interaction approaches.
- Affective Intelligence Evaluation: A progression from EQ-Bench → EmoBench-M → EmotionHallucer.
- ToM Reasoning: Works from ToMBench to MMToM-QA reveal ToM deficiencies in MLLMs.
- Reasoning Optimization: The success of DeepSeek-R1's GRPO approach in text-based reasoning.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Deep integration of the ToM cognitive framework with MLLM evaluation and training.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across 24 datasets.
- Writing Quality: ⭐⭐⭐⭐ — Clear problem formulation with well-motivated methodology.
- Value: ⭐⭐⭐⭐ — Provides both an evaluation toolkit and an optimization method.